MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation


Full Citation in the ACM Digital Library

SESSION: Session 1: MuSe Challenge Introduction

The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation

  • Lukas Christ
  • Shahin Amiriparian
  • Alice Baird
  • Alexander Kathan
  • Niklas Müller
  • Steffen Klug
  • Chris Gagne
  • Panagiotis Tzirakis
  • Lukas Stappen
  • Eva-Maria Meßner
  • Andreas König
  • Alan Cowen
  • Erik Cambria
  • Björn W. Schuller

The Multimodal Sentiment Analysis Challenge (MuSe) 2023 is a set of shared tasks addressing three different contemporary multimodal affect and sentiment analysis problems: In the Mimicked Emotions Sub-Challenge (MuSe-Mimic), participants predict three continuous emotion targets. This sub-challenge utilises the Hume-Vidmimic dataset comprising of user-generated videos. For the Cross-Cultural Humour Detection Sub-Challenge (MuSe-Humour), an extension of the Passau Spontaneous Football Coach Humour (Passau-SFCH) dataset is provided. Participants predict the presence of spontaneous humour in a cross-cultural setting. The Personalisation Sub-Challenge (MuSe-Personalisation) challenge is based on the Ulm-Trier Social Stress Test (Ulm-TSST) dataset, featuring recordings of subjects in a stressed situation. Here, arousal and valence signals are to be predicted, whereas parts of the test labels are made available in order to facilitate personalisation. MuSe 2023 seeks to bring together a broad audience from different research communities such as audio-visual emotion recognition, natural language processing, signal processing, and health informatics. In this baseline paper, we introduce the datasets, sub-challenges, and provided feature sets. As a competitive baseline system, a Gated Re-current Unit (GRU)-Recurrent Neural Network (RNN) is employed. On the respective sub-challenges' test datasets, it achieves a mean (across three continuous intensity targets) Pearson's Correlation Coefficient of .4727 for MuSe-Mimic, an Area Under the Curve (AUC) value of .8310 for MuSe-Humour and Concordance Correlation Coefficient (CCC) values of .7482 for arousal and .7827 for valence in the MuSe-Personalisation sub-challenge.

SESSION: Session 2: MuSe-Mimic Subchallenge

Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy

  • Chaoyue Ding
  • Daoming Zong
  • Baoxiang Li
  • Song Zhang
  • Xiaoxu Zhu
  • Guiping Zhong
  • Dinghao Zhou

In this paper, we present the solution to the MuSe-Mimic subchallenge of the 4th Multimodal Sentiment Analysis Challenge. This sub-challenge aims to predict the level of approval, disappointment and uncertainty in user-generated video clips. In our experiments, we found that naive joint training of multiple modalities by late fusion would result in insufficient learning of unimodal features. Moreover, different modalities contribute differently to MuSe-Mimic. Relying solely on multimodal features or treating unimodal features equally may limit the model's generalization performance. To address these challenges, we propose an efficient multimodal transformer equipped with a modality-aware adaptive training strategy to facilitate optimal joint training on multimodal sequence inputs. This framework holds promise in leveraging cross-modal interactions while ensuring adequate learning of unimodal features. Our model achieves the mean Pearson's Correlation Coefficient of .729 (ranking 2nd), outperforming official baseline result of .473. Our code is available at https://github.com/dingchaoyue/Multimodal-Emotion-Recognition-MER-and-Mu....

Exploring the Power of Cross-Contextual Large Language Model in Mimic Emotion Prediction

  • Guofeng Yi
  • Yuguang Yang
  • Yu Pan
  • Yuhang Cao
  • Jixun Yao
  • Xiang Lv
  • Cunhang Fan
  • Zhao Lv
  • Jianhua Tao
  • Shan Liang
  • Heng Lu

utilize multimodal data to predict the intensity of three emotional categories. In our work, we discovered that integrating multiple dimensions, modalities, and levels enhances the effectiveness of emotional judgment. In terms of feature extraction, we utilize over a dozen types of medium backbone networks, including W2V-MSP, GLM, and FAU, which are representative of audio, text, and video modalities, respectively. Additionally, we utilize the LoRA framework and employ various domain adaptation methods to effectively adapt to the task at hand. Regarding model design, apart from the RNN model in the baseline, we have extensively incorporated our transformer variant and multi-modal fusion model. Finally, we propose a Hyper-parameter Search Strategy (HPSS) for late fusion to further enhance the effectiveness of the fusion model. For the MuSe-MIMIC, our method achieves Pearson's Correlation Coefficient of 0.7753, 0.7647, and 0.6653 for Approval, Disappointment, and Uncertainty, respectively, outperforming the baseline system by a large margin (i.e., 0.5536, 0.5139, and 0.3395) on the test set. The final mean pearson is 0.7351, surpassing all other participants and ranking Top 1.

SESSION: Session 3: MuSe-Humour Subchallenge

Discovering Relevant Sub-spaces of BERT, Wav2Vec 2.0, ELECTRA and ViT Embeddings for Humor and Mimicked Emotion Recognition with Integrated Gradients

  • Tamás Grósz
  • Anja Virkkunen
  • Dejan Porjazovski
  • Mikko Kurimo

Large-scale, pre-trained models revolutionized the field of sentiment analysis and enabled multimodal systems to be quickly developed. In this paper, we address two challenges posed by the Multimodal Sentiment Analysis (MuSe) 2023 competition by focusing on automatically detecting cross-cultural humor and predicting three continuous emotion targets from user-generated videos. Multiple methods in the literature already demonstrate the importance of embedded features generated by popular pre-trained neural solutions. Based on their success, we can assume that the embedded space consists of several sub-spaces relevant to different tasks. Our aim is to automatically identify the task-specific sub-spaces of various embeddings by interpreting the baseline neural models. Once the relevant dimensions are located, we train a new model using only those features, which leads to similar or slightly better results with a considerably smaller and faster model. The best Humor Detection model using only the relevant sub-space of audio embeddings contained approximately 54% fewer parameters than the one processing the whole encoded vector, required 48% less time to be trained and even outperformed the larger model. Our empirical results validate that, indeed, only a portion of the embedding space is needed to achieve good performance. Our solution could be considered a novel form of knowledge distillation, which enables new ways of transferring knowledge from one model into another.

Humor Detection System for MuSE 2023: Contextual Modeling, Pesudo Labelling, and Post-smoothing

  • Mingyu Xu
  • Shun Chen
  • Zheng Lian
  • Bin Liu

Humor detection has emerged as an active research area within the field of artificial intelligence. Over the past few decades, it has made remarkable progress with the development of deep learning. This paper introduces a novel framework aimed at enhancing the model's understanding of humorous expressions. Specifically, we consider the impact of correspondence between labels and features. In order to achieve more effective models with limited training samples, we employ a widely utilized semi-supervised learning technique called pseudo labeling. Furthermore, we use the post-smoothing strategy to eliminate abnormally high predictions. At the same time, in order to alleviate the over-fitting phenomenon of the model on the validation set, we created 10 different random subsets of the training and then aggregating their prediction. To verify the effectiveness of our strategy, we evaluate its performance on the Cross-Cultural Humour sub-challenge at MuSe 2023. Experimental results demonstrate that our system achieves an AUC score of 0.9112, surpassing the performance of baseline models by a substantial margin.

MMT-GD: Multi-Modal Transformer with Graph Distillation for Cross-Cultural Humor Detection

  • Jun Yu
  • Wangyuan Zhu
  • Jichao Zhu
  • Xiaxin Shen
  • Jianqing Sun
  • Jiaen Liang

In this paper, we present a solution for the Cross-Cultural Humor Detection (MuSe-Humor) sub-challenge, which is part of the Multimodal Sentiment Analys Challenge (MuSe) 2023. The MuSe-Humor task aims to detect humor from multimodal data, including video, audio, and text, in a cross-cultural context. The training data consists of German recordings, while the test data consists of English recordings. To tackle this sub-challenge, we propose a method called MMT-GD, which leverages a multimodal transformer model to effectively integrate the multimodal data. Additionally, we incorporate graph distillation to ensure that the fusion process captures discriminative features from each modality, avoiding excessive reliance on any single modality. Experimental results validate the effectiveness of our approach, achieving an Area Under the Curve (AUC) score of 0.8704 on the test set and securing the third position in the challenge.

Multimodal Cross-Lingual Features and Weight Fusion for Cross-Cultural Humor Detection

  • Heng Xie
  • Jizhou Cui
  • Yuhang Cao
  • Junjie Chen
  • Jianhua Tao
  • Cunhang Fan
  • Xuefei Liu
  • Zhengqi Wen
  • Heng Lu
  • Yuguang Yang
  • Zhao Lv
  • Yongwei Li

Sentiment analysis plays a crucial role in interpreting human interactions across different modalities. This paper addresses the Cross-Cultural Multilingual Humour Detection Sub-Challenge in the MuSe 2023 Multi-modal Sentiment Analysis Challenge, which aims to identify humor instances in cross-cultural situations using multimodal data. In this paper, we explore the unique window length processing of audio data to more accurately and fully capture humor information. Furthermore, we employ a data augmentation technique for text to bridge the gap between the train set and the test set. This augmentation involves converting German and English samples in the database to each other, as well as translating both languages into Chinese. These techniques aim to enhance the model's generalization across different languages and alleviate the scarcity of humor samples. For multimodal fusion, we use the late fusion technique applied to different Gated Recurrent Unit (GRU) models with modality-specific weights obtained through gradient descent. Experimental results demonstrate the effectiveness of our approach, achieving an Area Under the Curve (AUC) score of 0.872 on the MuSe-Humor test set. This performance ranked second place in the sub-challenge.

JTMA: Joint Multimodal Feature Fusion and Temporal Multi-head Attention for Humor Detection

  • Qi Li
  • Yangyang Xu
  • Zhuoer Zhao
  • Shulei Tang
  • Feixiang Zhang
  • Ruotong Wang
  • Xiao Sun
  • Meng Wang

In this paper, we propose a model named Joint multimodal feature fusion and Temporal Multi-head Attention (JTMA) to solve the MuSe-Humor sub-challenge in Multimodal Sentiment Analysis Challenge 2023. The goal of MuSe-Humor sub-challenge is to predict whether humor occurs in the given dataset that includes data from multiple modalities (e.g., text, audio and video). The cross-cultural testing presents a new challenge that makes it different from the previous years. To solve the above problems, the proposed model JTMA firstly uses a 1-D CNN to aggregate temporal information within the unimodal feature. Then the interactions of inter-modality and intra-modality are performed by the multimodal feature encoder module. Finally, we integrate the high-level representations learned from multiple modalities to accurately predict humor. The effectiveness of our proposed model is demonstrated through experimental results obtained on the official test set. Our model achieves an impressive AUC score of 0.8889, surpassing the performance of all other participants in the competition, and securing the Top 1 ranking.

SESSION: Session 4: MuSe-Personalisation Subchallenge

ECG-Coupled Multimodal Approach for Stress Detection

  • Misha Libman
  • Gelareh Mohammadi

Psychological stress undoubtedly impacts the lives of many people on a regular basis, and it is, therefore, necessary to have systems capable of adequately identifying the level of stress an individual is experiencing. With the emergence of compact devices capable of capturing biosignals such as ECG, it is now feasible to use such signals for different applications, including stress monitoring, in addition to the more popular audio and video modalities. As part of the MuSe-Personalisation sub-challenge for 2023, we develop a GRU-based regressor to continuously predict stress levels from features extracted from raw ECG signals, with our unimodal model achieving a reasonable combined CCC, which is comparable to almost all the baseline audio and video models. Moreover, we utilised this model in combination with the two best performing MuSe-Personalisation baseline models to construct a number of multimodal models via late fusion, with our best model attaining a combined CCC of .7808, thereby improving upon the best baseline system.

Exclusive Modeling for MuSe-Personalisation Challenge

  • Haiyang Sun
  • Zhuofan Wen
  • Mingyu Xu
  • Zheng Lian
  • Licai Sun
  • Bin Liu
  • Jianhua Tao

In this paper, we present our solution to the Personalisation subchallenge in MuSe 2023. It aims to predict physio-arousal and valence for different individuals using multimodal inputs (such as audio, video, text, and physiological signals). During the competition, we observe notable performance variations among individuals with identical feature sets and model architectures, prompting us to propose an individual-specific framework. Additionally, fusion methods lead to severe overfitting when predicting physio-arousal. To enhance the generalization capability, we integrate ensemble learning, pseudo-label training, and information delay in the model design. Finally, our proposed solution achieves 0.8275 for physioarousal and 0.8876 for valence on the test data, significantly surpassing the baseline system (0.7450 for physio-arousal and 0.7827 for valence).

Exploiting Diverse Feature for Multimodal Sentiment Analysis

  • Jia Li
  • Wei Qian
  • Kun Li
  • Qi Li
  • Dan Guo
  • Meng Wang

In this paper, we present our solution to the MuSe-Personalisation sub-challenge in the MuSe 2023 Multimodal Sentiment Analysis Challenge. The task of MuSe-Personalisation aims to predict the continuous arousal and valence values of a participant based on their audio-visual, language, and physiological signal modalities data. Considering different people have personal characteristics, the main challenge of this task is how to build robustness feature presentation for sentiment prediction. To address this issue, we propose exploiting diverse features. Specifically, we proposed a series of feature extraction methods to build a robust representation and model ensemble. We empirically evaluate the performance of the utilized method on the officially provided dataset. As a result, we achieved 3rd place in the MuSe-Personalisation sub-challenge. Specifically, we achieve the results of 0.8492 and 0.8439 for MuSe-Personalisation in terms of arousal and valence CCC.

MuSe-Personalization 2023: Feature Engineering, Hyperparameter Optimization, and Transformer-Encoder Re-discovery

  • Ho-Min Park
  • Ganghyun Kim
  • Arnout Van Messem
  • Wesley De Neve

This paper presents our approach for the MuSe-Personalization sub-challenge of the fourth Multimodal Sentiment Analysis Challenge (MuSe 2023), with the goal of detecting human stress levels through multimodal sentiment analysis. We leverage and enhance a Transformer-encoder model, integrating improvements that mitigate issues related to memory leakage and segmentation faults. We propose novel feature extraction techniques, including a pose feature based on joint pair distance and self-supervised learning-based feature extraction for audio using Wav2Vec2.0 and Data2Vec. To optimize effectiveness, we conduct extensive hyperparameter tuning. Furthermore, we employ interpretable meta-learning to understand the importance of each hyperparameter. The outcomes obtained demonstrate that our approach excels in personalization tasks, with particular effectiveness in Valence prediction. Specifically, our approach significantly outperforms the baseline results, achieving an Arousal CCC score of 0.8262 (baseline: 0.7450), a Valence CCC score of 0.8844 (baseline: 0.7827), and a combined CCC score of 0.8553 (baseline: 0.7639) on the test set. These results secured us the second place in MuSe-Personalization.

Temporal-aware Multimodal Feature Fusion for Sentiment Analysis

  • Qi Li
  • Shulei Tang
  • Feixiang Zhang
  • Ruotong Wang
  • Yangyang Xu
  • Zhuoer Zhao
  • Xiao Sun
  • Meng Wang

In this paper, we present a solution to the MuSe-Personalisation sub-challenge in the Multimodal Sentiment Analysis Challenge 2023. The task of MuSe-Personalisation aims to predict a time-continuous emotional value (i.e., arousal and valence) by using multimodal data. The MuSe-Personalisation sub-challenge faces the individual variations problem, resulting in poor generalization on unknown test sets. To solve the above problem, we first extract several informative visual features, and then propose a framework containing feature selection, feature learning and fusion strategy to discover the best combination of features for sentiment analysis. Finally, our method achieved the Top 1 performance in the MuSe-Personalisation sub-challenge, and the result in the combined CCC of physiological arousal and valence was 0.8681, outperforming the baseline system by a large margin (i.e., 10.42%) on the test set.