MRAC '23: Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing

MRAC '23: Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing

MRAC '23: Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing

Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Fairness for Affective and Wellbeing Computing

  • Hatice Gunes

Datasets, algorithms, machine learning models and AI powered tools that are used for perception, prediction and decision making constitute the core of affective and wellbeing computing. Majority of these are prone to data or algorithmic bias (e.g., along the demographic attributes of race, age, gender etc.) that could have catastrophic consequences for various members of the society. Therefore, making considerations and providing solutions to avoid and/or mitigate these are of utmost importance for creating and deploying fair and unbiased affective and wellbeing computing systems. This talk will present the Cambridge Affective Intelligence and Robotics (AFAR) Lab's ( research explorations in this area.

The first part of the talk will discuss the lack of publicly available datasets with consideration for fair distribution across the human population and it will present a systematic investigation of bias and fairness in facial expression recognition and mental health prediction by comparing various approaches on well-known benchmark datasets. The second part of the talk will question whether counterfactuals can provide a solution for data imbalance, and will introduce an attempt to achieve fairer prediction models for facial expression recognition, while noting the limitations of a counterfactual approach employed at the pre-processing, in-processing and post-processing stages to mitigate for bias.

Majority of the ML methods aiming to mitigate bias focus on balancing data distributions or learning to adapt to the imbalances by adjusting the learning algorithm. The third and last part of the talk will introduce our work demonstrating how continual learning (CL) approaches are well-suited for mitigating bias by balancing learning with respect to different attributes such as race and gender, without compromising on recognition accuracy. The talk at its various stages will also outline recommendations to achieve greater fairness for affective and wellbeing computing, while emphasising the need for such models to be deployed and tested in real world settings and applications, for example for robotic wellbeing coaching via physical robots.

Sentiments and Bias in Automated Decision Making

  • Jesse Hoey

I will discuss how computational models of human (social) affect may be used to help mitigate biases in algorithmic decision making. I consider the more general "shortlist" problem of how to select the set of choices over which a decision maker can ponder. As the choices on a ballot are as important as the votes themselves, the decisions of who to hire, who to insure, or who to admit, are directly dependent to who is considered, who is categorized, or who meets the threshold for admittance. I will frame this problem as one requiring additional non-epistemic (affective) context that normalizes expected values, and propose a computational model for this context based on a social-psychological model of affect in social interactions.

SESSION: Workshop Presentations

An Improved Method for Enhancing Robustness of Multimodal Sentiment Classification Models via Utilizing Modality Latent Information

  • Hanxu Ai
  • Xiaomei Tao
  • Yuan Zhang

Multi-modal emotion analysis has become an active research field . However, in real-world scenarios, it is often necessary to analyze and recognize emotion data with noise. Integrating information from different modalities effectively to enhance the overall robustness of the model remains a challenge. To address this, we propose an improved approach that leverages modality latent information to enhance cross-modal interaction and improve the robustness of multi-modal emotion classification models. Specifically, we apply a multi-period-based preprocessing technique to the audio modality data. Additionally, we introduce a random modality noise injection strategy to augment the training data and enhance generalization capabilities. Finally, we employ a composite fusion method to integrate information features from different modalities, effectively promoting cross-modal information interaction and enhancing the overall robustness of the model. We evaluate our proposed method in the MER-NOISE sub-challenge of MER2023. Experimental results demonstrate that our improved multi-modal emotion classification model achieves a weighted F1 score of 69.66% and an MSE score of 0.92 on the MER-NOISE test set, with an overall score of 46.69%, representing a 5.69% improvement over the baseline. These results prove the effectiveness of our proposed approach in further enhancing the robustness of the model.

First-order Multi-label Learning with Cross-modal Interactions for Multimodal Emotion Recognition

  • Yunrui Cai
  • Jingran Xie
  • Boshi Tang
  • Yuanyuan Wang
  • Jun Chen
  • Haiwei Xue
  • Zhiyong Wu

Multimodal emotion recognition (MER) is essential for the machine to fully understand human intentions. Various deep neural network based models are proposed but it is still challenging to better model and fuse multimodal features. In addition, recent studies have focused on the classification task of predicting discrete labels, while lacking consideration of the dimension value. In this paper, we propose a multimodal fusion model based on Transformer architecture and cross-modal interactions, and adopt a multi-label learning algorithm of first-order strategy to predict discrete labels and dimension values respectively. We also propose a semi-supervised learning method of moment injection with unlabeled data to enhance the robustness of the model. Finally, we use ensemble learning to further improve the performance of the model. We evaluate the proposed method on the MER-MULTI sub-challenge of Multimodal Emotion Recognition Challenge (MER 2023). Experimental results demonstrate the promising performance of our proposed method, which can achieve the evaluation metric of 0.6765 on the test set.

Learning Aligned Audiovisual Representations for Multimodal Sentiment Analysis

  • Chaoyue Ding
  • Daoming Zong
  • Baoxiang Li
  • Ken Zheng
  • Dinghao Zhou
  • Jiakui Li
  • Qunyan Zhou

In this paper, we present the solutions to the MER-SEMI subchallenge of Multimodal Emotion Recognition Challenge (MER 2023). This subchallenge focuses on predicting discrete emotions for a small subset of unlabeled videos within the context of semi-supervised learning. Participants are provided with a combination of labeled and large amounts of unlabeled videos. Our preliminary experiments on labeled videos demonstrate that this task is primarily driven by the video and audio modalities, while the text modality plays a relatively weaker role in emotion prediction. To address this challenge, we propose the Video-Audio Transformer (VAT), which takes raw signals as inputs and extracts multimodal representations. VAT comprises a video encoder, an audio encoder, and a cross-modal encoder. To leverage the vast amount of unlabeled data, we introduce a contrastive loss to align the image and audio representations before fusing them through cross-modal attention. Additionally, to enhance the model's ability to learn from noisy video data, we apply momentum distillation, a self-training method that learns from pseudo-targets generated by a momentum model. Furthermore, we fine-tune VAT on annotated video data specifically for emotion recognition. Experimental results on the MER-SEMI task have shown the effectiveness of the proposed VAT model. Notably, our model ranks first (0.891) on the leaderboard. Our project is publicly available at

CLIP-based Model for Effective and Explainable Apparent Personality Perception

  • Peter Zhuowei Gan
  • Arcot Sowmya
  • Gelareh Mohammadi

In the field of Apparent Personality Perception (APP), a central challenge involves drawing robust inferences from observable behaviour and appearance. Various existing methods attempt to accomplish this by deploying diverse neural network architectures, all aimed at extracting and integrating features from multimodal data sources such as text, audio, and video. Notwithstanding, these methods grapple with issues related to generalisability and explainability, which hamper their applicability and trustworthiness in real-world situations. Responding to this challenge, our paper presents a novel approach to APP that capitalizes on the unique strengths of CLIP (Contrastive Language-Image Pre-Training), a large-scale multimodal pre-training method. CLIP learns image and text representations via natural language supervision and demonstrates a remarkable ability to handle a range of vision-related tasks using natural language queries and prompts. In the context of APP, we harness CLIP as a feature extractor for both personality traits and visual cues. These features are then incorporated into a regression model trained to predict apparent personality traits of a target person from their image or video. A unique strength of our model lies in its inherent multimodal nature (using image and text) giving it the ability to perform competently using only visual data, as opposed to most other models that require both image and audio features. This highlights our method's versatility in handling varied multimodal inputs. We put our method to the test using the ChaLearn First Impressions V2 (CVPR'17) challenge dataset, comparing our results with several state-of-the-art baselines that employ different modalities and architectures for APP. We found that our method either matches or outperforms the competition, and notably, provides intuitive and interpretable explanations for its predictions. By relying on natural language supervision and queries, our approach significantly enhances the generalisability and explainability of APP models. Beyond this, our work underscores the prospective role that large-scale transformers can play in better comprehending psychological phenomena. As a result, our method contributes positively to the broader domain of Multimodal and Responsible Affective Computing.

Generalised Bias Mitigation for Personality Computing

  • Jian Jiang
  • Viswonathan Manoranjan
  • Hanan Salam
  • Oya Celiktutan

Building systems with the capability of predicting the socio-emotional states of humans has many promising applications. However, if not properly designed, such systems might lead to biased decisions if biased data was used for training. Bias mitigation remains an open problem, which tackles the correction of a model's disparate performance over different groups defined by particular sensitive attributes (e.g., gender, age, race). Most existing methods are designed and tested in simple settings, limiting their general applicability to more complex real-world scenarios. In this work, we design a novel fairness loss function named Multi-Group Parity (MGP) to provide a generalised approach for bias mitigation in personality computing. In contrast to existing works in the literature, MGP is generalised as it features four 'multiple' properties (4Mul): multiple tasks, multiple modalities, multiple sensitive attributes, and multi-valued attributes. Extensive experiments on two large multi-modal benchmark personality computing datasets demonstrate that the MGP sets new state-of-the-art performance both in the traditional and in the proposed 4Mul settings.

Label Distribution Adaptation for Multimodal Emotion Recognition with Multi-label Learning

  • Hailun Lian
  • Cheng Lu
  • Sunan Li
  • Yan Zhao
  • Chuangao Tang
  • Yuan Zong
  • Wenming Zheng

In the task of multimodal emotion recognition with multi-label learning (MER-MULTI), leveraging the correlation between discrete and dimensional emotions is crucial for improving the model's performance. However, there may be a mismatch between the feature distributions of the training set and the testing set, which could result in the trained model's inability to adapt to the correlations between labels in the testing set. Therefore, a significant challenge in MER-MULTI is how to match the feature distributions of the training set and testing set samples. To tackle this issue, we propose a method called Label Distribution Adaptation for MER-MULTI. More specifically, by adapting the label distribution between the training set and testing set to remove training samples that do not match the features of the testing set. This can enhance the model's performance and generalization on testing data, enabling it to better capture the correlations between labels. Furthermore, to alleviate the difficulty of model training and inference, we design a novel loss function called Multi-label Emotion Joint Learning Loss (MEJL), which combines the correlations between discrete and dimensional emotions. Specifically, through contrastive learning, we transform the shared feature distribution of multiple labels into a space where discrete and dimensional emotions are consistent. This facilitates the model in learning the relationships between discrete and dimensional emotions. Finally, we have evaluated the proposed method, which has achieved second place in the MER-MULTI task of the MER 2023 Challenge.

EfficienTransNet: An Automated Chest X-ray Report Generation Paradigm

  • Chayan Mondal
  • Duc-Son Pham
  • Ashu Gupta
  • Shreya Ghosh
  • Tele Tan
  • Tom Gedeon

The significance of chest X-ray imaging in diagnosing chest diseases is well-established in clinical and research domains. The automation of generating X-ray reports can address various challenges associated with manual diagnosis by speeding up the report generation system, becoming the perfect assistant for radiologists, and reducing their tedious workload. But, this automation's key challenge is to accurately capture the abnormal findings and produce a fluent as well as natural report. In this paper, we introduce EfficienTransNet, an automatic chest X-ray report generation approach based on CNN-Transformers. EfficienTransNet prioritizes clinical accuracy and demonstrates improved text generation metrics. Our model incorporates clinical history or indications to enhance the report generation process and align with radiologists' workflow, which is mostly overlooked in recent research. On two publicly available X-ray report generation datasets, MIMIC-CXR, and IU X-ray, our model yields promising results on natural language evaluation and clinical accuracy metrics. Qualitative results, demonstrated with Grad-CAM, provide disease location information for radiologists' better understanding. Our proposed model emphasizes radiologists' workflow, enhancing the explainability, transparency, and trustworthiness of radiologists in the report generation process.

Semi-supervised Multimodal Emotion Recognition with Consensus Decision-making and Label Correction

  • Jingguang Tian
  • Desheng Hu
  • Xiaohan Shi
  • Jiajun He
  • Xingfeng Li
  • Yuan Gao
  • Tomoki Toda
  • Xinkang Xu
  • Xinhui Hu

Multimodal emotion recognition is the task of identifying and understanding emotions by integrating information from multiple modalities, such as audio, visual, and textual data. However, the scarcity of labeled data poses a significant challenge for this task. To this end, this paper proposes a novel approach via a semi-supervised learning framework by incorporating consensus decision-making and label correction methods. Firstly, we employ supervised learning on the trimodal input data to establish robust initial models. Secondly, we generate reliable pseudo-labels for unlabelled data by leveraging consensus decision-making and label correction methods. Thirdly, we train the model in a supervised manner using both labeled and pseudo-labeled data. Moreover, the process of generating pseudo-labels and semi-supervised learning can be iterated to refine the model further. Experimental results on the MER 2023 dataset show the effectiveness of our proposed framework, achieving significant improvement on the MER-MULTI, MER-NOISE, and MER-SEMI subsets, respectively.

FEENN: The Feature Enhancement Embedded Neural Network for Robust Multimodal Emotion Recognition

  • Chenchen Wang
  • Jing Dong
  • Yu Sui
  • Ping Xu
  • Ying Wu
  • Ying Xu

This paper introduces the Delta Team's submission to the Multimodal Emotion Recognition(MER 2023)-MER-NOISE Challenge. Multimodal emotion recognition aims to improve sentiment analysis ability by integrating emotional information from multiple modalities. However, in real applications, various interference such as noise and complex backgrounds significantly degrade model performance, making model construction more challenging. In this paper, we propose two simple and effective feature enhancement strategies, Hidden State Smoothing (HSS) and Embedding Vector Augmentation (EVA), to improve the robustness of the model. The HSS strategy extracts text and audio feature vectors by weighting encoded features at both shallow and deep hidden states, enhancing the model's representation of fine-grained sentiment categories. Additionally, the EVA is introduced to perturb audio embedding vectors with random noise for improving model generalization ability. On the MER-NOISE track, the models are trained with limited video data, and both strategies effectively improve the robustness of the model to varying degrees. The HSS strategy proves to be particularly effective and concise, achieving a 30% performance improvement over the baseline on the noisy test set. It adapts to many models in various interference scenarios.