AVEC'18- Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop

Full Citation in the ACM Digital Library

SESSION: Keynote

Session details: Keynote

Fabien Ringeval

Interpersonal Behavior Modeling for Personality, Affect, and Mental States Recognition and Analysis

Chi-Chun Lee

Imagine humans as complex dynamical systems: systems that are characterized by multiple interacting layers of hidden states (e.g., internal processes involving functions of cognition, perception, production, emotion, and social interaction) producing measurable multimodal signals (e.g., body gestures, facial expressions, physiology, and speech). This abstraction of humans with a signals and systems framework naturally brings a synergy between communities of engineering and behavioral sciences. Various research fields have emerged from such an interdisciplinary human-centered effort, e.g., behavioral signal processing [7], social signal processing [10], and affective computing [8], where technological advancements has continuously been made in order to robustly assess and infer individual speaker's states and traits.

The complexities in modeling human behavior are centered on the issue of heterogeneity of human behavior. Sources of variability in human behaviors originate from the differences in mechanisms of information encoding (behavior production) and decoding (behavior perception). Furthermore, a key additional layer of complexity exists because human behaviors occur largely during interactions with the environment and agents therein. This interplay, which causes a coupling effect between humans' behaviors, is the essence of interpersonal dynamics. This unique behavior dynamic has been at core not only in human communication studies [2], but further is crucial in automatic characterizing the speaker's social-affective behavior phenomenon (e.g., emotion recognition [4, 5] and personality trait identification [3, 9]) and in understanding interactions of those typical, distressed to disordered manifestations [1, 6].

SESSION: Introduction

Session details: Introduction

Chi Chun (Jeremy) Lee

AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition

Fabien Ringeval
Björn Schuller
Michel Valstar
Roddy Cowie
Heysem Kaya
Maximilian Schmitt
Shahin Amiriparian
Nicholas Cummins
Denis Lalanne
Adrien Michaud
Elvan Ciftçi
Hüseyin Güleç
Albert Ali Salah
Maja Pantic

The Audio/Visual Emotion Challenge and Workshop (AVEC 2018) "Bipolar disorder, and cross-cultural affect recognition'' is the eighth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: bipolar disorder classification, cross-cultural dimensional emotion recognition, and emotional label generation from individual ratings, respectively.

SESSION: Bipolar Disorder Sub-challenge

Session details: Bipolar Disorder Sub-challenge

Fabien Ringeval

Bipolar Disorder Recognition with Histogram Features of Arousal and Body Gestures

Le Yang
Yan Li
Haifeng Chen
Dongmei Jiang
Meshia Cédric Oveneke
Hichem Sahli

This paper targets the Bipolar Disorder Challenge (BDC) task of Audio Visual Emotion Challenge (AVEC) 2018. Firstly, two novel features are proposed: 1) a histogram based arousal feature, in which the continuous arousal values are estimated from the audio cues by a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) model; 2) a Histogram of Displacement (HDR) based upper body posture feature, which characterizes the displacement and velocity of the key body points in the video segment. In addition, we propose a multi-stream bipolar disorder classification framework with Deep Neural Networks (DNNs) and a Random Forest, and adopt the ensemble learning strategy to alleviate the possible over-fitting problem due to the limited training data. Experimental results show that the proposed arousal feature and upper body posture feature are discriminative for different bipolar episodes, and our proposed framework achieves promising classification results on the development set, with the unweighted average recall (UAR) of 0.714, which is higher than the baseline result 0.635. On test set evaluation, our system obtains the same UAR (0.574) as the challenge baseline.

Bipolar Disorder Recognition via Multi-scale Discriminative Audio Temporal Representation

Zhengyin Du
Weixin Li
Di Huang
Yunhong Wang

Bipolar disorder (BD) is a prevalent mental illness which has a negative impact on work and social function. However, bipolar symptoms are episodic, especially with irregular variations among different episodes, making BD very difficult to be diagnosed accurately. To solve this problem, this paper presents a novel audio-based approach, called IncepLSTM, which effectively integrates Inception module and Long Short-Term Memory (LSTM) on the feature sequence to capture multi-scale temporal information for BD recognition. Moreover, in order to obtain a discriminative representation of BD severity, we propose a novel severity-sensitive loss based on the triplet loss to model the inter-severity relationship. Considering the small scale of existing BD corpus, to avoid overfitting, we also make use of $L^1$ regulation to improve the sparsity of IncepLSTM. The evaluations are conducted on the Audio/Visual Emotion Challenge (AVEC) 2018 Dataset and the experimental results clearly demonstrate the effectiveness of our method.

Multi-modality Hierarchical Recall based on GBDTs for Bipolar Disorder Classification

Xiaofen Xing
Bolun Cai
Yinhu Zhao
Shuzhen Li
Zhiwei He
Weiquan Fan

In this paper, we propose a novel hierarchical recall model fusing multiple modality (including audio, video and text) for bipolar disorder classification, where patients with different mania level are recalled layer-by-layer. To address the complex distribution on the challenge data, the proposed framework utilizes multi-model, multi-modality and multi-layer to perform domain adaptation for each patient and hard sample mining for special patients. The experimental results show that our framework achieves competitive performance with Unweighed Average Recall (UAR) of 57.41% on the test set, and 86.77% on the development set.

Automated Screening for Bipolar Disorder from Audio/Visual Modalities

Zafi Sherhan Syed
Kirill Sidorov
David Marshall

This paper addresses the Bipolar Disorder sub-challenge of the Audio/Visual Emotion recognition Challenge (AVEC) 2018, where the objective is to classify patients suffering from bipolar disorder into states of remission, hypo-mania, and mania, from audio-visual recordings of structured interviews. To this end, we propose 'turbulence features' to capture sudden, erratic changes in feature contours from audio and visual modalities, and demonstrate their efficacy for the task at hand. We introduce Fisher Vector encoding of ComParE low level descriptors (LLDs) and demonstrate that these features are viable for screening of bipolar disorder from speech. We also perform several experiments with standard feature sets from the OpenSmile toolkit as well as multi-modal fusion. The best result achieved on the test set is a UAR = 57.41%, which matches the best result published as the official baseline.

SESSION: Cross-cultural Emotion Sub-challenge

Session details: Cross-cultural Emotion Sub-challenge

Chi Chun (Jeremy) Lee

Speech-based Continuous Emotion Prediction by Learning Perception Responses related to Salient Events: A Study based on Vocal Affect Bursts and Cross-Cultural Affect in AVEC 2018

Kalani Wataraka Gamage
Ting Dang
Vidhyasaharan Sethu
Julien Epps
Eliathamby Ambikairajah

This paper presents a novel framework for speech-based continuous emotion prediction. The proposed model characterises the perceived emotion estimation as time-invariant responses to salient events. Then arousal and valence variation over time is modelded as the ouput of a parallel array of time-invariant filters where each filter represents a salient event in this context, and the impulse response of the filter represents the learned perception emotion response. The proposed model is evaluted by considering vocal affect bursts/non-verbal vocal gestures as salient event candidates. The proposed model is validated based on the development dataset of AVEC 2018 challenge development dataset and achieves the highest accuracy of valence prediction among single modal methods based on speech or speech-transcript. We tested this model on cross-cultural settings provided by AVEC 2018 challenge test set, and the model performs reasonably well for an unseen culture as well and outperform speech-based baselines. Further we explore inclusion of interlocutor related cues to the proposed model and decision level fusion with existing features. Since the proposed model was evaluated solely based on laughter and slight laughter affect bursts which were nominated as salient by proposed saliency constrains of the model, the results presented highlight the significance of aforementioned gestures in human emotion expression and perception

Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks

Jian Huang
Ya Li
Jianhua Tao
Zheng Lian
Mingyue Niu
Minghao Yang

This paper presents our effects for Cross-cultural Emotion Sub-challenge in the Audio/Visual Emotion Challenge (AVEC) 2018, whose goal is to predict the level of three emotional dimensions time-continuously in a cross-cultural setup. We extract the emotional features from audio, visual and textual modalities. The state of art regressor for continuous emotion recognition, long short term memory recurrent neural network (LSTM-RNN) is utilized. We augment the training data by replacing the original training samples with shorter overlapping samples extracted from them, thus multiplying the number of training samples and also beneficial to train emotional temporal model with LSTM-RNN. In addition, two strategies are explored to decrease the interlocutor influence to improve the performance. We also compare the performance of feature level fusion and decision level fusion. The experimental results show the efficiency of the proposed method and competitive results are obtained.

Multi-modal Multi-cultural Dimensional Continues Emotion Recognition in Dyadic Interactions

Jinming Zhao
Ruichen Li
Shizhe Chen
Qin Jin

Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our solutions for the Cross-cultural Emotion Sub-challenge (CES) of Audio/Visual Emotion Challenge (AVEC) 2018. The videos were recorded in dyadic human-human interaction scenarios. In these complicated scenarios, a person's emotion state will be influenced by the interlocutor's behaviors, such as talking style/prosody, speech content, facial expression and body language. In this paper, we highlight two aspects of our solutions: 1) we explore multiple modalities's efficient deep learning features and use the LSTM network to capture the long-term temporal information. 2) we propose several multimodal interaction strategies to imitate the real interaction patterns for exploring which modality information of the interlocutor is effective, and we find the best interaction strategy which can make full use of the interlocutor's information. Our solutions achieve the best CCC performance of 0.704 and 0.783 on arousal and valence respectively on the challenge testing set of German, which significantly outperform the baseline system with corresponding CCC of 0.524 and 0.577 on arousal and valence, and which outperform the winner of the AVEC2017 with corresponding CCC of 0.675 and 0.756 on arousal and valence. The experimental results show that our proposed interaction strategies have strong generalization ability and can bring more robust performance.

SESSION: Gold-standard Emotion Sub-challenge

Session details: Gold-standard Emotion Sub-challenge

Fabien Ringeval

Towards a Better Gold Standard: Denoising and Modelling Continuous Emotion Annotations Based on Feature Agglomeration and Outlier Regularisation

Chen Wang
Phil Lopes
Thierry Pun
Guillaume Chanel

Emotions are often perceived by humans through a series of multimodal cues, such as verbal expressions, facial expressions and gestures. In order to recognise emotions automatically, reliable emotional labels are required to learn a mapping from human expressions to corresponding emotions. Dimensional emotion models have become popular and have been widely applied for annotating emotions continuously in the time domain. However, the statistical relationship between emotional dimensions is rarely studied. This paper provides a solution to automatic emotion recognition for the Audio/Visual Emotion Challenge (AVEC) 2018. The objective is to find a robust way to detect emotions using more reliable emotion annotations in the valence and arousal dimensions. The two main contributions of this paper are: 1) the proposal of a new approach capable of generating more dependable emotional ratings for both arousal and valence from multiple annotators by extracting consistent annotation features; 2) the exploration of the valence and arousal distribution using outlier detection methods, which shows a specific oblique elliptic shape. With the learned distribution, we are able to detect the prediction outliers based on their local density deviations and correct them towards the learned distribution. The proposed method performance is evaluated on the RECOLA database containing audio, video and physiological recordings. Our results show that a moving average filter is sufficient to remove the incidental errors in annotations. The unsupervised dimensionality reduction approaches could be used to determine a gold standard annotations from multiple annotations. Compared with the baseline model of AVEC 2018, our approach improved the arousal and valence prediction of concordance correlation coefficient significantly to respectively 0.821 and 0.589.

Fusing Annotations with Majority Vote Triplet Embeddings

Brandon M. Booth
Karel Mundnich
Shrikanth Narayanan

Human annotations of behavioral constructs are of great importance to the machine learning community because of the difficulty in quantifying states that cannot be directly observed, such as dimensional emotion. Disagreements between annotators and other personal biases complicate the goal of obtaining an accurate approximation of the true behavioral construct values for use as ground truth. We present a novel majority vote triplet embedding scheme for fusing real-time and continuous annotations of a stimulus to produce a gold-standard time series. We illustrate the validity of our approach by showing that the method produces reasonable gold-standards for two separate annotation tasks from a human annotation data set where the true construct labels are known a priori. We also apply our method to the RECOLA dimensional emotion data set in conjunction with state-of-the-art time warping methods to produce gold-standard labels that are sufficiently representative of the annotations and also that are more easily learned from features when evaluated using a battery of linear predictors as prescribed in the 2018 AVEC gold-standard emotion sub-challenge. In particular, we find that the proposed method leads to gold-standard labels that aid in valence prediction.

SESSION: Deep Learning for Affective Computing

Session details: Deep Learning for Affective Computing

Fabien Ringeval

Deep Learning for Continuous Multiple Time Series Annotations

Jian Huang
Ya Li
Jianhua Tao
Zheng Lian
Mingyue Niu
Minghao Yang

Learning from multiple annotations is an increasingly important research topic. Compared with conventional classification or regression problems, it faces more challenges because time-continuous annotations would result in noisy and temporal lags problems for continuous emotion recognition. In this paper, we address the problem by deep learning for continuous multiple time series annotations. We attach a novel crowd layer to the output layer of basic continuous emotion recognition system, which learns directly from the noisy labels of multiple annotators with end-to-end manner. The inputs of the system are multimodal features and the targets are multiple annotations, with the intention of learning an annotator-specific mapping. Our proposed method considers the ground truth as latent variables and multiple annotations are variant of ground truth by linear mapping. The experimental results show that our system can achieve superior performance and capture the reliabilities and biases of different annotators.

Learning an Arousal-Valence Speech Front-End Network using Media Data In-the-Wild for Emotion Recognition

Chih-Chuan Lu
Jeng-Lin Li
Chi-Chun Lee

Recent progress in speech emotion recognition (SER) technology has benefited from the use of deep learning techniques. However, expensive human annotation and difficulty in emotion database collection make it challenging for rapid deployment of SER across diverse application domains. An initialization - fine-tuning strategy help mitigate these technical challenges. In this work, we propose an initialization network that gears toward SER applications by learning the speech front-end network on a large media data collected in-the-wild jointly with proxy arousal-valence labels that are multimodally derived from audio and text information, termed as the Arousal-Valence Speech Front-End Network (AV-SpNET). The AV-SpNET can then be easily stacked simply with the supervised layers for the target emotion corpus of interest. We evaluate our proposed AV-SpNET on tasks of SER for two separate emotion corpora, the USC IEMOCAP and the NNIME database. The AV-SpNET outperforms other initialization techniques and reach the best overall performances requiring only 75% of the in-domain annotated data. We also observe that generally, by using the AV-SpNET as front-end network, it requires as little as 50% of the fine-tuned data to surpass method based on randomly-initialized network with fine-tuning on the complete training set.