AVEC '19- Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

AVEC '19- Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

Full Citation in the ACM Digital Library

SESSION: Keynote

Session details: Keynote

  •      Fabien Ringeval

The Socio-Affective Robot: Aimed to Understand Human Links?

  •      Véronique Aubergé

Is the social robot a product of artificial intelligence or is it a perception product by our natural intelligence, revealing some crucial aspects of social and cultural human processing? Among the smart objects, the social robot cannot be distinguished by precise and well defined technical or morphological cues. Even though no serious and discriminative attributes can be given by any science knowledge -- even the movement attribute, and the "autonomous'' cognitive attribute are not clearly defined -- in order to understand how an object becomes, perceptively, a subject (social robot), it is a fact that the automatons and the talking artefacts are now named robot, which is particularly attractive for general public, for scientists and engineers. However, is it a socio-cultural desire or a technical need to add the augmentation of the social space to the "augmented self'' (self body and self environment abilities)?

In this talk we will explore some social space perturbations in ecological conditions, such as elderly people suffering from isolation and interacting with a robot that can emit solely non-verbal speech primitives. Long term interactions were collected and analysed using the concepts of the Dynamic Affective Network for Social Entities (D.A.N.S.E.) theory. We will try to show that non-verbal speech primitives, organised in the D.A.N.S.E.'s "glue'' paradigm, permit to predict the relations with the robot perceived by elderly as oriented inside or outside dominance, but also to explore in particular an empathic dimension. Through a Living Lab method, evaluation of these hypotheses and building of an empathic socio-affective HRI were conducted together, within strong ethical constraints. In particular the development of frail robots for frail people will be proposed as possible ethical perspective.

SESSION: Introduction

Session details: Introduction

  •      Véronique Aubergé

AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition

  •      Fabien Ringeval
  • Björn Schuller
  • Michel Valstar
  • Nicholas Cummins
  • Roddy Cowie
  • Leili Tavabi
  • Maximilian Schmitt
  • Sina Alisamir
  • Shahin Amiriparian
  • Eva-Maria Messner
  • Siyang Song
  • Shuo Liu
  • Ziping Zhao
  • Adria Mallol-Ragolta
  • Zhao Ren
  • Mohammad Soleymani
  • Maja Pantic

The Audio/Visual Emotion Challenge and Workshop (AVEC 2019) 'State-of-Mind, Detecting Depression with AI, and Cross-cultural Affect Recognition' is the ninth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: state-of-mind recognition, depression assessment with AI, and cross-cultural affect sensing, respectively.

SESSION: State-of-Mind Sub-challenge

Session details: State-of-Mind Sub-challenge

  •      Nicholas Cummins

A Multimodal Framework for State of Mind Assessment with Sentiment Pre-classification

  •      Yan Li
  • Tao Yang
  • Le Yang
  • Xiaohan Xia
  • Dongmei Jiang
  • Hichem Sahli

In this paper, we aim at the AVEC2019 State of Mind Sub-Challenge (SoMS), and propose a multimodal state of mind assessment framework, for valence and arousal, respectively. For valence, sentiment analysis is firstly performed on the English text obtained via German speech recognition and translation to classify the audio visual session into positive/negative narrative. Then each overlapping 60s segment of the session is input into an audio visual SoM assessment model trained for positive/negative narratives. The mean prediction of all the segments is adopted as the final prediction of the audio visual session. For arousal, the first step of positive/negative classification is not performed. For the audio-visual SoM assessment models, we propose to extract the functional features (Function) and VGGish based deep learning features (VGGish) from speech, and the abstract visual features based on convolutional neural network (CNN) from the baseline visual features. For each feature stream, a long short term memory (LSTM) model is trained to predict the valence/arousal values of a segment, and a support vector regression (SVR) model is adopted for the final decision fusion. Experiments on the USoM dataset show that the model with Function, baseline ResNet features and baseline VGG features obtains promising prediction results for valence, with concordance correlation coefficient (CCC) up to 0.531 on the test set, which is much higher than the baseline result 0.219.

SESSION: Cross-cultural Emotion Sub-challenge

Session details: Cross-cultural Emotion Sub-challenge

  •      Dongmei Jiang

Efficient Spatial Temporal Convolutional Features for Audiovisual Continuous Affect Recognition

  •      Haifeng Chen
  • Yifan Deng
  • Shiwen Cheng
  • Yixuan Wang
  • Dongmei Jiang
  • Hichem Sahli

Affective dimension prediction from multi-modal is becoming an increasingly attractive research field in artificial intelligence (AI) and human-computer interaction (HCI) . Previous works have shown that discriminative features from multiple modalities are of importance to accurately recognize emotional states. Recently, deep representations have proved to be effective for emotional state recognition. To investigate new deep spatial-temporal features and evaluate their effectiveness for affective dimension recognition, in this paper, we propose:~(1) combining a pre-trained 2D-CNN and a 1D-CNN for learning deep spatial-temporal features from video images and audio spectrograms; and~(2) a spatial-Temporal Graph Convolutional Networks (ST-GCN) adapted to facial landmarks graph. To evaluate the effectiveness of the proposed spatial-temporal features for affective dimension prediction, we propose Deep Bidirectional Long Short-Term Memory Networks (DBLSTM) model for single-modality prediction, early-fusion and late-fusion predictions. With respect to the liking dimension, we use the text modality for prediction. Experimental results, on the AVEC2019 CES dataset, show that our proposed spatial-temporal features and recognition model obtain promising results. On the development set, the obtained concordance correlation coefficient (CCC) is up to $0.724$ for arousal and $0.705$ for valence, and on the test set, the CCC is $0.513$ for arousal and $0.515$ for valence, which outperform the baseline system with corresponding CCC of $0.355$ and $0.468$ on arousal and valence, respectively.

Predicting Depression and Emotions in the Cross-roads of Cultures, Para-linguistics, and Non-linguistics

  •      Heysem Kaya
  • Dmitrii Fedotov
  • Denis Dresvyanskiy
  • Metehan Doyran
  • Danila Mamontov
  • Maxim Markitantov
  • Alkim Almila Akdag Salah
  • Evrim Kavcar
  • Alexey Karpov
  • Albert Ali Salah

Cross-language, cross-cultural emotion recognition and accurate prediction of affective disorders are two of the major challenges in affective computing today. In this work, we compare several systems for Detecting Depression with AI Sub-challenge (DDS) and Cross-cultural Emotion Sub-challenge (CES) that are published as part of the Audio-Visual Emotion Challenge (AVEC) 2019. For both sub-challenges, we benefit from the baselines, while introducing our own features and regression models. For the DDS challenge, where ASR transcripts are provided by the organizers, we propose simple linguistic and word-duration features. These ASR transcript-based features are shown to outperform the state of the art audio visual features for this task, reaching a test set Concordance Correlation Coefficient (CCC) performance of 0.344 in comparison to a challenge baseline of 0.120. Our results show that non-verbal parts of the signal are important for detection of depression, and combining this with linguistic information produces the best results. For CES, the proposed systems using unsupervised feature adaptation outperform the challenge baselines on emotional primitives, reaching test set CCC performances of 0.466 and 0.499 for arousal and valence, respectively.

Adversarial Domain Adaption for Multi-Cultural Dimensional Emotion Recognition in Dyadic Interactions

  •      Jinming Zhao
  • Ruichen Li
  • Jingjun Liang
  • Shizhe Chen
  • Qin Jin

Cross-cultural emotion recognition has been a challenging research problem in the affective computing field. In this paper, we present our solutions for the Cross-cultural Emotion Sub-challenge (CES) in Audio/Visual Emotion Challenge (AVEC) 2019. The aim of this task is to investigate how emotion knowledge of Western European cultures (German and Hungarian) can be transferred to Chinese culture. Previous studies have shown that the cultural difference can bring significant performance impact to emotion recognition across cultures. In this paper, we propose an unsupervised adversarial domain adaptation approach to bridge the gap across different cultures for emotion recognition. The highlights of our complete solution for the CES challenge task include: 1) several efficient deep features from multiple modalities and the LSTM network to capture the temporal information. 2) several multimodal interaction strategies to take advantage of the interlocutor's multimodal information. 3) an unsupervised adversarial adaptation approach to bridge the emotion knowledge gap across different cultures. Our solutions achieve the best CCC performance of 0.4, 0.471 and 0.257 for arousal, valence and likability respectively on the challenge testing set of Chinese, which outperforms the baseline system with corresponding CCC of 0.355, 0.468 and 0.041.

SESSION: Detecting Depression with AI Sub-challenge

Session details: Detecting Depression with AI Sub-challenge

  •      Mohammad Soleymani

Evaluating Acoustic and Linguistic Features of Detecting Depression Sub-Challenge Dataset

  •      Larry Zhang
  • Joshua Driscol
  • Xiaotong Chen
  • Reza Hosseini Ghomi

Depression affects hundreds of millions of individuals world wide. With the prevalence of depression increasing, economic costs of the illness are growing significantly. The AVEC 2019 Detecting Depression with AI (Artificial Intelligence) Sub-Challenge provides an opportunity to use novel signal processing, machine learning, and artificial intelligence technology to predict the presence and severity of depression in individuals through digital biomarkers such as vocal acoustics, linguistic contents of speech, and facial expression. In our analysis, we point out key factors to consider during pre-processing and modelling to effectively build voice biomarkers for depression. We additionally verify the dataset for balance in demographic and severity score distribution to evaluate the generalizability of our results.

Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection

  •      Mariana Rodrigues Makiuchi
  • Tifani Warnita
  • Kuniaki Uto
  • Koichi Shinoda

Depression is a common, but serious mental disorder that affects people all over the world. Besides providing an easier way of diagnosing the disorder, a computer-aided automatic depression assessment system is demanded in order to reduce subjective bias in the diagnosis. We propose a multimodal fusion of speech and linguistic representation for depression detection. We train our model to infer the Patient Health Questionnaire (PHQ) score of subjects from AVEC 2019 DDS Challenge database, the E-DAIC corpus. For the speech modality, we use deep spectrum features extracted from a pretrained VGG-16 network and employ a Gated Convolutional Neural Network (GCNN) followed by a LSTM layer. For the textual embeddings, we extract BERT textual features and employ a Convolutional Neural Network (CNN) followed by a LSTM layer. We achieved a CCC score equivalent to 0.497 and 0.608 on the E-DAIC corpus development set using the unimodal speech and linguistic models respectively. We further combine the two modalities using a feature fusion approach in which we apply the last representation of each single modality model to a fully-connected layer in order to estimate the PHQ score. With this multimodal approach, it was possible to achieve the CCC score of 0.696 on the development set and 0.403 on the testing set of the E-DAIC corpus, which shows an absolute improvement of 0.283 points from the challenge baseline.

A Multi-Modal Hierarchical Recurrent Neural Network for Depression Detection

  •      Shi Yin
  • Cong Liang
  • Heyan Ding
  • Shangfei Wang

We propose a multi-modal method with a hierarchical recurrent neural structure to integrate vision, audio and text features for depression detection. Such a method contains two hierarchies of bidirectional long short term memories to fuse multi-modal features and predict the severity of depression. An adaptive sample weighting mechanism is introduced to adapt to the diversity of training samples. Experiments on the testing set of a depression detection challenge demonstrate the effectiveness of the proposed method.

Multi-modality Depression Detection via Multi-scale Temporal Dilated CNNs

  •      Weiquan Fan
  • Zhiwei He
  • Xiaofen Xing
  • Bolun Cai
  • Weirui Lu

Depression, a prevalent mental illness, is negatively impacting on individual and society. This paper targets the Depression Detection Challenge with AI Sub-challenge (DDS) task of Audio Visual Emotion Challenge (AVEC) 2019. Firstly, two task-specific features are proposed: 1) deep contextual text features, which incorporate global text features and sentiment scores estimated by fine-tuned Bidirectional Encoder Representations from Transformers (BERT); 2) span-wise dense temporal statistical features, in which multiple statistical functions are conducted in each continuous time span. Furthermore, we propose a multi-scale temporal dilated CNN to precisely capture the hidden temporal dependency in the data for automatic multi-modality depression detection. Our proposed framework achieves competitive performance with Concordance Correlation Coefficient (CCC) of 0.466 on development set and 0.430 on test set which is remarkably higher than the baseline result of 0.269 on development set and 0.120 on test set.

Multi-level Attention Network using Text, Audio and Video for Depression Prediction

  •      Anupama Ray
  • Siddharth Kumar
  • Rutvik Reddy
  • Prerana Mukherjee
  • Ritu Garg

Depression has been the leading cause of mental-health illness worldwide. Major depressive disorder (MDD), is a common mental health disorder that affects both psychologically as well as physically which could lead to loss of lives. Due to the lack of diagnostic tests and subjectivity involved in detecting depression, there is a growing interest in using behavioural cues to automate depression diagnosis and stage prediction. The absence of labelled behavioural datasets for such problems and the huge amount of variations possible in behaviour makes the problem more challenging. This paper presents a novel multi-level attention based network for multi-modal depression prediction that fuses features from audio, video and text modalities while learning the intra and intermodality relevance. The multi-level attention reinforces overall learning by selecting the most influential features within each modality for the decision making. We perform exhaustive experimentation to create different regression models for audio, video and text modalities. Several fusions models with different configurations are constructed to understand the impact of each feature and modality. We outperform the current baseline by 17.52% in terms of root mean squared error.