ASMMC-MMAC'18- Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data

SESSION: Emotion Generation in Speech and Facial Animation

Session details: Emotion Generation in Speech and Facial Animation

  •      Jin Qin

A Kullback-Leibler Divergence Based Recurrent Mixture Density Network for Acoustic Modeling in Emotional Statistical Parametric Speech Synthesis

  •      Xiaochun An
  • Yuchao Zhang
  • Bing Liu
  • Liumeng Xue
  • Lei Xie

This paper proposes a Kullback-Leibler divergence (KLD) based recurrent mixture density network (RMDN) approach for acoustic modeling in emotional statistical parametric speech synthesis (SPSS), which aims at improving model accuracy and emotion naturalness. First, to improve model accuracy, we propose to use RMDN as acoustic model, which combines an LSTM with a mixture density network (MDN). Adding mixture density layer allows us to do multimodal regression as well as to predict variances, thus modeling more accurate probability density functions of acoustic features. Second, we further introduce Kullback-Leibler divergence regularization in model training. Inspired by KLD's success in acoustic model adaptation, we aim to improve the emotion naturalness by maximizing the distances between the distributions of emotional speech and neutral speech. Objective and subjective evaluations show that the proposed approach improves the prediction accuracy of acoustic features and the naturalness of the synthesized emotional speech.

Visual Speech Emotion Conversion using Deep Learning for 3D Talking Head

  •      Dong-Yan Huang
  • Ellensi Chandra
  • Xiangting Yang
  • Ying Zhou
  • Huaiping Ming
  • Weisi Lin
  • Minghui Dong
  • Haizhou Li

In this paper, we present an audio-visual emotion conversion based on deep learning for 3D talking head. The technology aims at retargeting neutral facial and speech expression into emotional ones. The challenging issues are how to control dynamics and variations of different expressions of both speech and the face. The controllability of facial expressions is achieved by training a parallel neutral and emotional marker-based facial motion capture data using a temporal restricted Boltzmann machine (TRBMs) for emotion transfer, while emotional voice conversion is to use long short term memory recurrent neural networks (LSTM-RNNs). Through the combination of 3D skinning method and 3D motion capture, we can make facial animation model close to physical reality for different expressions of 3D talking head. Results on subjective emotion recognition task show that recognition rates of the synthetic audio-visual emotion are comparable to those given the original videos of the speaker.

A Comparison of Expressive Speech Synthesis Approaches based on Neural Network

  •      Liumeng Xue
  • Xiaolian Zhu
  • Xiaochun An
  • Lei Xie

Adaptability and controllability in changing speaking styles and speaker characteristics are the advantages of deep neural networks (DNNs) based statistical parametric speech synthesis (SPSS). This paper presents a comprehensive study on the use of DNNs for expressive speech synthesis with a small set of emotional speech data. Specifically, we study three typical model adaptation approaches: (1) retraining a neural model by emotion-specific data (retrain), (2) augmenting the network input using emotion-specific codes (code) and (3) using emotion-dependent output layers with shared hidden layers (multi-head). Long-short term memory (LSTM) networks are used as the acoustic models. Objective and subjective evaluations have demonstrated that the multi-head approach consistently outperforms the other two approaches with more natural emotion delivered in the synthesized speech.

SESSION: Emotion Recognition from Multimedia

Session details: Emotion Recognition from Multimedia

  •      Wei Huang

Speech Emotion Recognition via Contrastive Loss under Siamese Networks

  •      Zheng Lian
  • Ya Li
  • Jianhua Tao
  • Jian Huang

Speech emotion recognition is an important aspect of human-computer interaction. Prior work proposes various end-to-end models to improve the classification performance. However, most of them rely on the cross-entropy loss together with softmax as the supervision component, which does not explicitly encourage discriminative learning of features. In this paper, we introduce the contrastive loss function to encourage intra-class compactness and inter-class separability between learnable features. Furthermore, multiple feature selection methods and pairwise sample selection methods are evaluated. To verify the performance of the proposed system, we conduct experiments on The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database - a common evaluation corpus. Experimental results reveal the advantages of the proposed method, which reaches 62.19% in the weighted accuracy and 63.21% in the unweighted accuracy. It outperforms the baseline system that is optimized without the contrastive loss function with 1.14% and 2.55% in the weighted accuracy and the unweighted accuracy, respectively.

Deep Spectrum Feature Representations for Speech Emotion Recognition

  •      Ziping Zhao
  • Yiqin Zhao
  • Zhongtian Bao
  • Haishuai Wang
  • Zixing Zhang
  • Chao Li

Automatically detecting emotional state in human speech, which plays an effective role in areas of human machine interactions, has been a difficult task for machine learning algorithms. Previous work for emotion recognition have mostly focused on the extraction of carefully hand-crafted and tailored features. Recently, spectrogram representations of emotion speech have achieved competitive performance for automatic speech emotion recognition. In this work we propose a method to tackle the problem of deep features, herein denoted as deep spectrum features, extraction from the spectrogram by leveraging Attention-based Bidirectional Long Short-Term Memory Recurrent Neural Networks with fully convolutional networks. The learned deep spectrum features are then fed into a deep neural network (DNN) to predict the final emotion. The proposed model is then evaluated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset to validate its effectiveness. Promising results indicate that our deep spectrum representations extracted from the proposed model perform the best, 65.2% for weighted accuracy and 68.0% for unweighted accuracy when compared to other existing methods. We then compare the performance of our deep spectrum features with two standard acoustic feature representations for speech-based emotion recognition. When combined with a support vector classifier, the performance of the deep feature representations extracted are comparable with the conventional features. Moreover, we also investigate the impact of different frequency resolutions of the input spectrogram on the performance of the system.

Automatic Smile Detection of Infants in Mother-Infant Interaction via CNN-based Feature Learning

  •      Chuangao Tang
  • Wenming Zheng
  • Yuan Zong
  • Zhen Cui
  • Nana Qiu
  • Simeng Yan
  • Xiaoyan Ke

Smile is one of the most frequently appeared facial expressions of human beings. It is an explicit social signal to speculate on humans' emotion. Interpretation of infants smile behavior has received much attention in the field of psychology and pediatric medicine, i.e. Autism Spectral Disorder (ASD). Due to low efficiency of manual emotion encoding of infant facial images, automatic smile detection system has drawn many researchers' attention. However, lack of large amount of labeled data hinders the research progress in this field. In this paper, a novel dataset named RCLA&NBH\_Smile Dataset was presented. Thirty four infants facial expression videos were recorded during mother-infant interaction and over 77,000 frames were manually labeled. For developing an automatic smile analysis system, we proposed a Convolutional Neural Network (CNN) based method for smile detection. We firstly evaluated our proposed method on two public datasets, GENKI-4K and CelebA, and it achieved satisfactory result of mean accuracy of 94.55% and 92.60%, respectively. Experiments on RCLA&NBH\_Smile dataset show that our proposed approach can achieve the mean accuracy of 87.16% and F1-score of 62.54%, respectively.

SESSION: Affective Social Multimedia Computing

Session details: Affective Social Multimedia Computing

  •      Chao Li

Photo to Family Tree: Deep Kinship Understanding for Nuclear Family Photos

  •      Mengyin Wang
  • Jiashi Feng
  • Xiangbo Shu
  • Zequn Jie
  • Jinhui Tang

Multi-person kinship recognition is complicated and challenging as an extension of existing studies on recognizing kinship in pairwise face images independently, with little existing literature. In this paper, we extend kinship recognition from two persons to a nuclear family consisting of multiple persons. To generate the corresponding family tree from one nuclear family photo automatically, we propose a novel Deep Kinship Recognition (DKR) framework. Firstly, we propose a deep kinship classification model (named DKC-KGA) which leverages kin-or-not, gender and relative age attributes to predict kinship categories. Then, based on the outputs of DKC-KGA for an input nuclear family photo, we develop a reasoning conditional random field (R-CRF) model to infer the optimal corresponding family tree by utilizing the common knowledge w.r.t. kinship of a nuclear family. Our DKR framework gains superior performance on both Group-Face dataset and TSKinFace dataset, compared with state-of-the-arts.

Deep Full-scaled Metric Learning for Pedestrians Re-identification: A Pre-requisite Study on Multi-camera-based Affective Computing

  •      Wei Huang
  • Mingyuan Luo
  • Peng Zhang

In this study, a new full-scaled deep discriminant model is proposed to tackle the re-identification (re-id) problem of pedestrian targets, which aims to identify pedestrian targets within a network of cameras with non-overlapping fields of view and is pre-requisite in multi-camera-based affective computing. The new full-scaled model is realized by taking concepts of depth, width, and cardinality simultaneously into consideration, and the challenging re-id problem in this study is further tackled via a novel deep semi-supervised metric learning method based on the full-scaled model. Additionally, both the conventional stochastic gradient descent algorithm and an alternative more efficient proximal gradient descent algorithm are derived to realize the new deep metric learning method. For experimental evaluations, the novel full-scaled deep metric learning method has been compared with 9 other popular re-id methods based on 3 well-known databases. Comprehensive statistical analyses suggest the superiority of the new method when handling the balance learning problem in the re-id task.

Video Interestingness Prediction Based on Ranking Model

  •      Shuai Wang
  • Shizhe Chen
  • Jinming Zhao
  • Qin Jin

Predicting the interestingness of videos can greatly improve people's satisfactions in many applications such as video retrieval and recommendations. In order to obtain less subjective interestingness annotations, partial pairwise comparisons among videos are firstly annotated and all videos are then ranked globally to generate the interestingness value. We study two factors in interestingness prediction, namely comparison information and evaluation metric optimization. In this paper, we propose a novel deep ranking model which simulates the human annotation procedures for more reliable interestingness prediction. To be specific, we extract different visual and acoustic features and sample different comparison video pairs by different strategies such as random and fixed-distance. The richer information of human pairwise ranking annotations are used as a richer guidance compared with the plain interestingness value to train our networks. In addition to comparison information, we also explore reinforcement ranking model which directly optimizes the evaluation metric. Experimental results demonstrate that the fusion of the two ranking models can make better use of human labels and outperform the regression baseline. Also, it reaches the best performance according to the results of MediaEval 2017 interestingness prediction task.

Inferring Personality Traits from User Liked Images via Weakly Supervised Dual Convolutional Network

  •      Hancheng Zhu
  • Leida Li
  • Hongyan Jiang

With the prevalence of social media, users tend to use visual content to convey their preferences in on-line communication. This allows us to infer users' personality traits by their preferred images. Since users tend to have different preferences on image regions, modeling their traits from the holistic image may be problematic. In this paper, we propose an end-to-end weakly supervised dual convolutional network (WSDCN) for personality prediction, which consists of a classification network and a regression network. The classification network has the advantage of capturing class-specific localized image regions while only requiring the image-level class labels. In order to obtain the personality class-specific localized activation map (class activation map), the Big-Five (BF) traits are firstly converted into ten personality class labels for the classification network. Secondly, the Multi-Personality Class Activation Map (MPCAM) is utilized as the localized activation of deep feature maps to generate local deep features, which are then combined with the holistic deep features for the regression network. Finally, the user liked images and the associated personality labels are trained by the end-to-end WSDCN for predicting the BF personality traits. Experimental results on the annotated PsychoFlickr database show that the proposed method is superior to the state-of-the-art approaches, and the BF personality traits can be predicted simultaneously by the trained WSDCN model.

SESSION: Keynote Talk

  •      Jia Jia

Psychological stress and depression are threatening people's health. It is non-trivial to detect stress or depression timely for proactive care. With the popularity of social media, people are used to sharing their daily activities and interacting with friends on social media platforms, making it feasible to leverage online social media data for stress and depression detection. In this talk, we will systematically introduce our work on stress and depression detection employing large-scale benchmark datasets from real-world social media platforms, including 1) stress-related and depression-related textual, visual and social attributes from various aspects, 2) novel hybrid models for binary stress detection, stress event and subject detection, and cross-domain depression detection, and finally 3) several intriguing phenomena indicating the special online behaviors of stressed as well as depressed people. We would also like to demonstrate our developed mental health care applications at the end of this talk.