HuMA'20: Proceedings of the 1st International Workshop on Human-centric Multimedia Analysis

HuMA'20: Proceedings of the 1st International Workshop on Human-centric Multimedia Analysis

HuMA'20: Proceedings of the 1st International Workshop on Human-centric Multimedia Analysis

Full Citation in the ACM Digital Library

SESSION: Keynote Talks I

Session details: Keynote Talks I

  • Chuang Gan

Human-Centric Object Interactions - A Fine-Grained Perspective from Egocentric Videos

  • Dima Damen

This talk aims to argue for a fine(r)-grained perspective onto human-object interactions.
Motivation: Observe a person chopping some parsley. Can you detect the moment at which
the parsley was first chopped? Whether the parsley was chopped coarsely or finely?
The skill level of the person chopping? Finally, can the knowledge learnt from this
task be useful to understand a video of another person slicing an apple? Problem statement:
Finer-grained understanding of object interactions from video is the topic of this
talk. This will be primarily studied from a first-person (or egocentric) video, that
is video captured using a wearable (head-worn or glass-worn) audio-visual sensor (camera).
Approach: Using multi-modal footage (appearance, motion, audio, language), I will
present approaches for determining skill or expertise from video sequences [CVPR 2019],
assessing action "completion" ? i.e. when an interaction is attempted but not completed
[BMVC 2018], dual-domain and dual-time learning [CVPR 2020, CVPR 2019, ICCVW 2019]
as well as multi-modal fusion using vision, audio and language [CVPR 2020, ICCV 2019,
BMVC 2019]. All project and publication details at:
I will also introduce EPIC-KITCHENS-100, the largest egocentric dataset in people's
homes. The dataset now includes 20M frames of 90K action segments and 100 hours of
recording fully annotated, based on unique annotations from the participants narrating
their own videos, thus reflecting true intention. Dataset available from:

Sensing, Understanding and Synthesizing Humans in an Open World

  • Ziwei Liu

Sensing, understanding and synthesizing humans in images and videos have been a long-pursuing
goal of computer vision and graphics, with extensive real-life applications. It is
at the core of embodied intelligence. In this talk, I will discuss our work in human-centric
visual analysis (of faces, human bodies, scenes, videos and 3D scans), with an emphasis
on learning structural deep representations under complex scenarios. I will also discuss
the challenges related to naturally-distributed data (e.g. long-tailed and open-ended)
emerged from real-world sensors, and how we can overcome these challenges by incorporating
new neural computing mechanisms such as dynamic memory and routing. Our approach has
shown its effectiveness on both discriminative and generative tasks.

SESSION: Session 1: Multimedia Event Detection

Session details: Session 1: Multimedia Event Detection

  • Wu Liu

Intra and Inter-modality Interactions for Audio-visual Event Detection

  • Mathilde Brousmiche
  • St├ęphane Dupont
  • Jean Rout

The presence of auditory and visual sensory streams enables human beings to obtain
a profound understanding of a scene. While audio and visual signals are able to provide
relevant information separately, the combination of both modalities offers more accurate
and precise information. In this paper, we address the problem of audio-visual event
detection. The goal is to identify events that are both visible and audible. For this,
we propose an audio-visual network that models intra and inter-modality interactions
with Multi-Head Attention layers. Furthermore, the proposed model captures the temporal
correlation between the two modalities with multimodal LSTMs. Our method achieves
state-of-the-art performance on the AVE dataset.

Personalized User Modelling for Sleep Insight

  • Dhruv Deepak Upadhyay
  • Vaibhav Pandey
  • Nitish Nag
  • Ramesh Jain

Sleep is critical to leading a healthy lifestyle. Each day, most people go to sleep
without any idea about how their night's rest is going to be. For an activity that
humans spend around a third of their life doing, there is a surprising amount of mystery
around it. Despite current research, creating personalized sleep models in real-world
settings has been challenging. Existing literature provides several connections between
daily activities and sleep quality. Unfortunately, these insights do not generalize
well in many individuals. Thus, it is essential to create a personalized sleep model.
This research proposes a user centered sleep model that can identify causal relationships
between daily activities and sleep quality and present the user with specific feedback
about how their lifestyle affects their sleep. Our method uses N-of-1 experiments
on longitudinal multimodal user data and event mining to generate understanding between
lifestyle choices (exercise, eating, circadian rhythm) and their impact on sleep quality.
Our experimental results identified and quantified relationships while extracting
confounding variables through a causal framework.

AI at the Disco: Low Sample Frequency Human Activity Recognition for Night Club Experiences

  • Amritpal Singh Gill
  • Sergio Cabrero
  • Pablo Cesar
  • David A. Shamma

Human activity recognition (HAR) has grown in popularity as sensors have become more
ubiquitous. Beyond standard health applications, there exists a need for embedded
low cost, low power, accurate activity sensing for entertainment experiences. We present
a system and method of using a deep neural net for HAR using low-cost accelerometer-only
sensor running at 0.8Hz to preserve battery power. Despite these limitations, we demonstrate
an accuracy at 94.79% over 6 activity classes with an order of magnitude less data.
This sensing system conserves power further by using a connectionless reading---embedding
accelerometer data in the Bluetooth Low Energy broadcast packet---which can deliver
over a year of human-activity recognition data on a single coin cell battery. Finally,
we discuss the integration of our HAR system in a smart-fashion wearable for a live
two night deployment in an instrumented night club.

SESSION: Keynote Talk II

Session details: Keynote Talk II

  • Jingkuan Song

Unseen Activity Recognition in Space and Time

  • Cees Snoek

Progress in video understanding has been astonishing in the past decade. Classifying,
localizing, tracking and even segmenting actor instances at the pixel level is now
common place, thanks to label-supervised machine learning. Yet, it is becoming increasingly
clear that label-supervised knowledge transfer is expensive to obtain and scale, especially
as the need for spatiotemporal detail and compositional semantic specification in
long video sequences increases. In this talk we will discuss alternatives to label-supervision,
using semantics [1], language [2], ontologies [3], similarity [4] and time [5] as
the primary knowledge sources for various video understanding challenges. Despite
being less example-dependent, the proposed algorithmic solutions are naturally embedded
in modern (self-)-learned representations and lead to state-of-the-art unseen activity
recognition in space and time.

SESSION: Session 2: Face, Gesture, and Body Pose

Session details: Session 2: Face, Gesture, and Body Pose

  • Dingwen Zhang

Towards Purely Unsupervised Disentanglement of Appearance and Shape for Person Images

  • Hongtao Yang
  • Tong Zhang
  • Wenbing Huang
  • Xuming He
  • Fatih Porikli

There have been a fairly of research interests in exploring the disentanglement of
appearance and shape from human images. Most existing endeavours pursuit this goal
by either using training images with annotations or regulating the training process
with external clues such as human skeleton, body segmentation or cloth patches etc.
In this paper, we aim to address this challenge in a more unsupervised manner---we
do not require any annotation nor any external task-specific clues. To this end, we
formulate an encoder-decoder-like network to extract both the shape and appearance
features from input images at the same time, and train the parameters by three losses:
feature adversarial loss, color consistency loss and reconstruction loss. The feature
adversarial loss mainly impose little to none mutual information between the extracted
shape and appearance features, while the color consistency loss is to encourage the
invariance of person appearance conditioned on different shapes. More importantly,
our unsupervised framework utilizes learned shape features as masks which are applied
to the input itself in order to obtain clean appearance features. Without using fixed
input human skeleton, our network better preserves the conditional human posture while
requiring less supervision. Experimental results on DeepFashion and Market1501 demonstrate
that the proposed method achieves clean disentanglement and is able to synthesis novel
images of comparable quality with state-of-the-art weakly-supervised or even supervised

R-FENet: A Region-based Facial Expression Recognition Method Inspired by Semantic
Information of Action Units

  • Cong Wang
  • Ke Lu
  • Jian Xue
  • Yanfu Yan

Facial expression recognition is a challenging problem in real-world scenarios owing
to obstacles of illumination, occlusion, pose variations, and low-quality images.
Recent works have paid attention to the concept of the region of interest (RoI) to
strengthen local regional features in the presentation of facial expressions. However,
the regions are mostly assigned by general experience; for example, the average areas
of the eyes, mouth, and nose. In addition, features in the RoI are extracted from
cropped patches. This operation is repeated and inefficient because RoI areas mostly
overlap. This paper presents a region-based convolutional neural network for the recognition
of facial expression named R-FENet. The proposed network is constructed on the basis
of ResNet and predefined expert knowledge according to the Facial Action Coding System.
To locate the region related to facial expression, three RoI groups (i.e., the upper,
middle, and lower facial RoIs) including seven RoI areas are delimited according to
the semantic relationship between action units and facial expression. Furthermore,
aiming to avoid extracting features from the original image, the RoI pooling layer
is used to extract RoI features. The proposed R-FENet is validated on two public datasets
of facial expression captured in the wild: AffectNet and SFEW. Experiments show that
the proposed method achieves state-of-the-art results with accuracy of 60.95% on AffectNet
and 55.97% on SFEW, relative to single-model methods.

StarGAN-EgVA: Emotion Guided Continuous Affect Synthesis

  • Li Yu
  • Dolzodmaa Davaasuren
  • Shivansh Rao
  • Vikas Kumar

Recent advancement of Generative Adversarial Network (GAN) based architectures has
achieved impressive performance on static facial expression synthesis. Continuous
affect synthesis, which has applications in generating videos and movies, is underexplored.
Synthesizing continuous photo-realistic facial emotion expressions from a static image
is challenging because a) there is a lack of consensus on what parameters produce
a smooth shift of facial expressions and b) it is difficult to balance between the
consistency over personalized features and granularity among continuous emotional
states. We adapt one of the most successful networks, StarGAN, and propose StarGAN-EgVA
to generate continuous facial emotions based on 2D emotional representations, i.e.,
valence and arousal (VA). We propose to utilize categorical emotions ( e.g., happy,
sad) to guide the regression training on VA intensities so that the model learns both
the domain-specific features and subtle changes introduced by different VA intensities.
A special trick at testing is also exploited to automatically infer emotion labels
from the VA point on a 2D emotional plane to ensure smooth transition among emotional
states. Qualitative and quantitative experiments demonstrate our proposed model's
ability to generate more photo-realistic and consistent affect sequences than the

SESSION: Session 3: Human Object Interaction

Session details: Session 3: Human Object Interaction

  • Wenbing Huang

Human-Object Interaction Detection: A Quick Survey and Examination of Methods

  • Trevor Bergstrom
  • Humphrey Shi

Human-object interaction detection is a relatively new task in the world of computer
vision and visual semantic information extraction. With the goal of machines identifying
interactions that humans perform on objects, there are many real-world use cases for
the research in this field. To our knowledge, this is the first general survey of
the state-of-the-art and milestone works in this field. We provide a basic survey
of the developments in the field of human-object interaction detection. Many works
in this field use multi-stream convolutional neural network architectures, which combine
features from multiple sources in the input image. Most commonly these are the humans
and objects in question, as well as the spatial quality of the two. As far as we are
aware, there have not been in-depth studies performed that look into the performance
of each component individually. In order to provide insight to future researchers,
we perform an individualized-study that examines the performance of each component
of a multi-stream convolutional neural network architecture for human-object interaction
detection. Specifically, we examine the HORCNN architecture as it is a foundational
work in the field. In addition, we provide an in-depth look at the HICO-DET dataset,
a popular benchmark in the field of human-object interaction detection.

Online Video Object Detection via Local and Mid-Range Feature Propagation

  • Zhifan Zhu
  • Zechao Li

This work proposes a new Local and Mid-range feature Propagation (LMP) method for
video object detection to well capture feature correlations and reduce the redundant
computation. Specifically, the proposed LMP model contains two modules with two individual
propagation schemes. The local module is leveraged to propagate motion and appearance
context in short term. The local module is a lightweight one to greatly reduce the
redundant computation without considering local attention. On the other hand, to explore
the feature correlations in long term, the mid-range module based on the non-local
attention mechanism is introduced by capturing relative longer-range relationships.
By incorporating these two modules, LMP enables to enrich feature representation with
fast computation. The proposed method is evaluated on the ImageNet VID dataset. The
proposed LMP method achieves 64.2% mAP score at speed of 28.5 FPS on desktop GPUs,
which is the state-of-the-art performance among one-stage MobileNet based detectors.
Source code is available at

iWink: Exploring Eyelid Gestures on Mobile Devices

  • Zhen Li
  • Mingming Fan
  • Ying Han
  • Khai N. Truong

Although gaze has been widely studied for mobile interactions, eyelid-based gestures
are relatively understudied and limited to few basic gestures (e.g., blink). In this
work, we propose a gesture grammar to construct both basic and compound eyelid gestures.
We present an algorithm to detect nine eyelid gestures in real-time on mobile devices
and evaluate its performance with 12 participants. Results show that our algorithm
is able to recognize nine eyelid gestures with 83% and 78% average accuracy using
user-dependent and user-independent models respectively. Further, we design a gesture
mapping scheme to allow for navigating between and within mobile apps only using eyelid
gestures. Moreover, we show how eyelid gestures can be used to enable cross-application
and sensitive interactions. Finally, we highlight future research directions.

Commonsense Learning: An Indispensable Path towards Human-centric Multimedia

  • Bin Huang
  • Siao Tang
  • Guangyao Shen
  • Guohao Li
  • Xin Wang
  • Wenwu Zhu

Learning commonsense knowledge and conducting commonsense reasoning are basic human
ability to make presumptions about the type and essence of ordinary situation in daily
life, which serve as very important goals in human-centric Artificial Intelligence
(AI). With the increasing number of media types and quantities provided by various
Internet services, commonsense learning and reasoning with no doubt are playing key
roles in making progresses for human-centric multimedia analysis. Therefore, this
paper first introduces the basic concept of commonsense knowledge and commonsense
reasoning, then summarizes commonsense resources and benchmarks, gives an overview
on recent commonsense learning and reasoning methods, and discusses several popular
applications of commonsense knowledge in real-world scenarios. This work distinguishes
itself from existing literature that merely pays attention to natural language processing
in focusing more on multimedia which include both natural language processing and
computer vision. Furthermore, we also present our insights and thinking on future
research directions for commonsense.