HUMA'21: Proceedings of the 2nd International Workshop on Human-centric Multimedia Analysis

HUMA'21: Proceedings of the 2nd International Workshop on Human-centric Multimedia Analysis

HUMA'21: Proceedings of the 2nd International Workshop on Human-centric Multimedia Analysis

Full Citation in the ACM Digital Library

SESSION: Invited Talks 1

Session details: Invited Talks 1

  • Jingkuan Song

Modern Learning Methodologies for Co-Saliency Detection

  • Junwei Han

Visual saliency computing aims to imitate the human visual attention mechanism to identify the most prominent or unique areas or objects from a visual scene. It is one of the basic low-level image processing techniques and can be applied to many downstream computer vision tasks. From the perspective of traditional research, visual saliency computing can be divided into eye fixation prediction and salient object detection. However, recent research progress shows that many new research directions and branches have emerged in this field, including weak/semi-unsupervised saliency learning, co-saliency detection, and multi-mode saliency detection. This report will focus on the key issues in co-saliency detection, introduce co-saliency methods based on advanced learning methods such as multi-instance learning, metric learning, and deep learning, and discuss potential future research directions in this research area.

SESSION: Session 1: Pose, Action, and Interaction

Session details: Session 1: Pose, Action, and Interaction

  • Xinchen Liu

Learning Positional Priors for Pretraining 2D Pose Estimators

  • Kun Zhang
  • Ping Yao
  • Rui Wu
  • Chuanguang Yang
  • Ding Li
  • Min Du
  • Kai Deng
  • Renbiao Liu
  • Tianyao Zheng

The target of 2D human pose estimation is to locate the keypoints of body parts from 2D images. State-of-the-art methods for pose estimation usually construct pixel-wise heatmaps from keypoints as labels for learning neural networks, which are usually initialized randomly or using classification models on large dataset, such as ImageNet, for their backbones. According to statistical data, there are strong positional priors for human keypoints, which are highly dependent on their relationship between image patches. To learn positional priors for pretraining pose estimators, we propose Heatmap-Style Jigsaw Puzzles (HSJP) problem as self-supervised pretext task, whose target is to predict the location of each patch from an image composed of shuffled patches. During pretraining, we only use person images in MS-COCO, rather than introducing extra large dataset like ImageNet. A heatmap-style label for patch location is designed and our learning process is in a non-contrastive way. The weights learned by HSJP pretext task are utilised as backbones of 2D human pose estimators, which are then finetuned on MS-COCO human keypoints dataset. With two popular and strong 2D human pose estimators, HRNet and SimpleBaseline, we evaluate mAP score on both MS-COCO validation and test-dev datasets. Our experiments show that downstream pose estimators with our self-supervised pretraining obtain much better performance than those trained from scratch, and are comparable to those using ImageNet classification models as their initial backbones.

A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric

  • Yitian Yuan
  • Xiaohan Lan
  • Xin Wang
  • Long Chen
  • Zhi Wang
  • Wenwu Zhu

Temporal Sentence Grounding in Videos (TSGV), \ie, grounding a natural language sentence which indicates complex human activities in a long and untrimmed video sequence, has received unprecedented attentions over the last few years. Although each newly proposed method plausibly can achieve better performance than previous ones, current TSGV models still tend to capture the moment annotation biases and fail to take full advantage of multi-modal inputs. Even more incredibly, several extremely simple baselines without training can also achieve state-of-the-art performance. In this paper, we take a closer look at the existing evaluation protocols for TSGV, and find that both the prevailing dataset splits and evaluation metrics are the devils to cause unreliable benchmarking. To this end, we propose to re-organize two widely-used TSGV benchmarks (ActivityNet Captions and Charades-STA). Specifically, we deliberately make the ground-truth moment distribution different in the training and test splits, \ie, out-of-distribution (OOD) testing. Meanwhile, we introduce a new evaluation metric "dR@n,IoU@m'' to calibrate the basic IoU scores by penalizing on the bias-influenced moment predictions and alleviate the inflating evaluations caused by the dataset annotation biases such as overlong ground-truth moments. Under our new evaluation protocol, we conduct extensive experiments and ablation studies on eight state-of-the-art TSGV methods. All the results demonstrate that the re-organized dataset splits and new metric can better monitor the progress in TSGV. Our reorganized datsets are available at

NLOS Imaging Assisted Navigation for BVI

  • Yingxuan Zhu
  • Jian Li

Assistive navigation techniques support the activities of blind or visually impaired (BVI) people and improve their life quality. However, current navigation systems cannot detect hidden objects that may run out and become obstacles. In this paper, we propose systems and methods to detect hidden objects using non-line-of-sight (NLOS) imaging. Because NLOS imaging requires a wall to reflect light, which may not be feasible in the wild, we introduce a method to select the most suitable surface as the wall. In real practice, the environment is dynamic with multiple moving objects in different speeds. We develop a method to track surface object so as to maintain a proper distance and speed for the perception system. Based on the methods of surface object selection and tracking, we design a system to detect hidden object(s). We discuss our methods in details.

Using Feature Interaction among GPS Data for Road Intersection Detection

  • Rutian Qing
  • Yizhi Liu
  • Yijiang Zhao
  • Zhuhua Liao
  • YuXuan Liu

Road intersection plays a vital role in road network construction, automatic drive, and intelligent transportation systems. Most methods detect road intersections only using geometrical features without spatio-temporal features, leading to insufficient precision. In addition, the existing methods do not consider the impact of feature interaction. For the issue, this paper proposes a novel way to detect road intersections based on GPS trajectory by extracting spatio-temporal features and using the interaction of the features to enhance the precision of the detection. The proposed method is implemented on DIDI's GPS data in Chengdu. The performance shows that the proposed method can effectively improve the precision of road intersection detection compared with the existing method, which is beneficial for road network construction and makes up for the deficiency of existing methods.

SESSION: Session 2: Technical Demos

Modeling 3D Objects: Implications for Neuroscience, Behavioral and Medical Studies: A Case Demo

  • Andrey V. Vlasov
  • Alexey V. Tumyalis
  • Vladislav Aksiotis
  • Carlos D. Nieto
  • Arsenjy Karavaev

We have designed, developed and adapted 3D objects (3DOs) within the interactive environment for in-lab neuroscience research of motor control and the mirror neuron system (MNS) (Figure 1b; 3D view: The modeled 3DOs are implemented in an experimental design with hand movements. We have combined video hand motion capture, associative learning, immersive reality, 'mirror therapy' and non-invasive brain stimulation (NIBS) methods in order to explore the effects of functional motor activity of the MNS. We are proposing to explore the effects of functional motor activity of the MNS for further application with advanced NIBS protocols. This system demo for studying MNS is an example of the research which is spread widely in neuroscience, behavioral and medical studies. It included the implementation of a human-object-interaction in 3D reality. The 3DOs' models and hand movements are available for any research purposes (Figure 1a) [1].