HCMA '23: Proceedings of the 4th International Workshop on Human-centric Multimedia Analysis

HCMA '23: Proceedings of the 4th International Workshop on Human-centric Multimedia Analysis

HCMA '23: Proceedings of the 4th International Workshop on Human-centric Multimedia Analysis


Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Training Data Optimization in Human-Centric Analysis

  • Liang Zheng

Training data are a pillar in computer vision applications. While existing works typically assume fixed training sets, I will discuss how training data optimization complements and benefis state-of-the-art computer vision models. In particular, this talk focuses on a few human-centric applications: person re-identification, multi-object tracking , and micro-expression recognition. For person re-identification, I will discuss how to search for training data that match the target domain distribution. For tracking, I will introduce our research utilizing synthetic human data to learn association policies. Finally, I will introduce a protocol to synthesize large amounts of micro-expression training images leading to a state-of-the-art recognition system. I conclude the talk with trends and new problems in data-centric computer vision.

Multi-Modal Multi-Task Joint 2D and 3D Scene Perception and Localization

  • Dan Xu

The talk will cover several important computer vision tasks within the context of visual 2D/3D scene understanding, including scene depth estimation, joint learning of scene depth and scene parsing, and visual deep SLAM. Specifically, the talk presents how to design effective deep learning models to deal with these challenging problems while using the least effort of human annotations, including the introduction of a transformer network structure for deep multi-task learning, a weakly supervised model for scene parsing, and self-supervised joint learning of scene depth, camera pose, and moving object discovery. The talk will also cover explorations of large-scale implicit scene modeling in self-supervised manners which benefits various human-centric tasks and applications.

SESSION: Human-Centric Multimedia Analysis

Text-based Person Search in Full Images via Semantic-Driven Proposal Generation

  • Shizhou Zhang
  • De Cheng
  • Wenlong Luo
  • Yinghui Xing
  • Duo Long
  • Hao Li
  • Kai Niu
  • Guoqiang Liang
  • Yanning Zhang

Finding target persons in full scene images with a query of text description has important practical applications in intelligent video surveillance. However, different from the real-world scenarios where the bounding boxes are not available, existing text-based person re- trieval methods mainly focus on the cross modal matching between the query text descriptions and the gallery of cropped pedestrian images. To close the gap, we study the problem of text-based person search in full images by proposing a new end-to-end learning framework which jointly optimize the pedestrian detection, identification and visual-semantic feature embedding tasks. To take full advantage of the query text, the semantic features are leveraged to instruct the Region Proposal Network to pay more attention to the text-described proposals. Besides, a cross-scale visual-semantic embedding mechanism is utilized to improve the performance. To validate the proposed method, we collect and annotate two large-scale benchmark datasets based on the widely adopted image-based person search datasets CUHK-SYSU and PRW. Comprehensive experiments are conducted on the two datasets and compared with the baseline methods, our method achieves the state-of-the-art performance.

Functas Usability for Human Activity Recognition using Wearable Sensor Data

  • Thibault Malherbe
  • Anis Takka

Recent advancements in data science have introduced implicit neural representations as a powerful approach for learning complex, high-dimensional functions, bypassing the need for explicit equations or manual feature engineering. In this paper, we present our research on employing the weights of these implicit neural representations to characterize and classify batches of data, referred to as 'functas.' This approach eliminates the need for manual feature engineering on raw data. Specifically, we showcase the efficacy of the 'functas' method in the domain of human activity recognition, utilizing output data from sensors such as accelerometers and gyroscopes. Our results demonstrate the promising potential of the 'functas' approach, suggesting a potential shift in the paradigm of data science methodologies.

An Analytical Study of Visual Attention Behavior in Viewing Panoramic Video

  • Feilin Han
  • Ying Zhong
  • Ke-Ao Zhao

Panoramic video offers an immersive viewing experience in which viewers can actively explore 360-degree motion pictures and engage with the narrative. Studying user visual attention behavior could help us to have a better understanding of video processing, semantic learning, and coding in 360-degree videos. In this paper, we developed two attention visualization toolkits, visual saliency map and semantic attention annotation, for collecting ROI data. The practice-based analytical methodology is employed to discuss user behavior while viewing panoramic shorts. We gathered viewing behavior data from 23 participants and visualized attention saliency to analyze the viewers' visual attention behavior and narrative cognition process. According to the collected data, we summarize attention distribution regulations and derive practical insights into the aspects of learning decision-making for panorama production.

CACEE: Computational Aesthetic Classification of Expressive Effects Based on Emotional Consistency

  • Haonan Cheng
  • Zhenjia Shi
  • Xiang Lei
  • Long Ye

Calculating professional and non-professional expressive effects assumes a significant role in communication, video classification, and other fields. Nevertheless, existing Computational Aesthetic Classification (CAC) techniques mainly focus on image, audio, etc., and lacks the exploration of expressive effects. In this paper, we propose a novel Computational Aesthetic Classification of Expressive Effects (CACEE) method based on emotional consistency between various modalities. Considering the significant contrasts of the emotional presentation between aesthetic and ordinary expressions, we take the first step to CACEE by utilizing emotional consistency between various modalities as the evaluation metric. In particular, we construct an emotional consistency feature space that can successfully describe the level of emotional consistency across modalities. Then we propose a classification model based on convolutional neural networks to combine these features, which can efficiently identify whether the expression is aesthetic or not. Specifically, to verify the validity of our proposed CACEE method, we construct the CUC-RED, the first dataset of expressive effects, which provides data support for subsequent studies. Experiments validate the effectiveness of our constructed feature space and the comparison results show that the proposed classification model outperforms other traditional methods in the CACEE task.

Exploiting Temporal Information in Real-time Portrait Video Segmentation

  • Weichen Xu
  • Yezhi Shen
  • Qian Lin
  • Jan P. Allebach
  • Fengqing Zhu

Portrait video segmentation has been widely used in applications such as online conferencing and content creation. However, it is challenging for mobile devices with limited computation resources to achieve accurate and temporal consistent real-time portrait segmentation. In this work, we propose a segmentation method based on the classic encoder-decoder architecture with a lightweight model design. To facilitate the efficient use of temporal guidance, our method takes an RGB-M input where M is a guidance portrait mask concatenated to the RGB input. Furthermore, we leverage the temporal guidance to enable model inference on the adaptive portrait region of interest (ROI). We introduce a two-stage training strategy to compensate for the limited data variety of portrait video datasets. Our method is evaluated on portrait videos including different types of daily activities, and outperforms existing portrait segmentation methods in terms of segmentation accuracy. Without introducing significant delay, our method is suitable for applications requiring real-time processing.

Unknown Fault Detection of Rolling Bearing Based on Similarity Mining of Stationary and Non-stationary Features

  • Ruoxi Li
  • Jie Nie
  • Chenglong Wang
  • Di Niu
  • Shusong Yu
  • Weizhi Nie
  • Xiangqian Ding

Rolling bearings are essential and critical components in rotating machinery, and their condition is vital for ensuring the reliable operation of the equipment. Currently, numerous deep learning methods have been investigated for diagnosing the condition of bearings. However, these methods can only recognize fault categories that were included in the training of the model, and they are unable to detect unknown fault categories. In this paper, we propose a method for unknown fault detection of rolling bearings based on stationary - non-stationary feature similarity mining. This method analyzes the features of fault signals and explores the relationships between various fault categories, considering both stationary and non-stationary perspectives. In particular, we propose a framework for bearing unknown fault detection based on stationary and non-stationary feature mining. The framework comprises two stages: pre-training and the discovery of unknown fault categories. Siamese networks are employed during the pre-training stage to acquire the similarity relationship between stationary and non-stationary features. During the stage of discovering unknown fault categories, the similarity information is utilized to facilitate the identification of new fault categories. Additionally, we introduce a discriminative feature time-frequency focusing module to prioritize the model's attention towards non-stationary time-frequency fluctuations. The performance of our model is evaluated on the CWRU dataset and the PU dataset, and the experimental results demonstrate its superiority over other state-of-the-art methods, resulting in a significant enhancement in accuracy.