HCMA '22: Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis

HCMA '22: Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis

HCMA '22: Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis

Full Citation in the ACM Digital Library

SESSION: Oral Session

Anonym-Recognizer: Relationship-preserving Face Anonymization and Recognition

  • Chunlei Peng
  • Shuang Wan
  • Zimin Miao
  • Decheng Liu
  • Yu Zheng
  • Nannan Wang

With the widespread application of big data technology, we are exposed to more and more video monitoring. To prevent serious social problems caused by face data leakage, face anonymization has become an important kind of method to protect face privacy. The face anonymization mentioned in this paper refers to the anonymization generation of the visual appearance in face images. Existing face anonymization methods mainly focus on removing identity information. However, in the scenario of face recognition technology that needs to protect privacy, existing face anonymization technology makes anonymized faces that can no longer be used for face recognition, limiting the application scope of face anonymization. Therefore, when using face anonymization, it is equally important to ensure that the anonymized face images can still be used for downstream tasks such as face recognition. To this end, we propose Anonym-Recognizer, a relationship-preserving face anonymization and recognition method. Our method uses relationship cyphertext which can be any binary identity number representing the identity of the image owner and designs a generative adversarial network to perform face anonymization and relationship cyphertexts embedding. In our framework, we first use Visual Anonymizer to manipulate the visual appearance of the input image, then use Cyphertext Embedder to get the anonymized image with the identity information embedded. With the help of Anonym Recognizer, the face recognition system can extract the relationship cyphertexts from the anonymized image as the credentials to match the identity information. The proposed Anonym-Recognizer provides a new perspective for the recognition and application of anonymized face images. Experiments on the Megaface dataset show that our method can encourage a 100% recognition accuracy on anonymized faces while finishing the task of face anonymization with high qualitative and quantitative quality.

Cycle-Consistent Learning for Weakly Supervised Semantic Segmentation

  • Bin Wang
  • Yu Qiao
  • Dahua Lin
  • Stephen D.H. Yang
  • Weijia Li

We investigate a principle way to accomplish the weakly supervised semantic segmentation, only using scribbles as supervision. The key challenge of this task lies in how to accurately propagate the semantic labels from the annotated scribbles to those unlabeled regions so that accurate pseudo masks can be harvested to learn better segmentation models. To tackle this issue, we propose a simple, strong, and unified framework named Cycle-Consistent Learning (CCL) in this work. To be specific, our CCL first utilizes the given scribbles for training and makes a prediction for those unlabeled regions. Then, the predicted regions, in turn, serve as supervision for learning to predict the labeled scribbles. With such a cycle-consistent constraint, the accurate scribbles can reversely help ease those potential noises existing in the unlabeled regions, resulting in better pseudo masks. The training process of our CCL is looped until the network converges in an end-to-end way. We conduct extensive experiments on the popular PASCAL VOC benchmark and achieve a comparable result with the state-of-the-art method. The training mechanism of the CCL is straightforward and can be easily embedded into any future weakly supervised semantic segmentation approach.

Dual Domain-Adversarial Learning for Audio-Visual Saliency Prediction

  • Yingzi Fan
  • Longfei Han
  • Yue Zhang
  • Lechao Cheng
  • Chen Xia
  • Di Hu

Both visual and auditory information are valuable to determine the salient regions in videos. Deep convolution neural networks (CNN) showcase strong capacity in coping with the audio-visual saliency prediction task. Due to various factors such as shooting scenes and weather, there often exists moderate distribution discrepancy between source training data and target testing data. The domain discrepancy induces to performance degradation on target testing data for CNN models. This paper makes an early attempt to tackle the unsupervised domain adaptation problem for audio-visual saliency prediction. We propose a dual domain-adversarial learning algorithm to mitigate the domain discrepancy between source and target data. First, a specific domain discrimination branch is built up for aligning the auditory feature distributions. Then, those auditory features are fused into the visual features through a cross-modal self-attention module. The other domain discrimination branch is devised to reduce the domain discrepancy of visual features and audio-visual correlations implied by the fused audio-visual features. Experiments on public benchmarks demonstrate that our method can relieve the performance degradation caused by domain discrepancy.

Multi-level Multi-modal Feature Fusion for Action Recognition in Videos

  • Xinghang Hu
  • Yanli Ji
  • Gedamu Alemu Kumie

Several multi-modal feature fusion approaches have been proposed in recent years in order to improve action recognition in videos. These approaches do not take full advantage of the multi-modal information in the videos, since they are biased towards a single modality or treat modalities separately. To address the multi-modal problem, we propose a Multi-Level Multi-modal feature Fusion (MLMF) for action recognition in videos. The MLMF projects each modality to shared and specific feature spaces. According to the similarity between the two modal shared features space, we augment the features in the specific feature space. As a result, the fused features not only incorporate the unique characteristics of the two modalities, but also explicitly emphasize their similarities. Moreover, the video's action segments differ in length, so the model needs to consider different-level feature ensembling for fine-grained action recognition. The optimal multi-level unified action feature representation is achieved by aggregating features at different levels. Our approach is evaluated in the EPIC-KITCHEN 100 dataset, and achieved encouraging results of action recognition in videos.

Two-branch Objectness-centric Open World Detection

  • Yan Wu
  • Xiaowei Zhao
  • Yuqing Ma
  • Duorui Wang
  • Xianglong Liu

In recent years, with the development of deep learning, object detection has made great progress and has been widely used in many tasks. However, the previous models are all performed on closed sets, while there are many unknown categories in the real open world. Directly applying a model trained on known categories to the unknown classes will lead to misclassification. In this paper, we propose a two-branch objectness-centric open world object detection framework consisting of the bias-guided detector and the objectness-centric calibrator to effectively capture the objectness of both known and unknown instances and make the accurate prediction for known classes. The bias-guided detector trained with the known labels can predict the classes and boxes for known classes accurately. While the objectness-centric calibrator can localize the instances of any class, and does not affect the classification and regression of known classes. In the inference stage, we use the objectness-centric affirmation to confirm the results for known classes and predict the unknown instances. Comprehensive experiments conducted on the open world object detection benchmark validate the effectiveness of our method compared to state-of-the-art open world object detection approaches.

SESSION: Poster Session

Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation

  • Shuning Chang
  • Pichao Wang
  • Fan Wang
  • Hao Li
  • Zheng Shou

Temporal action proposal generation (TAPG) is a fundamental and challenging task in media interpretation and video understanding, especially in temporal action detection. Most previous works focus on capturing the local temporal context and can well locate simple action instances with clean frames and clear boundaries. However, they generally fail in complicated scenarios where interested actions involve irrelevant frames and background clutters, and the local temporal context becomes less effective. To deal with these problems, we present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG. Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer, and it improves the abilities of capturing long-range dependencies and learning robust feature for noisy action instances. Moreover, an adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features. The features from the two modules carry rich semantic information of the video, and are fused for effective sequential proposal generation. Extensive experiments are conducted on two challenging datasets, THUMOS14 and ActivityNet1.3, and the results demonstrate that our method outperforms state-of-the-art TAPG methods. Our code will be released soon.

Domain Camera Adaptation and Collaborative Multiple Feature Clustering for Unsupervised Person Re-ID

  • Yuanpeng Tu

Recently unsupervised person re-identification (re-ID) has drawn much attention due to its open-world scenario settings where limited annotated data is available. Existing supervised methods often fail to generalize well on unseen domains, while the unsupervised methods, mostly lack multi-granularity information and are prone to suffer from confirmation bias. In this paper, we aim at finding better feature representations on the unseen target domain from two aspects, 1) performing unsupervised domain adaptation on the labeled source domain and 2) mining potential similarities on the unlabeled target domain. Besides, a collaborative pseudo re-labeling strategy is proposed to alleviate the influence of confirmation bias. Firstly, a generative adversarial network is utilized to transfer images from the source domain to the target domain. Moreover, person identity preserving and identity mapping losses are introduced to improve the quality of generated images. Secondly, we propose a novel collaborative multiple feature clustering framework (CMFC) to learn the internal data structure of target domain, including global feature and partial feature branches. The global feature branch (GB) employs unsupervised clustering on the global feature of person images while the Partial feature branch (PB) mines similarities within different body regions. Finally, extensive experiments on two benchmark datasets show the competitive performance of our method under unsupervised person re-ID settings.

PSINet: Progressive Saliency Iteration Network for RGB-D Salient Object Detection

  • Songsong Duan
  • Chenxing Xia
  • Xianjin Fang
  • Bin Ge
  • Xiuju Gao
  • Kuan-Ching Li

RGB-D Salient Object Detection (RGB-D SOD) is a pixel-level dense prediction task that can highlight the prominent object in the scene by combining color information and depth constraints. Attention mechanisms have been widely employed in SOD due to their ability to capture important cues. However, most existing attentions (\textite.g., spatial attention, channel attention, self-attention) mainly exploit the pixel-level attention maps, ignoring the region properties of salient objects. To remedy this issue, we propose a progressive saliency iteration network (PSINet) with a region-wise saliency attention to improve the regional integrity of salient objects in an iterative manner. Specifically, two-stream Swin Transformers are first employed to extract RGB and depth features. Second, a multi-modality alternate and inverse module (AIM) is designed to extract complementary features from RGB-D images in an interleaved manner, which breaks down the barriers of inconsistency existing in the cross-modal data and also sufficiently captures the complementarity. Third, a triple progressive iteration decoder (TPID) is proposed to optimize the salient objects, where a coarse saliency map, generated by integrating multi-scale features with a U-Net, is viewed as region-wise attention maps to construct a region-wise saliency attention module(RSAM), which can emphasize the prominent region of features. Finally, the regional integrity of salient objects can be gradually optimized from coarse to fine by iterating the above steps on TPID. Quantitative and qualitative experiments demonstrate that the proposed model performs favorably against 19 state-of-the-art (SOTA) saliency detectors on five benchmark RGB-D SOD datasets.

Multimodal Network with Cross-Modal Attention for Audio-Visual Event Localization

  • Qianchao Tan

Audio-Visual Event (AVE) localization requires identifying which segments in a video have an audio-visual event and what category this event belongs to. To process the critical mutual information in both audio and visual modalities at the same time effectively, we propose a multimodal network with cross-modal attention to jointly focus on audio and visual communication. Remarkably, we use the audio-visual fusion features as guidance to focus on the local regions related to the event in visual area while enhancing the corresponding audio features. The importance of the semantic information expressed by different-modal segments in a video is in different degrees of importance. Hence, we adopt a heterogeneous graph-like approach to aggregate features from the other two-modal segments. To address the impacts of unsynchronized audio and visual information, our proposed Multimodal Segment Interaction Module (MSIM) establishes the connection between different segments by setting a similarity threshold, which can filter the useless information brought by other segments. Furthermore, we introduce a gating mechanism to explore the cross-modality relationships without neglecting the intra-modality relation, which is vital for improving the final recognition performance. Extensive experiments on the AVE dataset show that our proposed method performs favorably against current methods in both supervised and weakly supervised settings.

Real-time Embedded Demo System for Fall Detection under 15W Power

  • Junyi Lu
  • Wenxiang Jiang
  • Yang Xiao
  • Tingbing Yan
  • Zhiguo Cao
  • Zhiwen Fang
  • Joey Tianyi Zhou

Fall is one of the major threats to the safety and life quality of elderly and patients. In this paper, an embedded demo system for real-time spatial-temporal fall detection on demo video is proposed. It is built upon MicroSoft Kinect V2 for 3D visual sensing, which is insensitive to illumination change. Meanwhile, NVIDIA Jetson AGX Xavier under 15W power serves as the embedded computing processor. To fit the proposed demo system, a novel real-time spatial-temporal fall detection approach based on deep learning technology is also proposed, with end-to-end running capacity. To our knowledge, this is the first spatial-temporal fall detection method in end-to-end way. In particular, it formulates spatial-temporal fall detection problem as a object detection like task. It first compresses the depth video clip captured by multi-scale temporal sliding window into a compact dynamic image, with fall's rich motion information. Consequently, fall is detected on dynamic image with YOLOv3-Tiny of high running efficiency and promising effectiveness from spatial perspective. Besides real-time live running capacity of our demo system, the experiments on 1 challenging datasets also verify the superiority of the proposed spatial-temporal fall detection approach.

Face Clustering via Adaptive Aggregation of Clean Neighbors

  • Shiyong Hong
  • Yaobin Zhang
  • Xu Ling
  • Weihong Deng
  • Yunfeng Yin
  • Yingjie Zhang
  • Hongzhi Shi
  • Dongchao Wen

Face clustering has been widely studied to solve the problem of data annotation in large-scale unlabeled face images. In recent years, state-of-the-art performance has been updated every year based on the application of Graph Convolutional Networks(GCN) in face clustering tasks. The existing GCN-based methods make each node accept the feature information from its neighbors, and then aggregate the neighbors' information with equal weights to learn enhanced feature embedding. However, rare attention has been paid to improving the quality of aggregated information. In this paper, we aim to make each node aggregate the feature information that is more conducive to clustering. The proposed novel method named Adaptive Aggregation of Clean Neighbors(AACN) has two stages of preparation before inputting the graph into GCN. Specifically, we first design a noise edge cleaner to remove the wrong neighbors of each node to ensure that they receive more accurate neighbor information. Then, we carefully allocate adaptive weights to the clean neighbors of each node, and make all nodes aggregate the received information via adaptive aggregation instead of mean aggregation. The two-stage preparation enables nodes to learn more robust features through the GCN module. Experiments on standard face clustering benchmark MS1M show that AACN has achieved state-of-the-art performance, significantly boosting the pairwise F-score from 92.79% to 93.72% on 584K unlabeled face images and from 83.99% to 86.41% on 5.21M unlabeled face images.

Cross-modal Token Selection for Video Understanding

  • Liyong Pan
  • Zechao Li
  • Henghao Zhao
  • Rui Yan

Multi-modal action recognition is an essential task in human-centric machine learning. Humans perceive the world by processing and fusing information of multiple modalities such as vision and audio. We introduce a novel transformer-based multi-modal architecture that outperforms existing state-of-the-art methods while significantly reducing the computational cost. The key to our idea is a Token-Selector module that collates and condenses the most useful token combinations and only shares what is necessary for cross-modal modeling. We conduct extensive experiments on multiple multi-modal benchmark datasets and achieve state-of-the-art performance under similar experimental conditions while reducing 30 percent of computing consumption. Extensive ablation studies showcase the benefits of our improved method over naive approaches.