PIC '22: Proceedings of the 4th on Person in Context Workshop

PIC '22: Proceedings of the 4th on Person in Context Workshop

PIC '22: Proceedings of the 4th on Person in Context Workshop

Full Citation in the ACM Digital Library

SESSION: Session 1: Human-centric Spatio-Temporal Video Grounding (HC-STVG) Chanllenge

STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding

  • Zihang Lin
  • Chaolei Tan
  • Jian-Fang Hu
  • Zhi Jin
  • Tiancai Ye
  • Wei-Shi Zheng

In this technical report, we introduce our solution to human-centric spatio-temporal video grounding task. We propose a concise and effective framework named STVGFormer, which models spatio-temporal visual-linguistic dependencies with a static branch and a dynamic branch. The static branch performs cross-modal understanding in a single frame and learns to localize the target object spatially according to intra-frame visual cues like object appearances. The dynamic branch performs cross-modal understanding across multiple frames. It learns to predict the starting and ending time of the target moment according to dynamic visual cues like motions. Both the static and dynamic branches are designed as cross-modal transformers. We further design a novel static-dynamic interaction block to enable the static and dynamic branches to transfer useful and complementary information from each other, which is shown to be effective to improve the prediction on hard cases. Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG track of the 4th Person in Context Challenge.

Cascaded Decoding and Multi-Stage Inference for Spatio-Temporal Video Grounding

  • Li Yang
  • Peixuan Wu
  • Chunfeng Yuan
  • Bing Li
  • Weiming Hu

Human-centric spatio-temporal video grounding (HC-STVG) is a challenging task that aims to localize the spatio-temporal tube of the target person in a video based on a natural language description. In this report, we present our approach for this challenging HC-STVG task. Specifically, based on the TubeDETR framework, we propose two cascaded decoders to decouple spatial and temporal grounding, which allows the model to capture respective favorable features for these two grounding subtasks. We also devise a multi-stage inference strategy to reason about the target in a coarse-to-fine manner and thereby produce more precise grounding results for the target. To further improve accuracy, we propose a model ensemble strategy that incorporates the results of models with better performance in spatial or temporal grounding. We validated the effectiveness of our proposed method on the HC-STVG 2.0 dataset and won second place in the HC-STVG track of the 4th Person in Context (PIC) workshop at ACM MM 2022.

Human-centric Spatio-Temporal Video Grounding via the Combination of Mutual Matching Network and TubeDETR

  • Fan Yu
  • Zhixiang Zhao
  • Yuchen Wang
  • Yi Xu
  • Tongwei Ren
  • Gangshan Wu

In this technical report, we represent our solution for the Human-centric Spatio-Temporal Video Grounding (HC-STVG) track of the 4th Person in Context (PIC) workshop and challenge. Our solution is built on the basis of TubeDETR and Mutual Matching Network (MMN). Specifically, TubeDETR exploits a video-text encoder and a space-time decoder to predict the starting time, the ending time and the tube of the target person. MMN detects persons in images, links them as tubes, extracts features of person tubes and the text description, and predicts the similarities between them to choose the most likely person tube as the grounding result. Our solution finally finetunes the results by combining the spatio localization of MMN and the temporal localization of TubeDETR. In the HC-STVG track of the 4th PIC challenge, our solution achieves the third place.

SESSION: Session 2: Make-up Dense Video Captioning (MDVC) Chanllenge

PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning

  • Yifan Lu
  • Ziqi Zhang
  • Yuxin Chen
  • Chunfeng Yuan
  • Bing Li
  • Weiming Hu

The task of Dense Video Captioning (DVC) aims to generate captions with timestamps for multiple events in one video. Semantic information plays an important role for both localization and description of DVC. We present a semantic-assisted dense video captioning model based on the encoding-decoding framework. In the encoding stage, we design a concept detector to extract semantic information, which is then fused with multi-modal visual features to sufficiently represent the input video. In the decoding stage, we design a classification head, paralleled with the localization and captioning heads, to provide semantic supervision. Our method achieves significant improvements on the YouMakeup dataset \citewang2019youmakeup under DVC evaluation metrics and achieves high performance in the Makeup Dense Video Captioning (MDVC) task of \hrefhttp://picdataset.com/challenge/task/mdvc/ PIC 4th Challenge.

Fine-grained Video Captioning via Precise Key Point Positioning

  • Yunjie Zhang
  • Tiangyang Xu
  • Xiaoning Song
  • Zhenghua Feng
  • Xiao-Jun Wu

In recent years, a variety of excellent dense video caption models have emerged. However, most of these models focus on global features and salient events in the video. For the makeup data set used in this competition, the video content is very similar with only slight variations. Because the model lacks the ability to focus on fine-grained features, it does not generate captions very well. Based on this, this paper proposes a key point detection algorithm for the human face and human hand to synchronize and coordinate the detection of video frame extraction, and encapsulate the detected auxiliary features into the existing features, so that the existing video subtitle system can focus on fine-grained features. In order to improve the effect of generating subtitles, we further use the TSP model to extract more efficient video features. Our model has better performance than the baseline.

SESSION: Session 3: Make-up Temporal Video Grounding (MTVG) Chanllenge

Exploiting Feature Diversity for Make-up Temporal Video Grounding

  • Xiujun Shu
  • Wei Wen
  • Taian Guo
  • Sunan He
  • Chen Wu
  • Ruizhi Qiao

This technical report presents the 3rd winning solution for MTVG, a new task introduced in the 4-th Person in Context (PIC) Challenge at ACM MM 2022. MTVG aims at localizing the temporal boundary of the step in an untrimmed video based on a textual description. The biggest challenge of this task is the fine-grained video-text semantics of make-up steps. However, current methods mainly extract video features using action-based pre-trained models. As actions are more coarse-grained than make-up steps, action-based features are not suffi cient to provide fi ne-grained cues. To address this issue,we propose to achieve fi ne-grained representation via exploiting feature diversities. Specifi cally, we proposed a series of methods from feature extraction, network optimization, to model ensemble. As a result, we achieved 3rd place in the MTVG competition.