CoVieW'18- Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild

Full Citation in the ACM Digital Library

SESSION: Keynote & Invited Talks

Session details: Keynote & Invited Talks

Kwanghoon Sohn

Deep Video Understanding: Representation Learning, Action Recognition, and Language Generation

Tao Mei

Analyzing videos is one of the fundamental problems of computer vision and multimedia analysis for decades. The task is very challenging as video is an information-intensive media with large variations and complexities. Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now able to boost the performance of video analysis significantly and initiate new research directions to analyze video content. This talk will cover recent advances under the umbrella of video understanding, which start from basic networks that are widely adopted in state-of-the-art deep learning pipelines, to fundamental challenges of video representation learning and video classification/recognition, finally to an emerging area of video and language.

Actor and Observer: Joint Modeling of First and Third-Person Videos

Karteek Alahari

Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor). Despite this, learning such models for human action recognition has not been well studied. We address this challenge by introducing Charades-Ego, a large-scale dataset of paired first-person and third-person videos, and presenting a formulation to learn a joint representation of actions from these two perspectives. This talk will present this dataset and our actor-observer model.

Explore Multi-Step Reasoning in Video Question Answering

Yahong Han

This invited talk is a repeated but more detailed talk about the paper which is accepted by ACM-MM 2018: Video question answering (VideoQA) always involves visual reasoning. When answering questions composing of multiple logic correlations, models need to perform multi-step reasoning. In this paper, we formulate multi-step reasoning in VideoQA as a new task to answer compositional and logical structured questions based on video content. Existing VideoQA datasets are inadequate as benchmarks for the multi-step reasoning due to limitations as lacking logical structure and having language biases. Thus we design a system to automatically generate a large-scale dataset, namely SVQA (Synthetic Video Question Answering). Compared with other VideoQA datasets, SVQA contains exclusively long and structured questions with various spatial and temporal relations between objects. More importantly, questions in SVQA can be decomposed into human readable logical tree or chain layouts, each node of which represents a sub-task requiring a reasoning operation such as comparison or arithmetic. Towards automatic question answering in SVQA, we develop a new VideoQA model. Particularly, we construct a new attention module, which contains spatial attention mechanism to address crucial and multiple logical sub-tasks embedded in questions, as well as a refined GRU called ta-GRU (temporal-attention GRU) to capture the long-term temporal dependency and gather complete visual cues. Experimental results show the capability of multi-step reasoning of SVQA and the effectiveness of our model when compared with other existing models.

SESSION: Session 1: Regular Track

Session details: Session 1: Regular Track

Kwanghoon Sohn

Joint Object Tracking and Segmentation with Independent Convolutional Neural Networks

Hakjin Lee
Jongbin Ryu
Jongwoo Lim

Object tracking and segmentation are important research topics in computer vision. They provide the trajectory and boundary of an object based on their appearance and shape features. Most studies on tracking and segmentation focus on encoding methods for the feature of an object. However, the tracking trajectory and segmentation mask are acquired separately, although similar visual information is required for both methods. Therefore, in this paper, we propose a CNN-based joint object tracking and segmentation framework that provides a segmentation mask while improving the performance of object tacker. In our model, the tracking model determines the trajectory of the target object as a bounding box in each frame. Given the bounding box at each frame, the segmentation model predicts a dense mask of the target object in the bounding box. Then, the segmentation mask is used to refine the bounding box for the tracking model. We evaluate the performance of our algorithm on DAVIS benchmark dataset by AUC score and mean IoU. We showed that the performance of original tracker was improved by our proposed framework.

Stereo Vision aided Image Dehazing using Deep Neural Network

Jeong-Yun Na
Kuk-Jin Yoon

Deterioration of image due to haze is one of the factors that degrade the performance of computer vision algorithm. The haze component absorbs and reflects the reflected light from the object, distorting the original irradiance. The more the distance from the camera is, the more deteriorated it tends to be. Therefore, studies have been conducted to remove haze by estimating the distribution of haze along the distance. In this paper, we use convolution neural network to simultaneously perform depth estimation and haze removal based on stereo image, and depth information to help improve performance of haze removal. We propose a multitasking network in which the encoder learns depth information and dehazing features simultaneously by performing depth estimation and dehazing using two decoders.

The learning of the network is based on a stereo image, and a large amount of left and right hazy images are required. However, existing hazy image data sets are inferior in reality because they are added to fog components in indoor images. Therefore, a data set composed of a haze component corresponding to the distance information was constructed and used in the KITTI road data set composed of a large amount of stereo outdoor driving images. Experimental results show that the proposed network has robust dehazing performance compared to existing methods for various levels of hazy images and improves the visibility by strengthening the contrast of boundaries in faint areas due to haze.

Learning to Detect, Associate, and Recognize Human Actions and Surrounding Scenes in Untrimmed Videos

Jungin Park
Sangryul Jeon
Seungryong Kim
Jiyoung Lee
Sunok Kim
Kwanghoon Sohn

While recognizing human actions and surrounding scenes addresses different aspects of video understanding, they have strong correlations that can be used to complement the singular information of each other. In this paper, we propose an approach for joint action and scene recognition that is formulated in an end-to-end learning framework based on temporal attention techniques and the fusion of them. By applying temporal attention modules to the generic feature network, action and scene features are extracted efficiently, and then they are composed to a single feature vector through the proposed fusion module. Our experiments on the CoVieW18 dataset show that our model is able to detect temporal attention with only weak supervision, and remarkably improves multi-task action and scene classification accuracies.

SESSION: Session 2: Challenge Track

Session details: Session 2: Challenge Track

Kwanghoon Sohn

Multi-task Joint Learning for Videos in the Wild

Yong Won Hong
Hoseong Kim
Hyeran Byun

Most of the conventional state-of-the-art methods for video analysis achieve outstanding performance by combining two or more different inputs, e.g. an RGB image, a motion image, or an audio signal, in a two-stream manner. Although these approaches generate pronounced performance, it underlines that each considered feature is tantamount in the classification of the video. This dilutes the nature of each class that every class depends on the different levels of information from different features. To incorporate the nature of each class, we present the class nature specific fusion that combines the features with a different level of weights for the optimal class result. In this work, we first represent each frame-level video feature as a spectral image to train convolutional neural networks (CNNs) on the RGB and audio features. We then revise the conventional two-stream fusion method to form a class nature specific one by combining features in different weight for different classes. We evaluate our method on the Comprehensive Video Understanding in the Wild dataset to understand how each class reacted on each feature in wild videos. Our experimental results not only show the advantage over conventional two-stream fusion, but also illustrate the correlation of two features: RGB and audio signal for each class.

New Feature-level Video Classification via Temporal Attention Model

Hongje Seong
Junhyuk Hyun
Suhyeon Lee
Suhan Woo
Hyunbae Chang
Euntai Kim

CoVieW 2018 is a new challenge which aims at simultaneous scene and action recognition for untrimmed video [1]. In the challenge, frame-level video features extracted by pre-trained deep convolutional neural network (CNN) are provided for video-level classification. In this paper, a new approach for the video-level classification method is proposed. The proposed method focuses on the analysis in temporal domain and the temporal attention model is developed. To compensate for the differences in the lengths of various videos, temporal padding method is also developed to unify the lengths of videos. Further, data augmentation is performed to enhance some validation accuracy. Finally, for the train/validation in CoView 2018 dataset we recorded the performance of 95.53% accuracy in the scene and 87.17% accuracy in the action using temporal attention model, nonzero padding and data augmentation. The top-1 hamming score is the standard metric in the CoVieW 2018 challenge and 91.35% is obtained by the proposed method.

Video Understanding via Convolutional Temporal Pooling Network and Multimodal Feature Fusion

Heeseung Kwon
Suha Kwak
Minsu Cho

In this paper, we present a new end-to-end convolutional neural network architecture for video classification, and apply the model to action and scene recognition in untrimmed videos for the Challenge on Comprehensive Video Understanding in the Wild. The proposed architecture takes densely sampled video frames as inputs, and apply a temporal pooling operator inside the network to capture temporal context of the input video. As a result, our architecture outputs distinct video-level features with a set of different temporal pooling operators. Furthermore, we design a multimodal feature fusion model by concatenating our video-level features with those given in the challenge dataset. Experimental results on the challenge dataset demonstrate that the proposed architecture and the multimodal feature fusion approach together achieve outstanding performance in action and scene recognition.