MM '19- Proceedings of the 27th ACM International Conference on Multimedia

Full Citation in the ACM Digital Library

SESSION: Keynote I

Using Artificial Intelligence to Preserve Audiovisual Archives: New Horizons, More Questions

Jean Carrive

France has a long tradition of preserving its archives as well as its cultural heritage, as demonstrated by the "legal deposit". Established in the Renaissance for printed documents, the legal deposit aims to allow the collection and consultation of various kinds of documents. INA, the French National Audiovisual Institute [1], is in charge of this task for France's radio and television, as well as French media on the web. INA's mission is to make the most of its collections: commercially by selling programs, and academically by making these collections available to researchers working on humanities and social sciences.

Since its creation in 1975, INA has constantly developed its tools and methodologies for describing and documenting its collections: databases, thesauri, lexicons, documentation software, indexing methods indexing, search engines, etc. Its Research and Innovation Department has for many years been interested in partnering with academic laboratories to explore the possibilities of automatic content analysis technologies. The emergence of AI-derived technologies is now making it possible to consider new uses of these collections, but also raises new questions.

INA has thus demonstrated that it now becomes possible to mass-treat large audiovisual corpora to identify various kind of information, thus facilitating indexing, documentation and search in order to provide better services to users. For researchers in humanities and social sciences working on these resources at Inathèque de France, these new means of analysis allow to conduct new types of Digital Humanities investigations, but also introduce new methodological challenges. For INA's archivists and librarians, AI's assistance facilitates the documentation process but also poses questions about the impact of these technologies on professional practices, as well as on the scalability of these technologies over time.

The presentation will address these questions, building on the research projects and experiments carried out at INA.

SESSION: Session 1A: Multimodal Fusion&Visual Relations

Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching

Chunxiao Liu
Zhendong Mao
An-An Liu
Tianzhu Zhang
Bin Wang
Yongdong Zhang

Learning semantic correspondence between image and text is significant as it bridges the semantic gap between vision and language. The key challenge is to accurately find and correlate shared semantics in image and text. Most existing methods achieve this goal by representing the shared semantic as a weighted combination of all the fragments (image regions or text words), where fragments relevant to the shared semantic obtain more attention, otherwise less. However, despite relevant ones contribute more to the shared semantic, irrelevant ones will more or less disturb it, and thus will lead to semantic misalignment in the correlation phase. To address this issue, we present a novel Bidirectional Focal Attention Network (BFAN), which not only allows to attend to relevant fragments but also diverts all the attention into these relevant fragments to concentrate on them. The main difference with existing works is they mostly focus on learning attention weight while our BFAN focus on eliminating irrelevant fragments from the shared semantic. The focal attention is achieved by preassigning attention based on inter-modality relation, identifying relevant fragments based on intra-modality relation and reassigning attention. Furthermore, the focal attention is jointly applied in both image-to-text and text-to-image directions, which enables to avoid preference to long text or complex image. Experiments show our simple but effective framework significantly outperforms state-of-the-art, with relative [email protected] gains of 2.2% on both Flicr30K and MSCOCO benchmarks.

Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking

Tan Wang
Xing Xu
Yang Yang
Alan Hanjalic
Heng Tao Shen
Jingkuan Song

A major challenge in matching images and text is that they have intrinsically different data distributions and feature representations. Most existing approaches are based either on embedding or classification, the first one mapping image and text instances into a common embedding space for distance measuring, and the second one regarding image-text matching as a binary classification problem. Neither of these approaches can, however, balance the matching accuracy and model complexity well. We propose a novel framework that achieves remarkable matching performance with acceptable model complexity. Specifically, in the training stage, we propose a novel Multi-modal Tensor Fusion Network (MTFN) to explicitly learn an accurate image-text similarity function with rank-based tensor fusion rather than seeking a common embedding space for each image-text instance. Then, during testing, we deploy a generic Cross-modal Re-ranking (RR) scheme for refinement without requiring additional training procedure. Extensive experiments on two datasets demonstrate that our MTFN-RR consistently achieves the state-of-the-art matching performance with much less time complexity.

Structured Stochastic Recurrent Network for Linguistic Video Prediction

Shijie Yang
Liang Li
Shuhui Wang
Dechao Meng
Qingming Huang
Qi Tian

Intelligent machines are expected to have the capability of predicting impending occurrences. Inspired by video frame prediction and video captioning, we introduce a new task of Linguistic Video Prediction (LVP), which aims to predict the forthcoming events based on past video content and generate corresponding linguistic descriptions. Different from traditional video captioning that describes one specifically happened event, LVP is an open task involving one-to-many mappings between past and future. It explores different visual clues and associates them with potential events to generate corresponding descriptions. To address this task, we propose an end-to-end probabilistic approach named structured stochastic recurrent network (SRN) to characterize the one-to-many connections between past visual clues and possible future events. Specially, we first propose hierarchical-structured latent variables to represent the choice of event theme. Second, we introduce a stochastic attention module to capture the variations of the focused visual clues. Given a video, our model is able to generate multiple linguistic predictions by focusing on different event themes and visual clues. Experiments on ActivityNet dataset showed that the proposed model not only yields more informative predictions measured by BLEU, METEOR, ROUGE-L, CIDEr and SPICE scores, but also generates significantly more diverse predictions with higher recall rates to correctly hit the ground-truth.

Visual Relationship Detection with Relative Location Mining

Hao Zhou
Chongyang Zhang
Chuanping Hu

Visual relationship detection, as a challenging task used to find and distinguish the interactions between object pairs in one image, has received much attention recently. In this work, we propose a novel visual relationship detection framework by deeply mining and utilizing relative location of object-pair in every stage of the procedure. In both the stages, relative location information of each object-pair is abstracted and encoded as auxiliary feature to improve the distinguishing capability of object-pairs proposing and predicate recognition, respectively; Moreover, one Gated Graph Neural Network(GGNN) is introduced to mine and measure the relevance of predicates using relative location. With the location-based GGNN, those non-exclusive predicates with similar spatial position can be clustered firstly and then be smoothed with close classification scores, thus the accuracy of top n recall can be increased further. Experiments on two widely used datasets VRD and VG show that, with the deeply mining and exploiting of relative location information, our proposed model significantly outperforms the current state-of-the-art.

Vision-Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning

Tong Yu
Yilin Shen
Ruiyi Zhang
Xiangyu Zeng
Hongxia Jin

Interactive recommenders have demonstrated the advantage over traditional recommenders with dynamic change of items. However, the traditional user feedback in the format of clicks or ratings, provides limited user preference information and limited history tracking capabilities. As a result, it takes a user many interactions to find a desired item. Data of other modalities, such as item visual appearance and user comments in natural language, may enable richer user feedback. However, there are several critical challenges to be addressed when utilizing these multimodal data: multimodal matching, user preference tracking, and adaptation to dynamic unseen items. Without properly handling these challenges, the recommendations can easily violate the users' preference from their past natural language feedback. In this paper, we introduce a novel approach, called vision-language recommendation, that enables users to provide natural language feedback on visual products to have more natural and effective interactions. To model more explicit and accurate multimodal matching, we propose a novel visual attribute augmented reinforcement learning approach that enhances the grounding of natural language to visual items. Furthermore, to effectively track the users' preference and overcome the performance deficiency on dynamic unseen items after deployment, we propose a novel history multimodal matching reward to continuously adapt the model on-the-fly. Empirical results show that, our system augmented by visual attribute and history multimodal matching can significantly increase the success rate, reduce the number of recommendations that violate the user's previous feedback, and need less number of user interactions to find the desired items.

Multi-modal Multi-layer Fusion Network with Average Binary Center Loss for Face Anti-spoofing

Huafeng Kuang
Rongrong Ji
Hong Liu
Shengchuan Zhang
Xiaoshuai Sun
Feiyue Huang
Baochang Zhang

Face anti-spoofing detection is critical to guarantee the security of biometric face recognition systems. Despite extensive advances in facial anti-spoofing based on single-model image, little work has been devoted to multi-modal anti-spoofing, which is however widely encountered in real-world scenarios. Following the recent progress, this paper mainly focuses on multi-modal face anti-spoofing and aims to solve the following two challenges: (1) how to effectively fuse multi-modal information; and (2) how to effectively learn distinguishable features despite single cross-entropy loss. We propose a novel Multi-modal Multi-layer Fusion Convolutional Neural Network (mmfCNN), which targets at finding a discriminative model for recognizing the subtle differences between live and spoof faces. The mmfCNN can fully use different information provided by diverse modalities, which is based on a weight-adaptation aggregation approach. Specifically, we utilize a multi-layer fusion model to further aggregate the features from different layers, which fuses the low-, mid- and high-level information from different modalities in a unified framework. Moreover, a novel Average Binary Center (ABC) loss is proposed to maximize the dissimilarity between the features of live and spoof faces, which helps to stabilize the training to generate a robust and discriminative model. Extensive experiments conducted on the CISIA-SURF and 3DMAD datasets verify the significance and generalization capability of the proposed method for the face anti-spoofing task. Code is available at: https://github.com/SkyKuang/Face-anti-spoofing.

Dual-alignment Feature Embedding for Cross-modality Person Re-identification

Yi Hao
Nannan Wang
Xinbo Gao
Jie Li
Xiaoyu Wang

Person re-identification aims at searching pedestrians across different cameras, which is a key problem in video surveillance. With requirements in night environment, RGB-infrared person re-identification which could be regarded as a cross-modality matching problem, has gained increasing attention in recent years. Aside from cross-modality discrepancy, RGB-infrared person re-identification also suffers from human pose and view point differences. We design a dual-alignment feature embedding method to extract discriminative modality-invariant features. The concept of dual-alignment is two folds: spatial and modality alignments. We adopt the part-level features to extract fine-grained camera-invariant information. We introduce distribution loss function and correlation loss function to align the embedding features across visible and infrared modalities. Finally, we can extract modality-invariant features with robust and rich identity embeddings for cross-modality person re-identification. Experiment confirms that the proposed baseline and improvement achieves competitive results with the state-of-the-art methods on two datasets. For instance, We achieve (57.5+12.6)% rank-1 accuracy and (57.3+11.8)% mAP on the RegDB dataset.

Video Text Detection by Attentive Spatiotemporal Fusion of Deep Convolutional Features

Lan Wang
Jiahao Shi
Yang Wang
Feng Su

Scene text in videos carries rich semantic information and plays an important role in various content-based video applications. Compared to text in static images, scene text in videos exhibits some distinct characteristics such as motion blur and temporal redundancy, which bring additional difficulties as well as exploitable clues to the text detection task. In this paper, we propose a novel end-to-end deep neural network for detecting scene text in the video, which combines complementary text features from multiple related frames to enhance the overall detection performance relative to single-frame detection schemes. Specifically, we first extract descriptive features from each video frame using a hierarchical convolutional neural network. Next, we spatiotemporally sample and warp supplementary features from adjacent frames surrounding the current frame using a multi-scale deformable convolution structure. We then aggregate the sampled features with an attention mechanism to adaptively focus on and augment relevant features and generate an enhanced feature representation of the current frame, which is further fed to the prediction network for localizing text candidates. The proposed model achieves state-of-the-art text detection performance on public scene text video datasets, demonstrating the superiority of the proposed multi-frame feature fusion based video text detection scheme to most single-frame and tracking-based detection schemes.

Cross-Modal Subspace Learning with Scheduled Adaptive Margin Constraints

David Semedo
Joao Magalhaes

Cross-modal embeddings, between textual and visual modalities, aim to organise multimodal instances by their semantic correlations. State-of-the-art approaches use maximum-margin methods, based on the hinge-loss, to enforce a constant margin m, to separate projections of multimodal instances from different categories. In this paper, we propose a novel scheduled adaptive maximum-margin (SAM) formulation that infers triplet-specific constraints during training, therefore organising instances by adaptively enforcing inter-category and inter-modality correlations. This is supported by a scheduled adaptive margin function, that is smoothly activated, replacing a static margin by an adaptively inferred one reflecting triplet-specific semantic correlations while accounting for the incremental learning behaviour of neural networks to enforce category cluster formation and enforcement. Experiments on widely used datasets show that our model improved upon state-of-the-art approaches, by achieving a relative improvement of up to ~12.5% over the second best method, thus confirming the effectiveness of our scheduled adaptive margin formulation.

Video Relation Detection with Spatio-Temporal Graph

Xufeng Qian
Yueting Zhuang
Yimeng Li
Shaoning Xiao
Shiliang Pu
Jun Xiao

What we perceive from visual content are not only collections of objects but the interactions between them. Visual relations, denoted by the triplet

Effective Sentiment-relevant Word Selection for Multi-modal Sentiment Analysis in Spoken Language

Dong Zhang
Shoushan Li
Qiaoming Zhu
Guodong Zhou

Computational modeling of human spoken language is an emerging research area in multimedia analysis spanning across the text and acoustic modalities. Multi-modal sentiment analysis is one of the most fundamental tasks in human spoken language understanding. In this paper, we propose a novel approach to selecting effective sentiment-relevant words for multi-modal sentiment analysis with focus on both the textual and acoustic modalities. Unlike the conventional soft attention mechanism, we employ a deep reinforcement learning mechanism to perform sentiment-relevant word selection and fully remove invalid words of each modality for multi-modal sentiment analysis. Specifically, we first align the raw text and audio at the word level and extract independent handcraft features for each modality to yield the textual and acoustic word sequence. Second, we establish two collaborative agents to deal with the textual and acoustic modalities in spoken language respectively. On this basis, we formulate the sentiment-relevant word selection process in a multi-modal setting as a multi-agent sequential decision problem and solve it with a multi-agent reinforcement learning approach. Detailed evaluations of multi-modal sentiment classification and emotion recognition on three benchmark datasets demonstrate the great effectiveness of our approach over several conventional competitive baselines.

Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition

Yue Gu
Xinyu Lyu
Weijia Sun
Weitian Li
Shuhong Chen
Xinyu Li
Ivan Marsic

Emotion recognition in dyadic communication is challenging because: 1. Extracting informative modality-specific representations requires disparate feature extractor designs due to the heterogenous input data formats. 2. How to effectively and efficiently fuse unimodal features and learn associations between dyadic utterances are critical to the model generalization in actual scenario. 3. Disagreeing annotations prevent previous approaches from precisely predicting emotions in context. To address the above issues, we propose an efficient dyadic fusion network that only relies on an attention mechanism to select representative vectors, fuse modality-specific features, and learn the sequence information. Our approach has three distinct characteristics: 1. Instead of using a recurrent neural network to extract temporal associations as in most previous research, we introduce multiple sub-view attention layers to compute the relevant dependencies among sequential utterances; this significantly improves model efficiency. 2. To improve fusion performance, we design a learnable mutual correlation factor inside each attention layer to compute associations across different modalities. 3. To overcome the label disagreement issue, we embed the labels from all annotators into a k-dimensional vector and transform the categorical problem into a regression problem; this method provides more accurate annotation information and fully uses the entire dataset. We evaluate the proposed model on two published multimodal emotion recognition datasets: IEMOCAP and MELD. Our model significantly outperforms previous state-of-the-art research by 3.8%-7.5% accuracy, using a more efficient model.

A Multimodal View into Music's Effect on Human Neural, Physiological, and Emotional Experience

Timothy Greer
Benjamin Ma
Matthew Sachs
Assal Habibi
Shrikanth Narayanan

Music has a powerful influence on human experience. In this paper, we investigate how music affects brain activity, physiological response, and human-reported behavior. Using auditory features related to dynamics, timbre, harmony, rhythm, and register, we predicted brain activity in the form of phase synchronizations in bilateral Heschl's gyri and superior temporal gyri; physiological response in the form of galvanic skin response and heart activity; and emotional experience in the form of continuous, subjective descriptions reported by music listeners. We found that using multivariate time series models with attention mechanisms are effective in predicting emotional ratings, while vector-autoregressive models are effective in predicting involuntary human responses. Musical features related to dynamics, register, rhythm, and harmony were found to be particularly helpful in predicting these human reactions. This work adds to our understanding of how music affects multimodal human experience and has applications in affective computing, music emotion recognition, neuroscience, and music information retrieval.

Emotion Recognition using Multimodal Residual LSTM Network

Jiaxin Ma
Hao Tang
Wei-Long Zheng
Bao-Liang Lu

Various studies have shown that the temporal information captured by conventional long-short-term memory (LSTM) networks is very useful for enhancing multimodal emotion recognition using encephalography (EEG) and other physiological signals. However, the dependency among multiple modalities and high-level temporal-feature learning using deeper LSTM networks is yet to be investigated. Thus, we propose a multimodal residual LSTM (MMResLSTM) network for emotion recognition. The MMResLSTM network shares the weights across the modalities in each LSTM layer to learn the correlation between the EEG and other physiological signals. It contains both the spatial shortcut paths provided by the residual network and temporal shortcut paths provided by LSTM for efficiently learning emotion-related high-level features. The proposed network was evaluated using a publicly available dataset for EEG-based emotion recognition, DEAP. The experimental results indicate that the proposed MMResLSTM network yielded a promising result, with a classification accuracy of 92.87% for arousal and 92.30% for valence.

Stereoscopic Visual Discomfort Prediction Using Multi-scale DCT Features

Yang Zhou
Wanli Yu
Zhu Li
Haibing Yin

Prior approaches to the problem of visual discomfort prediction (VDP) for stereo/3D images are built for the uncompressed image. This paper presents a novel VDP method based on the compressed image by using multi-scale discrete cosine transform (MsDCT). Three types of visual discomfort features, including basic disparity intensity (BDI), disparity gradient energy (DGE) and disparity texture complexity (DTC), are extracted from two-dimensional (2-D) DCT coefficients. Additionally, a multi-scale transformation approach based on the different sizes of transform units is applied to obtain the multi-scale sub-features for each of the features. Then, through experimental comparison, a random forest regressor is chosen to fuse twenty-three sub-features to get the final objective prediction value of the S3D images. Experimental results conducted on two datasets show that the proposed method improves the prediction accuracy compared to those of recent S3D visual (dis)comfort predictors.

PDANet: Polarity-consistent Deep Attention Network for Fine-grained Visual Emotion Regression

Sicheng Zhao
Zizhou Jia
Hui Chen
Leida Li
Guiguang Ding
Kurt Keutzer

Existing methods on visual emotion analysis mainly focus on coarse-grained emotion classification, i.e. assigning an image with a dominant discrete emotion category. However, these methods cannot well reflect the complexity and subtlety of emotions. In this paper, we study the fine-grained regression problem of visual emotions based on convolutional neural networks (CNNs). Specifically, we develop a Polarity-consistent Deep Attention Network (PDANet), a novel network architecture that integrates attention into a CNN with an emotion polarity constraint. First, we propose to incorporate both spatial and channel-wise attentions into a CNN for visual emotion regression, which jointly considers the local spatial connectivity patterns along each channel and the interdependency between different channels. Second, we design a novel regression loss, i.e. polarity-consistent regression (PCR) loss, based on the weakly supervised emotion polarity to guide the attention generation. By optimizing the PCR loss, PDANet can generate a polarity preserved attention map and thus improve the emotion regression performance. Extensive experiments are conducted on the IAPS, NAPS, and EMOTIC datasets, and the results demonstrate that the proposed PDANet outperforms the state-of-the-art approaches by a large margin for fine-grained visual emotion regression. Our source code is released at: https://github.com/ZizhouJia/PDANet.

Towards Increased Accessibility of Meme Images with the Help of Rich Face Emotion Captions

K R Prajwal
C V Jawahar
Ponnurangam Kumaraguru

In recent years, there has been an explosion in the number of memes being created and circulated in online social networks. Despite their rapidly increasing impact on how we communicate online, meme images are virtually inaccessible to the visually impaired users. Existing automated assistive systems that were primarily devised for natural photos in social media, overlook the specific fine-grained visual details in meme images. In this paper, we concentrate on describing one such prominent visual detail: the meme face emotion. We propose a novel automated method that enables visually impaired social media users to understand and appreciate meme face emotions with the help of rich textual captions. We first collect a challenging dataset of meme face emotion captions to support future research in face emotion understanding. We design a two-stage approach that significantly outperforms baseline approaches across all the standard captioning metrics and also generates richer discriminative captions. By validating our solution with the help of visually impaired social media users, we show that our emotion captions enable them to understand and appreciate one of the most popular classes of meme images encountered on the Internet for the first time. Code, data, and models are publicly available.

Comp-GAN: Compositional Generative Adversarial Network in Synthesizing and Recognizing Facial Expression

Wenxuan Wang
Qiang Sun
Yanwei Fu
Tao Chen
Chenjie Cao
Ziqi Zheng
Guoqiang Xu
Han Qiu
Yu-Gang Jiang
Xiangyang Xue

Facial expression is important in understanding our social interaction. Thus the ability to recognize facial expression enables the novel multimedia applications. With the advance of recent deep architectures, research on facial expression recognition has achieved great progress. However, these models are still suffering from the problems of lacking sufficient and diverse high quality training faces, vulnerability to the facial variations, and recognizing a limited number of basic types of emotions. To tackle these problems, this paper proposes a novel end-to-end Compositional Generative Adversarial Network (Comp-GAN) that is able to synthesize new face images with specified poses and desired facial expressions; and such synthesized images can be further utilized to help train a robust and generalized expression recognition model. Essentially, Comp-GAN can dynamically change the expression and pose of faces according to the input images while keeping the identity information. Specifically, the generator has two major components: one for generating images with desired expression and the other for changing the pose of faces. Furthermore, a face reconstruction learning process is applied to re-generate the input image and constrains the generator for preserving the key information such as facial identity. For the first time, various one/zero-shot facial expression recognition tasks have been created. We conduct extensive experiments to show that the images generated by Comp-GAN are helpful to improve the performance of one/zero-shot facial expression recognition.

TC-GAN: Triangle Cycle-Consistent GANs for Face Frontalization with Facial Features Preserved

Juntong Cheng
Yi-Ping Phoebe Chen
Minjun Li
Yu-Gang Jiang

Face frontalization has always been an important field. Recently, with the introduction of generative adversarial networks (GANs), face frontalization has achieved remarkable success. A critical challenge during face frontalization is to ensure the features of the original profile image are retained. Even though some state-of-the-art methods can preserve identity features while rotating the face to the frontal view, they still have difficulty preserving facial expression features. Therefore, we propose the novel triangle cycle-consistent generative adversarial networks for the face frontalization task, termed TC-GAN. Our networks contain two generators and one discriminator. One of the generators generates the frontal contour, and the other generates the facial features. They work together to generate a photo-realistic frontal view of the face. We also introduce cycle-consistent loss to retain feature information effectively. To validate the advantages of TC-GAN, we apply it to the face frontalization task on two datasets. The experimental results demonstrate that our method can perform large-pose face frontalization while preserving the facial features (both identity and expression). To the best of our knowledge, TC-GAN outperforms the state-of-the-art methods in the preservation of facial identity and expression features during face frontalization.

Fewer-Shots and Lower-Resolutions: Towards Ultrafast Face Recognition in the Wild

Shiming Ge
Shengwei Zhao
Xindi Gao
Jia Li

Is it possible to train an effective face recognition model with fewer shots that works efficiently on low-resolution faces in the wild? To answer this question, this paper proposes a few-shot knowledge distillation approach to learn an ultrafast face recognizer via two steps. In the first step, we initialize a simple yet effective face recognition model on synthetic low-resolution faces by distilling knowledge from an existing complex model. By removing the redundancies in both face images and the model structure, the initial model can provide an ultrafast speed with impressive recognition accuracy. To further adapt this model into the wild scenarios with fewer faces per person, the second step refines the model via few-shot learning by incorporating a relation module that compares low-resolution query faces with faces in the support set. In this manner, the performance of the model can be further enhanced with only fewer low-resolution faces in the wild. Experimental results show that the proposed approach performs favorably against state-of-the-arts in recognizing low-resolution faces with an extremely low memory of 30KB and runs at an ultrafast speed of 1,460 faces per second on CPU or 21,598 faces per second on GPU.

Identity- and Pose-Robust Facial Expression Recognition through Adversarial Feature Learning

Can Wang
Shangfei Wang
Guang Liang

Existing facial expression recognition methods either focus on pose variations or identity bias, but not both simultaneously. This paper proposes an adversarial feature learning method to address both of these issues. Specifically, the proposed method consists of five components: an encoder, an expression classifier, a pose discriminator, a subject discriminator, and a generator. An encoder extracts feature representations, and an expression classifier tries to perform facial expression recognition using the extracted feature representations. The encoder and the expression classifier are trained collaboratively, so that the extracted feature representations are discriminative for expression recognition. A pose discriminator and a subject discriminator classify the pose and the subject from the extracted feature representations respectively. They are trained adversarially with the encoder. Thus, the extracted feature representations are robust to poses and subjects. A generator reconstructs facial images to further favor the feature representations. Experiments on five benchmark databases demonstrate the superiority of the proposed method to state-of-the-art work.

Self-supervised Face-Grouping on Graphs

Veith Röthlingshöfer
Vivek Sharma
Rainer Stiefelhagen

We propose a novel self-supervised method for fine-tuning deep face representations called Face-Grouping on Graphs. We apply our method to automatic face grouping, where characters are to be separated based on their identity. To solve this problem, a graph structure with positive and negative edges over a set of face-tracks based on their temporal overlap and similarity constraints is in- duced, which requires no manual labor. We compute feature repre- sentations over sub-sequences of each track (sub-tracks) in order to obtain robust features whilst being able to utilize information contained in face variance. Each sub-track is given the ability to exchange information with adjacent sub-tracks via a typed graph neural network running over the induced graph. This allows us to push each representation in a direction in feature space that groups all representations of the same character together and separates representations of different characters. We show that our method is capable of improving clustering accuracy on popular video face clustering datasets The Big Bang Theory and Buffy the Vampire Slayer by 4.9% and 17.0% respectively compared to baseline performance, and 0.52% respective 5.55% com- pared to state-of-the-art methods. Additionally, we achieve 19.0% absolute increase in B3 F-Score on Harry Potter 1 (ACCIO) over other state-of-the-art unsupervised methods. We provide perfor- mance metrics on all episodes of The Big Bang Theory and Buffy the Vampire Slayer to enable further comparison in the future.

SESSION: Session 1C: Fashion&Human Analysis

Who, Where, and What to Wear?: Extracting Fashion Knowledge from Social Media

Yunshan Ma
Xun Yang
Lizi Liao
Yixin Cao
Tat-Seng Chua

Fashion knowledge helps people to dress properly and addresses not only physiological needs of users, but also the demands of social activities and conventions. It usually involves three mutually related aspects of: occasion, person and clothing. However, there are few works focusing on extracting such knowledge, which will greatly benefit many downstream applications, such as fashion recommendation. In this paper, we propose a novel method to automatically harvest fashion knowledge from social media. We unify three tasks of occasion, person and clothing discovery from multiple modalities of images, texts and metadata. For person detection and analysis, we use the off-the-shelf tools due to their flexibility and satisfactory performance. For clothing recognition and occasion prediction, we unify the two tasks by using a contextualized fashion concept learning module, which captures the dependencies and correlations among different fashion concepts. To alleviate the heavy burden of human annotations, we introduce a weak label modeling module which can effectively exploit machine-labeled data, a complementary of clean data. In experiments, we contribute a benchmark dataset and conduct extensive experiments from both quantitative and qualitative perspectives. The results demonstrate the effectiveness of our model in fashion concept prediction, and the usefulness of extracted knowledge with comprehensive analysis.

Virtually Trying on New Clothing with Arbitrary Poses

Na Zheng
Xuemeng Song
Zhaozheng Chen
Linmei Hu
Da Cao
Liqiang Nie

Thanks to the recent advance in the multimedia techniques, increasing research attention has been paid to the virtual try-on task, especially with the 2D image modeling. The traditional try-on task aims to align the target clothing item naturally to the given person's body and hence present a try-on look of the person. However, in practice, people may also be interested in their try-on looks with different poses. Therefore, in this work, we introduce a new try-on setting, which enables the changes of both the clothing item and the person's pose. Towards this end, we propose a pose-guided virtual try-on scheme based on the generative adversarial networks (GANs) with a bi-stage strategy. In particular, in the first stage, we propose a shape enhanced clothing deformation model for deforming the clothing item, where the user body shape is incorporated as the intermediate guidance. For the second stage, we present an attentive bidirectional GAN, which jointly models the attentive clothing-person alignment and bidirectional generation consistency. For evaluation, we create a large-scale dataset, FashionTryOn, comprising $28,714$ triplets with each consisting of a clothing item image and two model images in different poses. Extensive experiments on FashionTryOn validate the superiority of our model over the state-of-the-art methods.

FashionOn: Semantic-guided Image-based Virtual Try-on with Detailed Human and Clothing Information

Chia-Wei Hsieh
Chieh-Yun Chen
Chien-Lung Chou
Hong-Han Shuai
Jiaying Liu
Wen-Huang Cheng

The image-based virtual try-on system has attracted a lot of research attention. The virtual try-on task is challenging since synthesizing try-on images involves the estimation of 3D transformation from 2D images, which is an ill-posed problem. Therefore, most of the previous virtual try-on systems cannot solve difficult cases, e.g., body occlusions, wrinkles of clothes, and details of the hair. Moreover, the existing systems require the users to upload the image for the target pose, which is not user-friendly. In this paper, we aim to resolve the above challenges by proposing a novel FashionOn network to synthesize user images fitting different clothes in arbitrary poses to provide comprehensive information about how suitable the clothes are. Specifically, given a user image, an in-shop clothing image, and a target pose (can be arbitrarily manipulated by joint points), FashionOn learns to synthesize the try-on images by three important stages: pose-guided parsing translation, segmentation region coloring, and salient region refinement. Extensive experiments demonstrate that FashionOn maintains the details of clothing information (e.g., logo, pleat, lace), as well as resolves the body occlusion problem, and thus achieves the state-of-the-art virtual try-on performance both qualitatively and quantitatively.

POINet: Pose-Guided Ovonic Insight Network for Multi-Person Pose Tracking

Weijian Ruan
Wu Liu
Qian Bao
Jun Chen
Yuhao Cheng
Tao Mei

Multi-person pose tracking aims to jointly estimate and track multi-person keypoints in the unconstrained videos. The most popular solution to this task follows the tracking-by-detection strategy that relies on human detection and data association. While human detection has been boosted by deep learning, existing works mainly exploit several separated stages with hand-crafted metrics to realize data association, leading to great uncertainty and feeble adaption in complex scenes. To handle these problems, we propose an end-to-end pose-guided ovonic insight network (POINet) for the data association in multi-person pose tracking, which jointly learns feature extraction, similarity estimation, and identity assignment. Specifically, we design a pose-guided representation network to integrate pose information into hierarchical convolutional features, generating a pose-aligned person representation for person, which helps handle partial occlusions. Moreover, we propose an ovonic insight network to adaptively encode the cross-frame identity transformation, which can cope with the tough tracking cases of person leaving and entering the scene. In general, the proposed POINet provides a new insight to realize multi-person pose tracking in an end-to-end fashion. Extensive experiments conducted on the PoseTrack benchmark demonstrate that our POINet outperforms the state-of-the-art methods.

M2E-Try On Net: Fashion from Model to Everyone

Zhonghua Wu
Guosheng Lin
Qingyi Tao
Jianfei Cai

Most existing virtual try-on applications require clean clothes images. Instead, we present a novel virtual Try-On network, M2E-Try On Net, which transfers the clothes from a model image to a person image without the need of any clean product images. To obtain a realistic image of person wearing the desired model clothes, we aim to solve the following challenges: 1) non-rigid nature of clothes - we need to align poses between the model and the user; 2) richness in textures of fashion items - preserving the fine details and characteristics of the clothes is critical for photo-realistic transfer; 3) variation of identity appearances - it is required to fit the desired model clothes to the person identity seamlessly. To tackle these challenges, we introduce three key components, including the pose alignment network (PAN), the texture refinement network (TRN) and the fitting network (FTN). Since it is unlikely to gather image pairs of input person image and desired output image (i.e. person wearing the desired clothes), our framework is trained in a self-supervised manner to gradually transfer the poses and textures of the model's clothes to the desired appearance. In the experiments, we verify on the Deep Fashion dataset and MVC dataset that our method can generate photo-realistic images for the person to try-on the model clothes. Furthermore, we explore the model capability for different fashion items, including both upper and lower garments.

Personalized Capsule Wardrobe Creation with Garment and User Modeling

Xue Dong
Xuemeng Song
Fuli Feng
Peiguang Jing
Xin-Shun Xu
Liqiang Nie

Recent years have witnessed a growing trend of building the capsule wardrobe by minimizing and diversifying the garments in their messy wardrobes. Thanks to the recent advances in multimedia techniques, many researches have promoted the automatic creation of capsule wardrobes by the garment modeling. Nevertheless, most capsule wardrobes generated by existing methods fail to consider the user profile, including the user preferences, body shapes and consumption habits, which indeed largely affects the wardrobe creation. To this end, we introduce a combinatorial optimization-based personalized capsule wardrobe creation framework, named PCW-DC, which jointly integrates both garment modeling (\textiti.e., wardrobe compatibility) and user modeling (\textiti.e., preferences, body shapes). To justify our model, we construct a dataset, named bodyFashion, which consists of $116,532$ user-item purchase records on Amazon involving 11,784 users and 75,695 fashion items. Extensive experiments on bodyFashion have demonstrated the effectiveness of our proposed model. As a byproduct, we have released the codes and the data to facilitate the research community.

Aesthetic Attributes Assessment of Images

Xin Jin
Le Wu
Geng Zhao
Xiaodong Li
Xiaokun Zhang
Shiming Ge
Dongqing Zou
Bin Zhou
Xinghui Zhou

Image aesthetic quality assessment has been a relatively hot topic during the last decade. Most recently, comments type assessment (aesthetic captions) has been proposed to describe the general aesthetic impression of an image using text. In this paper, we propose Aesthetic Attributes Assessment of Images, which means the aesthetic attributes captioning. This is a new formula of image aesthetic assessment, which predicts aesthetic attributes captions together with the aesthetic score of each attribute. We introduce a new dataset named DPC-Captions which contains comments of up to 5 aesthetic attributes of one image through knowledge transfer from a full-annotated small-scale dataset. Then, we propose Aesthetic Multi-Attribute Network (AMAN), which is trained on a mixture of fully-annotated small-scale PCCD dataset and weakly-annotated large-scale DPC-Captions dataset. Our AMAN makes full use of transfer learning and attention model in a single framework. The experimental results on our DPC-Captions and PCCD dataset reveal that our method can predict captions of 5 aesthetic attributes together with numerical score assessment of each attribute. We use the evaluation criteria used in image captions to prove that our specially designed AMAN model outperforms traditional CNN-LSTM model and modern SCA-CNN model of image captions.

GP-BPR: Personalized Compatibility Modeling for Clothing Matching

Xuemeng Song
Xianjing Han
Yunkai Li
Jingyuan Chen
Xin-Shun Xu
Liqiang Nie

Owing to the recent advances in the multimedia processing domain and the publicly available large-scale real-world data provided by online fashion communities, like the IQON and Chictopia, researchers are enabled to investigate the automatic clothing matching solutions. In a sense, existing methods mainly focus on modeling the general item-item compatibility from the aesthetic perspective, but fail to incorporate the user factor. In fact, aesthetics can be highly subjective, as different people may hold different clothing preferences. In light of this, in this work, we attempt to tackle the problem of personalized compatibility modeling from not only the general aesthetics but also the personal preference perspectives. In particular, we present a personalized compatibility modeling scheme GP-BPR, comprising of two essential components: general compatibility modeling and personal preference modeling, which characterize the item-item and user-item interactions, respectively. In particular, due to the concern that both the modalities (e.g., the image and context description) of fashion items can deliver important cues regarding user personal preference, we present a comprehensive personal preference modeling method. Moreover, for evaluation, we create a large-scale dataset, IQON3000, from the online fashion community IQON. Extensive experiment results on IQON3000 verify the effectiveness of the proposed scheme. As a byproduct, we have released the dataset, codes, and involved parameters to benefit other researchers.

Outfit Compatibility Prediction and Diagnosis with Multi-Layered Comparison Network

Xin Wang
Bo Wu
Yueqi Zhong

Existing works about fashion outfit compatibility focus on predicting the overall compatibility of a set of fashion items with their information from different modalities. However, there are few works explore how to explain the prediction, which limits the persuasiveness and effectiveness of the model. In this work, we propose an approach to not only predict but also diagnose the outfit compatibility. We introduce an end-to-end framework for this goal, which features for: (1) The overall compatibility is learned from all type-specified pairwise similarities between items, and the backpropagation gradients are used to diagnose the incompatible factors. (2) We leverage the hierarchy of CNN and compare the features at different layers to take into account the compatibilities of different aspects from the low level (such as color, texture) to the high level (such as style). To support the proposed method, we build a new type-specified outfit dataset named Polyvore-T based on Polyvore dataset. We compare our method with the prior state-of-the-art in two tasks: outfit compatibility prediction and fill-in-the-blank. Experiments show that our approach has advantages in both prediction performance and diagnosis ability.

BraidNet: Braiding Semantics and Details for Accurate Human Parsing

Xinchen Liu
Meng Zhang
Wu Liu
Jingkuan Song
Tao Mei

This paper focuses on fine-grained human parsing in images. This is a very challenging task due to the diverse person appearance, semantic ambiguity of different body parts and clothing, and extremely small parsing targets. Although existing approaches can achieve significant improvement by pyramid feature learning, multi-level supervision, and joint learning with pose estimation, human parsing is still far from being solved. Different from existing approaches, we propose a Braiding Network, named as BraidNet, to learn complementary semantics and details for fine-grained human parsing. The BraidNet contains a two-stream braid-like architecture. The first stream is a semantic abstracting net with a deep yet narrow structure which can learn semantic knowledge by a hierarchy of fully convolution layers to overcome the challenges of diverse person appearance. To capture low-level details of small targets, the detail-preserving net is designed to exploit a shallow yet wide network without down-sampling, which can retain sufficient local structures for small objects. Moreover, we design a group of braiding modules across the two sub-nets, by which complementary information can be exchanged during end-to-end training. Besides, in the end of BraidNet, a Pairwise Hard Region Embedding strategy is propose to eliminate the semantic ambiguity of different body parts and clothing. Extensive experiments show that the proposed BraidNet achieves better performance than the state-of-the-art methods for fine-grained human parsing.

Modality-aware Collaborative Learning for Visible Thermal Person Re-Identification

Mang Ye
Xiangyuan Lan
Qingming Leng

Visible thermal person re-identification (VT-ReID) is a cross-modality pedestrian retrieval problem, which automatically searches persons between day-time visible images and night-time thermal images. Despite the extensive progress in single-modality ReID, the cross-modality pedestrian retrieval problem has limited attention due to its challenges in modality discrepancy and large intra-class variations across cameras. Existing cross-modality ReID methods usually solve this problem by learning cross-modality feature representations with modality-sharable classifier. However, this learning strategy may lose discriminative information in different modalities. In this paper, we propose a novel modality-aware collaborative (MAC) learning method on top of a two-stream network for VT-ReID, which handles the modality-discrepancy in both feature level and classifier level. In feature level, it handles the modality discrepancy by a two-stream network with different parameters. In classifier level, it contains two separate modality-specific identity classifiers for two modalities to capture the modality-specific information, and they have the same network architecture but different parameters. In addition, we introduce a collaborative learning scheme, which regularizes the modality-sharable and modality-specific identity classifiers by utilizing the relationship between different classifiers. Extensive experiments on two cross-modality person re-identification datasets demonstrate the superiority of the proposed method, achieving much better performance than the state-of-the-art.

Adaptive Multi-Path Aggregation for Human DensePose Estimation in the Wild

Yuyu Guo
Lianli Gao
Jingkuan Song
Peng Wang
Wuyuan Xie
Heng Tao Shen

Dense human pose "in the wild'' task aims to map all 2D pixels of the detected human body to a 3D surface by establishing surface correspondences, i.e., surface patch index and part-specific UV coordinates. It remains challenging especially under the condition of "in the wild'', where RGB images capture complex, real-world scenes with background, occlusions, scale variations, and postural diversity. In this paper, we propose an end-to-end deep Adaptive Multi-path Aggregation network (AMA-net) for Dense Human Pose Estimation. In the proposed framework, we address two main problems: 1) how to design a simple yet effective pipeline for supporting distinct sub-tasks (e.g., instance segmentation, body part segmentation, and UV estimation); and 2) how to equip this pipeline with the ability of handling "in the wild''. To solve these problems, we first extend FPN by adding a branch for mapping 2D pixels to a 3D surface in parallel with the existing branch for bounding box detection. Then, in AMA-net, we extract variable-sized object-level feature maps (e.g., 7×7, 14×14, and 28×28), named multi-path, from multi-layer feature maps, which capture rich information of objects and are then adaptively utilized in different tasks. AMA-net is simple to train and adds only a small overhead to FPN. We discover that aside from the deep feature map, Adaptive Multi-path Aggregation is of particular importance for improving the accuracy of dense human pose estimation "in the wild''. The experimental results on the challenging Dense-COCO dataset demonstrate that our approach sets a new record for Dense Human Pose Estimation task, and it significantly outperforms the state-of-the-art methods. Our code: \urlhttps://github.com/nobody-g/AMA-net.

Illumination-Invariant Person Re-Identification

Yukun Huang
Zheng-Jun Zha
Xueyang Fu
Wei Zhang

Due to the effect of weak illumination, person images captured by surveillance cameras usually contain various degradations such as color shift, low contrast and noise. These degradations result in severe discriminant information loss, which makes the person re-identification (re-id) more challenging. However, existing person re-identification approaches are designed based on the assumption that the pedestrians images are under well lighting conditions, which is impractical in real-world scenarios. Inspired by the Retinex theory, we propose a illumination-invariant person re-identification framework which is able to simultaneously achieve Retinex illumination decomposition and person re-identification. We first verify that directly using weak illuminated images can greatly reduce the performance of person re-id. We then design a bottom-up attention network to remove the effect of weak illumination and obtain the enhanced image without introducing over-enhancement. To effectively connect low-level and high-level vision tasks, a joint training strategy is further introduced to boost the performance of person re-id under weak illumination conditions. Experiments have demonstrated the advantages of our method on benchmarks with severe lighting changes and low light conditions.

AI Coach: Deep Human Pose Estimation and Analysis for Personalized Athletic Training Assistance

Jianbo Wang
Kai Qiu
Houwen Peng
Jianlong Fu
Jianke Zhu

Recent years have witnessed an unprecedented growing of sport videos, as different types of sports activities can be widely-observed (i.e., from professional athletics to personal fitness). Existing approaches by computer vision have predominantly focused on creating experiences of content browsing and searching by video tagging and summarization. These techniques have already enabled a wide-range of applications for sports enthusiasts, such as text-based video search, highlight generation, and so on. In this paper, we take one step further to create an AI coach system to provide personalized athletic training experiences. Especially for sports activities which the training quality largely depends on the correctness of human poses in a video sequence. As sports videos often involve grand challenges of fast movement (e.g., skiing, skating) and complex actions (e.g., gymnastics), we propose to design the system with several distinct features: (1) trajectory extraction for a single human instance by leveraging deep visual tracking, (2) human pose estimation by proposing a novel human joints relation model in spatial and temporal domains, (3) pose correction by abnormal detection and exemplar-based visual suggestions. We have collected sports training videos from 30 sports enthusiasts, namely Freestyle Skiing Aerials dataset (63 clips). We show that the proposed system can lead to a remarkably better user training experience by extensive user studies.

SESSION: Session 1D: Live Multimedia Applications&Streaming

Online Camera Pose Optimization for the Surround-view System

Xiao Liu
Lin Zhang
Ying Shen
Shaoming Zhang
Shengjie Zhao

Surround-view system is an important information medium for drivers to monitor the driving environment. A typical surround-view system consists of four to six fish-eye cameras arranged around the vehicle. From these camera inputs, a top-down image of the ground around the vehicle, namely the surround-view image can be generated with well calibrated camera poses. Although existing surround-view system solutions can estimate camera poses accurately in off-line environment, how to correct the camera poses' change in online environment is still an open issue. In this paper, we propose a camera pose optimization method for surround-view system in online environment. Our method consists of two models: Ground Model and Ground-Camera Model, both of which correct the camera poses by minimizing photometric errors between ground projections of adjacent cameras. Experiments show that our method can effectively correct the geometric misalignment of the surround-view image caused by camera poses' change. Since our method is highly automated with low requirement of calibration site and manual operation, it has a wide range of applications and is convenient for the end-users. To make the results reproducible, the source code is publicly available at https://cslinzhang.github.io/CamPoseOpt/.

LiveSense: Contextual Advertising in Live Streaming Videos

Xiang Chen
Tam V. Nguyen
Zhiqi Shen
Mohan Kankanhalli

Live streaming has become a new form of entertainment, which attracts hundreds of millions of users worldwide. The huge amount of multimedia data in live streaming platforms creates tremendous opportunities for online advertising. However, existing state-of-the-art video advertising strategies (e.g., pre-roll and contextual mid-roll advertising) that rely on analyzing the whole video, are not applicable to live streaming videos. This paper describes a novel monetization framework, named LiveSense, for live streaming videos, which is able to display a contextually relevant ad at a suitable timestamp in a non-intrusive way. Specifically, given a live streaming video, we first employ a deep neural network to determine whether the current moment is appropriate for displaying an ad using the historical streaming data. Then, we detect a set of candidate ad insertion areas by incorporating image saliency, background map, and location priorities, so that the ad is displayed over the non-important area. We introduce three types of relevance metrics including textual relevance, global visual relevance and local visual relevance to select the contextually relevant ad. To minimize user intrusiveness, we initially display the ad at a non-important area. If the user is interested in the ad, we will show the ad in an overlaid window with a translucent background. Empirical evaluation on a real-world dataset demonstrates that our proposed framework is able to effectively display ads in live streaming videos while maintaining users' online experience.

Real-Time Gesture Recognition Using 3D Sensory Data and a Light Convolutional Neural Network

Nicholas Diliberti
Chao Peng
Christopher Kaufman
Yangzi Dong
Jeffrey T. Hansberger

In this work, we propose an end-to-end system that provides both hardware and software support for real-time gesture recognition. We apply a convolutional neural network over 3D rotation data of finger joints rather than over vision-based data, in order to extract high-level intentions (features) users are trying to convey. A pair of customized motion capturing gloves are designed with inertial measurement unit (IMU) sensors to obtain gestural datasets for network training and real-time recognition. A network reduction strategy has been developed to appropriately reduce a network's complexity in both depth and width dimensions while maintaining a high recognition accuracy with the classification model produced by the network. The classification model is able to classify new data samples by scanning a real-time stream of joint rotations during the use of the gloves. Our evaluation results expose the relationships between the network reduction hyperparameters and the change of recognition accuracy. Based on the evaluation, we are able to determine an appropriate version of the light network and achieve 98% accuracy.

Embodied One-Shot Video Recognition: Learning from Actions of a Virtual Embodied Agent

Yuqian Fu
Chengrong Wang
Yanwei Fu
Yu-Xiong Wang
Cong Bai
Xiangyang Xue
Yu-Gang Jiang

One-shot learning aims to recognize novel target classes from few examples by transferring knowledge from source classes, under a general assumption that the source and target classes are semantically related but not exactly the same. Based on this assumption, recent work has focused on image-based one-shot learning, while little work has addressed video-based one shot learning. One of the challenges lies in that it is difficult to maintain the disjoint-class assumption for videos, since video clips of target classes may potentially appear in the videos of source classes. To address this issue, we introduce a novel setting, termed as embodied agents based one-shot learning, which leverages synthetic videos produced in a virtual environment to understand realistic videos of target classes. In this setting, we further propose two types of learning tasks: embodied one-shot video domain adaptation and embodied one-shot video transfer recognition. These tasks serve as a testbed for evaluating video related one-shot learning tasks. In addition, we propose a general video segment augmentation method, which significantly facilitates a variety of one-shot learning tasks. Experimental results validate the soundness of our setting and learning tasks, and also show the effectiveness of our augmentation approach to video recognition in the small-sample size regime.

Livesmart: A QoS-Guaranteed Cost-Minimum Framework of Viewer Scheduling for Crowdsourced Live Streaming

Rui-Xiao Zhang
Ming Ma
Tianchi Huang
Haitian Pang
Xin Yao
Chenglei Wu
Jiangchuan Liu
Lifeng Sun

Viewer scheduling among different CDN providers in crowdsourced live streaming (CLS) service is especially challenging due to the large-scale dynamic viewers as well as the time-variant performance of the content delivery network. A practical scheduling method should tackle the following challenges: 1) accurate modeling of viewer patterns and CDN performance; 2) intelligent workload offloading to save costs while guaranteeing the quality of service (QoS); 3) and ease of integration with practical CDN infrastructure in CLS platforms.

In this paper, we propose Livesmart, a novel framework that facilitates a QoS-guaranteed cost-efficient approach for CLS services. Specifically, we address the first challenge by carefully designing deep neural networks which make Livestream capture the environment dynamics without any presumptions; we then tackle the second challenge by leveraging the Model Predictive Control (MPC) method which enables Livesmart to make decisions in a long-term way. For the last challenge, we propose a probability shift model based on the realistic CLS delivery structure, thus empowering Livesmart to be practically deployed. We collect real-world data in cooperation with Kuaishou, one of the largest CLS provider in China, and evaluate Livesmart with trace-driven experiments. In comparison with prevalent methods, Livesmart can significantly reduce the CDN bandwidth costs (24.97%-63.45%) and improve the average QoS (5.79%-7.63%).

Comyco: Quality-Aware Adaptive Video Streaming via Imitation Learning

Tianchi Huang
Chao Zhou
Rui-Xiao Zhang
Chenglei Wu
Xin Yao
Lifeng Sun

Learning-based Adaptive Bit Rate~(ABR) method, aiming to learn outstanding strategies without any presumptions, has become one of the research hotspots for adaptive streaming. However, it is still suffering from several issues, i.e., low sample efficiency and lack of awareness of the video quality information. In this paper, we propose Comyco, a video quality-aware ABR approach that enormously improves the learning-based methods by tackling the above issues. Comyco trains the policy via imitating expert trajectories given by the instant solver, which can not only avoid redundant exploration but also make better use of the collected samples. Meanwhile, Comyco attempts to pick the chunk with higher perceptual video qualities rather than video bitrates. To achieve this, we construct Comyco's neural network architecture, video datasets and QoE metrics with video quality features. Using trace-driven and real world experiments, we demonstrate significant improvements of Comyco's sample efficiency in comparison to prior work, with 1700x improvements in terms of the number of samples required and 16x improvements on training time required. Moreover, results illustrate that Comyco outperforms previously proposed methods, with the improvements on average QoE of 7.5% - 16.79%. Especially, Comyco also surpasses state-of-the-art approach Pensieve by 7.37% on average video quality under the same rebuffering time.

Low-Latency Network-Adaptive Error Control for Interactive Streaming

Silas L. Fong
Salma Emara
Baochun Li
Ashish Khisti
Wai-Tian Tan
Xiaoqing Zhu
John Apostolopoulos

We introduce a novel network-adaptive algorithm that is suitable for alleviating network packet losses for low-latency interactive communications between a source and a destination. Network packet losses happen in a bursty manner as well as an arbitrary manner, where the former is usually due to network congestion and the latter can be caused by unreliable wireless links. Our network-adaptive algorithm estimates in real time the best parameters of a recently proposed streaming code that corrects both arbitrary losses (which cause crackling noise in audio) and burst losses (which cause undesirable jitters and pauses in audio) using forward error correction (FEC). The network-adaptive algorithm updates the coding parameters in real time as follows: The destination estimates appropriate coding parameters based on its observed packet loss pattern and then the parameters are fed back to the source for updating the underlying code. In addition, a new explicit construction of practical low-latency streaming codes that achieve the optimal tradeoff between the capability of correcting arbitrary losses and the capability of correcting burst losses is provided. Simulation evaluations based on real-world packet loss traces reveal that our proposed network-adaptive algorithm combined with our optimal streaming codes achieves significantly higher reliability compared to uncoded and non-adaptive FEC schemes over UDP (User Datagram Protocol).

Navigation Graph for Tiled Media Streaming

Jounsup Park
Klara Nahrstedt

After the emergence of video streaming services, more creative and diverse multimedia content has become available, and now the capability of streaming 360-degree videos will open a new era of multimedia experiences. However, streaming these videos requires larger bandwidth and less latency than what is found in conventional video streaming systems. Rate adaptation of tiled videos and view prediction techniques are used to solve this problem. In this paper, we introduce the Navigation Graph, which models viewing behaviors in the temporal (segments) and the spatial (tiles) domains to perform the rate adaptation of tiled media associated with the view prediction. The Navigation Graph allows clients to perform view prediction more easily by sharing the viewing model in the same way in which media description information is shared in DASH. It is also useful for encoding the trajectory information in the media description file, which could also allow for more efficient navigation of 360-degree videos. This paper provides information about the creation of the Navigation Graph and its uses. The performance evaluation shows that the Navigation Graph based view prediction and rate adaptation outperform other existing tiled media streaming solutions. Navigation Graph is not limited to 360-degree video streaming applications, but it can also be applied to other tiled media streaming systems, such as volumetric media streaming for augmented reality applications.

CACA: Learning-based Content-aware Cache Admission for Video Content in Edge Caching

Yu Guan
Xinggong Zhang
Zongming Guo

In the last decades, network caches (Content Distribution Network, CDN) have been widely deployed in video delivery system. As cache has been pushed to network edge as far as possible, small cache size and irregular request pattern make it a great challenge for edge cache to catch popular video contents. Although we can apply cache admission policies to block cold contents out, however, all current admission policies are still based on request pattern (content size, frequency), which perform poorly in edge cache. This paper proposes a novel feature-based cache admission policy, Content-feature Aware Cache Admission(CACA). It admits video objects to cache by video features, not by request pattern anymore. The intuition behind that is, for a group of users, their preferred contents may change at any time, but their preferred content features would maintain for a while. Popularity of video features (such as topic, author), is much more predicable than that of single video object. To mine critical features from huge feature space, this paper proposes a tree-structure reinforcement learning algorithm. Critical features are learned from a feature-partition tree which is spanned and pruned by history popularity. Then, an Exploration-and-Exploitation method is used to select the Top-K critical features. Video contents with these features will be admitted to cache. We carried out extensive experiments with 24-hours data traces from a commercial video content provider. The experimental results demonstrate that the proposed CACA is able to improve hit ratio up to 15%, reduce back-to-origin up to 20% and save 95% memory, compared with state-of-art cache admission policies.

Dense Feature Aggregation and Pruning for RGBT Tracking

Yabin Zhu
Chenglong Li
Bin Luo
Jin Tang
Xiao Wang

How to perform effective information fusion of different modalities is a core factor in boosting the performance of RGBT tracking. This paper presents a novel deep fusion algorithm based on the representations from an end-to-end trained convolutional neural network. To deploy the complementarity of features of all layers, we propose a recursive strategy to densely aggregate these features that yield robust representations of target objects in each modality. In different modalities, we propose to prune the densely aggregated features of all modalities in a collaborative way. In a specific, we employ the operations of global average pooling and weighted random selection to perform channel scoring and selection, which could remove redundant and noisy features to achieve more robust feature representation. Experimental results on two RGBT tracking benchmark datasets suggest that our tracker achieves clear state-of-the-art against other RGB and RGBT tracking methods.

Asynchronous Tracking-by-Detection on Adaptive Time Surfaces for Event-based Object Tracking

Haosheng Chen
Qiangqiang Wu
Yanjie Liang
Xinbo Gao
Hanzi Wang

Event cameras, which are asynchronous bio-inspired vision sensors, have shown great potential in a variety of situations, such as fast motion and low illumination scenes. However, most of the event-based object tracking methods are designed for scenarios with untextured objects and uncluttered backgrounds. There are few event-based object tracking methods that support bounding box-based object tracking. The main idea behind this work is to propose an asynchronous Event-based Tracking-by-Detection (ETD) method for generic bounding box-based object tracking. To achieve this goal, we present an Adaptive Time-Surface with Linear Time Decay (ATSLTD) event-to-frame conversion algorithm, which asynchronously and effectively warps the spatio-temporal information of asynchronous retinal events to a sequence of ATSLTD frames with clear object contours. We feed the sequence of ATSLTD frames to the proposed ETD method to perform accurate and efficient object tracking, which leverages the high temporal resolution property of event cameras. We compare the proposed ETD method with seven popular object tracking methods, that are based on conventional cameras or event cameras, and two variants of ETD. The experimental results show the superiority of the proposed ETD method in handling various challenging environments.

Exploit the Connectivity: Multi-Object Tracking with TrackletNet

Gaoang Wang
Yizhou Wang
Haotian Zhang
Renshu Gu
Jenq-Neng Hwang

Multi-object tracking (MOT) is an important topic and critical task related to both static and moving camera applications, such as traffic flow analysis, autonomous driving and robotic vision. However, due to unreliable detection, occlusion and fast camera motion, tracked targets can be easily lost, which makes MOT very challenging. Most recent works exploit spatial and temporal information for MOT, but how to combine appearance and temporal features is still not well addressed. In this paper, we propose an innovative and effective tracking method called TrackletNet Tracker (TNT) that combines temporal and appearance information together as a unified framework. First, we define a graph model which treats each tracklet as a vertex. The tracklets are generated by associating detection results frame by frame with the help of the appearance similarity and the spatial consistency. To compensate camera movement, epipolar constraints are taken into consideration in the association. Then, for every pair of two tracklets, the similarity, called the connectivity in the paper, is measured by our designed multi-scale TrackletNet. Afterwards, the tracklets are clustered into groups and each group represents a unique object ID. Our proposed TNT has the ability to handle most of the challenges in MOT, and achieves promising results on MOT16 and MOT17 benchmark datasets compared with other state-of-the-art methods.

Themis: Efficient and Adaptive Resource Partitioning for Reducing Response Delay in Cloud Gaming

Yusen Li
Haoyuan Liu
Xiwei Wang
Lingjun Pu
Trent Marbach
Shanjiang Tang
Gang Wang
Xiaoguang Liu

Cloud gaming has been increasing in popularity recently, but issues relating to maintaining low interaction delay for users to guarantee satisfactory gaming experience is still prevalent. Interaction delays caused by server-side processing are heavily influenced by how the processes partition the resources. However, finding the optimal partitioning policy that minimizes the response delay is complicated by several critical challenges. In this paper, we propose Themis, a system that enables efficient and adaptive online resource partitioning for reducing response delay in cloud gaming. Briefly, Themis employs machine learning technology to build a performance model which is able to capture the complex relationships between resource partition and system performance. With this model, Themis divides the processes into disjoint groups and partitions resources among process groups, which greatly simplifies the resource partition problem while ensuring high partitioning effectiveness. To tackle dynamic workload changes, Themis leverages reinforcement learning to learn how different partitioning actions affect system performance in an online manner, and adaptively choose the best actions for minimizing response delay in real time. We evaluate Themis in a real cloud gaming environment using several real games. The experimental results show that Themis can reduce the response delay by 17% to 36% compared to a system without resource partitioning, and outperforms other resource partitioning policies significantly. To the best of our knowledge, this is the first work to optimize response delay in cloud gaming through resource partitioning.

PAN: Persistent Appearance Network with an Efficient Motion Cue for Fast Action Recognition

Can Zhang
Yuexian Zou
Guang Chen
Lei Gan

Despite the remarkable performance in video-based action recognition over the past several years, current state-of-the-art approaches heavily rely on the optical flow as motion representation. However, computing the optical flow in advance is computationally expensive, which restricts action recognition to be real-time. In this paper, we shed light on fast action recognition by lifting the reliance on optical flow. Inspired by Persistence of Vision in human visual system, we design a novel motion cue called Persistence of Appearance (PA), which enables the network to distill motion information directly from adjacent RGB frames. Our PA derives from optical flow and focuses on the small displacements of motion boundaries. Compared with other motion representations, our PA enables the network to achieve competitive accuracy on UCF101. Meanwhile, the inference speed reaches 1855 fps, which is over 120x faster than that of the traditional optical flow based methods. Besides, we devise a decision strategy called Various-timescale inference Pooling (VIP) to empower the network with the ability of long-range temporal modeling across various timescales. We further incorporate the proposed PA and VIP to form a unified framework called Persistent Appearance Network (PAN). Compared with methods using only RGB frames, our delicately designed PAN achieves state-of-the-art results on three benchmark datasets: UCF101, HMDB51 and Kinetics, where it reaches 96.2%, 74.8% and 82.5% accuracy respectively with the run-time speed as high as 595 fps. The code for this project is available at: https://github.com/zhang-can/PAN-PyTorch .

SESSION: Keynote II

FemTech: Broadening Participation to Digital Technology Development

Pernille Bjørn
Maria Menendez-Blanco

In the digital age, the fields and professions related to computing are having an unprecedent impact on our lives, and on societies. As computing becomes integrated in fundamental ways in healthcare [10,11], labor markets [2,4], and political processes [3,6], questions about who participates and takes decisions in developing digital technologies are becoming increasingly crucial and unavoidable [7].

A bottom line is that, if a rather homogeneous group develops most of the digital technologies, there is a risk that these technologies only consider a part of the population, and therefore unwillingly introduce biases or trigger exclusion. There are many intersectional characteristics - such as race, gender, or class - by which people can be part of an excluded minority. This keynote focuses on women as a gender minority in computing.

In Western societies, the percentage of women participating in computing is low. According to a recent report for the European Commission, there are four times more men than women in Europe in studies related to Information and Communication Technologies [12]. Similarly, a study by the Department of Labor Bureau of Labor Statistics showed that only 26% of computing jobs in USA were held by women [13].

SESSION: Session 2A: Knowledge Processing&Action Analysis

Training Efficient Saliency Prediction Models with Knowledge Distillation

Peng Zhang
Li Su
Liang Li
BingKun Bao
Pamela Cosman
GuoRong Li
Qingming Huang

Recently, deep learning-based saliency prediction methods have achieved significant accuracy improvements. However, they are hard to embed in practical multimedia applications due to large memory consumption and running time caused by complicated architectures. In addition, most methods are fine-tuned from pre-trained models for classification tasks, and networks cannot flexibly be transferred for a new task. In this paper, a condensed and randomly initialized student network is employed to achieve higher efficiency by transferring knowledge from complicated and well-trained teacher networks. This is the first use of knowledge distillation for efficient pixel-wise saliency prediction. Instead of directly minimizing Euclidean distance between feature maps, we propose two statistical representations of feature maps (i.e., first-order and second-order statistics) as knowledge. We conduct experiments on three kinds of teacher networks and four benchmark datasets to verify the effectiveness of the proposed method. Compared with the teacher networks, the student networks achieve an acceleration ratio of 4.56-4.73. Compared with state-of-the-art approaches, the proposed model achieves competitive accuracy with faster running speed (up to 4.38 times) and smaller model size (up to 93.27% reduction). We further embedded the proposed saliency prediction model into a video captioning application. The saliency-embedded approaches improve video captioning on all test metrics with a small complexity cost. The student-model embedded approach achieves 25% time saving with similar performance to the teacher embedded one.

Explainable Video Action Reasoning via Prior Knowledge and State Transitions

Tao Zhuo
Zhiyong Cheng
Peng Zhang
Yongkang Wong
Mohan Kankanhalli

Human action analysis and understanding in videos is an important and challenging task. Although substantial progress has been made in past years, the explainability of existing methods is still limited. In this work, we propose a novel action reasoning framework that uses prior knowledge to explain semantic-level observations of video state changes. Our method takes advantage of both classical reasoning and modern deep learning approaches. Specifically, prior knowledge is defined as the information of a target video domain, including a set of objects, attributes and relationships in the target video domain, as well as relevant actions defined by the temporal attribute and relationship changes (i.e. state transitions). Given a video sequence, we first generate a scene graph on each frame to represent concerned objects, attributes and relationships. Then those scene graphs are associated by tracking objects across frames to form a spatio-temporal graph (also called video graph), which represents semantic-level video states. Finally, by sequentially examining each state transition in the video graph, our method can detect and explain how those actions are executed with prior knowledge, just like the logical manner of thinking by humans. Compared to previous works, the action reasoning results of our method can be explained by both logical rules and semantic-level observations of video content changes. Besides, the proposed method can be used to detect multiple concurrent actions with detailed information, such as who (particular objects), when (time), where (object locations) and how (what kind of changes). Experiments on a re-annotated dataset CAD-120 show the effectiveness of our method.

Perceptual Visual Reasoning with Knowledge Propagation

Guohao Li
Xin Wang
Wenwu Zhu

Visual Question Answering (VQA) aims to answer natural language questions given images, where great challenges lie in comprehensive understanding and reasoning based on the rich contents provided by both questions and images. Most existing literature on VQA fuses the image and question features together with attention mechanism to answer the questions. In order to obtain a more human-like inferential ability, there have been some preliminary module-based approaches which decompose the whole problem into modular sub-problems. However, these methods still suffer from unsolved challenges such as lacking sufficient explainability and logical inference --- no doubt the gap between these preliminary studies and the real human reasoning behaviors is still extremely large. To tackle the challenges, we propose a Perceptual Visual Reasoning (PVR) model which advances one important step towards the more explainable VQA in this paper. Our proposed PVR model is a module-based approach which incorporates the concept of logical and/or for logic inference, introduces a richer group of perceptual modules for better logic generalization and utilizes the supervised information on each sub-module for more explainability. Knowledge propagation is therefore enabled by resorting to the modular design and supervision on sub-modules. We carry out extensive experiments with various evaluation metrics to demonstrate the superiority of the proposed PVR model against other state-of-the-art methods.

Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding

Xuejing Liu
Liang Li
Shuhui Wang
Zheng-Jun Zha
Li Su
Qingming Huang

Weakly supervised referring expression grounding (REG) aims at localizing the referential entity in an image according to linguistic query, where the mapping between the image region (proposal) and the query is unknown in the training stage. In referring expressions, people usually describe a target entity in terms of its relationship with other contextual entities as well as visual attributes. However, previous weakly supervised REG methods rarely pay attention to the relationship between the entities. In this paper, we propose a knowledge-guided pairwise reconstruction network (KPRN), which models the relationship between the target entity (subject) and contextual entity (object) as well as grounds these two entities. Specifically, we first design a knowledge extraction module to guide the proposal selection of subject and object. The prior knowledge is obtained in a specific form of semantic similarities between each proposal and the subject/object. Second, guided by such knowledge, we design the subject and object attention module to construct the subject-object proposal pairs. The subject attention excludes the unrelated proposals from the candidate proposals. The object attention selects the most suitable proposal as the contextual proposal. Third, we introduce a pairwise attention and an adaptive weighting scheme to learn the correspondence between these proposal pairs and the query. Finally, a pairwise reconstruction module is used to measure the grounding for weakly supervised learning. Extensive experiments on four large-scale datasets show our method outperforms existing state-of-the-art methods by a large margin.

Explainable Interaction-driven User Modeling over Knowledge Graph for Sequential Recommendation

Xiaowen Huang
Quan Fang
Shengsheng Qian
Jitao Sang
Yan Li
Changsheng Xu

Compared with the traditional recommendation system, sequential recommendation holds the ability of capturing the evolution of users' dynamic interests. Many previous studies in sequential recommendation focus on the accuracy of predicting the next item that a user might interact with, while generally ignore providing explanations why the item is recommended to the user. Appropriate explanations are critical to help users adopt the recommended item, and thus improve the transparency and trustworthiness of the recommendation system. In this paper, we propose a novel Explainable Interaction-driven User Modeling (EIUM) algorithm to exploit Knowledge Graph (KG) for constructing an effective and explainable sequential recommender. Qualified semantic paths between specific user-item pair are extracted from KG. Encoding those semantic paths and learning the importance scores for each path provides the path-wise explanation for the recommendation system. Different from traditional item- level sequential modeling methods, we capture the interaction-level user dynamic preferences by modeling the sequential interactions. It is a high- level representation which contains auxiliary semantic information from KG. Furthermore, we adopt a joint learning manner for better representation learning by employing multi-modal fusion, which benefits from the structural constraints in KG and involves three kinds of modalities. Extensive experiments on the large-scale dataset show the better performance of our approach in making sequential recommendations in terms of both accuracy and explainability.

Learning Using Privileged Information for Food Recognition

Lei Meng
Long Chen
Xun Yang
Dacheng Tao
Hanwang Zhang
Chunyan Miao
Tat-Seng Chua

Food recognition for user-uploaded images is crucial in visual diet tracking, an emerging application linking multimedia and healthcare domains. However, it is challenging due to the various visual appearances of food images. This is caused by different conditions when taking the photos, such as angles, distances, light conditions, food containers, and background scenes. To alleviate such a semantic gap, this paper presents a cross-modal alignment and transfer network (ATNet), which is motivated by the paradigm of learning using privileged information (LUPI). It additionally utilizes the ingredients in food images as an "intelligent teacher" in the training stage to facilitate cross-modal information passing. Specifically, ATNet first uses a pair of synchronized autoencoders to build the base image and ingredient channels for information flow. Subsequently, the information passing is enabled through a two-stage cross-modal interaction. The first stage of interaction adopts a two-step method, called partial heterogeneous transfer, to 1) alleviate the intrinsic heterogeneity between images and ingredients and 2) align them in a shared space to make their carried information about food classes interact. In the second stage, ATNet learns to map the visual embeddings of images to the ingredient channel for food recognition from the view of "teacher''. This leads a refined recognition by a multi-view fusion. Experiments on two real-world datasets show that ATNet can be incorporated with any state-of-the-art CNN models to consistently improve their performance.

Occluded Facial Expression Recognition Enhanced through Privileged Information

Bowen Pan
Shangfei Wang
Bin Xia

In this paper, we propose a novel approach of occluded facial expression recognition under the help of non-occluded facial images. The non-occluded facial images are used as privileged information, which is only required during training, but not required during testing. Specifically, two deep neural networks are first trained from occluded and non-occluded facial images respectively. Then the non-occluded network is fixed and is used to guide the fine-tuning of the occluded network from both label space and feature space. Similarity constraint and loss inequality regularization are imposed to the label space to make the output of occluded network converge to that of the non-occluded network. Adversarial leaning is adopted to force the distribution of the learned features from occluded facial images to be close to that from non-occluded facial images. Furthermore, a decoder network is employed to reconstruct the non-occluded facial images from occluded features. Under the guidance of non-occluded facial images, the occluded network is expected to learn better features and classifier during training. Experiments on the benchmark databases with both synthesized and realistic occluded facial images demonstrate the superiority of the proposed method to state-of-the-art.

Attention Transfer (ANT) Network for View-invariant Action Recognition

Yanli Ji
Feixiang Xu
Yang Yang
Ning Xie
Heng Tao Shen
Tatsuya Harada

With wide applications in surveillance and human-robot interaction, view-invariant human action recognition is critical, however, challenging, due to the action occlusion and information loss caused by view change. Current methods mainly seek for a common feature space for different views. However, such solutions become invalid when there exist few common features, e.g. large view change. To tackle the problem, we propose an AttentioN Transfer (ANT) Network for view-invariant action recognition. Other than transferring features, ANT transfers attention from the reference view to arbitrary views, which correctly emphasize crucial body joints and their relations for view-invariant representation. In addition, the attention calculation method taking into account both recognition contribution and reliability of skeleton joints generates effective attention. Experiments showed its effectiveness for correctly locating crucial body joints in action sequences. We exhaustively evaluate our approach on the UESTC and the NTU dataset with three types of view-invariant evaluations, i.e. X-view, X-sub, and Arbitrary-view evaluation. Experiment results demonstrate its superiority in view-invariant representation and recognition.

Action Recognition with Bootstrapping based Long-range Temporal Context Attention

Ziming Liu
Guangyu Gao
A. K. Qin
Tong Wu
Chi Harold Liu

Actions always refer to complex vision variations in a long-range redundant video sequence. Instead of focusing on limited range sequence, i.e. convolution on adjacent frames, in this paper, we proposed an action recognition approach with bootstrapping based long-range temporal context attention. Specifically, due to vision variations of the local region across frames, we target at capturing temporal context by proposing the Temporal Pixels based Parallel-head Attention (TPPA) block. In TPPA, we apply the self-attention mechanism between local regions at the same position across temporal frames to capture the interaction impacts. Meanwhile, to deal with video redundancy and capture long-range context, the TPPA is extended to the Random Frames based Bootstrapping Attention (RFBA) framework. While the bootstrapping sampling frames have the same distribution of the whole video sequence, the RFBA not only captures longer temporal context with only a few sampling frames but also has comprehensive representation through multiple sampling. Furthermore, we also try to apply this temporal context attention to image-based action recognition, by transforming the image into "pseudo video" with the spatial shift. Finally, we conduct extensive experiments and empirical evaluations on two most popular datasets:UCF101 for videos andStanford40 for images. In particular, our approach achieves top-1 accuracy of $91.7%$ in UCF101 and mAP of $90.9%$ in Stanford40.

Sparse Temporal Causal Convolution for Efficient Action Modeling

Changmao Cheng
Chi Zhang
Yichen Wei
Yu-Gang Jiang

Recently, spatio-temporal convolutional networks have achieved prominent performance in action classification. However, debates on the importance of temporal information lead to the rethinking of these architectures. In this work, we propose to employ sparse temporal convolutional operations in networks for efficient action modeling. We demonstrate that the explicit temporal feature interactions can be largely reduced without any degradation. And towards better scalability, we use causal convolutions for temporal feature learning. Under causality constraints, we replenish the model with auxiliary self-supervised tasks, namely video prediction and frame order discrimination. Besides, a gradient based multi-task learning algorithm is introduced for guaranteeing the dominance of action recognition task. The proposed model matches or outperforms the state-of-the-art methods on Kinetics, Something-Something V2, UCF101 and HMDB51 datasets.

Optimized Skeleton-based Action Recognition via Sparsified Graph Regression

Xiang Gao
Wei Hu
Jiaxiang Tang
Jiaying Liu
Zongming Guo

With the prevalence of accessible depth sensors, dynamic human body skeletons have attracted much attention as a robust modality for action recognition. Previous methods model skeletons based on RNN or CNN, which has limited expressive power for irregular skeleton joints. While graph convolutional networks (GCN) have been proposed to address irregular graph-structured data, the fundamental graph construction remains challenging. In this paper, we represent skeletons naturally on graphs, and propose a graph regression based GCN (GR-GCN) for skeleton-based action recognition, aiming to capture the spatio-temporal variation in the data. As the graph representation is crucial to graph convolution, we first propose graph regression to statistically learn the underlying graph from multiple observations. In particular, we provide spatio-temporal modeling of skeletons and pose an optimization problem on the graph structure over consecutive frames, which enforces the sparsity of the underlying graph for efficient representation. The optimized graph not only connects each joint to its neighboring joints in the same frame strongly or weakly, but also links with relevant joints in the previous and subsequent frames. We then feed the optimized graph into the GCN along with the coordinates of the skeleton sequence for feature learning, where we deploy high-order and fast Chebyshev approximation of spectral graph convolution. Further, we provide analysis of the variation characterization by the Chebyshev approximation. Experimental results validate the effectiveness of the proposed graph regression and show that the proposed GR-GCN achieves the state-of-the-art performance on the widely used NTU RGB+D, UT-Kinect and SYSU 3D datasets.

Prediction-CGAN: Human Action Prediction with Conditional Generative Adversarial Networks

Wanru Xu
Jian Yu
Zhenjiang Miao
Lili Wan
Qiang Ji

The underlying challenge of human action prediction, i.e. maintaining prediction accuracy at very beginning of an action execution, is still not well handled. In this paper, we propose a Prediction Conditional Generative Adversarial Network (Prediction-CGAN) for predicting action, which shares information between completely observed and partially observed videos. Instead of generating future frames, we aim at completing visual representations of unfinished video, which can be directly utilized to predict action label no matter at any progress levels. The Prediction-CGAN incorporates the completion constraint to learn a transformation from incomplete actions to complete actions; the adversarial constraint to ensure the generation has similar discriminative power to complete representation; the label consistency constraint to encourage label consistency between each segment and its corresponding complete video; and the confidence monotonically increasing constraint to yield increasingly accurate predictions as observing more frames. Meanwhile, we introduce a novel adversarial criterion especially for prediction task, which requires the generation is more discriminative than its corresponding incomplete representation, while the generation is less discriminative than its real complete representation. In experiments, we present adequate evaluations to show that the proposed Prediction-CGAN outperforms state-of-the-art methods in action prediction.

Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition

Haoze Wu
Zheng-Jun Zha
Xin Wen
Zhenzhong Chen
Dong Liu
Xuejin Chen

The 3D convolutional neural networks recently have been applied to explore spatial-temporal content for video action recognition. However, they either suffer from high computational cost by spatial-temporal feature extraction or ignore the correlation between appearance and motion. In this work, we propose a novel Cross-Fiber Spatial-Temporal Co-enhanced (CFST) architecture aiming to reduce the number of parameters tremendously while achieve accurate recognition of actions. We slice the complex 3D convolutional network into a group of lightweight fibers that run through the whole network. Crossing separated fibers, we introduce the Cross-Fiber Recalibration unit which shares extracted features from each fiber and measures the interaction between fibers to emphasize informative ones. Within each fiber, the Spatial-Temporal Co-enhanced unit is put forward to co-enhance the learning of spatial and temporal features, leading to more discriminative spatial-temporal representation. An end-to-end deep network, CFST-Net, is also presented based on the proposed CFST architecture for video action recognition. Extensive experimental results show that our CFST-Net significantly boosts the performance of existing convolution networks and achieves state-of-the-art accuracy on three challenging benchmarks, i.e., UCF-101, HMDB-51 and Kinetics-400, with much fewer parameters and FLOPs.

Long Short-Term Relation Networks for Video Action Detection

Dong Li
Ting Yao
Zhaofan Qiu
Houqiang Li
Tao Mei

It has been well recognized that modeling human-object or object-object relations would be helpful for detection task. Nevertheless, the problem is not trivial especially when exploring the interactions between human actor, object and scene (collectively as human-context) to boost video action detectors. The difficulty originates from the aspect that reliable relations in a video should depend on not only short-term human-context relation in the present clip but also the temporal dynamics distilled over a long-range span of the video. This motivates us to capture both short-term and long-term relations in a video. In this paper, we present a new Long Short-Term Relation Networks, dubbed as LSTR, that novelly aggregates and propagates relation to augment features for video action detection. Technically, Region Proposal Networks (RPN) is remoulded to first produce 3D bounding boxes, i.e., tubelets, in each video clip. LSTR then models short-term human-context interactions within each clip through spatio-temporal attention mechanism and reasons long-term temporal dynamics across video clips via Graph Convolutional Networks (GCN) in a cascaded manner. Extensive experiments are conducted on four benchmark datasets, and superior results are reported when comparing to state-of-the-art methods.

SESSION: Session 2B: Adversarial Learning

Attacking Gait Recognition Systems via Silhouette Guided GANs

Meijuan Jia
Hongyu Yang
Di Huang
Yunhong Wang

This paper investigates a new attack method to gait recognition systems. Different from typical spoofing attacks that require impostors to mimic certain clothing or walking styles, it proposes to intercept the video stream captured by the on-site camera and replace it with synthesized samples. To this end, we present a novel Generative Adversarial Network (GAN) based approach, which is able to render a faked video from the source walking sequence of a specified subject and the target scene image with both good visual effects and sufficient discriminative details. A new generator architecture is built, where the features of the source foreground sequence and the target background image are combined at multiple scales, making the synthesized video vivid. To fool recognition systems, the silhouette-conditioned losses are specially designed to constrain the static and dynamic consistency between the subjects in the source and generated videos. The person re-identification similarity based triplet loss is exploited to guide the generator, which keeps the personalized appearance properties stable. The edge and flow-related losses further regulate the generation of the attacking video. Two state-of-the-art gait recognition systems are used for evaluation, namely GaitSet and CNN-Gait, and we analyze their performance under attacking. Both the visual fidelity and attacking ability of the generated videos validate the effectiveness of the proposed method.

Mocycle-GAN: Unpaired Video-to-Video Translation

Yang Chen
Yingwei Pan
Ting Yao
Xinmei Tian
Tao Mei

Unsupervised image-to-image translation is the task of translating an image from one domain to another in the absence of any paired training examples and tends to be more applicable to practical applications. Nevertheless, the extension of such synthesis from image-to-image to video-to-video is not trivial especially when capturing spatio-temporal structures in videos. The difficulty originates from the aspect that not only the visual appearance in each frame but also motion between consecutive frames should be realistic and consistent across transformation. This motivates us to explore both appearance structure and temporal continuity in video synthesis. In this paper, we present a new Motion-guided Cycle GAN, dubbed as Mocycle-GAN, that novelly integrates motion estimation into unpaired video translator. Technically, Mocycle-GAN capitalizes on three types of constrains: adversarial constraint discriminating between synthetic and real frame, cycle consistency encouraging an inverse translation on both frame and motion, and motion translation validating the transfer of motion between consecutive frames. Extensive experiments are conducted on video-to-labels and labels-to-video translation, and superior results are reported when comparing to state-of-the-art methods. More remarkably, we qualitatively demonstrate our Mocycle-GAN for both flower-to-flower and ambient condition transfer.

Adversarial Preference Learning with Pairwise Comparisons

Zitai Wang
Qianqian Xu
Ke Ma
Yangbangyan Jiang
Xiaochun Cao
Qingming Huang

When facing rich multimedia content and making a decision, users tend to be overwhelmed with redundant options. Recommendation system can improve the users' experience by predicting the possible preference of a given user. The vast majority of the literature adopts the collaborative framework, which relies on a static and fixed formulation of the rating score prediction function (in most cases an inner product function). However, such a static learning paradigm is not consistent with the dynamic feature of human intelligence. Motivated by this, we present a novel adversarial framework for collaborative ranking. On one hand, we leverage a deep generator to approximate an arbitrary continuous score function in terms of pairwise comparison. On the other hand, a discriminator provides personalized supervision signals with increasing difficulty. Different from the traditional static learning framework, our proposed approach enjoys a dynamic nature and unifies both the generative and the discriminative model for collaborative ranking. Comprehensive empirical studies on three real-world datasets show significant improvements of the adversarial framework over the state-of-the-art methods.

Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search

Jiawei Liu
Zheng-Jun Zha
Richang Hong
Meng Wang
Yongdong Zhang

The newly emerging text-based person search task aims at retrieving the target pedestrian by a query in natural language with fine-grained description of a pedestrian. It is more applicable in reality without the requirement of image/video query of a pedestrian, as compared to image/video based person search, i.e., person re-identification. In this work, we propose a novel deep adversarial graph attention convolution network (A-GANet) for text-based person search. The A-GANet exploits both textual and visual scene graphs, consisting of object properties and relationships, from the text queries and gallery images of pedestrians, towards learning informative textual and visual representations. It learns an effective joint textual-visual latent feature space in adversarial learning manner, bridging modality gap and facilitating pedestrian matching. Specifically, the A-GANet consists of an image graph attention network, a text graph attention network and an adversarial learning module. The image and text graph attention networks are designed with a novel graph attention convolution layer, which effectively exploits graph structure in the learning of textual and visual features, leading to precise and discriminative representations. An adversarial learning module is developed with a feature transformer and a modality discriminator, to learn a joint textual-visual feature space for cross-modality matching. Extensive experimental results on two challenging benchmarks, i.e., CUHK-PEDES and Flickr30k datasets, have demonstrated the effectiveness of the proposed method.

STDGAN: ResBlock Based Generative Adversarial Nets Using Spectral Normalization and Two Different Discriminators

Zhaoyu Zhang
Jun Yu

Generative adversarial network (GAN) is a powerful generative model. However, it suffers from two key problems, which are convergence and mode collapse. To overcome these drawbacks, this paper presents a novel architecture of GAN, called STDGAN, which consists of one generator and two different discriminators. With the fact that GAN is the analogy of a minimax game, the proposed architecture is as follows. The generator G aims to produce realistic-looking samples to fool both of two discriminators. The first discriminator D1 rewards high scores for the samples from the data distribution, while the second one D2 favors the samples from the generator conversely. Specifically, the minibatch discrimination and Spectral Normalization (SN) are first adopted in D1. Then, based on the ResBlock architecture, Spectral Normalization (SN) and Scaled Exponential Linear Units (SELU) are adopted in the first and last half layers of D2 respectively. In particular, a novel loss function is designed to optimize the STDGAN by minimizing the KL divergence. Extensive experiments on CIFAR-10/100 and ImageNet datasets demonstrate that the proposed STDGAN can effectively solve the problems of convergence and mode collapse and obtain the higher inception score (IS) and lower Frechet Inception Distance (FID) compared with other state-of-the-art GANs.

Adversarial Colorization of Icons Based on Contour and Color Conditions

Tsai-Ho Sun
Chien-Hsun Lai
Sai-Keung Wong
Yu-Shuen Wang

We present a system to help designers create icons that are widely used in banners, signboards, billboards, homepages, and mobile apps. Designers are tasked with drawing contours, whereas our system colorizes contours in different styles. This goal is achieved by training a dual conditional generative adversarial network (GAN) on our collected icon dataset. One condition requires the generated image and the drawn contour to possess a similar contour, while the other anticipates the image and the referenced icon to be similar in color style. Accordingly, the generator takes a contour image and a man-made icon image to colorize the contour, and then the discriminators determine whether the result fulfills the two conditions. The trained network is able to colorize icons demanded by designers and greatly reduces their workload. For the evaluation, we compared our dual conditional GAN to several state-of-the-art techniques. Experiment results demonstrate that our network is over the previous networks. Finally, we will provide the source code, icon dataset, and trained network for public use.

MetaAdvDet: Towards Robust Detection of Evolving Adversarial Attacks

Chen Ma
Chenxu Zhao
Hailin Shi
Li Chen
Junhai Yong
Dan Zeng

Deep neural networks (DNNs) are vulnerable to the adversarial attack which is maliciously implemented by adding human-imperceptible perturbation to images and thus leads to incorrect prediction. Existing studies have proposed various methods to detect the new adversarial attacks. However, new attack methods keep evolving constantly and yield new adversarial examples to bypass the existing detectors. It needs to collect tens of thousands samples to train detectors, while the new attacks evolve much more frequently than the high-cost data collection. Thus, this situation leads the newly evolved attack samples to remain in small scales. To solve such few-shot problem with the evolving attacks, we propose a meta-learning based robust detection method to detect new adversarial attacks with limited examples. Specifically, the learning consists of a double-network framework: a task-dedicated network and a master network which alternatively learn the detection capability for either seen attack or a new attack. To validate the effectiveness of our approach, we construct the benchmarks with few-shot-fashion protocols based on three conventional datasets, i.e. CIFAR-10, MNIST and Fashion-MNIST. Comprehensive experiments are conducted on them to verify the superiority of our approach with respect to the traditional adversarial attack detection methods. The implementation code is available online.

Tell Me Where It is Still Blurry: Adversarial Blurred Region Mining and Refining

Jen-Chun Lin
Wen-Li Wei
Tyng-Luh Liu
C.-C. Jay Kuo
Mark Liao

Mobile devices such as smart phones are ubiquitously being used to take photos and videos, thus increasing the importance of image deblurring. This study introduces a novel deep learning approach that can automatically and progressively achieve the task via adversarial blurred region mining and refining (adversarial BRMR). Starting with a collaborative mechanism of two coupled conditional generative adversarial networks (CGANs), our method first learns the image-scale CGAN, denoted as iGAN, to globally generate a deblurred image and locally uncover its still blurred regions through an adversarial mining process. Then, we construct the patch-scale CGAN, denoted as pGAN, to further improve sharpness of the most blurred region in each iteration. Owing to such complementary designs, the adversarial BRMR indeed functions as a bridge between iGAN and pGAN, and yields the performance synergy in better solving blind image deblurring. The overall formulation is self-explanatory and effective to globally and locally restore an underlying sharp image. Experimental results on benchmark datasets demonstrate that the proposed method outperforms the current state-of-the-art technique for blind image deblurring both quantitatively and qualitatively.

Joint-attention Discriminator for Accurate Super-resolution via Adversarial Training

Rong Chen
Yuan Xie
Xiaotong Luo
Yanyun Qu
Cuihua Li

Tremendous progress has been witnessed on single image super-resolution (SR), where existing deep SR models achieve impressive performance in objective criteria, e.g., PSNR and SSIM. However, most of the SR methods are limited in visual perception, for example, they look too smooth. Generative adversarial network (GAN) favors SR visual effects over most of the deep SR models but is poor in objective criteria. In order to trade off the objective and subjective SR performance, we design a joint-attention discriminator with which GAN improves the SR performance in PSNR and SSIM, as well as maintaining the visual effect compared with non-attention GAN based SR models. The joint-attention discriminator contains dense channel-wise attention and cross-layer attention blocks. The former is applied in the shallow layers of the discriminator for channel-wise weighting combination of feature maps. The latter is employed to select feature maps in some middle and deep layers for effective discrimination. Extensive experiments are conducted on six benchmark datasets and the experimental results show that our proposed discriminator combining with different generators can achieve more realistic visual performances.

BasketballGAN: Generating Basketball Play Simulation Through Sketching

Hsin-Ying Hsieh
Chieh-Yu Chen
Yu-Shuen Wang
Jung-Hong Chuang

We present a data-driven basketball set play simulation. Given an offensive set play sketch, our method simulates potential scenarios that may occur in the game. The simulation provides coaches and players with insights on how a given set play can be executed. To achieve the goal, we train a conditional adversarial network on NBA movement data to imitate the behaviors of how players move around the court through two major components: a generator that learns to generate natural player movements based on a latent noise and a user sketched set play; and a discriminator that is used to evaluate the realism of the basketball play. To improve the quality of simulation, we minimize 1.) a dribbler loss to prevent the ball from drifting away from the dribbler; 2.) a defender loss to prevent the dribbler from not being defended; 3.) a ball passing loss to ensure the straightness of passing trajectories; and 4) an acceleration loss to minimize unnecessary players' movements. To evaluate our system, we objectively compared real and simulated basketball set plays. Besides, a subjective test was conducted to judge whether a set play was real or generated by our network. On average, the mean correct rates to the binary tests were 56.17 %. Experiment results and the evaluations demonstrated the effectiveness of our system.

Joint Adversarial Domain Adaptation

Shuang Li
Chi Harold Liu
Binhui Xie
Limin Su
Zhengming Ding
Gao Huang

Domain adaptation aims to transfer the enriched label knowledge from large amounts of source data to unlabeled target data. It has raised significant interest in multimedia analysis. Existing researches mainly focus on learning domain-wise transferable representations via statistical moment matching or adversarial adaptation techniques, while ignoring the class-wise mismatch across domains, resulting in inaccurate distribution alignment. To address this issue, we propose a Joint Adversarial Domain Adaptation (JADA) approach to simultaneously align domain-wise and class-wise distributions across source and target in a unified adversarial learning process. Specifically, JADA attempts to solve two complementary minimax problems jointly. The feature generator aims to not only fool the well-trained domain discriminator to learn domain-invariant features, but also minimize the disagreement between two distinct task-specific classifiers' predictions to synthesize target features near the support of source class-wisely. As a result, the learned transferable features will be equipped with more discriminative structures, and effectively avoid mode collapse. Additionally, JADA enables an efficient end-to-end training manner via a simple back-propagation scheme. Extensive experiments on several real-world cross-domain benchmarks, including VisDA-2017, ImageCLEF, Office-31 and digits, verify that JADA can gain remarkable improvements over other state-of-the-art deep domain adaptation approaches.

Adversarial Seeded Sequence Growing for Weakly-Supervised Temporal Action Localization

Chengwei Zhang
Yunlu Xu
Zhanzhan Cheng
Yi Niu
Shiliang Pu
Fei Wu
Futai Zou

Temporal action localization is an important yet challenging research topic due to its various applications. Since the frame-level or segment-level annotations of untrimmed videos require amounts of labor expenditure, studies on the weakly-supervised action detection have been springing up. However, most of existing frameworks rely on Class Activation Sequence (CAS) to localize actions by minimizing the video-level classification loss, which exploits the most discriminative parts of actions but ignores the minor regions. In this paper, we propose a novel weakly-supervised framework by adversarial learning of two modules for eliminating such demerits. Specifically, the first module is designed as a well-designed Seeded Sequence Growing (SSG) Network for progressively extending seed regions (namely the highly reliable regions initialized by a CAS-based framework) to their expected boundaries. The second module is a specific classifier for mining trivial or incomplete action regions, which is trained on the shared features after erasing the seeded regions activated by SSG. In this way, a whole network composed of these two modules can be trained in an adversarial manner. The goal of the adversary is to mine features that are difficult for the action classifier. That is, erasion from SSG will force the classifier to discover minor or even new action regions on the input feature sequence, and the classifier will drive the seeds to grow, alternately. At last, we could obtain the action locations and categories from the well-trained SSG and the classifier. Extensive experiments on two public benchmarks THUMOS'14 and ActivityNet1.3 demonstrate the impressive performance of our proposed method compared with the state-of-the-arts.

Cycle-consistent Conditional Adversarial Transfer Networks

Jingjing Li
Erpeng Chen
Zhengming Ding
Lei Zhu
Ke Lu
Zi Huang

Domain adaptation investigates the problem of cross-domain knowledge transfer where the labeled source domain and unlabeled target domain have distinctive data distributions. Recently, adversarial training have been successfully applied to domain adaptation and achieved state-of-the-art performance. However, there is still a fatal weakness existing in current adversarial models which is raised from the equilibrium challenge of adversarial training. Specifically, although most of existing methods are able to confuse the domain discriminator, they cannot guarantee that the source domain and target domain are sufficiently similar. In this paper, we propose a novel approach named cycle-consistent conditional adversarial transfer networks (3CATN) to handle this issue. Our approach takes care of the domain alignment by leveraging adversarial training. Specifically, we condition the adversarial networks with the cross-covariance of learned features and classifier predictions to capture the multimodal structures of data distributions. However, since the classifier predictions are not certainty information, a strong condition with the predictions is risky when the predictions are not accurate. We, therefore, further propose that the truly domain-invariant features should be able to be translated from one domain to the other. To this end, we introduce two feature translation losses and one cycle-consistent loss into the conditional adversarial domain adaptation networks. Extensive experiments on both classical and large-scale datasets verify that our model is able to outperform previous state-of-the-arts with significant improvements.

GAN Flexible Lmser for Super-resolution

Peiying Li
Shikui Tu
Lei Xu

Existing single image super-resolution (SISR) methods usually focus on Low-Resolution (LR) images which are artificially generated from High-Resolution (HR) images by a down-sampling process, but are not robust for unmatched training set and testing set. This paper proposes a GAN Flexible Lmser (GFLmser) network that bidirectionally learns the High-to-Low (H2L) process that degrades HR images to LR images and the Low-to-High (L2H) process that recovers the LR images back to HR images. The two directions share the same architecture, added with the gated skip connections from the H2L-net to the L2H-net in order to enhance information transferring for super-resolution. In comparison with several related state-of-the-art methods, experiments demonstrate that not only GFLmser is the most robust method on images of unmatched training set and testing set, but also its performance on real-world face LR images is best in PSNR and reasonably good in FID.

SESSION: Session 2C: Captioning&Video Analysis

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

Longteng Guo
Jing Liu
Jinhui Tang
Jiangwei Li
Wei Luo
Hanqing Lu

Image captioning attempts to generate a sentence composed of several linguistic words, which are used to describe objects, attributes, and interactions in an image, denoted as visual semantic units in this paper. Based on this view, we propose to explicitly model the object interactions in semantics and geometry based on Graph Convolutional Networks (GCNs), and fully exploit the alignment between linguistic words and visual semantic units for image captioning. Particularly, we construct a semantic graph and a geometry graph, where each node corresponds to a visual semantic unit, i.e., an object, an attribute, or a semantic (geometrical) interaction between two objects. Accordingly, the semantic (geometrical) context-aware embeddings for each unit are obtained through the corresponding GCN learning processers. At each time step, a context gated attention module takes as inputs the embeddings of the visual semantic units and hierarchically align the current word with these units by first deciding which type of visual semantic unit (object, attribute, or interaction) the current word is about, and then finding the most correlated visual semantic units under this type. Extensive experiments are conducted on the challenging MS-COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches.

Hierarchical Global-Local Temporal Modeling for Video Captioning

Yaosi Hu
Zhenzhong Chen
Zheng-Jun Zha
Feng Wu

In this paper, a Hierarchical Temporal Model (HTM) is proposed for the video captioning task, based on exploring the global and local temporal structure to better recognize fine-grained objects and actions. In our HTM, the encoder and decoder are hierarchically aligned according to different levels of features. The encoder applies two LSTM layers to construct temporal structures at both frame-level and object-level where the attention mechanism is applied to locate objects of interest, and the decoder uses corresponding LSTM layers to extract pivotal features from global to local through multi-level attention mechanism. Moreover, the local temporal structure is constructed implicitly from candidate object-oriented features under the guidance of global temporal-spatial representation, that could generate more accurate descriptions in handling shot-switching problems. Experiments on the widely used Microsoft Video Description Corpus (MSVD) and Charades datasets demonstrate the effectiveness of our proposed approach when compared to the state-of-the-art methods.

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards

Yuqing Song
Shizhe Chen
Yida Zhao
Qin Jin

Generating image descriptions in different languages is essential to satisfy users worldwide. However, it is prohibitively expensive to collect large-scale paired image-caption dataset for every target language which is critical for training descent image captioning models. Previous works tackle the unpaired cross-lingual image captioning problem through a pivot language, which is with the help of paired image-caption data in the pivot language and pivot-to-target machine translation models. However, such language-pivoted approach suffers from inaccuracy brought by the pivot-to-target translation, including disfluency and visual irrelevancy errors. In this paper, we propose to generate cross-lingual image captions with self-supervised rewards in the reinforcement learning framework to alleviate these two types of errors. We employ self-supervision from mono-lingual corpus in the target language to provide fluency reward, and propose a multi-level visual semantic matching model to provide both sentence-level and concept-level visual relevancy rewards. We conduct extensive experiments for unpaired cross-lingual image captioning in both English and Chinese respectively on two widely used image caption corpora. The proposed approach achieves significant performance improvement over state-of-the-art methods.

MUCH: Mutual Coupling Enhancement of Scene Recognition and Dense Captioning

Xinhang Song
Bohan Wang
Gongwei Chen
Shuqiang Jiang

Due to the abstraction of scenes, comprehensive scene understanding requires semantic modeling in both global and local aspects. Scene recognition is usually researched from a global point of view, while dense captioning is typically studied for local regions. Previous works separately research on the modeling of scene recognition and dense captioning. In contrast, we propose a joint learning framework that benefits from the mutual coupling of scene recognition and dense captioning models. Generally, these two tasks are coupled through two steps, 1) fusing the supervision by considering the contexts between scene labels and local captions, and 2) jointly optimizing semantically symmetric LSTM models. Particularly, in order to balance bias between dense captioning and scene recognition, a scene adaptive non-maximum suppression (NMS) method is proposed to emphasize the scene related regions in region proposal procedure, and a region-wise and category-wise weighted pooling method is proposed to avoid over attention on particular regions in local to global pooling procedure. For the model training and evaluation, scene labels are manually annotated for Visual Genome database. The experimental results on Visual Genome show the effectiveness of the proposed method. Moreover, the proposed method also can improve previous CNN based works on public scene databases, such as MIT67 and SUN397.

Attention-based Densely Connected LSTM for Video Captioning

Yongqing Zhu
Shuqiang Jiang

Recurrent Neural Networks (RNNs), especially the Long Short-Term Memory (LSTM), have been widely used for video captioning, since they can cope with the temporal dependencies within both video frames and the corresponding descriptions. However, as the sequence gets longer, it becomes much harder to handle the temporal dependencies within the sequence. And in traditional LSTM, previously generated hidden states except the last one do not work directly to predict the current word. This may lead to the predicted word highly related to the last generated hidden state other than the overall context. To better capture long range dependencies and directly leverage early generated hidden states, in this work, we propose a novel model named Attention-based Densely Connected Long Short-Term Memory (DenseLSTM). In DenseLSTM, to ensure maximum information flow, all previous cells are connected to the current cell, which makes the updating of the current state directly related to all its previous states. Furthermore, an attention mechanism is designed to model the impacts of different hidden states. Because each cell is directly connected with all its successive cells, each cell has direct access to the gradients from later ones. In this way, the long-range dependencies are more effectively captured. We perform experiments on two publicly used video captioning datasets: the Microsoft Video Description Corpus (MSVD) and the MSR-VTT, and experimental results illustrate the effectiveness of DenseLSTM.

Critic-based Attention Network for Event-based Video Captioning

Elaheh Barati
Xuewen Chen

In this paper, we investigate utilizing an actor-critic architecture for an event-based video captioning. In this captioning task, the video can contain multiple overlapping events. We consider the words as the actions that our model takes sequentially. The architecture of our model consists of an actor network to predict the captions given temporal segments in a video and a critic network to measure the quality of the generated captions. Our model, first, localizes events in the video, then by using the localized events, it locates temporal segments in the video. We adopt a global network to generate a caption for each temporal segment. We propose an attention mechanism to account for the importance of each localized event in captioning a temporal segment in the video. We provide a set of experiments on utilizing our method in the task of event-based video captioning on ActivityNet Captions and TACoS-MultiLevel datasets. Experimental results show that our method outperforms state-of-the-art video captioning methods.

Watch It Twice: Video Captioning with a Refocused Video Encoder

Xiangxi Shi
Jianfei Cai
Shafiq Joty
Jiuxiang Gu

With the rapid growth of video data and the increasing demands of various crossmodal applications such as intelligent video search and assistance towards visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields. The state-of-the-art video captioning methods focus more on encoding the temporal information, while lacking effective ways to remove irrelevant temporal information and also neglecting the spatial details. In particular, the current unidirectional video encoder can be negatively affected by irrelevant temporal information, especially the irrelevant information at the beginning and at the end of a video. In addition, disregarding detailed spatial features may lead to incorrect word choices in decoding. In this paper, we propose a novel recurrent video encoding method and a novel visual spatial feature for the video captioning task. The recurrent encoding module encodes the video twice with a predicted key frame to avoid irrelevant temporal information often occurring at the beginning and at the end of a video. The novel spatial features represent spatial information from different regions of a video and provide the decoder with more detailed information. Experiments on two benchmark datasets show superior performance of the proposed method.

MvsGCN: A Novel Graph Convolutional Network for Multi-video Summarization

Jiaxin Wu
Sheng-Hua Zhong
Yan Liu

Multi-video summarization, which tries to generate a single summary for a collection of video, is an important task in dealing with ever-growing video data. In this paper, we are the first to propose a graph convolutional network for multi-video summarization. The novel network measures the importance and relevance of each video shot in its own video as well as in the whole video collection. The important node sampling method is proposed to emphasize the effective features which are more possible to be selected as the final video summary. Two strategies are proposed to integrate into the network to solve the inherent class imbalance problem in the task of video summarization. The loss regularization for diversity is used to encourage a diverse summary to be generated. Extensive experiments are carried out, and in comparison with traditional and recent graph models and the state-of-the-art video summarization methods, our proposed model is effective in generating a representative summary for multiple videos with good diversity. It also achieves state-of-the-art performance on two standard video summarization datasets.

Stacked Memory Network for Video Summarization

Junbo Wang
Wei Wang
Zhiyong Wang
Liang Wang
Dagan Feng
Tieniu Tan

In recent years, supervised video summarization has achieved promising progress with various recurrent neural networks (RNNs) based methods, which treats video summarization as a sequence-to-sequence learning problem to exploit temporal dependency among video frames across variable ranges. However, RNN has limitations in modelling the long-term temporal dependency for summarizing videos with thousands of frames due to the restricted memory storage unit. Therefore, in this paper we propose a stacked memory network called SMN to explicitly model the long dependency among video frames so that redundancy could be minimized in the video summaries produced. Our proposed SMN consists of two key components: Long Short-Term Memory (LSTM) layer and memory layer, where each LSTM layer is augmented with an external memory layer. In particular, we stack multiple LSTM layers and memory layers hierarchically to integrate the learned representation from prior layers. By combining the hidden states of the LSTM layers and the read representations of the memory layers, our SMN is able to derive more accurate video summaries for individual video frames. Compared with the existing RNN based methods, our SMN is particularly good at capturing long temporal dependency among frames with few additional training parameters. Experimental results on two widely used public benchmark datasets: SumMe and TVsum, demonstrate that our proposed model is able to clearly outperform a number of state-of-the-art ones under various settings.

Generative Reconstructive Hashing for Incomplete Video Analysis

Jingyi Zhang
Zhen Wei
Ionut Cosmin Duta
Fumin Shen
Li Liu
Fan Zhu
Xing Xu
Ling Shao
Heng Tao Shen

In the literature of video analysis, most researches, such as retrieval and recognition, hypothesize that each input video contains at least one complete semantic entity, e.g. an activity, action and event.However, this hypothesis does not hold in many realistic scenarios due to two main reasons. First, complete videos whose qualities are good enough for automatic analysis are not always accessible because of heavy motion blur, occlusions, interruptions, etc. % Second, extracting features from complete videos always fails to meet up with speed and storage requirements in large-scale use cases.To tackle these challenges, incomplete videos are more useful, but researches on them are seldom mentioned. In this paper, we propose a novel and effective hashing framework specialized in large-scale incomplete video analysis called Generative Reconstructive Hashing (GRH). To begin with, an adversarial generative network that is specially designed to map incomplete video features to the feature distributions of complete videos, so that features of incomplete videos become indistinguishable from those of complete videos. Then, the discriminative hashing module further fills the gap between full video features and estimated features from partial videos by projecting both features into a common binary feature space, which allows improvement in efficiency compared with real-value based methods. GRH is the first end-to-end framework for incomplete video analysis. Extensive experiments on various datasets demonstrate GRH's superior effectiveness and efficiency on retrieval and recognition tasks. GRH outperforms the recent state-of-the-art methods by 5.44/3.22/4.82 in terms of MAPs on HMDB51/UCF101/CCV datasets, respectively.

You Only Recognize Once: Towards Fast Video Text Spotting

Zhanzhan Cheng
Jing Lu
Yi Niu
Shiliang Pu
Fei Wu
Shuigeng Zhou

Video text spotting is still an important research topic due to its various real-applications. Previous approaches usually fall into the four-staged pipeline: text detection in individual images, frame-wisely recognizing localized text regions, tracking text streams and generating final results with complicated post-processing skills, which might suffer from the huge computational cost as well as the interferences of low-quality text. In this paper, we propose a fast and robust video text spotting framework by only recognizing the localized text one-time instead of frame-wisely recognition. Specifically, we first obtain text regions in videos with a well-designed spatial-temporal detector. Then we concentrate on developing a novel text recommender for selecting the highest-quality text from text streams and only recognizing the selected ones. Here, the recommender assembles text tracking, quality scoring and recognition into an end-to-end trainable module, which not only avoids the interferences from low-quality text but also dramatically speeds up the video text spotting process. In addition, we collect a larger scale video text dataset (LSVTD) for promoting the video text spotting community, which contains 100 text videos from 22 different real-life scenarios. Extensive experiments on two public benchmarks show that our method greatly speeds up the recognition process averagely by 71 times compared with the frame-wise manner, and also achieves the remarkable state-of-the-art.

Black-box Adversarial Attacks on Video Recognition Models

Linxi Jiang
Xingjun Ma
Shaoxiang Chen
James Bailey
Yu-Gang Jiang

Deep neural networks (DNNs) are known for their vulnerability to adversarial examples. These are examples that have undergone small, carefully crafted perturbations, and which can easily fool a DNN into making misclassifications at test time. Thus far, the field of adversarial research has mainly focused on image models, under either a white-box setting, where an adversary has full access to model parameters, or a black-box setting where an adversary can only query the target model for probabilities or labels. Whilst several white-box attacks have been proposed for video models, black-box video attacks are still unexplored. To close this gap, we propose the first black-box video attack framework, called V-BAD. V-BAD utilizestentative perturbations transferred from image models andpartition-based rectifications found by the NES to obtain good adversarial gradient estimates with fewer queries to the target model. V-BAD is equivalent to estimating the projection of the adversarial gradient on a selected subspace. Using three benchmark video datasets, we demonstrate that V-BAD can craft both untargeted and targeted attacks to fool two state-of-the-art deep video recognition models. For the targeted attack, it achieves $>$93% success rate using only an average of $3.4 \sim 8.4 \times 10^4$ queries, a similar number of queries to state-of-the-art black-box image attacks. This is despite the fact that videos often have two orders of magnitude higher dimensionality than static images. We believe that V-BAD is a promising new tool to evaluate and improve the robustness of video recognition models to black-box adversarial attacks.

Ranking Video Salient Object Detection

Zheng Wang
Xinyu Yan
Yahong Han
Meijun Sun

Video salient object detection has been attracting more and more research interests recently. However, the definition of salient objects in videos has been controversial all the time, which has become a critical bottleneck in video salient object detection. Specifically, the sequential information contained in videos results in a fact that objects have a relative saliency ranking between each other rather than specific saliency. This implies that simply distinguishing objects into salient or not-salient as usual could not represent the information about saliency comprehensively. To address this issue, 1) in this paper we propose a completely new definition for the salient objects in videos---ranking salient objects, which considers relative saliency ranking assisted with eye fixation points. 2) Based on this definition, a ranking video salient object dataset(RVSOD) is built. 3) Leveraging our RVSOD, a novel neural network called Synthesized Video Saliency Network (SVSNet) is constructed to detect both traditional salient objects and human eye movements in videos. Finally, a ranking saliency module (RSM) takes the results of SVSNet as input to generate the ranking saliency maps. We hope our approach will serve as a baseline and lead to a conceptually new research in the field of video saliency.

Video Retargeting: Trade-off between Content Preservation and Spatio-temporal Consistency

Donghyeon Cho
Yunjae Jung
Francois Rameau
Dahun Kim
Sanghyun Woo
In So Kweon

As new display technologies (i.e. foldable phone and modular display) with variable aspect ratios emerge, content-aware video retargeting has attracted much attention from both academia and industry. The content-aware video retargeting aims to adjust the aspect ratio of a video sequence while preserving both, its content and its spatio-temporal consistency. This is a particularly challenging task since these two properties may drastically differ and contradict depending on the video characteristics. In this paper, we explore this conflict in the context of video retargeting, then we propose an appropriate solution to alleviate this issue using a deep recurrent convolutional neural network architecture. First of all, we present a method to generate multiple ground-truth labels under various aspect ratios. Using this dataset, our network is trained to predict various retargeted video candidates from a single input sequence. The resulting candidates present different properties, some of them with more emphasis on the content preservation while the others focus on the spatio-temporal consistency. Among the generated candidates, the final result which satisfy the best compromise is selected. A large set of qualitative and quantitative experiments shows the ability of our method for the content-aware video retargeting.

SESSION: Session 2D: 3D Visual Processing

3D Point Cloud Geometry Compression on Deep Learning

Tianxin Huang
Yong Liu

3D point cloud presentation has been widely used in computer vision, automatic driving, augmented reality, smart cities and virtual reality. 3D point cloud compression method with higher compression ratio and tiny loss is the key to improve data transportation efficiency. In this paper, we propose a new 3D point cloud geometry compression method based on deep learning, also an auto-encoder performing better than other networks in detail reconstruction. It can reach much higher compression ratio than the state-of-art while keeping tolerable loss. It also supports parallel compressing multiple models by GPU, which can improve processing efficiency greatly. The compression process is composed of two parts. Firstly, Raw data is compressed into codeword by extracting feature of raw model with encoder. Then, the codeword is further compressed with sparse coding. Decompression process is implemented in reverse order. Codeword is recovered and fed into decoder to reconstruct point cloud. Detail reconstruction ability is improved by a hierarchical structure in our decoder. Latter outputs are grown from former fuzzier outputs. In this way, details are added to former output by latter layers step by step to make a more precise prediction. We compare our method with PCL compression and Draco compression on ShapeNet40 part dataset. Our method may be the first deep learning-based point cloud compression algorithm. The experiments demonstrate it is superior to former common compression algorithms with large compression ratio, which can also reserve original shapes with tiny loss.

Eye in the Sky: Drone-Based Object Tracking and 3D Localization

Haotian Zhang
Gaoang Wang
Zhichao Lei
Jenq-Neng Hwang

Drones, or general UAVs, equipped with a single camera have been widely deployed to a broad range of applications, such as aerial photography, fast goods delivery and most importantly, surveillance. Despite the great progress achieved in computer vision algorithms, these algorithms are not usually optimized for dealing with images or video sequences acquired by drones, due to various challenges such as occlusion, fast camera motion and pose variation. In this paper, a drone-based multi-object tracking and 3D localization scheme is proposed based on the deep learning based object detection. We first combine a multi-object tracking method called TrackletNet Tracker (TNT) which utilizes temporal and appearance information to track detected objects located on the ground for UAV applications. Then, we are also able to localize the tracked ground objects based on the group plane estimated from the Multi-View Stereo technique. The system deployed on the drone can not only detect and track the objects in a scene, but can also localize their 3D coordinates in meters with respect to the drone camera. The experiments have proved our tracker can reliably handle most of the detected objects captured by drones and achieve favorable 3D localization performance when compared with the state-of-the-art methods.

MMJN: Multi-Modal Joint Networks for 3D Shape Recognition

Weizhi Nie
Qi Liang
An-An Liu
Zhendong Mao
Yangyang Li

3D shape recognition has attracted wide research attention in the field of multimedia and computer vision. With the recent advance of deep learning, various deep models with different representations have achieved the state-of-the-art performances. Among them, many modalities are proposed to represent 3D model, such as point cloud, multi-view, and PANORAMA-view. Based on these representations, many corresponding deep models have shown significant performances on 3D shape recognition. However, few work to considers utilizing the fusion information of multi-modal for 3D shape recognition. Since these different modalities represent the same 3D model, they should guide each other to get a better feature representation. In this paper, we propose a novel multi-modal joint network (MMJN) for 3D shape recognition, which can consider the correlation between two different modalities to extract the robust feature vector. More specifically, we propose a novel correlation loss which can utilize the correlation between different features extracted by different modality networks to increase the robustness of the feature representation. Finally, we utilize the late fusion method to fuse the multi-modal information for 3D model representation and recognition. Here, we define the weight of different modalities features based on the statistic method and utilize the advantages of different modalities to generate more robust feature. We evaluated the proposed method on the ModelNet40 dataset for 3D shape classification and retrieval tasks. Experimental results and comparisons with the state-of-the-art methods demonstrate the superiority of our approach.

Monocular Visual Object 3D Localization in Road Scenes

Yizhou Wang
Yen-Ting Huang
Jenq-Neng Hwang

3D localization of objects in road scenes is important for autonomous driving and advanced driver-assistance systems (ADAS). However, with common monocular camera setups, 3D information is difficult to obtain. In this paper, we propose a novel and robust method for 3D localization of monocular visual objects in road scenes by joint integration of depth estimation, ground plane estimation, and multi-object tracking techniques. Firstly, an object depth estimation method with depth confidence is proposed by utilizing the monocular depthmap from a CNN. Secondly, an adaptive ground plane estimation using both dense and sparse features is proposed to localize the objects when their depth estimation is not reliable. Thirdly, temporal information is taken into consideration by a new object tracklet smoothing method. Unlike most existing methods which only consider vehicle localization, our method is applicable for common moving objects in the road scenes, including pedestrians, vehicles, cyclists, etc. Moreover, the input depthmap can be replaced by some equivalent depth information from other sensors, like LiDAR, depth camera and Radar, which makes our system much more competitive compared with other object localization methods. As evaluated on KITTI dataset, our method achieves favorable performance on 3D localization of both pedestrians and vehicles when compared with the state-of-the-art vehicle localization methods, though no published performance on pedestrian 3D localization can be compared with, from the best of our knowledge.

Unsupervised Domain Adaptation for 3D Human Pose Estimation

Xiheng Zhang
Yongkang Wong
Mohan S. Kankanhalli
Weidong Geng

Training an accurate 3D human pose estimator often requires a large amount of 3D ground-truth data which is inefficient and costly to collect. Previous methods have either resorted to weakly supervised methods to reduce the demand of ground-truth data for training, or using synthetically-generated but photo-realistic samples to enlarge the training data pool. Nevertheless, the former methods mainly require either additional supervision, such as unpaired 3D ground-truth data, or the camera parameters in multiview settings. On the other hand, the latter methods require accurately textured models, illumination configurations and background which need careful engineering. To address these problems, we propose a domain adaptation framework with unsupervised knowledge transfer, which aims at leveraging the knowledge in multi-modality data of the easy-to-get synthetic depth datasets to better train a pose estimator on the real-world datasets. Specifically, the framework first trains two pose estimators on synthetically-generated depth images and human body segmentation masks with full supervision, while jointly learning a human body segmentation module from the predicted 2D poses. Subsequently, the learned pose estimator and the segmentation module are applied to the real-world dataset to unsupervisedly learn a new RGB image based 2D/3D human pose estimator. Here, the knowledge encoded in the supervised learning modules are used to regularize a pose estimator without ground-truth annotations. Comprehensive experiments demonstrate significant improvements over weakly supervised methods when no ground-truth annotations are available. Further experiments with ground-truth annotations show that the proposed framework can outperform state-of-the-art fully supervised methods. In addition, we conducted ablation studies to examine the impact of each loss term, as well as with different amount of supervisions signal.

DaNet: Decompose-and-aggregate Network for 3D Human Shape and Pose Estimation

Hongwen Zhang
Jie Cao
Guo Lu
Wanli Ouyang
Zhenan Sun

Reconstructing 3D human shape and pose from a monocular image is challenging despite the promising results achieved by most recent learning based methods. The commonly occurred misalignment comes from the facts that the mapping from image to model space is highly non-linear and the rotation-based pose representation of the body model is prone to result in drift of joint positions. In this work, we present the Decompose-and-aggregate Network (DaNet) to address these issues. DaNet includes three new designs, namely UVI guided learning, decomposition for fine-grained perception, and aggregation for robust prediction. First, we adopt the UVI maps, which densely build a bridge between 2D pixels and 3D vertexes, as an intermediate representation to facilitate the learning of image-to-model mapping. Second, we decompose the prediction task into one global stream and multiple local streams so that the network not only provides global perception for the camera and shape prediction, but also has detailed perception for part pose prediction. Lastly, we aggregate the message from local streams to enhance the robustness of part pose prediction, where a position-aided rotation feature refinement strategy is proposed to exploit the spatial relationship between body parts. Such a refinement strategy is more efficient since the correlations between position features are stronger than that in the original rotation feature space. The effectiveness of our method is validated on the Human3.6M and UP-3D datasets. Experimental results show that the proposed method significantly improves the reconstruction performance in comparison with previous state-of-the-art methods. Our code is publicly available at https://github.com/HongwenZhang/DaNet-3DHumanReconstrution .

3D Singing Head for Music VR: Learning External and Internal Articulatory Synchronicity from Lyric, Audio and Notes

Jun Yu
Chang Wen Chen
Zengfu Wang

We propose a real-time 3D singing head system to enhance the talking head on model integrity, keyframe generation and song synchronicity. The individual head appearance meshes are first obtained by matching multi-view visible images with face prior for accuracy, and then used to reconstruct entire head model by integrating with generic internal articulatory meshes for efficiency. After embedding physiology, the keyframes of each phoneme-music note correspondence are substantially synthesized from real articulation data. The song synchronicity of articulators is learned using a deep neural network to train visual co-articulation model (VCM) on parallel audio-visual data. Finally, the keyframes of adjacent phoneme-music note correspondences are blended by VCM to produce song synchronized animation. Compared to state-of-the-art baselines, our system can not only clearly distinguish phonemes and notes, but also significantly reduce the dependence on training data.

Fine-grained Fitting Experience Prediction: A 3D-slicing Attention Approach

Shan Huang
Zhi Wang
Laizhong Cui
Yong Jiang
Rui Gao

The comfortableness of fashion items (e.g., footwear) when people actually wear them has become an increasingly important factor in today's fashion experience. However, existing solutions usually only provide general metrics, e.g., a size of a pair of shoes, for people to roughly infer the fitness possibility, failing to tell the details about how much it fits or why it does not fit a person. In this paper, we propose a fine-grained fitting experience prediction framework based on 3D shapes of both fashion items and people's bodies. First, we propose a 3D-slicing sampling method, by extracting a series of parallel slices from an object, to represent the spatial details of the object with a much smaller amount of features. Second, we propose a spatial self-attention based fitness prediction model including a sub-region attention method and a sequence attention method, which can capture users' comfortable preferences for fine-grained regions divided from slices. Our design can capture users' try-on preferences and landmark positions that may or may not fit (e.g., too tight or too loose). Then, we design a multi-position experience module to predict users' fitting experiences, which can help to explore the spatial differences among slices better. Finally, we use subjective experiments over 500 people trying 32 pairs of fashion shoes with detailed places' comfortableness reported in questionnaires to verify our design, which has accuracies of $77.7%$ and $80.9%$ in reporting the comfortableness of tightness and length respectively, and an overall fitness accuracy of $83.6%$.

iDFusion: Globally Consistent Dense 3D Reconstruction from RGB-D and Inertial Measurements

Dawei Zhong
Lei Han
Lu Fang

We present a practical fast, globally consistent and robust dense 3D reconstruction system, iDFusion, by exploring the joint benefit of both the visual (RGB-D) solution and inertial measurement unit (IMU). A global optimization considering all the previous states is adopted to maintain high localization accuracy and global consistency, yet its complexity of being linear to the number of all previous camera/IMU observations seriously impedes real-time implementation. We show that the global optimization can be solved efficiently at the complexity linear to the number of keyframes, and further realize a real-time dense 3D reconstruction system given the estimated camera states. Meanwhile, for the sake of robustness, we propose a novel loop-validity detector based on the estimated bias of the IMU state. By checking the consistency of camera movements, a false loop closure constraint introduces manifest inconsistency between the camera movements and IMU measurements. Experiments reveal that iDFusion owns superior reconstruction performance running in 25 fps on CPU computing of portable devices, under challenging yet practical scenarios including texture-less, motion blur, and repetitive contents.

Ground-Aware Point Cloud Semantic Segmentation for Autonomous Driving

Jian Wu
Jianbo Jiao
Qingxiong Yang
Zheng-Jun Zha
Xuejin Chen

Semantic understanding of 3D scenes is essential for autonomous driving. Although a number of efforts have been devoted to semantic segmentation of dense point clouds, the great sparsity of 3D LiDAR data poses significant challenges in autonomous driving. In this paper, we work on the semantic segmentation problem of extremely sparse LiDAR point clouds with specific consideration of the ground as reference. In particular, we propose a ground-aware framework that well solves the ambiguity caused by data sparsity. We employ a multi-section plane fitting approach to roughly extract ground points to assist segmentation of objects on the ground. Based on the roughly extracted ground points, our approach implicitly integrates the ground information in a weakly-supervised manner and utilizes ground-aware features with a new ground-aware attention module. The proposed ground-aware attention module captures long-range dependence between ground and objects, which significantly facilitates the segmentation of small objects that only consist of a few points in extremely sparse point clouds. Extensive experiments on two large-scale LiDAR point cloud datasets for autonomous driving demonstrate that the proposed method achieves state-of-the-art performance both quantitatively and qualitatively.

SRINet: Learning Strictly Rotation-Invariant Representations for Point Cloud Classification and Segmentation

Xiao Sun
Zhouhui Lian
Jianguo Xiao

Point cloud analysis has drawn broader attentions due to its increasing demands in various fields. Despite the impressive performance has been achieved on several databases, researchers neglect the fact that the orientation of those point cloud data is aligned. Varying the orientation of point cloud may lead to the degradation of performance, restricting the capacity of generalizing to real applications where the prior of orientation is often unknown. In this paper, we propose the point projection feature, which is invariant to the rotation of the input point cloud. A novel architecture is designed to mine features of different levels. We adopt a PointNet-based backbone to extract global feature for point cloud, and the graph aggregation operation to perceive local shape structure. Besides, we introduce an efficient key point descriptor to assign each point with different response and help recognize the overall geometry. Mathematical analyses and experimental results demonstrate that the proposed method can extract strictly rotation-invariant representations for point cloud recognition and segmentation without data augmentation, and outperforms other state-of-the-art methods.

L2G Auto-encoder: Understanding Point Clouds by Local-to-Global Reconstruction with Hierarchical Self-Attention

Xinhai Liu
Zhizhong Han
Xin Wen
Yu-Shen Liu
Matthias Zwicker

Auto-encoder is an important architecture to understand point clouds in an encoding and decoding procedure of self reconstruction. Current auto-encoder mainly focuses on the learning of global structure by global shape reconstruction, while ignoring the learning of local structures. To resolve this issue, we propose Local-to-Global auto-encoder (L2G-AE) to simultaneously learn the local and global structure of point clouds by local to global reconstruction. Specifically, L2G-AE employs an encoder to encode the geometry information of multiple scales in a local region at the same time. In addition, we introduce a novel hierarchical self-attention mechanism to highlight the important points, scales and regions at different levels in the information aggregation of the encoder. Simultaneously, L2G-AE employs a recurrent neural network (RNN) as decoder to reconstruct a sequence of scales in a local region, based on which the global point cloud is incrementally reconstructed. Our outperforming results in shape classification, retrieval and upsampling show that L2G-AE can understand point clouds better than state-of-the-art methods.

Self-supervised Representation Learning Using 360° Data

Junnan Li
Jianquan Liu
Yongkang Wong
Shoji Nishimura
Mohan S. Kankanhalli

The amount of 360-degree panoramas shared online has been rapidly increasing due to the availability of affordable and compact omnidirectional cameras, which offers huge amount of new information unavailable before. In this paper, we present the first work to exploit unlabeled 360-degree data for image representation learning. We propose middle-out, a new self-supervised learning task, which leverages the spatial configuration of normal field-of-view images sampled from a 360-degree image as supervisory signal. We train a Siamese ConvNet model to identify the middle image among three shuffled images sampled from a panorama by perspective projection. Compared to previous self-supervised methods that train models using image patches or video frames with limited field-of-view, our method leverages the rich semantic information contained in 360-degree images and enforces the model to not only learn about objects, but also develop a higher-level understanding about object relationships and scene structures. We quantitatively demonstrate that the feature representation learned using the proposed task is useful for a wide range of vision tasks including object classification, object detection, scene classification, semantic segmentation, and geometry estimation. We also qualitatively show that the proposed method can enforce the ConvNet to extract high-level semantic concepts, an ability which previous self-supervised learning methods have not acquired.

360-degree Video Gaze Behaviour: A Ground-Truth Data Set and a Classification Algorithm for Eye Movements

Ioannis Agtzidis
Mikhail Startsev
Michael Dorr

Eye tracking and the analysis of gaze behaviour are established tools to produce insights into how humans observe their surroundings and consume visual multimedia content. For example, gaze recordings may be directly used to study attention allocation towards the areas and objects of interest. Furthermore, segmenting the raw gaze traces into their constituent eye movements has applications in the assessment of subjective quality and mental load, and may improve computational saliency prediction of the content as well. Currently, eye trackers are beginning to be integrated into commodity virtual and augmented reality set-ups that allow for more diverse stimuli to be presented, including 360-degree content. However, because of the more complex eye-head coordination patterns that emerge, the definitions and the well-established methods that were developed for monitor-based eye tracking are often no longer directly applicable. The main contributions of this work to the field of 360-degree content analysis are threefold: First, we collect and partially annotate a new eye tracking data set for naturalistic 360-degree videos. Second, we propose a new two-stage pipeline for reliable manual annotation of both "traditional" (fixations and saccades) and more complex eye movement types that is implemented in a flexible user interface. Lastly, we develop and test a proof-of-concept algorithm for automatic classification of all the eye movement types in our data set. The data set and the source code for both the annotation tool and the algorithm are publicly available at https://gin.g-node.org/ioannis.agtzidis/360_em_dataset.

SESSION: Demonstration I

BioTouchPass Demo: Handwritten Passwords for Touchscreen Biometrics

Ruben Tolosana
Ruben Vera-Rodriguez
Julian Fierrez
Aythami Morales

BioTouchPass enhances traditional authentication systems based on Personal Identification Numbers (PIN) and One-Time Passwords (OTP) through the incorporation of biometric information from handwriting as a second level of user authentication. In our proposed approach, users draw each digit of the password on the touchscreen of the device instead of typing them as usual. This way the security of the authentication system increases as impostors need more than the traditional password to get access to the system. BioTouchPass achieves results with Equal Error Rates (EERs) ca. 4.0% when the attacker knows the password, outperforming other authentication schemes based on touch biometrics, and providing a user-friendly interface easily adaptable to a variety of mobile devices and application scenarios.

Adapting Computer Vision Algorithms for Omnidirectional Video

Hannes Fassold

Omnidirectional (360°) video has got quite popular because it provides a highly immersive viewing experience. For computer vision algorithms, it poses several challenges, like the special (equirectangular) projection commonly employed and the huge image size. In this work, we give a high-level overview of these challenges and outline strategies how to adapt computer vision algorithm for the specifics of omnidirectional video.

Exquisitor: Breaking the Interaction Barrier for Exploration of 100 Million Images

Hanna Ragnarsdóttir
Þórhildur Þorleiksdóttir
Omar Shahbaz Khan
Björn Þór Jónsson
Gylfi Þór Guðmundsson
Jan Zahálka
Stevan Rudinac
Laurent Amsaleg
Marcel Worring

In this demonstration, we present Exquisitor, a media explorer capable of learning user preferences in real-time during interactions with the 99.2 million images of YFCC100M. Exquisitor owes its efficiency to innovations in data representation, compression, and indexing. Exquisitor can complete each interaction round, including learning preferences and presenting the most relevant results, in less than 30 ms using only a single CPU core and modest RAM. In short, Exquisitor can bring large-scale interactive learning to standard desktops and laptops, and even high-end mobile devices.

Documenting Physical Objects with Live Video and Object Detection

Scott Carter
Laurent Denoue
Daniel Avrahami

Responding to requests for information from an application, a remote person, or an organization that involve documenting the presence and/or state of physical objects can lead to incomplete or inaccurate documentation. We propose a system that couples information requests with a live object recognition tool to semi-automatically catalog requested items and collect evidence of their current state.

Split & Dual Screen Comparison of Classic vs Object-based Video

Maarten Wijnants
Sven Coppers
Gustavo Rovelo Ruiz
Peter Quax
Wim Lamotte

Over-the-top (OTT) streaming services like YouTube and Netflix induce massive amounts of video data, hereby putting substantial pressure on network infrastructure. This paper describes a demonstration of the object-based video (OBV) methodology that allows for the quality-variant MPEG-DASH streaming of respectively the background and foreground object(s) of a video scene. The OBV methodology is inspired by research into human visual attention and foveated compression, in that it allows to adaptively and dynamically assign bitrate to those portions of the visual scene that have the highest utility in terms of perceptual quality. Using a content corpus of interview-like video footage, the described demonstration proves the OBV methodology's potential to downsize video bitrate requirements while incurring at most marginal perceptual impact (i.e., in terms of subjective video quality). Thanks to its standards-compliant Web implementation, the OBV methodology is directly and broadly deployable without requiring capital expenditure.

CamaLeon: Smart Camera for Conferencing in the Wild

Laurent Denoue
Scott Carter
Chelhwon Kim

Despite work on smart spaces, nowadays a lot of knowledge work happens in the wild: at home, in coffee places, trains, buses, planes, and of course in crowded open office cubicles. Conducting web conferences in these settings creates privacy issues, and can also distract participants, leading to a perceived lack of professionalism from the remote peer(s). To solve this common problem, we implemented CamaLeon, a browser-based tool that uses real-time machine vision powered by deep learning to change the webcam stream sent by the remote peer. Specifically, CamaLeon dynamically changes the "wild" background into one that resembles that of the office workers. In order to detect the background in disparate settings, we designed and trained a fast UNet model on head and shoulder images. CamaLeon also uses a face detector to determine whether it should stream the person's face, depending on its location (or lack of presence). It uses face recognition to make sure it streams only a face that belongs to the user who connected to the meeting. We tested the system during a few real video conferencing calls at our company in which two workers are remote. Both parties felt a sense of enhanced co-presence, and the remote participants felt more professional with their background replaced.

Personalized Video Summarization with Idiom Adaptation

Yi Dong
Chang Liu
Zhiqi Shen
Yu Han
Zhanning Gao
Pan Wang
Changgong Zhang
Peiran Ren
Xuansong Xie

Short videos are becoming key for media consumers exploring the TV and Internet. The production of short videos however remains costly. In this paper, we present a domain specific video summarization application with idiom adaptation that leverages multimedia content analysis and insights from cinematic and persuasive domains. From the back-end, content curators can push raw materials and the pre-processing algorithms will automatically extract the features and encode them as editing idioms. Users can create personalized video summaries based on these idioms. We have validated the effectiveness of the demonstration on a TVC data-set with over 600 videos, enabling production of domain specific video summaries with combinations of editing idioms. This approach has been put into trial in the testbed at Alibaba Wood.

Tastalyzer: Audiovisual Exploration of Urban and Rural Variations in Music Taste

Christine Bauer
Markus Schedl
Vera Angerer
Stefan Wegenkittl

We present a browsing interface that allows for an audiovisual exploration of regional music taste around the world. We exploit a total of 10,758,121 geolocated tweets about music. The web-based geo-aware visualization and auralization called Tastalyzer enables exploring and analyzing music taste on a fine-grained geographical level, such as (i) comparing rural and corresponding urban music taste within an agglomeration (city) or (ii) comparing the music taste in a target region (agglomeration) to the taste of the country the region is part of and (iii) to the global music taste.

Interactive Multi-camera Soccer Video Analysis System

Yunjin Wu
Ziyuan Zhao
Shengqiang Zhang
Lulu Yao
Yan Yang
Tom Z. J. Fu
Stefan Winkler

Automatic sports video analysis is an active field of research, and accurate player & ball tracking is essential for soccer video analysis and visualization. However, the variations over frames and the scarceness of large-scale well-annotated datasets make it difficult to perform supervised learning using pre-trained models, especially for Multi-Camera Multi-Target Tracking (MCMT). In this paper, we introduce an end-to-end system for multi-camera soccer video analysis that makes heavy use of parallel processing for optimization of the processing workflow. The proposed thread-level parallelism speeds up our system by more than 15 times while maintaining the level of accuracy. The system tracks the trajectories of the ball and the players in a world coordinate system based on soccer videos captured by a set of synchronized cameras. Based on these trajectories, various player-, ball-, and team-related statistics are computed, and the resulting data and visualizations can be interactively explored by the user.

Walker's Movie Map: Route Vies Synthesis Using Omni-directional Videos

Naoki Sugimoto
Yuko Iinuma
Kiyoharu Aizawa

We present a new movie map for walkers that synthesizes street walking views along routes for walkers in an area. We acquired a number of omnidirectional videos from the perspectives of walkers of streets in a certain area (ex. $1km^2$ around Kyoto Station), then perform SLAM to obtain camera poses of key video frames with the coordinates adjusted to the map of the area using reference points. In order to switch one video to another at intersections, we identify the frames of video intersection using camera locations. We refine the intersection frames using visual feature matching. Finally, we synthesize moving route views by switching omnidirectional videos with alignment of the direction of the cameras. The result shows that our method can precisely identify the intersection frames, and generates smooth switching of videos at the intersection.

ACE: Art, Color and Emotion

Gjorgji Strezoski
Arumoy Shome
Riccardo Bianchi
Shruti Rao
Marcel Worring

We present ACE, the Art, Color and Emotion browser. ACE is a data driven web based platform for exploring the visual sentiment and emotion in artistic paintings over time. To that end, we train our own visual artistic sentiment extraction model by leveraging the artworks from the OmniArt dataset. With our model we are able to estimate the overall sentiment dominating in groups of artworks belonging to a specific time interval. To make the results interactive and explorable we designed an intuitive interface with a carefully considered shape, color and element placement enforcing a top-down interaction scheme. Moreover, we perform extensive control on resource utilisation to provide the smoothest possible user experience and quality of service while using ACE.

Development of an Acoustic AR Gamification System to Support Physical Exercise

Takumi Kiriu
Mohit Mittal
Panote Siriaraya
Yukiko Kawai
Shinsuke Nakajima

In recent years, running has become increasingly popular as an effective exercise activity which could help improve and maintain one's physical health. However, it is generally difficult to motivate people to adhere to their fitness regimens and persist in such activities. To help address this problem, we developed a Gamified Acoustic AR running support system where we use the previous running records of a user to project a "Virtual runner" into an augmented reality space for runners to compete against. The presence of the virtual runner is conveyed through the sound of running footsteps and breathing: users hear the sound of the virtual runner while they are running and are able to virtually compete against themselves in real time. In this paper, we describe how such a system could be implemented and discuss the results of an experiment study which highlights the effectiveness of our Gamified AR system.

Audio-Visual Variational Fusion for Multi-Person Tracking with Robots

Xavier Alameda-Pineda
Soraya Arias
Yutong Ban
Guillaume Delorme
Laurent Girin
Radu Horaud
Xiaofei Li
Bastien Morgue
Guillaume Sarrazin

Robust multi-person tracking with robots opens the door to analysing engagement and social signals in real-world environments. Multi-person scenarios are charaterised by (i) a time-varying number of people, (ii) intermittent auditory (\eg speech turns) and visual cues (\eg person appearing/disappearing) and (iii) impact of the robot actions in perception. The various sensors (cameras and microphones) available for perception, provide a rich flow of information of intermittent and complementary nature. How to jointly exploit these cues to tackle the multi-person tracking problem with an autonomous system has been an intense research line of the Perception Team in the past few years. In this demo we want to present our, now mature, achievements in the field, and demonstrate two robotic systems able to track multiple persons using auditory and visual cues, when they are available. We will bring the two robots and the necessary computing resources with us, as well as the required presentation materials to discuss the models, methods and tools supporting this technology with the attendants.

BUDA.ART: A Multimodal Content Based Analysis and Retrieval System for Buddha Statues

Benjamin Renoust
Matheus Oliveira Franca
Jacob Chan
Van Le
Ayaka Uesaka
Yuta Nakashima
Hajime Nagahara
Jueren Wang
Yutaka Fujioka

We introduce BUDA.ART, a system designed to assist researchers in Art History, to explore and analyze an archive of pictures of Buddha statues. The system combines different CBIR and classical retrieval techniques to assemble 2D pictures, 3D statue scans and meta-data, that is focused on the Buddha facial characteristics. We build the system from an archive of 50,000 Buddhism pictures, identify unique Buddha statues, extract contextual information, and provide specific facial embedding to first index the archive. The system allows for mobile, on-site search, and to explore similarities of statues in the archive. In addition, we provide search visualization and 3D analysis of the statues.

Fast Video Quality Enhancement using GANs

Leonardo Galteri
Lorenzo Seidenari
Marco Bertini
Tiberio Uricchio
Alberto Del Bimbo

Video compression algorithms result in a reduction of image quality, because of their lossy approach to reduce the required bandwidth. This affects commercial streaming services such as Netflix, or Amazon Prime Video, but affects also video conferencing and video surveillance systems. In all these cases it is possible to improve the video quality, both for human view and for automatic video analysis, without changing the compression pipeline, through a post-processing that eliminates the visual artifacts created by the compression algorithms. Generative Adversarial Networks have obtained extremely high quality results in image enhancement tasks; however, to obtain such results large generators are usually employed, resulting in high computational costs and processing time. In this work we present an architecture that can be used to reduce the computational cost and that has been implemented on mobile devices. A possible application is to improve video conferencing, or live streaming. In these cases there is no original uncompressed video stream available. Therefore, we report results using no-reference video quality metric showing high naturalness and quality even for efficient networks.

Animating Your Life: Real-Time Video-to-Animation Translation

Yang Chen
Yingwei Pan
Ting Yao
Xinmei Tian
Tao Mei

We demonstrate a video-to-animation translator, which can transform real-world video into cartoon or ink-wash animation in real-time. When users upload a video or record what they are seeing with the phone, the video-to-animation translator renders the live streaming video with cartoon or ink-wash animation style while maintaining the original contents. We formulate this task as video-to-video translation problem in the absence of any paired training examples, since the manual labeling of such paired video-animation data is cost-expensive and even unrealistic in practice. Technically, an unified unpaired video-to-video translator is utilized to explore both appearance structure and temporal continuity in video synthesis. As such, not only the visual appearance in each frame but also motion between consecutive frames are ensured to be realistic and consistent for video translation. Based on these technologies, our demonstration can be conducted on any videos in the wild and supports live video-to-animation translation, which engages users with the animated artistic expression of their life.

SESSION: Reproducibility

Using Mr. MAPP for Lower Limb Phantom Pain Management

Kanchan Bahirat
Yu-Yen Chung
Thiru Annaswamy
Gargi Raval
Kevin Desai
Balakrishnan Prabhakaran
Michael Riegler

Phantom pain is a chronic pain that is experienced as a vivid sensation stemming from the missing limb. From traditional mirror box to virtual reality-based approaches, a wide spectrum of treatments using mimic feedback of the amputated limb have been developed for alleviating phantom limb pain. In our previous work, Mixed reality-based framework for MAnaging Phantom Pain (Mr.MAPP) was presented and used to generate a virtual phantom upper limb, in real time, to manage the phantom pain. However, amputation of the lower limb is more common than that of the upper limb. Hence, in this paper, on top of demonstrating the reproducibility of the Mr.MAPP framework for upper limb, we extend it to manage lower limb phantom pain as well. Unlike an upper limb amputee, a patient with lower limb amputated is constrained to perform the training procedure in a sitting posture. Accordingly, virtual training games are designed for lower limb exercises with sitting posture such as knee flexion and extension, ankle dorsiflexion and tandem coordinated movement. Finally, the technical details of the system setup for playing the training games are introduced.

Reproducible Experiments on Adaptive Discriminative Region Discovery for Scene Recognition

Zhengyu Zhao
Zhuoran Liu
Martha Larson
Ahmet Iscen
Naoko Nitta

This companion paper supports the replication of scene image recognition experiments using Adaptive Discriminative Region Discovery (Adi-Red), an approach presented at ACM Multimedia 2018. We provide a set of artifacts that allow the replication of the experiments using a Python implementation. All the experiments are covered in a single shell script, which requires the installation of an environment, following our instructions, or using ReproZip.The data sets (images and labels) are automatically downloaded, and the train-test splits used in the experiments are created. The first experiment is from the original paper, and the second supports exploration of the resolution of the scale-specific input image, an interesting additional parameter. For both experiments, five other parameters can be adjusted: the threshold used to select the number of discriminative patches, the number of scales used, the type of patch selection (Adi-Red, dense or random), the architecture and pre-training data set of the pre-trained CNN feature extractor. The final output includes four tables (original Table 1, Table 2 and Table 4, and a table for the resolution experiment) and two plots (original Figure 3 and Figure 4).

On Reproducing Semi-dense Depth Map Reconstruction using Deep Convolutional Neural Networks with Perceptual Loss

Ilya Makarov
Dmitrii Maslov
Olga Gerasimova
Vladimir Aliev
Alisa Korinevskaya
Ujjwal Sharma
Haoliang Wang

In our recent papers, we proposed a new family of residual convolutional neural networks trained for semi-dense and sparse depth reconstruction without use of RGB channel. The proposed models can be used in low-resolution depth sensors or SLAM methods estimating partial depth with certain distributions. We proposed using perceptual loss for training depth reconstruction in order to better preserve edge structure and reduce over-smoothness of models trained on MSE loss alone. This paper contains reproducibility companion guide on training, running and evaluating suggested methods, while also presenting links on further studies in view of reviewers comments and related problems of depth reconstruction.

Companion Paper for

Mengbai Xiao
Shuoqian Wang
Chao Zhou
Li Liu
Zhenhua Li
Yao Liu
Songqing Chen
Lucile Sassatelli
Gwendal Simon

This artifact includes source code, scripts and datasets required to reproduce the experimental figures in the evaluation of the MM'18 paper, which is entitled "MiniView Layout for Bandwidth-Efficient 360-Degree Video". The artifact reports the comparison results among the standard cube layout (CUBE), the equi-angular layout (EAC), and the MiniView layout (MVL) in terms of compressed video size, visual quality of views and decoding and rendering time.

SESSION: Best Paper Session (note: Honorable Mentions)

Multi-modal Knowledge-aware Hierarchical Attention Network for Explainable Medical Question Answering

Yingying Zhang
Shengsheng Qian
Quan Fang
Changsheng Xu

Online healthcare services can offer public ubiquitous access to the medical knowledge, especially with the emergence of medical question answering websites, where patients can get in touch with doctors without going to hospital. Explainability and accuracy are two main concerns for medical question answering. However, existing methods mainly focus on accuracy and cannot provide a good explanation for retrieved medical answers. This paper proposes a novelMulti-Modal Knowledge-aware Hierarchical Attention Network (MKHAN) to effectively exploit multi-modal knowledge graph (MKG) for explainable medical question answering. MKHAN can generate path representation by composing the structural, linguistics, and visual information of entities, and infer the underlying rationale of question-answer interactions by leveraging the sequential dependencies within a path from MKG. Furthermore, a novel hierarchical attention network is proposed to discriminate the salience of paths endowing our model with explainability. We build a large-scale multi-modal medical knowledge graph andtwo real-world medical question answering datasets, the experimental results demonstrate the superior performance on our approachcompared with the state-of-the-art methods.

Multimodal Dialog System: Generating Responses via Adaptive Decoders

Liqiang Nie
Wenjie Wang
Richang Hong
Meng Wang
Qi Tian

On the shoulders of textual dialog systems, the multimodal ones, recently have engaged increasing attention, especially in the retail domain. Despite the commercial value of multimodal dialog systems, they still suffer from the following challenges: 1) automatically generate the right responses in appropriate medium forms; 2) jointly consider the visual cues and the side information while selecting product images; and 3) guide the response generation with multi-faceted and heterogeneous knowledge. To address the aforementioned issues, we present a Multimodal diAloG system with adaptIve deCoders, MAGIC for short. In particular, MAGIC first judges the response type and the corresponding medium form via understanding the intention of the given multimodal context. Hereafter, it employs adaptive decoders to generate the desired responses: a simple recurrent neural network (RNN) is applied to generating general responses, then a knowledge-aware RNN decoder is designed to encode the multiform domain knowledge to enrich the response, and the multimodal response decoder incorporates an image recommendation model which jointly considers the textual attributes and the visual images via a neural model optimized by the max-margin loss. We comparatively justify MAGIC over a benchmark dataset. Experiment results demonstrate that MAGIC outperforms the existing methods and achieves the state-of-the-art performance.

Audiovisual Zooming: What You See Is What You Hear

Arun Asokan Nair
Austin Reiter
Changxi Zheng
Shree Nayar

When capturing videos on a mobile platform, often the target of interest is contaminated by the surrounding environment. To alleviate the visual irrelevance, camera panning and zooming provide the means to isolate a desired field of view (FOV). However, the captured audio is still contaminated by signals outside the FOV. This effect is unnatural---for human perception, visual and auditory cues must go hand-in-hand. We present the concept ofAudiovisual Zooming, whereby an auditory FOV is formed to match the visual. Our framework is built around the classic idea of beamforming, a computational approach to enhancing sound from a single direction using a microphone array. Yet, beamforming on its own can not incorporate the auditory FOV, as the FOV may include an arbitrary number of directional sources. We formulate our audiovisual zooming as a generalized eigenvalue problem and propose an algorithm for efficient computation on mobile platforms. To inform the algorithmic and physical implementation, we offer a theoretical analysis of our algorithmic components as well as numerical studies for understanding various design choices of microphone arrays. Finally, we demonstrate audiovisual zooming on two different mobile platforms: a mobile smartphone and a 360$^\circ $ spherical imaging system for video conference settings.

Human-imperceptible Privacy Protection Against Machines

Zhiqi Shen
Shaojing Fan
Yongkang Wong
Tian-Tsong Ng
Mohan Kankanhalli

Privacy concerns with social media have recently been under the spotlight, due to a few incidents on user data leakage on social networking platforms. With the current advances in machine learning and big data, computer algorithms often act as a first-step filter for privacy breaches, by automatically selecting content with sensitive information, such as photos that contain faces or vehicle license plate. In this paper we propose a novel algorithm to protect the sensitive attributes against machines, meanwhile keeping the changes imperceptible to humans. In particular, we first conducted a series of human studies to investigate multiple factors that influence human sensitivity to the visual changes. We discover that human sensitivity is influenced by multiple factors, from low-level features such as illumination, texture, to high-level attributes like object sentiment and semantics. Based on our human data, we propose for the first time the concept of human sensitivity map. With the sensitivity map, we design a human-sensitivity-aware image perturbation model, which is able to modify the computational classification results of sensitive attributes while preserving the remaining attributes. Experiments on real world data demonstrate the superior performance of the proposed model on human-imperceptible privacy protection.

Flexible Online Multi-modal Hashing for Large-scale Multimedia Retrieval

Xu Lu
Lei Zhu
Zhiyong Cheng
Jingjing Li
Xiushan Nie
Huaxiang Zhang

Multi-modal hashing fuses multi-modal features at both offline training and online query stage for compact binary hash learning. It has aroused extensive attention in research filed of efficient large-scale multimedia retrieval. However, existing methods adopt batch-based learning scheme or unsupervised learning paradigm. They cannot efficiently handle the very common online streaming multi-modal data (for batch-learning methods), or learn the hash codes suffering from limited discriminative capability and less flexibility for varied streaming data (for existing online multi-modal hashing methods). In this paper, we develop a supervised Flexible Online Multi-modal Hashing (FOMH) method to adaptively fuse heterogeneous modalities and flexibly learn the discriminative hash code for the newly coming data, even if part of the modalities is missing. Specifically, instead of adopting the fixed weights, the modalities weights in FOMH are automatically learned with the proposed flexible multi-modal binary projection to timely capture the variations of streaming samples. Further, we design an efficient asymmetric online supervised hashing strategy to enhance the discriminative capability of the hash codes, while avoiding the challenging symmetric semantic matrix decomposition and storage cost. Moreover, to support fast hash updating and avoid the propagation of binary quantization errors in online learning process, we propose to directly update the hash codes with an efficient discrete online optimization. Experiments on several public multimedia retrieval datasets validate the superiority of the proposed method from various aspects.

SESSION: Multimedia Art Exhibition

Latent History

Refik Anadol

Latent History is a time and space exploration into Stockholm's past and ultimately present, through the deployment of machine learning algorithms trained on datasets from both archival and contemporary photographs. Through the exploration of photographic memories from the past 150 years, this exhibition aims to investigate and re-imagine collective memory, hidden layers of history, and the consciousness of a city that otherwise might remain unseen.

Data Stones

Peter AC Nelson

Data Stones explores the overlap of Chinese and European philosophies offered by data visualisation. A database of every message sent between two people, accumulated through the regular internet usage is transformed into a computer generated rock. These stones are produced procedurally from the mundane dialogue accumulated by the everyday use of instant messaging. I download thousands of messages sent between two people and sort them according to length, date and content (using Latent Dirichlet Allocation). These various means of processing extract patterns and sentiments in what never had any intrinsic order. The stone is treated like a graph, where the statistical patterns my sentiments determines its shape, and becomes an object for contemplation and speculation. Data Stones relying on the logic that if a system encoded a stone, then it can always theoretically be decoded. In contemplating these stones, we hope to crystallise our thoughts, and finds ourselves, staring back.

Unresolved Sun / Soleil Irrésolu: An Art-Science Installation on the Origin of Time

Jean-Marc Chomaz
Laurent Karst
Gregory Louis

This work questions our relationship to time and cosmic phenomena that occur on scales in space and time far beyond human perception. In UNRESOLVED SUN / SOLEIL IRRÉSOLU, these phenomena are made tangible by a thin disc of green fluorescent liquid inhabited by mobile and ephemeral vibrations, in an unusual visual and sound landscape. Thanks to a complex optical device and precise adjustment, the light that passes through the layer is projected onto the adjacent wall in the form of a large orange disc deformed by storms. The space deployed in the imagination results more from music than from choreography. The small green fluorescent disc and the large projection then behave like a sound material composed of textures and rhythms. The sonic material is derived from the real and historical astrophysical observation of the crab Pulsar, a remnant neutron star of the gigantic explosion of a supernova never reported, the first object in space observed with a radio telescope to pulse at audible frequency.

Toasters: Collective inter-connected behavioral objects and passive interaction

Olivain Porry

Toasters is a work about the collective use of behavioral objects[1] in interactive art installation. Its purpose is to set up a As a variable-geometry art-installation made of toasters[2], it explore how such objects can behave collectively. What kind of collective behavior does it reveal - How does the spectator react - And according to which modalities of aesthetic and practical experiences ?

The One: An Interactive Installation for Visualizing the Cognition of Mind State by Capturing Face Expression, Body Shape, Wearing Cloth and Talking Voice

Lyn Chao-ling Chen

In the artwork, the topic of One has been discussed. Broadly speaking, it implies the whole beings in the universe, and it also narrows down to individuals. Human being was considered and it can be describes in physical layer and mental layer. The appearance of people as the protective coloration in the society, and that creates the self-identity of themselves and the impressions of the others. The artwork tried to arouse people to aware the influences of cognition from their minds by visualizing the cognition. The physical properties of human being contain face expression, body shape, wearing cloth and talking voice are transformed into five elements of Chinese Taoist philosophy: wood, fire, earth, metal and water. After the transformation, the interactive installation is performed as an improvisational colorful Chinese brush painting with mimic echoing sounds. Multimedia input contains gray image analysis to form the black block on the painting, color image subtraction to compose the color block on the painting, and a buffer to capture and replay sounds continuously. In the indoor exhibition, the camera was set to capture the audiences coming from the door. In the interactive installation, the dynamic improvisational painting with echoing sounds as evidence remind audiences the way they exist, and that arouse them to rethink the cognition of themselves from the many clones who used known to be "me". In a collaborative interaction, the clones from the crowd also arouse audiences to aware of their impressions on the others. An improvisational painting as colorful Chinese brush painting was exhibited, in which reveals countenance, contour, apparel and utterance in physical point, and the awareness of the cognition of selves and of the others in mental aspect.

I, You, We: Exploring Interactive Multimedia Performance

H. Cecilia Suhr

This paper explores the conceptual framework behind the "I, You, We" interactive multimedia performance. In doing so, it unpacks how interactivity in this performance occurs between the camera installation and audience members; between performer and audience; between sound and vision; and finally between electronic music and acoustic music.

MovIPrint: Move, Explore and Fabricate

Yen-Ting Cho
Yen-Ling Kuo
Yen-Ting Yeh
Yi-Chin Lee

MovIPrint is a user-friendly, interactive installation that uses software and a depth-sensing camera to capture human body movement. After inputting digital data such as images or video into the software, MovIPrint offers people innovative and user-friendly ways to explore that data by manipulating it with their body movement. We use media content and/or wireframe design to enable people to then fabricate their own moving images and 3D digital models.

Macrogroove: A Sound 3D-sculpture Interactive Player

Paul Chable
Gilles Azzaro
Jean Mélou
Yvain Quéau
Axel Carlier
Jean-Denis Durou

Macrogroove is an interactive and playful multimedia system that allows the user to play a sound coded in the form of a 3D-sonagram. The user is invited to manually move a laser sheet over this "pseudo-relief", in order to playback the original sound in real time.

SESSION: Keynote III

EU Data Protection Law: An Ally for Scientific Reproducibility?

Mireille Hildebrandt

This keynote will introduce some of the key concepts of European data protection law, and clarify how and why this is not equivalent with privacy law. Next, I will explain why and how EU data protection law could enhance the methodological integrity of machine learning applications, also in the domain of multimedia.

The question is, first, how the General Data Protection Regulation (GDPR) applies to inferences captured from multimedia data. This raises a number of questions. Does it matter whether such data has been made public by the person it relates to? Does processing personal data always require consent? What counts as valid consent? What if the inferences are mere statistics? What does the prohibition of processing 'sensitive data' (ethnicity, health) mean for multimedia analytics? This keynote will provide a crash course in the underlying 'logic' of the GDPR [3], with a focus on what is relevant for inferences based on multimedia content and metadata. I will uncover the purpose limitation principle as the guiding rationale of EU data protection law, protecting individuals against incorrect, unfair or unwarranted targeting.

In the second part of the keynote I will explain how the purpose limitation principle relates to machine learning research design, requiring keen attention to specific aspects of methodological integrity [2]. These may concern p-hacking, data dredging, or cherry picking performance metrics, and connect with the reproducibility crisis in machine learning that is on the verge of destroying the reliability of ML applications [1].

SESSION: Session 3A: Multimodal QA&Content Generation

Hierarchical Graph Semantic Pooling Network for Multi-modal Community Question Answer Matching

Jun Hu
Shengsheng Qian
Quan Fang
Changsheng Xu

Nowadays, community question answering (CQA) systems have attracted millions of users to share their valuable knowledge. Matching relevant answers for a specific question is a core function of CQA systems. Previous interaction-based matching approaches show promising performance in CQA systems. However, they typically suffer from two limitations: (1) They usually model content as word sequences, which ignores the semantics provided by non-consecutive phrases, long-distance word dependency and visual information. (2) Word-level interactions focus on the distribution of similar words in terms of position, while being agnostic to the semantic-level interactions between questions and answers. To address these limitations, we propose aHierarchical Graph Semantic Pooling Network (HGSPN) to model the hierarchical semantic-level interactions in a unified framework for multi-modal CQA matching. Instead of viewing text content as word sequences, we convert them into graphs, which can model non-consecutive phrases and long-distance word dependency for better obtaining the composition of semantics. In addition, visual content is also modeled into the graphs to provide complementary semantics. A well-designed stacked graph pooling network is proposed to capture the hierarchical semantic-level interactions between questions and answers based on these graphs. A novel convolutional matching network is designed to infer the matching score by integrating the hierarchical semantic-level interaction features. Experimental results on two real-world datasets demonstrate that our model outperforms the state-of-the-art CQA matching models.

Learnable Aggregating Net with Diversity Learning for Video Question Answering

Xiangpeng Li
Lianli Gao
Xuanhan Wang
Wu Liu
Xing Xu
Heng Tao Shen
Jingkuan Song

Video visual question answering (V-VQA) remains challenging at the intersection of vision and language, where it requires joint comprehension of video and natural language question. Image-Question co-attention mechanism, which aims at generating a spatial map highlighting image regions relevant to answering the question and vice versa, has obtained impressive results. Despite the success, simply applying co-attention to video visual question answering results in unsatisfactory performance due to the complexity and temporal nature of videos. In this paper, we proposed a novel architecture, namely Learnable Aggregating Net with Diversity learning (LAD-Net), for V-VQA. In the proposed method, we address two central problems: 1) how to deploy co-attention to V-VQA task considering the complex and diverse content of videos; and 2) how to aggregate the frame-level features without destroying the feature distributions and temporal information. To solve these problems, our LAD-Net first extends single-path based co-attention mechanism to a multi-path pyramid co-attention structure with a novel diversity learning to explicitly encourage attention diversity. For video-level (or question-level) descriptor, instead of taking a simple temporal pooling (i.e., average pooling), we propose a new learnable aggregation method with a set of evidence gates. It automatically aggregates adaptively-weighted frame-level features (or word-level features) to extract rich video (or question) context semantic information by imitating Bags-of-Words (BoW) quantization. With evidence gates, it then further chooses the most related signals representing the evidence information to predict the answer.Extensive validations on the two challenging video visual question answering datasets TGIF-QA and TVQA show that LAD-Net achieves the state-of-the-art performance under various settings and metrics. Our proposed strategies are of particular importance for improving the performance of the baseline co-attention V-VQA.

Erasing-based Attention Learning for Visual Question Answering

Fei Liu
Jing Liu
Richang Hong
Hanqing Lu

Attention learning for visual question answering remains a challenging task, where most existing methods treat the attention and the non-attention parts in isolation. In this paper, we propose to enforce the correlation between the attention and the non-attention parts as a constraint for attention learning. We first adopt an attention-guided erasing scheme to obtain the attention and the non-attention parts respectively, and then learn to separate the attention and the non-attention parts by an appropriate distance margin in a feature embedding space. Furthermore, we associate a typical classification loss with the above distance constraint to learn a more discriminative attention map for answer prediction. The proposed approach does not introduce extra model parameters or inference complexity, and can be combined with any attention-based models. Extensive ablation experiments validate the effectiveness of our method, and new state-of-the-art or competitive results on four publicly available datasets are achieved.

Question-Aware Tube-Switch Network for Video Question Answering

Tianhao Yang
Zheng-Jun Zha
Hongtao Xie
Meng Wang
Hanwang Zhang

Video Question & Answering (VideoQA), a task to answer questions in videos, involves rich spatio-temporal content (e.g., appearance and motion) and requires multi-hop reasoning process. However, existing methods usually deal with appearance and motion separately and fail to synchronize the attentions on appearance and motion features, neglecting two key properties of video QA: (1) appearance and motion features are usually concomitant and complementary to each other at time slice level. Some questions rely on joint representations of both kinds of features at some point in the video; (2) appearance and motion have different importance in multi-step reasoning. In this paper, we propose a novel Question- Aware Tube-Switch Network (TSN) for video question answering which contains (1) a Mix module to synchronously combine the appearance and motion representation at time slice level, achieving fine-grained temporal alignment and correspondence between appearance and motion at every time slice and (2) a Switch mod- ule to adaptively choose appearance or motion tube as primary at each reasoning step, guiding the multi-hop reasoning process. To end-to-end train TSN, we utilize the Gumbel-Softmax strategy to account for the discrete tube-switch process. Extensive experimental results on two benchmarks: MSVD-QA and MSRVTT-QA, have demonstrated that the proposed TSN consistently outperforms state-of-the-art on all metrics.

Multi-interaction Network with Object Relation for Video Question Answering

Weike Jin
Zhou Zhao
Mao Gu
Jun Yu
Jun Xiao
Yueting Zhuang

Video question answering is an important task for testing machine's ability of video understanding. The existing methods normally focus on the combination of recurrent and convolutional neural networks to capture spatial and temporal information of the video. Recently, some work has also shown that using attention mechanism can achieve better performance. In this paper, we propose a new model called Multi-interaction network for video question answering. There are two types of interactions in our model. The first type is the multi-modal interaction between the visual and textual information. The second type is the multi-level interaction inside the multi-modal interaction. Specifically, instead of using original self-attention, we propose a new attention mechanism called multi-interaction, which can capture both element-wise and segment-wise sequence interactions, simultaneously. And in addition to the normal frame-level interaction, we also take the object relations into consideration, in order to obtain more fine-grained information, such as motions and other potential relations among these objects. We evaluate our method on TGIF-QA and other two video QA datasets. The qualitative and quantitative experimental results show the effectiveness of our model, which achieves the new state-of-the-art performance.

CRA-Net: Composed Relation Attention Network for Visual Question Answering

Liang Peng
Yang Yang
Zheng Wang
Xiao Wu
Zi Huang

The task of Visual Question Answering (VQA) is to answer a natural language question tied to the content of a visual image. Most existing VQA models either apply attention mechanism to locate the relevant object regions and/or utilize the off-the-shelf methods of the relation reasoning to detect object relations. However, they 1) mostly encode the simple relations which cannot sufficiently provide sophisticated knowledge for answering complicated visual questions; 2) seldom leverage the harmony cooperation of the object appearance feature and relation feature. To address these problems, we propose a novel end-to-end VQA model, termed Composed Relation Attention Network (CRA-Net ). In specific, we devise two question-adaptive relation attention modules that can extract not only the fine-grained and precise binary relations but also the more sophisticated trinary relations. Both kinds of question-related relations can reveal deeper semantics, thereby enhancing the reasoning ability in question answering. Furthermore, our CRA-Net also combines the object appearance feature with the relation feature under the guidance of the corresponding question, which can reconcile the two types of features effectively. Extensive experiments on two large benchmark datasets, VQA-1.0 and VQA-2.0, demonstrate that our proposed model outperforms state-of-the-art approaches.

Walking with MIND: Mental Imagery eNhanceD Embodied QA

Juncheng Li
Siliang Tang
Fei Wu
Yueting Zhuang

The EmbodiedQA is a task of training an embodied agent by intelligently navigating in a simulated environment and gathering visual information to answer questions. Existing approaches fail to explicitly model the mental imagery function of the agent, while the mental imagery is crucial to embodied cognition, and has a close relation to many high-level meta-skills such as generalization and interpretation. In this paper, we propose a novel Mental Imagery eNhanceD (MIND) module for the embodied agent, as well as a relevant deep reinforcement framework for training. The MIND module can not only model the dynamics of the environment (e.g. 'what might happen if the agent passes through a door') but also help the agent to create a better understanding of the environment (e.g. 'The refrigerator is usually in the kitchen'). Such knowledge makes the agent a faster and better learner in locating a feasible policy with only a few trails. Furthermore, the MIND module can generate mental images that are treated as short-term subgoals by our proposed deep reinforcement framework. These mental images facilitate policy learning since short-term subgoals are easy to achieve and reusable. This yields better planning efficiency than other algorithms that learn a policy directly from primitive actions. Finally, the mental images visualize the agent's intentions in a way that human can understand, and this endows our agent's actions with more interpretability. The experimental results and further analysis prove that the agent with the MIND module is superior to its counterparts not only in EQA performance but in many other aspects such as route planning, behavioral interpretation, and the ability to generalize from a few examples.

Finding Images by Dialoguing with Image

Lejian Ren
Si Liu
Han Huang
Jizhong Han
Shuicheng Yan
Bo Li

Image retrieval in complicated scene is a challenging task that requires the comprehensive understanding of an image. In this paper, we propose a scene graph based image retrieval framework that combines the scene graph generation with image retrieval and fine tuning the searching results via a dialogue mechanism. Specifically, we proposed an image retrieval oriented scene graph generation model that takes an image and a text describing the image as inputs. The additional text input is used to control the generated scene graph. It provides information for a newly introduced attributes head to better predict the attributes and helps constructing an adjacency matrix at the same time. Graph Convolutional Network is further used to gather information among nodes for precise relation estimation. Moreover, modification on the scene graph can be done by changing the text. Our proposed approach achieves the state-of-the-art performances in both scene graph based image retrieval and scene graph generation in the Visual Genome dataset.

Exploiting Temporal Relationships in Video Moment Localization with Natural Language

Songyang Zhang
Jinsong Su
Jiebo Luo

We address the problem of video moment localization with natural language, i.e. localizing a video segment described by a natural language sentence. While most prior work focuses on grounding the query as a whole, temporal dependencies and reasoning between events within the text are not fully considered. In this paper, we propose a novel Temporal Compositional Modular Network (TCMN) where a tree attention network first automatically decomposes a sentence into three descriptions with respect to the main event, context event and temporal signal. Two modules are then utilized to measure the visual similarity and location similarity between each segment and the decomposed descriptions. Moreover, since the main event and context event may rely on different modalities (RGB or optical flow), we use late fusion to form an ensemble of four models, where each model is independently trained by one combination of the visual input. Experiments show that our model outperforms the state-of-the-art methods on the TEMPO dataset.

Cross-Modal Dual Learning for Sentence-to-Video Generation

Yue Liu
Xin Wang
Yitian Yuan
Wenwu Zhu

Automatic content generation has become an attractive while challenging topic in the past decade. Generating videos from sentences particularly poses great challenges to the multimedia community due to its multi-modal characteristics in essence, e.g., difficulties in semantic alignment, and the temporal dependencies in video contents. Existing works resort to Variational AutoEncoder (VAE) or Generative Adversary Network (GAN) for generating videos given sentences, which may suffer from either blurry generated videos or unstable training processes as well as difficulties in converging to optimal solutions. In this paper, we propose a cross-modal dual learning (CMDL) algorithm to tackle the challenges in sentence-to-video generation and address the weaknesses in existing works. The proposed CMDL model adopts a dual learning mechanism to simultaneously learn the bidirectional mappings between sentences and videos such that it is able to generate realistic videos which maintain semantic consistencies with their corresponding textual descriptions. By further capturing both global and contextual structures, CMDL employs a multi-scale sentence-to-visual encoder to produce more sequentially consistent and plausible videos. Extensive experiments on various datasets validate the advantages of our proposed CMDL model against several state-of-the-art benchmarks both visually and quantitatively.

Preserving Semantic and Temporal Consistency for Unpaired Video-to-Video Translation

Kwanyong Park
Sanghyun Woo
Dahun Kim
Donghyeon Cho
In So Kweon

In this paper, we investigate the problem of unpaired video-to-video translation. Given a video in the source domain, we aim to learn the conditional distribution of the corresponding video in the target domain, without seeing any pairs of corresponding videos. While significant progress has been made in the unpaired translation of images, directly applying these methods to an input video leads to low visual quality due to the additional time dimension. In particular, previous methods suffer from semantic inconsistency (i.e., semantic label flipping) and temporal flickering artifacts. To alleviate these issues, we propose a new framework that is composed of carefully-designed generators and discriminators, coupled with two core objective functions: 1) content preserving loss and 2) temporal consistency loss. Extensive qualitative and quantitative evaluations demonstrate the superior performance of the proposed method against previous approaches. We further apply our framework to a domain adaptation task and achieve favorable results.

Referring Expression Comprehension with Semantic Visual Relationship and Word Mapping

Chao Zhang
Weiming Li
Wanli Ouyang
Qiang Wang
Woo-Shik Kim
Sunghoon Hong

Referring expression comprehension, which locates the object instance described by a natural language expression, gains increasing interests in recent years. This paper aims at improving the task from two aspects: visual feature extraction and language features extraction. For visual feature extraction, we observe that most of the previous methods utilize only relative spatial information to model the visual relationship between object pairs while discarding rich semantic relationship between objects. This makes the visual-language matching difficult when the language expression contains semantic relationship to discriminate the referred object from other objects in the image. In this work, we propose a Semantic Visual Relationship Module (SVRM) to exploit this important information. For language feature extraction, a major problem comes from the long-tail distribution of words in the expressions. Since more than half of the words appear less than 20 times in the public datasets, deep models such as LSTM tend to fail to learn accurate representations for these words. To solve this problem, we propose a word2vec based word mapping method that maps these low frequency words to high frequency words with similar meaning. Experiments show that the proposed method outperforms existing state-of-the-art methods on three referring expression comprehension datasets.

SDIT: Scalable and Diverse Cross-domain Image Translation

Yaxing Wang
Abel Gonzalez-Garcia
Joost van de Weijer
Luis Herranz

Recently, image-to-image translation research has witnessed remarkable progress. Although current approaches successfully generate diverse outputs or perform scalable image transfer, these properties have not been combined into a single method. To address this limitation, we propose SDIT: Scalable and Diverse image-to-image translation. These properties are combined into a single generator. The diversity is determined by a latent variable which is randomly sampled from a normal distribution. The scalability is obtained by conditioning the network on the domain attributes. Additionally, we also exploit an attention mechanism that permits the generator to focus on the domain-specific attribute. We empirically demonstrate the performance of the proposed method on face mapping and other datasets beyond faces.

A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning

Pengfei Wang
Chengquan Zhang
Fei Qi
Zuming Huang
Mengyi En
Junyu Han
Jingtuo Liu
Errui Ding
Guangming Shi

Detecting scene text of arbitrary shapes has been a challenging task over the past years. In this paper, we propose a novel segmentation-based text detector, namely SAST, which employs a context attended multi-task learning framework based on a Fully Convolutional Network (FCN) to learn various geometric properties for the reconstruction of polygonal representation of text regions. Taking sequential characteristics of text into consideration, a Context Attention Block is introduced to capture long-range dependencies of pixel information to obtain a more reliable segmentation. In post-processing, a Point-to-Quad assignment method is proposed to cluster pixels into text instances by integrating both high-level object knowledge and low-level pixel information in a single shot. Moreover, the polygonal representation of arbitrarily-shaped text can be extracted with the proposed geometric properties much more effectively. Experiments on several benchmarks, including ICDAR2015, ICDAR2017-MLT, SCUT-CTW1500, and Total-Text, demonstrate that SAST achieves better or comparable performance in terms of accuracy. Furthermore, the proposed algorithm runs at 27.63 FPS on SCUT-CTW1500 with a Hmean of 81.0% on a single NVIDIA Titan Xp graphics card, surpassing most of the existing segmentation-based methods.

SESSION: Session 3B: Attention&Saliency

Aberrance-aware Gradient-sensitive Attentions for Scene Recognition with RGB-D Videos

Xinhang Song
Sixian Zhang
Yuyun Hua
Shuqiang Jiang

With the developments of deep learning, previous approaches have made successes in scene recognition with massive RGB data obtained from the ideal environments. However, scene recognition in real world may face various types of aberrant conditions caused by different unavoidable factors, such as the lighting variance of the environments and the limitations of cameras, which may damage the performance of previous models. In addition to ideal conditions, our motivation is to investigate researches on robust scene recognition models for unconstrained environments. In this paper, we propose an aberrance-aware framework for RGB-D scene recognition, where several types of attentions, such as temporal, spatial and modal attentions are integrated to spatio-temporal RGB-D CNN models to avoid the interference of RGB frame blurring, depth missing, and light variance. All the attentions are homogeneously obtained by projecting the gradient-sensitive maps of visual data into corresponding spaces. Particularly, the gradient maps are captured with the convolutional operations with the typically designed kernels, which can be seamlessly integrated into end-to-end CNN training. The experiments under different challenging conditions demonstrate the effectiveness of the proposed method.

An Attentional-LSTM for Improved Classification of Brain Activities Evoked by Images

Sheng-hua Zhong
Ahmed Fares
Jianmin Jiang

Multimedia stimulation of brain activities is not only becoming an emerging area for intensive research, but also achieved significant progresses towards classification of brain activities and interpretation of brain understanding of multimedia content. To exploit the characteristics of EEG signals in capturing human brain activities, we propose a region-dependent and attention-driven bi-directional LSTM network (RA-BiLSTM) for image evoked brain activity classification. Inspired by the hemispheric lateralization of human brains, the proposed RA-BiLSTM extracts additional information at regional level to strengthen and emphasize the differences between two hemispheres. In addition, we propose a new attentional-LSTM by adding an extra attention gate to: (i) measure and seize the importance of channel-based spatial information, and (ii) support the proposed RA-BiLSTM to capture the dynamic correlations hidden from both the past and the future in the current state across EEG sequences. Extensive experiments are carried out and the results demonstrate that our proposed RA-BiLSTM not only achieves effective classification of brain activities on evoked image categories, but also significantly outperforms the existing state of the arts.

Multi-Level Fusion based Class-aware Attention Model for Weakly Labeled Audio Tagging

Yifang Yin
Meng-Jiun Chiou
Zhenguang Liu
Harsh Shrivastava
Rajiv Ratn Shah
Roger Zimmermann

Recognizing ongoing events based on acoustic clues has been a critical research problem for a variety of AI applications. Compared to visual inputs, acoustic cues tend to be less descriptive and less consistent in time domain. The duration of a sound event can be quite short, which creates great difficulties for, especially weakly labeled, audio tagging. To solve these challenges, we present a novel end-to-end multi-level attention model that first makes segment-level predictions with temporal modeling, followed by advanced aggregations along both time and feature domains. Our model adopts class-aware attention based temporal fusion to highlight/suppress the relevant/irrelevant segments to each class. Moreover, to improve the representation ability of acoustic inputs, a new multi-level feature fusion method is proposed to obtain more accurate segment-level predictions, as well as to perform more effective multi-layer aggregation of clip-level predictions. We additionally introduce a weight sharing strategy to reduce model complexity and overfitting. Comprehensive experiments have been conducted on the AudioSet and the DCASE17 datasets. Experimental results show that our proposed method works remarkably well and obtains the state-of-the-art audio tagging results on both datasets. Furthermore, we show that our proposed multi-level fusion based model can be easily integrated with existing systems where additional performance gain can be obtained.

Fine-grained Cross-media Representation Learning with Deep Quantization Attention Network

Meiyu Liang
Junping Du
Wu Liu
Zhe Xue
Yue Geng
Congxian Yang

Cross-media search is useful for getting more comprehensive and richer information about social network hot topics or events. To solve the problems of feature heterogeneity and semantic gap of different media data, existing deep cross-media quantization technology provides an efficient and effective solution for cross-media common semantic representation learning. However, due to the fact that social network data often exhibits semantic sparsity, diversity, and contains a lot of noise, the performance of existing cross-media search methods often degrades. To address the above issue, this paper proposes a novel fine-grained cross-media representation learning model with deep quantization attention network for social network cross-media search (CMSL). First, we construct the image-word semantic correlation graph, and perform deep random walks on the graph to realize semantic expansion and semantic embedding learning, which can discover some potential semantic correlations between images and words. Then, in order to discover more fine-grained cross-media semantic correlations, a multi-scale fine-grained cross-media semantic correlation learning method that combines global and local saliency semantic similarity is proposed. Third, the fine-grained cross-media representation, cross-media semantic correlations and binary quantization code are jointly learned by a unified deep quantization attention network, which can preserve both inter-media correlations and intra-media similarities, by minimizing both cross-media correlation loss and binary quantization loss. Experimental results demonstrate that CMSL can generate high-quality cross-media common semantic representation, which yields state-of-the-art cross-media search performance on two benchmark datasets, NUS-WIDE and MIR-Flickr 25k.

Understanding the Teaching Styles by an Attention based Multi-task Cross-media Dimensional Modeling

Suping Zhou
Jia Jia
Yufeng Yin
Xiang Li
Yang Yao
Ying Zhang
Zeyang Ye
Kehua Lei
Yan Huang
Jialie Shen

Teaching style plays an influential role in helping students to achieve academic success. In this paper, we explore a new problem of effectively understanding teachers' teaching styles. Specifically, we study 1) how to quantitatively characterize various teachers' teaching styles for various teachers and 2) how to model the subtle relationship between cross-media teaching related data (speech, facial expressions and body motions, content et al.) and teaching styles. Using the adjectives selected from more than 10,000 feedback questionnaires provided by an educational enterprise, a novel concept called Teaching Style Semantic Space (TSSS) is developed based on the pleasure-arousal dimensional theory to describe teaching styles quantitatively and comprehensively. Then a multi-task deep learning based model, Attention-based Multi-path Multi-task Deep Neural Network (AMMDNN), is proposed to accurately and robustly capture the internal correlations between cross-media features and TSSS. Based on the benchmark dataset, we further develop a comprehensive data set including 4,541 full-annotated cross-modality teaching classes. Our experimental results demonstrate that the proposed AMMDNN outperforms (+0.0842% in terms of the concordance correlation coefficient (CCC) on average) baseline methods. To further demonstrate the advantages of the proposed TSSS and our model, several interesting case studies are carried out, such as teaching styles comparison among different teachers and courses, and leveraging the proposed method for teaching quality analysis.

Ingredient-Guided Cascaded Multi-Attention Network for Food Recognition

Weiqing Min
Linhu Liu
Zhengdong Luo
Shuqiang Jiang

Recently, food recognition is gaining more attention in the multimedia community due to its various applications, e.g., multimodal foodlog and personalized healthcare. Most of existing methods directly extract visual features of the whole image using popular deep networks for food recognition without considering its own characteristics. Compared with other types of object images, food images generally do not exhibit distinctive spatial arrangement and common semantic patterns, and thus are very hard to capture discriminative information. In this work, we achieve food recognition by developing an Ingredient-Guided Cascaded Multi-Attention Network (IG-CMAN), which is capable of sequentially localizing multiple informative image regions with multi-scale from category-level to ingredient-level guidance in a coarse-to-fine manner. At the first level, IG-CMAN generates the initial attentional region from the category-supervised network with Spatial Transformer (ST). Taking this localized attentional region as the reference, IG-CMAN combined ST with LSTM to sequentially discover diverse attentional regions with fine-grained scales from ingredient-guided sub-network in the following levels. Furthermore, we introduce a new dataset ISIA Food-200 with 200 food categories from the list in the Wikipedia, about 200,000 food images and 319 ingredients. We conducted extensive experiment on two popular food datasets and newly proposed ISIA Food-200, and verified the effectiveness of our method. Qualitative results along with visualization further show that IG-CMAN can introduce the explainability for localized regions, and is able to learn relevant regions for ingredients.

Pedestrian Attribute Recognition via Hierarchical Multi-task Learning and Relationship Attention

Lian Gao
Di Huang
Yuanfang Guo
Yunhong Wang

Pedestrian Attribute Recognition (PAR) is an important task in surveillance video analysis. In this paper, we propose a novel end-to-end hierarchical deep learning approach to PAR. The proposed network introduces semantic segmentation into PAR and formulates it as a multi-task learning problem, which brings in pixel-level supervision in feature learning for attribute localization. According to the spatial properties of local and global attributes, we present a two stage learning mechanism to decouple coarse attribute localization and fine attribute recognition into successive phases within a single model, which strengthens feature learning. Besides, we design an attribute relationship attention module to efficiently capture and emphasize the latent relations among different attributes, further enhancing the discriminative power of the feature. Extensive experiments are conducted and very competitive results are reached on the RAP and PETA databases, indicating the effectiveness and superiority of the proposed approach.

Small and Dense Commodity Object Detection with Multi-Scale Receptive Field Attention

Zhong Ji
Qiankun Kong
Haoran Wang
Yanwei Pang

Small and dense commodity object detection is highly valued to the applications in practical scenario. Unlike existing approaches mostly focus on detecting generic objects, this paper studies the problem of specific commodity detection, which is characterized by searching for small and dense instances with similar appearances. Since there is no available dataset or benchmark specialized for exploring this issue, we release a Small and Dense Object Dataset of Milk Tea (SDOD-MT) for promoting the research. Besides, our main solutions for mitigating the detection performance drop caused by the existence of small and dense objects can be concluded as two items. First, for the sake of highlighting the information of positive objects in the feature map, we propose a Multi-Scale Receptive Field (MSRF) attention to generate an attention map to weight the importance on each location of the image feature. Second, for eliminating the negative impact for detection performance brought by the issue of sample imbalance, we present a new loss function named ω-focal loss, which significantly improves the detection accuracy of the categories with few objects. Incorporating these two components into an end-to-end deep architecture, we propose a one-stage detecting framework, dubbed CommodityNet. Extensive experimental results on SDODMT demonstrate that the proposed approach achieves a superior performance on small dense object detection.

What I See Is What You See: Joint Attention Learning for First and Third Person Video Co-analysis

Huangyue Yu
Minjie Cai
Yunfei Liu
Feng Lu

In recent years, more and more videos are captured from the first-person viewpoint by wearable cameras. Such first-person video provides additional information besides the traditional third-person video, and thus has a wide range of applications. However, techniques for analyzing the first-person video can be fundamentally different from those for the third-person video, and it is even more difficult to explore the shared information from both viewpoints. In this paper, we propose a novel method for first- and third-person video co-analysis. At the core of our method is the notion of "joint attention'', indicating the learnable representation that corresponds to the shared attention regions in different viewpoints and thus links the two viewpoints. To this end, we develop a multi-branch deep network with a triplet loss to extract the joint attention from the first- and third-person videos via self-supervised learning. We evaluate our method on the public dataset with cross-viewpoint video matching tasks. Our method outperforms the state-of-the-art both qualitatively and quantitatively. We also demonstrate how the learned joint attention can benefit various applications through a set of additional experiments.

Impact of Saliency and Gaze Features on Visual Control: Gaze-Saliency Interest Estimator

Souad Chaabouni
Frederic Precioso

Predicting user intent from gaze presents a challenging question for developing real-time interactive systems like interactive search engine, implicit annotations of large datasets or intelligent robot behavior. Indeed, solutions to annotate easily large sets of images while reducing the burden of annotators is a key aspect for current machine learning techniques. We propose in this paper to design an estimator of the user interest for a given visual content based on eye-tracker feature analysis. We revise existing gaze-based interest estimator, and analyze the impact of the intrinsic saliency of the content displayed for interest estimation. We first explore low-level saliency prediction and propose a new gaze and saliency interest estimator. Experimental results show the advantage of our method for the annotation task in a weakly supervised context. In partic- ular, we extend previous evaluation criteria on new experimental protocol displaying four images by frame as a first step towards "Google Image search-like" interfaces. Our Gaze and Saliency Inter-est Estimator (GSIE) reaches an overall accuracy of 83% in average of user interest prediction. If we consider the accuracy reached in a limited time, the GSIE is 70% in average within about 500ms and 80% in average within 1000ms. This result confirms our GSIE as an efficient real-time visual control solution.

A Unified Multiple Graph Learning and Convolutional Network Model for Co-saliency Estimation

Bo Jiang
Xingyue Jiang
Ajian Zhou
Jin Tang
Bin Luo

Co-saliency estimation which aims to identify the common salient object regions contained in an image set is an active problem in computer vision. The main challenge for co-saliency estimation problem is how to exploit the salient cues of both intra-image and inter-image simultaneously. In this paper, we first represent intra-image and inter-image as intra-graph and inter-graph respectively and formulate co-saliency estimation as graph nodes labeling. Then, we propose a novel multiple graph learning and convolutional network (M-GLCN) for image co-saliency estimation. M-GLCN conducts graph convolutional learning and labeling on both inter-graph and intra-graph cooperatively and thus can well exploit the salient cues of both intra-image and inter-image simultaneously for co-saliency estimation. Moreover, M-GLCN employs a new graph learning mechanism to learn both inter-graph and intra-graph adaptively. Experimental results on several benchmark datasets demonstrate the effectiveness of M-GLCN on co-saliency estimation task.

SGDNet: An End-to-End Saliency-Guided Deep Neural Network for No-Reference Image Quality Assessment

Sheng Yang
Qiuping Jiang
Weisi Lin
Yongtao Wang

We propose an end-to-end saliency-guided deep neural network (SGDNet) for no-reference image quality assessment (NR-IQA). Our SGDNet is built on an end-to-end multi-task learning framework in which two sub-tasks including visual saliency prediction and image quality prediction are jointly optimized with a shared feature extractor. The existing multi-task CNN-based NR-IQA methods which usually consider distortion identification as the auxiliary sub-task cannot accurately identify the complex mixtures of distortions exist in authentically distorted images. By contrast, our saliency prediction sub-task is more universal because visual attention always exists when viewing every image, regardless of its distortion type. More importantly, related works have reported that saliency information is highly correlated with image quality while this property is fully utilized in our proposed SGNet by training the model with more informative labels including saliency maps and quality scores simultaneously. In addition, the outputs of the saliency prediction sub-task are transparent to the primary quality regression sub-task by providing a kind of spatial attention masks for a more perceptually-consistent feature fusion. By training the whole network with the two sub-tasks together, more discriminant features can be learned and a more accurate mapping from feature representations to quality scores can be established. Experimental results on both authentically and synthetically distorted IQA datasets demonstrate the superiority of our SGDNet, as compared to the state-of-the-art approaches.

Co-saliency Detection Based on Hierarchical Consistency

Bo Li
Zhengxing Sun
Quan Wang
Qian Li

As an interesting and emerging topic, co-saliency detection aims at discovering common and salient objects in a group of related images, which is useful to variety of visual media applications. Although a number of approaches have been proposed to address this problem, many of them are designed with the misleading assumption, suboptimal image representation, or heavy supervision cost and thus still suffer from certain limitations, which reduces their capability in the real-world scenarios. To alleviate these limitations, we propose a novel unsupervised co-saliency detection method, which successively explores the hierarchical consistency in the image group including background consistency, high-level and low-level objects consistency in a unified framework. We first design a novel superpixel-wise variational autoencoder (SVAE) network to precisely distinguish the salient objects from the background collection based on the reconstruction errors. Then, we propose a two-stage clustering strategy to explore the multi-level salient objects consistency by using high-level and low-level features separately. Finally, the co-saliency results are refined by applying a CRF based refinement method with the multi-level salient objects consistency. Extensive experiments on three widely datasets show that our method achieves superior or competitive performance compared to the state-of-the-art methods.

SESSION: Session 3C: Smart Applications

Inferring Mood Instability via Smartphone Sensing: A Multi-View Learning Approach

Xiao Zhang
Fuzhen Zhuang
Wenzhong Li
Haochao Ying
Hui Xiong
Sanglu Lu

A high correlation between mood instability (MI), the rapid and constant fluctuation in mood, and mental health has been demonstrated. However, conventional approaches to measure MI are limited owing to the high manpower and time cost required. In this paper, we propose a smartphone-based MI detection that can automatically and passively detect MI with minimal human involvement. The proposed method trains a multi-view learning classification model using features extracted from the smartphone sensing data of volunteers and their self-reported moods. The trained classifier is then used to detect the MI of unseen users efficiently, thereby reducing the human involvement and time cost significantly. Based on extensive experiments conducted with the dataset collected from 68 volunteers, we demonstrate that the proposed multi-view learning model outperforms the baseline classifiers.

Visual-Inertial State Estimation with Pre-integration Correction for Robust Mobile Augmented Reality

Zikang Yuan
Dongfu Zhu
Cheng Chi
Jinhui Tang
Chunyuan Liao
Xin Yang

Mobile devices equipped with a monocular camera and an inertial measurement unit (IMU) are ideal platforms for augmented reality (AR) applications. However, nontrivial noises in low-cost IMUs, which are usually equipped in consumer-level mobile devices, could lead to large errors in pose estimation and in turn significantly degrade the user experience in mobile AR apps. In this study, we propose a novel monocular visual-inertial state estimation approach for robust and accurate pose estimation even for low-cost IMUs. The core of our method is an IMU pre-integration correction approach which effectively reduces the negative impact of IMU noises using the visual constraints in a sliding window and the kinematic constraint. We seamlessly integrate the IMU pre-integration correction module into a tightly-coupled,sliding-window based optimization framework for state estimation. Experimental results on public dataset EUROC demonstrate the superiority of our method to the state-of-the-art VINS-Mono in terms of smaller absolute trajectory errors (ATE) and relative pose errors (RPE). We further apply our method to real AR applications on two types of consumer-level mobile devices equipped with low-cost IMUs, i.e. an off-the-shelf smartphone and an AR glass. Experimental results demonstrate that our method can facilitate robust AR with little drifts on the two devices.

Close the Gap between Deep Learning and Mobile Intelligence by Incorporating Training in the Loop

Cong Wang
Yanru Xiao
Xing Gao
Li Li
Jun Wang

Pre-trained deep learning models can be deployed on mobile devices to conduct inference. However, they are usually not updated thereafter. In this paper, we take a step further to incorporate training deep neural networks on battery-powered mobile devices and overcome the difficulties from the lack of labeled data. We design and implement a new framework to enlarge sample space via data paring and learn a deep metric under the privacy, memory and computational constraints. A case study of deep behavioral authentication is conducted. Our experiments demonstrate accuracy over 95% on three public datasets, a sheer 15% gain from traditional multi-class classification with less data and robustness against brute-force attacks with 99% success. We demonstrate the training performance on various smartphone models, where training 100 epochs takes less than 10 mins and can be boosted 3-5 times with feature transfer. We also profile memory, energy and computational overhead. Our results indicate that training consumes lower energy than watching videos so can be scheduled intermittently on mobile devices.

Towards Automatic Face-to-Face Translation

Prajwal K R
Rudrabha Mukhopadhyay
Jerin Philip
Abhishek Jha
Vinay Namboodiri
C V Jawahar

In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact in multiple real-world applications. First, we build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio. Quantitative evaluation of LipGAN on the standard LRW test set shows that it significantly outperforms existing approaches across all standard metrics. We also subject our Face-to-Face Translation pipeline, to multiple human evaluations and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages. Code, models and demo video are made publicly available.

MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video

Yinwei Wei
Xiang Wang
Liqiang Nie
Xiangnan He
Richang Hong
Tat-Seng Chua

Personalized recommendation plays a central role in many online content sharing platforms. To provide quality micro-video recommendation service, it is of crucial importance to consider the interactions between users and items (i.e. micro-videos) as well as the item contents from various modalities (e.g. visual, acoustic, and textual). Existing works on multimedia recommendation largely exploit multi-modal contents to enrich item representations, while less effort is made to leverage information interchange between users and items to enhance user representations and further capture user's fine-grained preferences on different modalities. In this paper, we propose to exploit user-item interactions to guide the representation learning in each modality, and further personalized micro-video recommendation. We design a Multi-modal Graph Convolution Network (MMGCN) framework built upon the message-passing idea of graph neural networks, which can yield modal-specific representations of users and micro-videos to better capture user preferences. Specifically, we construct a user-item bipartite graph in each modality, and enrich the representation of each node with the topological structure and features of its neighbors. Through extensive experiments on three publicly available datasets, Tiktok, Kwai, and MovieLens, we demonstrate that our proposed model is able to significantly outperform state-of-the-art multi-modal recommendation methods.

Personalized Hashtag Recommendation for Micro-videos

Yinwei Wei
Zhiyong Cheng
Xuzheng Yu
Zhou Zhao
Lei Zhu
Liqiang Nie

Personalized hashtag recommendation methods aim to suggest users hashtags to annotate, categorize, and describe their posts. The hashtags, that a user provides to a post (e.g., a micro-video), are the ones which in her mind can well describe the post content where she is interested in. It means that we should consider both users' preferences on the post contents and their personal understanding on the hashtags. Most existing methods rely on modeling either the interactions between hashtags and posts or the interactions between users and hashtags for hashtag recommendation. These methods have not well explored the complicated interactions among users, hashtags, and micro-videos. In this paper, towards the personalized micro-video hashtag recommendation, we propose a Graph Convolution Network based Personalized Hashtag Recommendation (GCN-PHR) model, which leverages recently advanced GCN techniques to model the complicate interactions among

Multimodal Classification of Urban Micro-Events

Maarten Sukel
Stevan Rudinac
Marcel Worring

In this paper we seek methods to effectively detect urban micro- events. Urban micro-events are events which occur in cities, have limited geographical coverage and typically affect only a small group of citizens. Because of their scale these events are difficult to identify in most data sources. However, by using citizen sensing to gather data, detecting them becomes feasible. The data gathered by citizen sensing is often multimodal and, as a consequence, the in- formation required to detect urban micro-events is distributed over multiple modalities. This makes it essential to have a classifier ca- pable of combining them. In this paper we explore several methods of creating such a classifier, including early, late and hybrid fusion as well as representation learning using multimodal graphs. We evaluate performance in terms of accurate classification of urban micro-events on a real world dataset obtained from a live citizen re- porting system. We show that a multimodal approach yields higher performance than unimodal alternatives. Furthermore, we demon- strate that our hybrid combination of early and late fusion with multimodal embeddings outperforms our other fusion methods.

Routing Micro-videos via A Temporal Graph-guided Recommendation System

Yongqi Li
Meng Liu
Jianhua Yin
Chaoran Cui
Xin-Shun Xu
Liqiang Nie

In the past few years, micro-videos have become the dominant trend in the social media era. Meanwhile, as the number of microvideos increases, users are frequently overwhelmed by their uninterested ones. Despite the success of existing recommendation systems developed for various communities, they cannot be applied to routing micro-videos, since users in micro-video platforms have their unique characteristics: diverse and dynamic interest, multilevel interest, as well as true negative samples. To address these problems, we present a temporal graph-guided recommendation system. In particular, we first design a novel graph-based sequential network to simultaneously model users' dynamic and diverse interest.Similarly, uninterested information can be captured from users'true negative samples. Beyond that, we introduce users' multi-level interest into our recommendation model via a user matrix that is able to learn the enhanced representation of users' interest. Finally, the system can make accurate recommendation by considering the above characteristics. Experimental results on two public datasets verify the effectiveness of our proposed model.

Joint Rotation-Invariance Face Detection and Alignment with Angle-Sensitivity Cascaded Networks

Bowen Yang
Chun Yang
Qi Liu
Xu-Cheng Yin

Due to the angle variations especially in unconstrained scenarios, face detection and alignment have become challenging tasks. In existing methods, face detection and alignment are always conducted separately, which can greatly increase the computation cost. Moreover, this separation will abandon the inherent correlation underlying the two tasks. In this paper, we propose a simple but effective architecture, named Angle-Sensitivity Cascaded Networks (ASCN), for jointly conducting rotation-invariance face detection and alignment. ASCN mainly consists of three consecutive cascaded networks. Specifically, in the first stage, the rotation angle is predicted and candidate bounding boxes are proposed simultaneously. In the second stage, ASCN further refines the candidates and orientations. In the last stage, ASCN jointly learns the accurate bounding boxes and alignment. Besides, for accurately locating landmarks in hard examples, we introduce a pose-equitable loss to balance the faces with large poses. Extensive experiments conducted on benchmark datasets demonstrate the surprising performance of our method. Notably, our method maintains real-time efficiency for both detection and alignment tasks on the ordinary CPU platform.

See Through the Windshield from Surveillance Camera

Daiqian Ma
Yan Bai
Renjie Wan
Ce Wang
Boxin Shi
Ling-Yu Duan

This paper attempts to address the challenging task of seeing through the windshield images captured by surveillance cameras in the wild. Such images usually have very low visibility due to heterogeneous degradations caused by blur, haze, reflection, noise etc., which makes existing image enhancing methods inapplicable. We propose a windshield image restoration generative adversarial network (WIRE-GAN) to restore and enhance the visibility of windshield images. We adopt the weakly supervised framework based on the generative model, which has effectively released the request of paired training data for a specific type of degradation. To generate more semantically consistent results even in extreme lighting conditions, we introduce a novel content-preserving strategy into the proposed weakly-supervised framework. To make the image restoration more reliable, the WIRE-GAN network constructs a sort of content-aware embedding space and enforces the constraint of the restored windshield images being closer to the original input in the embedding space. Moreover, we collect a large-scale windshield image dataset (WIRE dataset) to validate the advantage of our method in improving the image quality, and further evaluate the impact of windshield restoration on the vehicle ReID performance.

Exploring Background-bias for Anomaly Detection in Surveillance Videos

Kun Liu
Huadong Ma

Anomaly detection in surveillance videos, as a special case of video-based action recognition, is an important topic in multimedia community and public security. Currently, most of the state-of-the-art methods utilize deep learning to recognize the patterns of anomaly or action. However, whether deep neural networks really learn the essence of the anomaly or just remember the background is an important but often neglected problem. In this paper, we develop a series of experiments to validate the existence of background-bias phenomenon, which makes deep networks tend to learn the background information rather than the pattern of anomalies to recognize abnormal behavior. To solve it, we first re-annotate the largest anomaly detection dataset and design a new evaluation metric to measure whether the models really learn the essence of anomalies. Then, we propose an end-to-end trainable, anomaly-area guided framework, where we design a novel region loss to explicitly drive the network to learn where is anomalous region. Besides, given very deep networks and scarce training data for anomaly, our architecture is trained with a meta learning module to prevent severe overfitting. Extensive experiments on the benchmark show that our approach outperforms other methods on both the previous and our proposed evaluation metrics through reducing the influence of the background information.

Editing Text in the Wild

Liang Wu
Chengquan Zhang
Jiaming Liu
Junyu Han
Jingtuo Liu
Errui Ding
Xiang Bai

In this paper, we are interested in editing text in natural images, which aims to replace or modify a word in the source image with another one while maintaining its realistic look. This task is challenging, as the styles of both background and text need to be preserved so that the edited image is visually indistinguishable from the source image. Specifically, we propose an end-to-end trainable style retention network (SRNet) that consists of three modules: text conversion module, background inpainting module and fusion module. The text conversion module changes the text content of the source image into the target text while keeping the original text style. The background inpainting module erases the original text, and fills the text region with appropriate texture. The fusion module combines the information from the two former modules, and generates the edited text images. To our knowledge, this work is the first attempt to edit text in natural images at the word level. Both visual effects and quantitative results on synthetic and real-world dataset (ICDAR 2013) fully confirm the importance and necessity of modular decomposition. We also conduct extensive experiments to validate the usefulness of our method in various real-world applications such as text image synthesis, augmented reality (AR) translation, information hiding, etc.

A Novel Two-stage Separable Deep Learning Framework for Practical Blind Watermarking

Yang Liu
Mengxi Guo
Jian Zhang
Yuesheng Zhu
Xiaodong Xie

As a vital copyright protection technology, blind watermarking based on deep learning with an end-to-end encoder-decoder architecture has been recently proposed. Although the one-stage end-to-end training (OET) facilitates the joint learning of encoder and decoder, the noise attack must be simulated in a differentiable way, which is not always applicable in practice. In addition, OET often encounters the problems of converging slowly and tends to degrade the quality of watermarked images under noise attack. In order to address the above problems and improve the practicability and robustness of algorithms, this paper proposes a novel two-stage separable deep learning (TSDL) framework for practical blind watermarking. Precisely, the TSDL framework is composed of noise-free end-to-end adversary training (FEAT) and noise-aware decoder-only training (ADOT). A redundant multi-layer feature encoding network is developed in FEAT to obtain the encoder, while ADOT is used to get the decoder which is robust and practical enough to accept any type of noise. Extensive experiments demonstrate that the proposed framework not only exhibits better stability, greater performance and faster convergence speed compared with current state-of-the-art OET methods, but is also able to resist high-intensity noises that have not been tested in previous works.

Towards a Perceptual Loss: Using a Neural Network Codec Approximation as a Loss for Generative Audio Models

Ishwarya Ananthabhotla
Sebastian Ewert
Joseph A. Paradiso

Generative audio models based on neural networks have led to considerable improvements across fields including speech enhancement, source separation, and text-to-speech synthesis. These systems are typically trained in a supervised fashion using simple element-wise l1 or l2 losses. However, because they do not capture properties of the human auditory system, such losses encourage modelling perceptually meaningless aspects of the output, wasting capacity and limiting performance. Additionally, while adversarial models have been employed to encourage outputs that are statistically indistinguishable from ground truth and have resulted in improvements in this regard, such losses do not need to explicitly model perception as their task; furthermore, training adversarial networks remains an unstable and slow process. In this work, we investigate an idea fundamentally rooted in psychoacoustics. We train a neural network to emulate an MP3 codec as a differentiable function. Feeding the output of a generative model through this MP3 function, we remove signal components that are perceptually irrelevant before computing a loss. To further stabilize gradient propagation, we employ intermediate layer outputs to define our loss, as found useful in image domain methods. Our experiments using an autoencoding task show an improvement over standard losses in listening tests, indicating the potential of psychoacoustically motivated models for audio generation.

SESSION: Session 3D: Algorithms in Multimedia

User Diverse Preference Modeling by Multimodal Attentive Metric Learning

Fan Liu
Zhiyong Cheng
Changchang Sun
Yinglong Wang
Liqiang Nie
Mohan Kankanhalli

Most existing recommender systems represent a user's preference with a feature vector, which is assumed to be fixed when predicting this user's preferences for different items. However, the same vector cannot accurately capture a user's varying preferences on all items, especially when considering the diverse characteristics of various items. To tackle this problem, in this paper, we propose a novel Multimodal Attentive Metric Learning (MAML) method to model user diverse preferences for various items. In particular, for each user-item pair, we propose an attention neural network, which exploits the item's multimodal features to estimate the user's special attention to different aspects of this item. The obtained attention is then integrated into a metric-based learning method to predict the user preference on this item. The advantage of metric learning is that it can naturally overcome the problem of dot product similarity, which is adopted by matrix factorization (MF) based recommendation models but does not satisfy the triangle inequality property. In addition, it is worth mentioning that the attention mechanism cannot only help model user's diverse preferences towards different items, but also overcome the geometrically restrictive problem caused by collaborative metric learning. Extensive experiments on large-scale real-world datasets show that our model can substantially outperform the state-of-the-art baselines, demonstrating the potential of modeling user diverse preference for recommendation.

Deep Hashing by Discriminating Hard Examples

Cheng Yan
Guansong Pang
Xiao Bai
Chunhua Shen
Jun Zhou
Edwin Hancock

This paper tackles a rarely explored but critical problem within learning to hash, i.e., to learn hash codes that effectively discriminate hard similar and dissimilar examples, to empower large-scale image retrieval. Hard similar examples refer to image pairs from the same semantic class that demonstrate some shared appearance but have different fine-grained appearance. Hard dissimilar examples are image pairs that come from different semantic classes but exhibit similar appearance. These hard examples generally have a small distance due to the shared appearance. Therefore, effective encoding of the hard examples can well discriminate the relevant images within a small Hamming distance, enabling more accurate retrieval in the top-ranked returned images. However, most existing hashing methods cannot capture this key information as their optimization is dominated byeasy examples, i.e., distant similar/dissimilar pairs that share no or limited appearance. To address this problem, we introduce a novel Gamma distribution-enabled and symmetric Kullback-Leibler divergence-based loss, which is dubbed dual hinge loss because it works similarly as imposing two smoothed hinge losses on the respective similar and dissimilar pairs. Specifically, the loss enforces exponentially variant penalization on the hard similar (dissimilar) examples to emphasize and learn their fine-grained difference. It meanwhile imposes a bounding penalization on easy similar (dissimilar) examples to prevent the dominance of the easy examples in the optimization while preserving the high-level similarity (dissimilarity). This enables our model to well encode the key information carried by both easy and hard examples. Extensive empirical results on three widely-used image retrieval datasets show that (i) our method consistently and substantially outperforms state-of-the-art competing methods using hash codes of the same length and (ii) our method can use significantly (e.g., 50%-75%) shorter hash codes to perform substantially better than, or comparably well to, the competing methods.

Watch, Reason and Code: Learning to Represent Videos Using Program

Xuguang Duan
Qi Wu
Chuang Gan
Yiwei Zhang
Wenbing Huang
Anton van den Hengel
Wenwu Zhu

Humans have a surprising capacity to induce general rules that describe the specific actions portrayed in a video sequence. The rules learned through this kind of process allow us to achieve similar goals to those shown in the video but in more general circumstances. Enabling an agent to achieve the same capacity represents a significant challenge. In this paper, we propose a Watch-Reason-Code(WRC) model to synthesise programs that describe the process carried out in a set of video sequences. The 'watch' stage is simply a video encoder that encodes videos to multiple feature vectors. The 'reason' stage takes as input the features from multiple diverse videos and generates a compact feature representation via a novel deviation-pooling method. The 'code' stage is a multi-sound decoder that the first step leverages to generate a draft program layout with possible useful statements and perceptions. Further steps then take these outputs and generate a fully structured, compile-able and executable program. We evaluate the effectiveness of our model in two video-to-program synthesis environments, Karel andVizDoom, showing that we can achieve the state-of-the-art under a variety of settings.

Super Resolution Using Dual Path Connections

Bin-Cheng Yang

\beginabstract Deep convolutional neural networks (CNNs) have been demonstrated to be effective for singe-image super-resolution (SISR) recently. Inspired by Chen et al. \citechen2017dual, we propose a novel method for SISR by introducing dual path connections into a deep convolutional neural network, we call it SRDPN. SRDPN consists of three parts, which are feature extraction block, multiple stacked dual path blocks and reconstruction block. Each dual path block is made of one transition unit and several cascading dual path units. Dual path unit, the core component of the proposed SRDRN, is a specially designed network unit which uses both residual connection and dense connections for convolution layer to exploit common features and explore new features layer-wise. The transition unit in each dual path block is used to fuse the residual and dense features in previous dual path block to keep computation and memory cost under control. Finally, we concatenate outputs of all the dual path blocks for reconstruction of a residual between high-resolution (HR) image and low-resolution (LR) image, both making information forward-propagation direct and alleviating gradient vanishing/exploding problem. Experiments show the proposed SRDPN has superior performance over the state-of-the-art methods. \endabstract

Supervised Discrete Hashing With Mutual Linear Regression

Xingbo Liu
Xiushan Nie
Quan Zhou
Yilong Yin

Supervised linear hashing can compress high-dimensional data into compact binary codes owing to its efficiency. Generally, the relation between label and hash codes is widely used in the existing hashing methods because of its effectiveness of improving the accuracy. The existing hashing methods always use two different projections to represent the mutual regression between hash codes and class labels. In contrast to the existing methods, we propose a novel learning-based hashing method termed supervised discrete hashing with mutual linear regression (SDHMLR) in this study, where only one stable projection is used to describe the linear correlation between hash codes and corresponding labels. To the best of our knowledge, this strategy has not been used for hashing previously. In addition, we further use a boosting strategy to improve the final performance of the proposed method without adding extra constraints and with little extra expenditure in terms of time and space. Extensive experiments conducted on three image benchmarks demonstrate the superior performance of the proposed method.

Robust Subspace Discovery by Block-diagonal Adaptive Locality-constrained Representation

Zhao Zhang
Jiahuan Ren
Sheng Li
Richang Hong
Zhengjun Zha
Meng Wang

We propose a novel and unsupervised representation learning model, i.e., Robust Block-Diagonal Adaptive Locality-constrained Latent Representation (rBDLR). rBDLR is able to recover multi-subspace structures and extract the adaptive locality-preserving salient features jointly. Leveraging on the Frobenius-norm based latent low-rank representation model, rBDLR jointly learns the coding coefficients and salient features, and improves the results by enhancing the robustness to outliers and errors in given data, preserving local information of salient features adaptively and ensuring the block-diagonal structures of the coefficients. To improve the robustness, we perform the latent representation and adaptive weighting in a recovered clean data space. To force the coefficients to be block-diagonal, we perform auto-weighting by minimizing the reconstruction error based on salient features, constrained using a block-diagonal regularizer. This ensures that a strict block-diagonal weight matrix can be obtained and salient features will possess the adaptive locality preserving ability. By minimizing the difference between the coefficient and weights matrices, we can obtain a block-diagonal coefficients matrix and it can also propagate and exchange useful information between salient features and coefficients. Extensive results demonstrate the superiority of rBDLR over other state-of-the-art methods.

Heterogeneous Domain Adaptation via Soft Transfer Network

Yuan Yao
Yu Zhang
Xutao Li
Yunming Ye

Heterogeneous domain adaptation (HDA) aims to facilitate the learning task in a target domain by borrowing knowledge from a heterogeneous source domain. In this paper, we propose a Soft Transfer Network (STN), which jointly learns a domain-shared classifier and a domain-invariant subspace in an end-to-end manner, for addressing the HDA problem. The proposed STN not only aligns the discriminative directions of domains but also matches both the marginal and conditional distributions across domains. To circumvent negative transfer, STN aligns the conditional distributions by using the soft-label strategy of unlabeled target data, which prevents the hard assignment of each unlabeled target data to only one category that may be incorrect. Further, STN introduces an adaptive coefficient to gradually increase the importance of the soft-labels since they will become more and more accurate as the number of iterations increases. We perform experiments on the transfer tasks of image-to-image, text-to-image, and text-to-text. Experimental results testify that the STN significantly outperforms several state-of-the-art approaches.

Alleviating Feature Confusion for Generative Zero-shot Learning

Jingjing Li
Mengmeng Jing
Ke Lu
Lei Zhu
Yang Yang
Zi Huang

Lately, generative adversarial networks (GANs) have been successfully applied to zero-shot learning (ZSL) and achieved state-of-the-art performance. By synthesizing virtual unseen visual features, GAN-based methods convert the challenging ZSL task into a supervised learning problem. However, since real unseen visual features are not available at the training stage, GAN-based ZSL methods have to train the GAN generator on the seen categories and further apply it to unseen instances. An inevitable issue of such a paradigm is that the synthesized unseen features are prone to seen references and incapable to reflect the novelty and diversity of real unseen instances. In a nutshell, the synthesized features are confusing. One cannot tell unseen categories from seen ones using the synthesized features. As a result, the synthesized features are too subtle to be classified in generalized zero-shot learning (GZSL) which involves both seen and unseen categories at the test stage. In this paper, we first introduce the feature confusion issue. Then, we propose a new feature generating network, named alleviating feature confusion GAN (AFC-GAN), to challenge the issue. Specifically, we present a boundary loss which maximizes the decision boundary of seen categories and unseen ones. Furthermore, a novel metric named feature confusion score (FCS) is proposed to quantify the feature confusion. Extensive experiments on five widely used datasets verify that our method is able to outperform previous state-of-the-arts under both ZSL and GZSL protocols.

Duet Robust Deep Subspace Clustering

Yangbangyan Jiang
Qianqian Xu
Zhiyong Yang
Xiaochun Cao
Qingming Huang

Subspace clustering has long been recognized as vulnerable toward gross corruptions -- the corruptions can easily mislead the estimation of the underlying subspace structure. Recently, deep extensions of traditional subspace clustering methods have shown their great power to boost the clustering performance. However, deep learning methods are, in themselves, more prone to be affected by data corruptions. This motivates us to design specific robust extensions for deep subspace clustering methods. More precisely, we contribute a new robust deep framework called Duet Robust Deep Subspace Clustering (DRDSC). Our main idea is to explicitly model the corrupted patterns from both the data reconstruction perspective and the latent self-expression perspective with two regularization norms. Moreover, since the two involved norms are non-smooth, we implement a smoothing technique for these norms to facilitate the back-propagation of our proposed network. Experiments carried out on read-world vision tasks with different noise settings demonstrate the effectiveness of our proposed method.

Imbalance-aware Pairwise Constraint Propagation

Hui Liu
Yuheng Jia
Junhui Hou
Qingfu Zhang

Pairwise constraint propagation (PCP) aims to propagate a limited number of initial pairwise constraints (PCs, including must-link and cannot-link constraints) from the constrained data samples to the unconstrained ones to boost subsequent PC-based applications. The existing PCP approaches always suffer from the imbalance characteristic of PCs, which limits their performance significantly. To this end, we propose a novel imbalance-aware PCP method, by comprehensively and theoretically exploring the intrinsic structures of the underlying PCs. Specifically, different from the existing methods that adopt a single representation, we propose to use two separate carriers to represent the two types of links. And the propagation is driven by the structure embedded in data samples and the regularization of the local, global, and complementary structures of the two carries. Our method is elegantly cast as a well-posed constrained optimization model, which can be efficiently solved. Experimental results demonstrate that the proposed PCP method is capable of generating more high-fidelity PCs than the recent PCP algorithms. In addition, the augmented PCs by our method produce higher accuracy than state-of-the-art semi-supervised clustering methods when applied to constrained clustering. To the best of our knowledge, this is the first PCP method taking the imbalance property of PCs into account.

Hybrid Image Enhancement With Progressive Laplacian Enhancing Unit

Jie Huang
Zhiwei Xiong
Xueyang Fu
Dong Liu
Zheng-Jun Zha

In this paper, we propose a novel hybrid network with Laplacian enhancing unit for image enhancement. We combine the merits of two representative enhancement methods, i.e., the scaling scheme and the generative scheme, by forming a hybrid enhancing module. Meanwhile, we model image enhancement in a progressive manner with a deep cascading CNN architecture, in which the previous feature maps are used to enhance subsequent features to get an improved performance. Specifically, we propose a Laplacian enhancing unit, which can adjustably enhance the detail information by adding the residual of previous feature maps. This unit is embedded across layers for progressively enhancing the features. We build our network on the U-Net architecture and name it Hybrid Progressive Enhancing U-Net. Experiments show that our method achieves superior image enhancement results compared with the state-of-the-arts, while retaining competitive implementation efficiency.

Zero-Shot Restoration of Back-lit Images Using Deep Internal Learning

Lin Zhang
Lijun Zhang
Xiao Liu
Ying Shen
Shaoming Zhang
Shengjie Zhao

How to restore back-lit images still remains a challenging task. State-of-the-art methods in this field are based on supervised learning and thus they are usually restricted to specific training data. In this paper, we propose a "zero-shot" scheme for back-lit image restoration, which exploits the power of deep learning, but does not rely on any prior image examples or prior training. Specifically, we train a small image-specific CNN, namely ExCNet (short for Exposure Correction Network) at test time, to estimate the "S-curve" that best fits the test back-lit image. Once the S-curve is estimated, the test image can be then restored straightforwardly. ExCNet can adapt itself to different settings per image. This makes our approach widely applicable to different shooting scenes and kinds of back-lighting conditions. Statistical studies performed on 1512 real back-lit images demonstrate that our approach can outperform the competitors by a large margin. To the best of our knowledge, our scheme is the first unsupervised CNN-based back-lit image restoration method. To make the results reproducible, the source code is available at https://cslinzhang.github.io/ExCNet/.

Kindling the Darkness: A Practical Low-light Image Enhancer

Yonghua Zhang
Jiawan Zhang
Xiaojie Guo

Images captured under low-light conditions often suffer from (partially) poor visibility. Besides unsatisfactory lightings, multiple types of degradations, such as noise and color distortion due to the limited quality of cameras, hide in the dark. In other words, solely turning up the brightness of dark regions will inevitably amplify hidden artifacts. This work builds a simple yet effective network for Kindling the Darkness (denoted as KinD), which, inspired by Retinex theory, decomposes images into two components. One component (illumination) is responsible for light adjustment, while the other (reflectance) for degradation removal. In such a way, the original space is decoupled into two smaller subspaces, expecting to be better regularized/learned. It is worth to note that our network is trained with paired images shot under different exposure conditions, instead of using any ground-truth reflectance and illumination information. Extensive experiments are conducted to demonstrate the efficacy of our design and its superiority over state-of-the-art alternatives. Our KinD is robust against severe visual defects, and user-friendly to arbitrarily adjust light levels. In addition, our model spends less than 50ms to process an image in VGA resolution on a 2080Ti GPU. All the above merits make our KinD attractive for practical use.

TGG: Transferable Graph Generation for Zero-shot and Few-shot Learning

Chenrui Zhang
Xiaoqing Lyu
Zhi Tang

Zero-shot and few-shot learning aim to improve generalization to unseen concepts, which are promising in many realistic scenarios. Due to the lack of data in unseen domain, relation modeling between seen and unseen domains is vital for knowledge transfer in these tasks. Most existing methods capture seen-unseen relationimplicitly via semantic embedding or feature generation, resulting in inadequate use of relation and some issues remain (e.g. domain shift). To tackle these challenges, we propose a Transferable Graph Generation (TGG ) approach, in which the relation is modeled and utilizedexplicitly via graph generation. Specifically, our proposed TGG contains two main components: (1) Graph generation for relation modeling. Anattention-based aggregate network and arelation kernel are proposed, which generate instance-level graph based on a class-level prototype graph and visual features. Proximity information aggregating is guided by a multi-head graph attention mechanism, where seen and unseen features synthesized by GAN are revised as node embeddings. The relation kernel further generates edges with GCN and graph kernel method, to capture instance-level topological structure while tackling data imbalance and noise. (2) Relation propagation for relation utilization. Adual relation propagation approach is proposed, where relations captured by the generated graph are separately propagated from the seen and unseen subgraphs. The two propagations learn from each other in a dual learning fashion, which performs as an adaptation way for mitigating domain shift. All components are jointly optimized with a meta-learning strategy, and our TGG acts as an end-to-end framework unifying conventional zero-shot, generalized zero-shot and few-shot learning. Extensive experiments demonstrate that it consistently surpasses existing methods of the above three fields by a significant margin.

SESSION: Doctoral Symposium

Cross-modal Neural Sign Language Translation

Amanda Cardoso Duarte

Sign Language is the primary means of communication for the majority of the Deaf and hard-of-hearing communities. Current computational approaches in this general research area have focused specifically on sign language recognition and the translation of sign language to text. However, the reverse problem of translating from spoken to sign language has so far not been widely explored. The goal of this doctoral research is to explore sign language translation in this generalized setting, i.e. translating from spoken language to sign language and vice versa. Towards that end, we propose a concrete methodology for tackling the problem of speech to sign language translation and introduce How2Sign, the first public, continuous American Sign Language dataset that enables such research. With a parallel corpus of almost 60 hours of sign language videos (collected with both RGB and depth sensor data) and the corresponding speech transcripts for over 2500 instructional videos, How2Sign is a public dataset of unprecedented scale that can be used to advance not only sign language translation, but also a wide range of sign language understanding tasks.

On-Camera Digital Watermarking and its Application for Law Enforcement and Public Safety

Michael Kerr

This research explores Law Enforcement Agency (LEA) applications for digital watermarking performed on the camera in real time, combining this with Distributed Ledger Technology (DLT) to suggest workable systems for data cataloguing and image integrity. Reference implementations of both technologies are developed and evaluated for their effectiveness and suitability for these purposes.

On Quantizing the Mental Image of Concepts for Visual Semantic Analyses

Marc A. Kastner

With the rise of multi-modal applications, the need for better understanding of the relationship between language and vision becomes prominent. While modern applications often consider both text and image, human perception is often only of secondary consideration. In my doctoral studies, I research the quantization of visual differences between concepts regarding human perception. Initially, I looked at local visual differences between concepts and their subordinate concepts, measuring the variety gap between images of, e.g. car and vehicle. In the following study, I applied data-mining on Web-crawled images to estimate psycholinguistics metrics like the imageability of words. In this way, the tendency of low- vs. high-imageability can be estimated on a dictionary-level, defining the gap between words like peace and car. Going forward, I want to create visualization demos to analyze psycholinguistic relationships in image datasets.

SESSION: Keynote IV

Inventing Narratives of the Anthropocene: Microclimate Machines and Arts & Sciences Installations

Jean-Marc Chomaz

Since the creation of the Laboratoire d'Hydrodynamique (LadHyX) of the CNRS and the École Polytechnique, I have been involved as a researcher and artist in "arts & sciences" projects in all disciplines (circus, theatre, design, contemporary art, music, etc.). My approach tries to give direct access to an imaginary using scientific language and concepts not to demonstrate, but to make sense.

The works I have created alone or jointly with other artists such as Ana Rewakowicz and Camille Duprat, Anaïs Tondeur, Aniara Rodado, the duets Evelina Domnitch Dmitry Gelfand, HeHe, or within the Labofactory collective founded with Laurent Karst and François-Eudes Chanfrault, are not intended to show or demonstrate scientific phenomena, to provide formal evidence or to reveal established facts. Rather, they suggest a different point of view, a destabilizing transgression, an uncomfortable comparison, a bodily experience, a metaphor for physics that would use the scientific imagination to reinvent our perception of the world and question the truth in its relativity and in all its fragility.

These shared adventures have led me to realize that my intention is closely linked to deeper meaning and commitment. The human species, which, on a geological scale, should have remained an ephemeral and marginal event, is confronted with a deadly threat directly linked to its own action and its casual use, without verbalization and questioning, of science and technology. The fascination that science exerts on everyone's mind, starting with scientists themselves, remains extremely powerful, as evidenced by media coverage of the likely observation of the Higgs Boson or the black hole in the centre of the giant Messier 87 galaxy. Science therefore does not need to be re-invested but to be reinvested by humans, in order to allow new stories to emerge in thought and speech and to constitute a modern "song of gesture", entirely devoted to sustainable actions on a global scale and to the emergence of ethical paths of thought, generally accepted.

SESSION: Session 4A: Cross-Modal Retrieval

Dual-level Embedding Alignment Network for 2D Image-Based 3D Object Retrieval

Heyu Zhou
An-An Liu
Weizhi Nie

Recent advances in 3D modeling software and 3D capture devices contribute to the availability of large-scale 3D objects. However, manually labelled large-scale 3D object dataset is still too expensive to build in practice. An intuitive idea is to transfer the knowledge from label-rich 2D images (source domain) to unlabelled 3D objects (target domain) to facilitate 3D big data management. In this paper, we propose an unsupervised dual-level embedding alignment (DLEA) network for a new task, 2D image-based 3D object retrieval. It mainly consists of two modules, visual feature learning and cross-domain feature adaptation, for jointly optimizing. The first module transforms individual 3D object into a set of multi-view images and utilizes 2D CNNs to extract visual features of both multi-view image sets and the source 2D images. For multi-view fusion by reducing the distribution divergence between both domains, we propose a cross-domain view-wise attention mechanism to adaptively compute the weights of individual views and aggregate them into a compact descriptor to narrow the gap between source and target domains. With the visual representation of both domains, the module of cross-domain feature adaptation aims to enforce the domain-level and class-level embedding alignment of cross-domain feature spaces. For domain-level embedding alignment, we train a discriminator to align the global distribution statistics of both spaces. For class-level embedding alignment, we map the features in the same class but from different domains nearby through aligning the centroid of each class from both domains. To our knowledge, this is the first unsupervised work to jointly realize cross-domain feature learning and distribution alignment in an end-to-end manner for this new task. Moreover, we constructed two new datasets, MI3DOR and MI3DOR-2, to advocate the research on this topic. Extensive comparison experiments can demonstrate the superiority of DLEA against the state-of-art methods.

TC-Net for iSBIR: Triplet Classification Network for Instance-level Sketch Based Image Retrieval

Hangyu Lin
Yanwei Fu
Peng Lu
Shaogang Gong
Xiangyang Xue
Yu-Gang Jiang

Sketch has been employed as an effective communication tool to express the abstract and intuitive meaning of object. While content-based sketch recognition has been studied for several decades, the instance-level Sketch Based Image Retrieval (iSBIR) task has attracted significant research attention recently. In many previous iSBIR works -- TripletSN, and DSSA, edge maps were employed as intermediate representations in bridging the cross-domain discrepancy between photos and sketches. However, it is nontrivial to efficiently train and effectively use the edge maps in an iSBIR system. Particularly, we find that such an edge map based iSBIR system has several major limitations. First, the system has to be pre-trained on a significant amount of edge maps, either from large-scale sketch datasets, e.g., TU-Berlin~\citeeitz2012hdhso, or converted from other large-scale image datasets, e.g., ImageNet-1K\citedeng2009imagenet dataset. Second, the performance of such an iSBIR system is very sensitive to the quality of edge maps. Third and empirically, the multi-cropping strategy is essentially very important in improving the performance of previous iSBIR systems. To address these limitations, this paper advocates an end-to-end iSBIR system without using the edge maps. Specifically, we present a Triplet Classification Network (TC-Net) for iSBIR which is composed of two major components: triplet Siamese network, and auxiliary classification loss. Our TC-Net can break the limitations existed in previous works. Extensive experiments on several datasets validate the efficacy of the proposed network and system.

Video-Based Cross-Modal Recipe Retrieval

Da Cao
Zhiwang Yu
Hanling Zhang
Jiansheng Fang
Liqiang Nie
Qi Tian

As a natural extension of image-based cross-modal recipe retrieval, retrieving a specific video given a recipe as the query is seldom explored. There are various temporal and spatial elements hidden in cooking videos. In addition, current image-based cross-modal recipe retrieval approaches mostly emphasize the understanding of textual and visual content independently. Such methods overlook the interaction between textual and visual content. In this work, we innovatively propose a new problem of video-based cross-modal recipe retrieval and thoroughly investigate this issue under the attention paradigm. In particular, we firstly exploit a parallel-attention network to independently learn the representations of videos and recipes. Next, a co-attention network is proposed to explicitly emphasize the cross-modal interactive features between videos and recipes. Meanwhile, a cross-modal fusion sub-network is proposed to learn both the independent and collaborative dynamics, which can enhance the associated representation of videos and recipes. Last but not the least, the embedding vectors of videos and recipes stemming from joint network are optimized with a pairwise ranking loss. Extensive experiments on a self-collected dataset have verified the effectiveness and rationality of our proposed solution.

A Two-Step Cross-Modal Hashing by Exploiting Label Correlations and Preserving Similarity in Both Steps

Zhen-Duo Chen
Yongxin Wang
Hui-Qiong Li
Xin Luo
Liqiang Nie
Xin-Shun Xu

In this paper, we present a novel Two-stEp Cross-modal Hashing method, TECH for short, for cross-modal retrieval tasks. As a two-step method, it first learns hash codes based on semantic labels, while preserving the similarity in the original space and exploiting the label correlations in the label space. In the light of this, it is able to make better use of label information and generate better binary codes. In addition, different from other two-step methods that mainly focus on the hash codes learning, TECH adopts a new hash function learning strategy in the second step, which also preserves the similarity in the original space. Moreover, with the help of well designed objective function and optimization scheme, it is able to generate hash codes discretely and scalable for large scale data. To the best of our knowledge, it is the first cross-modal hashing method exploiting label correlations, and also the first two-step hashing model preserving the similarity while leaning hash function. Extensive experiments demonstrate that the proposed approach outperforms some state-of-the-art cross-modal hashing methods.

Learning Local Similarity with Spatial Relations for Object Retrieval

Zhenfang Chen
Zhanghui Kuang
Wayne Zhang
Kwan-Yee K. Wong

Many state-of-the-art object retrieval algorithms aggregate activations of convolutional neural networks into a holistic compact feature, and utilize global similarity for an efficient nearest neighbor search. However, holistic features are often insufficient for representing small objects of interest in gallery images, and global similarity drops most of the spatial relations in the images. In this paper, we propose an end-to-end local similarity learning framework to tackle these problems. By applying a correlation layer to the locally aggregated features, we compute a local similarity that can not only handle small objects, but also capture spatial relations between the query and gallery images. We further reduce the memory and storage footprints of our framework by quantizing local features. Our model can be trained using only synthetic data, and achieve competitive performance. Extensive experiments on challenging benchmarks demonstrate that our local similarity learning framework outperforms previous global similarity based methods.

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation

Weikuo Guo
Huaibo Huang
Xiangwei Kong
Ran He

Cross-modal retrieval has become a hot research topic in recent years for its theoretical and practical significance. This paper proposes a new technique for learning such deep visual-semantic embedding that is more effective and interpretable for cross-modal retrieval. The proposed method employs a two-stage strategy to fulfill the task. In the first stage, deep mutual information estimation is incorporated into the objective to maximize the mutual information between the input data and its embedding. In the second stage, an expelling branch is added to the network to disentangle the modality-exclusive information from the learned representations. This helps to reduce the impact of modality-exclusive information to the common subspace representation as well as improve the interpretability of the learned feature. Extensive experiments on two large-scale benchmark datasets demonstrate that our method can learn better visual-semantic embedding and achieve state-of-the-art cross-modal retrieval results.

Separated Variational Hashing Networks for Cross-Modal Retrieval

Peng Hu
Xu Wang
Liangli Zhen
Dezhong Peng

Cross-modal hashing, due to its low storage cost and high query speed, has been successfully used for similarity search in multimedia retrieval applications. It projects high-dimensional data into a shared isomorphic Hamming space with similar binary codes for semantically-similar data. In some applications, all modalities may not be obtained or trained simultaneously for some reasons, such as privacy, secret, storage limitation, and computational resource limitation. However, most existing cross-modal hashing methods need all modalities to jointly learn the common Hamming space, thus hindering them from handling these problems. In this paper, we propose a novel approach called Separated Variational Hashing Networks (SVHNs) to overcome the above challenge. Firstly, it adopts a label network (LabNet) to exploit available and nonspecific label annotations to learn a latent common Hamming space by projecting each semantic label into a common binary representation. Then, each modality-specific network can separately map the samples of the corresponding modality into their binary semantic codes learned by LabNet. We achieve it by conducting variational inference to match the aggregated posterior of the hashing code of LabNet with an arbitrary prior distribution. The effectiveness and efficiency of our SVHNs are verified by extensive experiments carried out on four widely-used multimedia databases, in comparison with 11 state-of-the-art approaches.

Semi-supervised Deep Quantization for Cross-modal Search

Xin Wang
Wenwu Zhu
Chenghao Liu

The problem of cross-modal similarity search, which aims at making efficient and accurate queries across multiple domains, has become a significant and important research topic. Composite quantization, a compact coding solution superior to hashing techniques, has shown its effectiveness for similarity search. However, most existing works utilizing composite quantization to search multi-domain content only consider either pairwise similarity information or class label information across different domains, which fails to tackle the semi-supervised problem in composite quantization. In this paper, we address the semi-supervised quantization problem by considering: (i) pairwise similarity information (without class label information) across different domains, which captures the intra-document relation, (ii) cross-domain data with class label which can help capture inter-document relation, and (iii) cross-domain data with neither pairwise similarity nor class label which enables the full use of abundant unlabelled information. To the best of our knowledge, we are the first to consider both supervised information (pairwise similarity + class label) and unsupervised information (neither pairwise similarity nor class label) simultaneously in composite quantization. A challenging problem arises: how can we jointly handle these three sorts of information across multiple domains in an efficient way? To tackle this challenge, we propose a novel semi-supervised deep quantization (SSDQ) model that takes both supervised and unsupervised information into account. The proposed SSDQ model is capable of incorporating the above three kinds of information into one single framework when utilizing composite quantization for accurate and efficient queries across different domains. More specifically, we employ a modified deep autoencoder for better latent representation and formulate pairwise similarity loss, supervised quantization loss as well as unsupervised distribution match loss to handle all three types of information. The extensive experiments demonstrate the significant improvement of SSDQ over several state-of-the-art methods on various datasets.

A New Benchmark and Approach for Fine-grained Cross-media Retrieval

Xiangteng He
Yuxin Peng
Liu Xie

Cross-media retrieval is to return the results of various media types corresponding to the query of any media type. Existing researches generally focus on coarse-grained cross-media retrieval. When users submit an image of "Slaty-backed Gull" as a query, coarse-grained cross-media retrieval treats it as "Bird", so that users can only get the results of "Bird", which may include other bird species with similar appearance (image and video), descriptions (text) or sounds (audio), such as "Herring Gull". Such coarse-grained cross-media retrieval is not consistent with human lifestyle, where we generally have the fine-grained requirement of returning the exactly relevant results of "Slaty-backed Gull" instead of "Herring Gull". However, few researches focus on fine-grained cross-media retrieval, which is a highly challenging and practical task. Therefore, in this paper, we first construct a new benchmark for fine-grained cross-media retrieval, which consists of 200 fine-grained subcategories of the "Bird", and contains 4 media types, including image, text, video and audio. To the best of our knowledge, it is the first benchmark with 4 media types for fine-grained cross-media retrieval. Then, we propose a uniform deep model, namely FGCrossNet, which simultaneously learns 4 types of media without discriminative treatments. We jointly consider three constraints for better common representation learning: classification constraint ensures the learning of discriminative features for fine-grained subcategories, center constraint ensures the compactness characteristic of the features of the same subcategory, and ranking constraint ensures the sparsity characteristic of the features of different subcategories. Extensive experiments verify the usefulness of the new benchmark and the effectiveness of our FGCrossNet. The new benchmark and the source code of FGCrossNet will be made available at https://github.com/PKU-ICST-MIPL/FGCrossNet_ACMMM2019.

Cross-Modal Image-Text Retrieval with Semantic Consistency

Hui Chen
Guiguang Ding
Zijin Lin
Sicheng Zhao
Jungong Han

Cross-modal image-text retrieval has been a long-standing challenge in the multimedia community. Existing methods explore various complicated embedding spaces to assess the semantic similarity between a given image-text pair, but consider no/little about the consistency across them. To remedy this situation, we introduce the idea of semantic consistency for learning various embedding spaces jointly. Specifically, similar to the previous works, we start by constructing two different embedding spaces, namely the image-grounded embedding space and the text-grounded embedding space. However, instead of learning these two embedding spaces separately, we incorporate a semantic consistency constraint in the common ranking objective function such that both embedding spaces can be learned simultaneously and benefit from each other to gain performance improvement. We conduct extensive experiments on three benchmark datasets, \ie Flickr8k, Flickr30k and MS COCO. Results show that our model outperforms the state-of-the-art models on all three datasets, which can well demonstrate the effectiveness and superiority of the introduction of semantic consistency. Our source code is released at: \urlhttps://github.com/HuiChen24/SemanticConsistency.

Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment

Po-Yao Huang
Guoliang Kang
Wenhe Liu
Xiaojun Chang
Alexander G. Hauptmann

Visual-semantic embeddings are central to many multimedia applications such as cross-modal retrieval between visual data and natural language descriptions. Conventionally, learning a joint embedding space relies on large parallel multimodal corpora. Since massive human annotation is expensive to obtain, there is a strong motivation in developing versatile algorithms to learn from large corpora with fewer annotations. In this paper, we propose a novel framework to leverage automatically extracted regional semantics from un-annotated images as additional weak supervision to learn visual-semantic embeddings. The proposed model employs adversarial attentive alignments to close the inherent heterogeneous gaps between annotated and un-annotated portions of visual and textual domains. To demonstrate its superiority, we conduct extensive experiments on sparsely annotated multimodal corpora. The experimental results show that the proposed model outperforms state-of-the-art visual-semantic embedding models by a significant margin for cross-modal retrieval tasks on the sparse Flickr30k and MS-COCO datasets. It is also worth noting that, despite using only 20% of the annotations, the proposed model can achieve competitive performance (Recall at 10 > 80.0% for 1K and > 70.0% for 5K text-to-image retrieval) compared to the benchmarks trained with the complete annotations.

Towards Optimal CNN Descriptors for Large-Scale Image Retrieval

Yinzheng Gu
Chuanpeng Li
Yu-Gang Jiang

Instance-level image retrieval is a long-standing and challenging problem in multimedia. Recently, fine-tuning Convolutional Neural Networks (CNNs) has become a promising direction, and a number of successful strategies based on global CNN descriptors have been proposed. However, it is difficult to make direct comparisons and draw conclusions due to different settings and/or datasets. The goal of this paper is two-fold. Firstly, we present a unified implementation of modern global-CNN-based retrieval systems, break such a system into six major components, and investigate each part individually as well as globally when considering different configurations. We conduct a systematic series of experiments on a component-by-component basis and find an optimal solution in designing such a system. Secondly, we introduce a novel joint loss function with learnable parameter for fine-tuning for retrieval tasks and show, with extensive experiments, significant improvement over previous works. On the new and challenging large-scale Google-Landmarks-Dataset, we set a baseline for future research and comparisons, while on traditional retrieval benchmarks such as Oxford5k and Paris6k, as well as their recent revised versions ROxford5k and RParis6k, we achieve state-of-the-art performance under all three (Easy, Medium, and Hard) evaluation protocals by a large margin compared to competing methods.

A Framework for Effective Known-item Search in Video

Jakub Lokoč
Gregor Kovalčik
Tomáš Souček
Jaroslav Moravec
Přemysl Čech

Searching for one particular scene in a large video collection (known-item search) represents a challenging task for video retrieval systems. According to the recent results reached at evaluation campaigns, even respected approaches based on machine learning do not help to solve the task easily in many cases. Hence, in addition to effective automatic multimedia annotation and embedding, interactive search is recommended as well. This paper presents a comprehensive description of an interactive video retrieval framework VIRET that successfully participated at several recent evaluation campaigns. Utilized video analysis, feature extraction and retrieval models are detailed as well as several experiments evaluating effectiveness of selected system components. The results of the prototype at the Video Browser Showdown 2019 are highlighted in connection with an analysis of collected query logs. We conclude that the framework comprise a set of effective and efficient models for most of the evaluated known-item search tasks in 1000 hours of video and could serve as a baseline reference approach. The analysis also reveals that the result presentation interface needs improvements for better performance of future VIRET prototypes.

W2VV++: Fully Deep Learning for Ad-hoc Video Search

Xirong Li
Chaoxi Xu
Gang Yang
Zhineng Chen
Jianfeng Dong

Ad-hoc video search (AVS) is an important yet challenging problem in multimedia retrieval. Different from previous concept-based methods, we propose a fully deep learning method for query representation learning. The proposed method requires no explicit concept modeling, matching and selection. The backbone of our method is the proposed W2VV++ model, a super version of Word2VisualVec (W2VV) previously developed for visual-to-text matching. W2VV++ is obtained by tweaking W2VV with a better sentence encoding strategy and an improved triplet ranking loss. With these simple yet important changes, W2VV++ brings in a substantial improvement. As our participation in the TRECVID 2018 AVS task and retrospective experiments on the TRECVID 2016 and 2017 data show, our best single model, with an overall inferred average precision (infAP) of 0.157, outperforms the state-of-the-art. The performance can be further boosted by model ensemble using late average fusion, reaching a higher infAP of 0.163. With W2VV++, we establish a new baseline for ad-hoc video search.

SESSION: Session 4B: Visual Analysis&Applications

Gradual Network for Single Image De-raining

Weijiang Yu
Zhe Huang
Wayne Zhang
Litong Feng
Nong Xiao

Most advances in single image de-raining meet a key challenge, which is removing rain streaks with different scales and shapes while preserving image details. Existing single image de-raining approaches treat rain-streak removal as a process of pixel-wise regression directly. However, they are lacking in mining the balance between over-de-raining (e.g. removing texture details in rain-free regions) and under-de-raining (e.g. leaving rain streaks). In this paper, we firstly propose a coarse-to-fine network called Gradual Network (GraNet) consisting of coarse stage and fine stage for delving into single image de-raining with different granularities. Specifically, to reveal coarse-grained rain-streak characteristics (e.g. long and thick rain streaks/raindrops), we propose a coarse stage by utilizing local-global spatial dependencies via a local-global sub-network composed of region-aware blocks. Taking the residual result (the coarse de-rained result) between the rainy image sample (i.e. the input data) and the output of coarse stage (i.e. the learnt rain mask) as input, the fine stage continues to de-rain by removing the fine-grained rain streaks (e.g. light rain streaks and water mist) to get a rain-free and well-reconstructed output image via a unified contextual merging sub-network with dense blocks and a merging block. Solid and comprehensive experiments on synthetic and real data demonstrate that our GraNet can significantly outperform the state-of-the-art methods by removing rain streaks with various densities, scales and shapes while keeping the image details of rain-free regions well-preserved.

AnoPCN: Video Anomaly Detection via Deep Predictive Coding Network

Muchao Ye
Xiaojiang Peng
Weihao Gan
Wei Wu
Yu Qiao

Video anomaly detection is a challenging problem due to the ambiguity and complexity of how anomalies are defined. Recent approaches for this task mainly utilize deep reconstruction methods and deep prediction ones, but their performances suffer when they cannot guarantee either higher reconstruction errors for abnormal events or lower prediction errors for normal events. Inspired by the predictive coding mechanism explaining how brains detect events violating regularities, we address the Anomaly detection problem with a novel deep Predictive Coding Network, termed as AnoPCN, which consists of a Predictive Coding Module (PCM) and an Error Refinement Module (ERM). Specifically, PCM is designed as a convolutional recurrent neural network with feedback connections carrying frame predictions and feedforward connections carrying prediction errors. By using motion information explicitly, PCM yields better prediction results. To further solve the problem of narrow regularity score gaps in deep reconstruction methods, we decompose reconstruction into prediction and refinement, introducing ERM to reconstruct current prediction error and refine the coarse prediction. AnoPCN unifies reconstruction and prediction methods in an end-to-end framework, and it achieves state-of-the-art performance with better prediction results and larger regularity score gaps on three benchmark datasets including ShanghaiTech Campus, CUHK Avenue, and UCSD Ped2.

Single Image Deraining via Recurrent Hierarchy Enhancement Network

Youzhao Yang
Hong Lu

Single image deraining is an important problem in many computer vision tasks since rain streaks can severely hamper and degrade the visibility of images. In this paper, we propose a novel network named Recurrent Hierarchy Enhancement Network (ReHEN) to remove rain streaks from rainy images stage by stage. Unlike previous deep convolutional network methods, we adopt a Hierarchy Enhancement Unit (HEU) to fully extract local hierarchical features and generate effective features. Then a Recurrent Enhancement Unit (REU) is added to keep the useful information from HEU and benefit the rain removal in the later stages. To focus on different scales, shapes, and densities of rain streaks adaptively, Squeeze-and-Excitation (SE) block is applied in both HEU and REU to assign different scale factors to high-level features. Experiments on five synthetic datasets and a real-world rainy image set show that the proposed method outperforms the state-of-the-art methods considerably. The source code is available at https://github.com/nnUyi/ReHEN.

DADNet: Dilated-Attention-Deformable ConvNet for Crowd Counting

Dan Guo
Kun Li
Zheng-Jun Zha
Meng Wang

Most existing CNN-based methods for crowd counting always suffer from large scale variation in objects of interest, leading to density maps of low quality. In this paper, we propose a novel deep model called Dilated-Attention-Deformable ConvNet (DADNet), which consists of two schemes: multi-scale dilated attention and deformable convolutional DME (Density Map Estimation). The proposed model explores a scale-aware attention fusion with various dilation rates to capture different visual granularities of crowd regions of interest, and utilizes deformable convolutions to generate a high-quality density map. There are two merits as follows: (1) varying dilation rates can effectively identify discriminative regions by enlarging the receptive fields of convolutional kernels upon surrounding region cues, and (2) deformable CNN operations promote the accuracy of object localization in the density map by augmenting the spatial object location sampling with adaptive offsets and scalars. DADNet not only excels at capturing rich spatial context of salient and tiny regions of interest simultaneously, but also keeps a robustness to background noises, such as partially occluded objects. Extensive experiments on benchmark datasets verify that DADNet achieves the state-of-the-art performance. Visualization results of the multi-scale attention maps further validate the remarkable interpretability achieved by our solution.

DTDN: Dual-task De-raining Network

Zheng Wang
Jianwu Li
Ge Song

Removing rain streaks from rainy images is necessary for many tasks in computer vision, such as object detection and recognition. It needs to address two mutually exclusive objectives: removing rain streaks and reserving realistic details. Balancing them is critical for de-raining methods. We propose an end-to-end network, called dual-task de-raining network (DTDN), consisting of two sub-networks: generative adversarial network (GAN) and convolutional neural network (CNN), to remove rain streaks via coordinating the two mutually exclusive objectives self-adaptively. DTDN-GAN is mainly used to remove structural rain streaks, and DTDN-CNN is designed to recover details in original images. We also design a training algorithm to train these two sub-networks of DTDN alternatively, which share same weights but use different training sets. We further enrich two existing datasets to approximate the distribution of real rain streaks. Experimental results show that our method outperforms several recent state-of-the-art methods, based on both benchmark testing datasets and real rainy images.

IntersectGAN: Learning Domain Intersection for Generating Images with Multiple Attributes

Zehui Yao
Boyan Zhang
Zhiyong Wang
Wanli Ouyang
Dong Xu
Dagan Feng

Generative adversarial networks (GANs) have demonstrated great success in generating various visual content. However, images generated by existing GANs are often of attributes (e.g., smiling expression) learned from one image domain. As a result, generating images of multiple attributes requires many real samples possessing multiple attributes which are very resource expensive to be collected. In this paper, we propose a novel GAN, namely IntersectGAN, to learn multiple attributes from different image domains through an intersecting architecture. For example, given two image domains $X_1$ and $X_2$ with certain attributes, the intersection $X_1 \cap X_2$ denotes a new domain where images possess the attributes from both $X_1$ and $X_2$ domains. The proposed IntersectGAN consists of two discriminators $D_1$ and $D_2$ to distinguish between generated and real samples of different domains, and three generators where the intersection generator is trained against both discriminators. And an overall adversarial loss function is defined over three generators. As a result, our proposed IntersectGAN can be trained on multiple domains of which each presents one specific attribute, and eventually eliminates the need of real sample images simultaneously possessing multiple attributes. By using the CelebFaces Attributes dataset, our proposed IntersectGAN is able to produce high quality face images possessing multiple attributes (e.g., a face with black hair and a smiling expression). Both qualitative and quantitative evaluations are conducted to compare our proposed IntersectGAN with other baseline methods. Besides, several different applications of IntersectGAN have been explored with promising results.

Weakly Supervised Fine-grained Image Classification via Correlation-guided Discriminative Learning

Zhihui Wang
Shijie Wang
Pengbo Zhang
Haojie Li
Wei Zhong
Jianjun Li

Weakly supervised fine-grained image classification (WFGIC) aims at learning to recognize hundreds of subcategories in each basic-level category with only image level labels available. It is extremely challenging and existing methods mainly focus on the discriminative semantic parts or regions localization as the key differences among different subcategories are subtle and local. However, they localize these regions independently while neglecting the fact that regions are mutually correlated and region groups can be more discriminative. Meanwhile, most current work tends to derive features directly from the output of CNN and rarely considers the correlation within the feature vector. To address these issues, we propose an end-to-end Correlation-guided Discriminative Learning (CDL) model to fully mine and exploit the discriminative potentials of correlations for WFGIC globally and locally. From the global perspective, a discriminative region grouping (DRG) sub-network is proposed which first establishes correlation between regions and then enhances each region by weighted aggregating all the correlation from other regions to it. By this means each region's representation encodes the global image-level context and thus is more robust; meanwhile, through learning the correlation between discriminative regions, the network is guided to implicitly discover the discriminative region groups which are more powerful for WFGIC. From the local perspective, a discriminative feature strengthening sub-network (DFS) is proposed to mine and learn the internal spatial correlation among elements of each patch's feature vector, to improve its discriminative power locally by jointly emphasizes informative elements while suppresses the useless ones. Extensive experiments demonstrate the effectiveness of proposed DRG and DFS sub-networks, and show that the CDL model achieves state-of-the-art performance both in accuracy and efficiency.

Single-shot Semantic Image Inpainting with Densely Connected Generative Networks

Ling Shen
Richang Hong
Haoran Zhang
Hanwang Zhang
Meng Wang

Semantic image inpainting - a task to speculate and fill in large missing areas of a natural image, has shown exciting progress with the introduction of generative adversarial networks (GANs). But due to lack of sufficient understanding of semantic and spatial context, existing methods easily generate blurred boundary and distorted structure, which are inconsistent with the surrounding area. In this paper, we propose a new end-to-end framework named Single-shot Densely Connected Generative Network (SSDCGN), which generates visually realistic and semantically distinct pixels for the missing content by a battery of symmetric encoder-decoder groups. To maximize semantic extraction and realize precise spatial context localization, we involve a deeper densely skip connection in our network. Extensive experiments on Paris StreetView and ImageNet datasets show the superiority of our method.

GAIN: Gradient Augmented Inpainting Network for Irregular Holes

Jianfu Zhang
Li Niu
Dexin Yang
Liwei Kang
Yaoyi Li
Weijie Zhao
Liqing Zhang

Image inpainting, which aims to fill the missing holes of the images, is a challenging task because the holes may contain complicated structures or different possible layouts. Deep learning methods have shown promising performance in image inpainting but still, suffer from generating poor-structured artifacts when the holes are large and irregular. Some existing methods use edge inpainting to help image inpainting, with binary edge map obtained from image gradient. However, by only using the binary edge map, these methods discard the rich information in image gradient and thus leave some critical issues (e.g. , color discrepancy) unattended. In this paper, we propose Gradient Augmented Inpainting Network (GAIN), which uses image gradient information instead of edge information to facilitate image inpainting. Specifically, we formulate a multi-task learning framework which performs image inpainting and gradient inpainting simultaneously. A novel GAI-Block is designed to encourage the information fusion between the image feature map and the gradient feature map. Moreover, gradient information is also used to determine the filling priority, which can guide the network to construct more plausible semantic structures for the holes. Experimental results on public datasets CelebA-HQ and Places2 show that our proposed method outperforms state-of-the-art methods quantitatively and qualitatively.

Deep Spatial Pyramid Features Collaborative Reconstruction for Partial Person ReID

Zan Gao
Li-Shuai Gao
Hua Zhang
Zhiyong Cheng
Richang Hong

Partial person re-identification (ReID) is a hot research problem in computer vision. Accurate partial ReID is very challenging due to the common occlusion problem. To address this problem, in this paper, we propose a novel D eep spatial pyramid feature C ollaborative R econstruction approach (DCR ) for partial person ReID, which can effectively and efficiently tackle the occlusion in arbitrary sizes. Specifically, a fully convolutional network (FCN) is first leveraged to extract feature maps of an arbitrary-size image, and then the spatial pyramid pooling (SPP) is adopted to obtain spatial pyramid features. Thereafter, our DCR method is designed to efficiently solve the matching problem between the partial person and the holistic person in the partial person ReID task where the occlusion problem often occurs. Experiments on two partial person ReID datasets demonstrate the efficiency and efficacy of the proposed method by comparing to several state-of-the-art partial person ReID approaches. Our method outperforms all the competitors with a large margin and can achieve an improvement of 9.07% and 5.95% over the DSR method on the Partial REID and Partial-iLIDS Person ReID datasets in terms of the Rank-1 accuracy, respectively.

DoT-GNN: Domain-Transferred Graph Neural Network for Group Re-identification

Ziling Huang
Zheng Wang
Wei Hu
Chia-Wen Lin
Shin'ichi Satoh

Most person re-identification (ReID) approaches focus on retrieving a person-of-interest from a database of collected individual images. In addition to the individual ReID task, matching a group of persons across different camera views also plays an important role in surveillance applications. This kind of Group Re-identification (GReID) task is very challenging since we face the obstacles not only from the appearance changes of individuals, but also from the group layout and membership changes. In order to obtain robust representation for the group image, we design a Domain-Transferred Graph Neural Network (DoT-GNN) method. The merits are three aspects: 1) Transferred Style. Due to the lack of training samples, we transfer the labeled ReID dataset to the G-ReID dataset style, and feed the transferred samples to the deep learning model. Taking the superiority of deep learning models, we achieve a discriminative individual feature model. 2) Graph Generation. We treat a group as a graph, where each node denotes the individual feature and each edge represents the relation of a couple of individuals. We propose a graph generation strategy to create sufficient graph samples. 3) Graph Neural Network. Employing the generated graph samples, we train the GNN so as to acquire graph features which are robust to large graph variations. The key to the success of DoT-GNN is that the transferred graph addresses the challenge of the appearance change, while the graph representation in GNN overcomes the challenge of the layout and membership change. Extensive experimental results demonstrate the effectiveness of our approach, outperforming the state-of-the-art method by 1.8% CMC-1 on Road Group dataset and 6.0% CMC-1 on DukeMCMT dataset respectively.

Improving the Learning of Multi-column Convolutional Neural Network for Crowd Counting

Zhi-Qi Cheng
Jun-Xiu Li
Qi Dai
Xiao Wu
Jun-Yan He
Alexander G. Hauptmann

Tremendous variation in the scale of people/head size is a critical problem for crowd counting. To improve the scale invariance of feature representation, recent works extensively employ Convolutional Neural Networks with multi-column structures to handle different scales and resolutions. However, due to the substantial redundant parameters in columns, existing multi-column networks invariably exhibit almost the same scale features in different columns, which severely affects counting accuracy and leads to overfitting. In this paper, we attack this problem by proposing a novel Multicolumn Mutual Learning (McML) strategy. It has two main innovations: 1) A statistical network is incorporated into the multi-column framework to estimate the mutual information between columns, which can approximately indicate the scale correlation between features from different columns. By minimizing the mutual information, each column is guided to learn features with different image scales. 2) We devise a mutual learning scheme that can alternately optimize each column while keeping the other columns fixed on each mini-batch training data. With such asynchronous parameter update process, each column is inclined to learn different feature representation from others, which can efficiently reduce the parameter redundancy and improve generalization ability. More remarkably, McML can be applied to all existing multi-column networks and is end-to-end trainable. Extensive experiments on four challenging benchmarks show that McML can significantly improve the original multi-column networks and outperform the other state-of-the-art approaches.

Crowd Counting via Multi-layer Regression

Xin Tan
Chun Tao
Tongwei Ren
Jinhui Tang
Gangshan Wu

Crowd counting aims to estimate the number of persons in a crowd image--a challenge until this day--as congestion degree varies, people's appearances may seem different. To address this problem, we propose a novel crowd counting method named Multi-layer Regression Network (MRNet), which consists of a multi-layer recognition branch and several density regressors. In practice, the recognition branch recognizes the congestion degree of the regions in a crowd image, then disintegrates the image into background and several crowd regions layer by layer, each regions are assigned different congestion degrees. In each layer, the recognized crowd regions with the specific congestion degree are delivered to a regressor with the corresponding density prior for crowd density estimation. The generated density maps at all layers are integrated to obtain the final density map for crowd density estimation. To date, MRNet is the first method to estimate crowd densities on crowd regions with different regressors. We conduct a comprehensive evaluation of MRNet on four typical datasets in comparison with nine state-of-the-art methods. By using multi-layer regression, MRNet achieves significant improvement in crowd counting accuracy, and outperforms the state-of-the-art methods.

Gesture-to-Gesture Translation in the Wild via Category-Independent Conditional Maps

Yahui Liu
Marco De Nadai
Gloria Zen
Nicu Sebe
Bruno Lepri

Recent works have shown Generative Adversarial Networks (GANs) to be particularly effective in image-to-image translations. However, in tasks such as body pose and hand gesture translation, existing methods usually require precise annotations, e.g. key-points or skeletons, which are time-consuming to draw. In this work, we propose a novel GAN architecture that decouples the required annotations into a category label - that specifies the gesture type - and a simple-to-draw category-independent conditional map - that expresses the location, rotation and size of the hand gesture. Our architecture synthesizes the target gesture while preserving the background context, thus effectively dealing with gesture translation in the wild. To this aim, we use an attention module and a rolling guidance approach, which loops the generated images back into the network and produces higher quality images compared to competing works. Thus, our GAN learns to generate new images from simple annotations without requiring key-points or skeleton labels. Results on two public datasets show that our method outperforms state of the art approaches both quantitatively and qualitatively. To the best of our knowledge, no work so far has addressed the gesture-to-gesture translation in the wild by requiring user-friendly annotations.

SESSION: Session 4C: Social Computing&Image Processing

Seeking Micro-influencers for Brand Promotion

Tian Gan
Shaokun Wang
Meng Liu
Xuemeng Song
Yiyang Yao
Liqiang Nie

What made you want to wear the clothes you are wearing? Where is the place you want to visit for your next-coming holiday? Why do you like the music you frequently listen to? If you are like most people, you probably made these decisions as a result of watching influencers on social media. Furthermore, influencer marketing is an opportunity for brands to take advantage of social media using a well-defined and well-designed social media marketing strategy. However, choosing the right influencers is not an easy task. With more people gaining an increasing number of followers in social media, finding the right influencer for an E-commerce company becomes paramount. In fact, most marketers cite it as a top challenge for their brands. To address the aforementioned issues, we proposed a data-driven micro-influencer ranking scheme to solve the essential question of finding out the right micro-influencer. Specifically, we represented brands and influencers by fusing their historical posts' visual and textual information. A novel k-buckets sampling strategy with a modified listwise learning to rank model were proposed to learn a brand-micro-influncer scoring function. In addition, we developed a new Instagram brand micro-influencer dataset, consisting of 360 brands and 3,748 micro-influencers, which can benefit future researchers in this area. The extensive evaluations demonstrate the advantage of our proposed method compared with the state-of-the-art methods.

Multi-modal Knowledge-aware Event Memory Network for Social Media Rumor Detection

Huaiwen Zhang
Quan Fang
Shengsheng Qian
Changsheng Xu

The wide dissemination and misleading effects of online rumors on social media have become a critical issue concerning the public and government. Detecting and regulating social media rumors is important for ensuring users receive truthful information and maintaining social harmony. Most of the existing rumor detection methods focus on inferring clues from media content and social context, which largely ignores the rich knowledge information behind the highly condensed text which is useful for rumor verification. Furthermore, existing rumor detection models underperform on unseen events because they tend to capture lots of event-specific features in seen data which cannot be transferred to newly emerged events. In order to address these issues, we propose a novel Multimodal Knowledge-aware Event Memory Network (MKEMN) which utilizes the Multi-modal Knowledge-aware Network (MKN) and Event Memory Network (EMN) as building blocks for social media rumor detection. Specifically, the MKN learns the multi-modal representation of the post on social media and retrieves external knowledge from real-world knowledge graph to complement the semantic representation of short texts of posts and takes conceptual knowledge as additional evidence to improve rumor detection. The EMN extracts event-invariant features of events and stores them into global memory. Given an event representation, the EMN takes it as a query to retrieve the memory network and output the corresponding features shared among events. With the additional information provided by EMN, our model can learn robust representations of events and consistently perform well on the newly emerged events. Extensive experiments on two Twitter benchmark datasets demonstrate that our rumor detection method achieves much better results than state-of-the-art methods.

MOC: Measuring the Originality of Courseware in Online Education Systems

Jiawei Wang
Jiansheng Fang
Jiao Xu
Shifeng Huang
Da Cao
Ming Yang

In online education systems, the courseware plays a pivotal role in helping educators present and impart knowledge to students. The originality of courseware heavily impacts the choice of educators, because the teaching content evolves and so does courseware. However, how to measure the originality of a courseware is a challenging task, due to the lack of labels and the difficulty of quantification. To this end, we contribute a similarity ranking-based unsupervised approach to measure the originality of a courseware. In particular, we first exploit a pre-trained deep visual-text embedding to obtain the representations of images and texts in a local manner. Next, inspired by the design of capsule neural network, a vector-based pooling network is proposed to learn multimodal representations of images and texts. Finally, we propose a Discriminator to optimize the model by maximizing the mutual information between local features and global features in an unsupervised manner. To evaluate the performance of our proposed model, we further subtly collect a dataset for evaluating the originality of courseware by treating sequential versions of each courseware as ranking lists. Therefore, the learning-to-rank scheme can be utilized to evaluate the similarity-based ranking performance. Extensive experimental results have demonstrated the superiority of our proposed framework as compared to other state-of-the-art competitors.

Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events

Wim Boes
Hugo Van hamme

We tackle the task of environmental event classification by drawing inspiration from the transformer neural network architecture used in machine translation. We modify this attention-based feedforward structure in such a way that allows the resulting model to use audio as well as video to compute sound event predictions. We perform extensive experiments with these adapted transformers on an audiovisual data set, obtained by appending relevant visual information to an existing large-scale weakly labeled audio collection. The employed multi-label data contains clip-level annotation indicating the presence or absence of 17 classes of environmental sounds, and does not include temporal information. We show that the proposed modified transformers strongly improve upon previously introduced models and in fact achieve state-of-the-art results. We also make a compelling case for devoting more attention to research in multimodal audiovisual classification by proving the usefulness of visual information for the task at hand, namely audio event recognition. In addition, we visualize internal attention patterns of the audiovisual transformers and in doing so demonstrate their potential for performing multimodal synchronization.

User-Aware Folk Popularity Rank: User-Popularity-Based Tag Recommendation That Can Enhance Social Popularity

Xueting Wang
Yiwei Zhang
Toshihiko Yamasaki

In this paper we propose a method that can enhance the social popularity of a post (i.e., the number of views or likes) by recommending appropriate hash tags considering both content popularity and user popularity. A previous approach called FolkPopularityRank (FP-Rank) considered only the relationship among images, tags, and their popularity. However, the popularity of an image/video is strongly affected by who uploaded it. Therefore, we develop an algorithm that can incorporate user popularity and users' tag usage tendency into the FP-Rank algorithm. The experimental results using 60,000 training images with their accompanying tags and 1,000 test data, which were actually uploaded to a real social network service (SNS), show that, in ten days, our proposed algorithm can achieve 1.2 times more views than the FP-Rank algorithm. This technology would be critical to individual users and companies/brands who want to promote themselves in SNSs.

Intrinsic Image Popularity Assessment

Keyan Ding
Kede Ma
Shiqi Wang

The goal of research in automatic image popularity assessment (IPA) is to develop computational models that can accurately predict the potential of a social image to go viral on the Internet. Here, we aim to single out the contribution of visual content to image popularity, \ie, intrinsic image popularity. Specifically, we first describe a probabilistic method to generate massive popularity-discriminable image pairs, based on which the first large-scale image database for intrinsic IPA (I$^2$PA) is established. We then develop computational models for I$^2$PA based on deep neural networks, optimizing for ranking consistency with millions of popularity-discriminable image pairs. Experiments on Instagram and other social platforms demonstrate that the optimized model performs favorably against existing methods, exhibits reasonable generalizability on different databases, and even surpasses human-level performance on Instagram. In addition, we conduct a psychophysical experiment to analyze various aspects of human behavior in I$^2$PA.

Vision-based Price Suggestion for Online Second-hand Items

Liang Han
Zhaozheng Yin
Zhurong Xia
Li Guo
Mingqian Tang
Rong Jin

Different from shopping in physical stores, where people have the opportunity to closely check a product (e.g., touching the surface of a T-shirt or smelling the scent of perfume) before making a purchase decision, online shoppers rely greatly on the uploaded product images to make any purchase decision. The decision-making is challenging when selling or purchasing second-hand items online since estimating the items' prices is not trivial. In this work, we present a vision-based price suggestion system for the online second-hand item shopping platform. The goal of vision-based price suggestion is to help sellers set effective prices for their second-hand listings with the images uploaded to the online platforms. To provide effective price suggestions for second-hand items with their images, first we propose to better extract representative visual features from the images with the aid of some other image-based item information (e.g., category, brand). Then, we design a vision-based price suggestion module which takes the extracted visual features along with some statistical item features from the shopping platform as the inputs to determine whether an uploaded item image is qualified for price suggestion by a binary classification model, and provide price suggestions for items with qualified images by a regression model. According to the two demands from the platform operator, two different objective functions are proposed to jointly optimize the classification model and the regression model. For better training these two models, we also propose a warm-up training strategy for the joint optimization. Extensive experiments on a large real-world dataset demonstrate the effectiveness of our vision-based price prediction system.

Instance of Interest Detection

Fan Yu
Haonan Wang
Tongwei Ren
Jinhui Tang
Gangshan Wu

In this paper, we propose a novel task named Instance of Interest Detection (IOID) to provide instance-level user interest modeling for image semantic description. IOID focuses on extracting the instances which are beneficial to represent image content, while other related tasks such as saliency analysis, attention model and instance segmentation extract the regions attracting visual attention or with a predefined category. To this end, we propose a Cross-influential Network for IOID, which integrates both visual saliency and semantic context. Moreover, we contribute the first dataset IOID evaluation, which consists of 45,000 images from MSCOCO with manually annotated instances of interest. Our method outperforms the state-of-the-art baselines on this dataset.

On Learning Disentangled Representation for Acoustic Event Detection

Lijian Gao
Qirong Mao
Ming Dong
Yu Jing
Ratna Chinnam

Polyphonic Acoustic Event Detection (AED) is a challenging task as the sounds are mixed with the signals from different events, and the features extracted from the mixture do not match well with features calculated from sounds in isolation, leading to suboptimal AED performance. In this paper, we propose a supervised β-VAE model for AED, which adds a novel event-specific disentangling loss in the objective function of disentangled learning. By incorporating either latent factor blocks or latent attention in disentangling, supervised β-VAE learns a set of discriminative features for each event. Extensive experiments on benchmark datasets show that our approach outperforms the current state-of-the-arts (top-1 performers in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 AED challenge). Supervised β-VAE has great success in challenging AED tasks with a large variety of events and imbalanced data.

Progressive Retinex: Mutually Reinforced Illumination-Noise Perception Network for Low-Light Image Enhancement

Yang Wang
Yang Cao
Zheng-Jun Zha
Jing Zhang
Zhiwei Xiong
Wei Zhang
Feng Wu

Contrast enhancement and noise removal are coupled problems for low-light image enhancement. The existing Retinex based methods do not take the coupling relation into consideration, resulting in under or over-smoothing of the enhanced images. To address this issue, this paper presents a novel progressive Retinex framework, in which illumination and noise of low-light image are perceived in a mutually reinforced manner, leading to noise reduction low-light enhancement results. Specifically, two fully pointwise convolutional neural networks are devised to model the statistical regularities of ambient light and image noise respectively, and to leverage them as constraints to facilitate the mutual learning process. The proposed method not only suppresses the interference caused by the ambiguity between tiny textures and image noises, but also greatly improves the computational efficiency. Moreover, to solve the problem of insufficient training data, we propose an image synthesis strategy based on camera imaging model, which generates color images corrupted by illumination-dependent noises. Experimental results on both synthetic and real low-light images demonstrate the superiority of our proposed approaches against the State-Of-The-Art (SOTA) low-light enhancement methods.

Lightweight Image Super-Resolution with Information Multi-distillation Network

Zheng Hui
Xinbo Gao
Yunchu Yang
Xiumei Wang

In recent years, single image super-resolution (SISR) methods using deep convolution neural network (CNN) have achieved impressive results. Thanks to the powerful representation capabilities of the deep networks, numerous previous ways can learn the complex non-linear mapping between low-resolution (LR) image patches and their high-resolution (HR) versions. However, excessive convolutions will limit the application of super-resolution technology in low computing power devices. Besides, super-resolution of any arbitrary scale factor is a critical issue in practical applications, which has not been well solved in the previous approaches. To address these issues, we propose a lightweight information multi-distillation network (IMDN) by constructing the cascaded information multi-distillation blocks (IMDB), which contains distillation and selective fusion parts. Specifically, the distillation module extracts hierarchical features step-by-step, and fusion module aggregates them according to the importance of candidate features, which is evaluated by the proposed contrast-aware channel attention mechanism. To process real images with any sizes, we develop an adaptive cropping strategy (ACS) to super-resolve block-wise image patches using the same well-trained model. Extensive experiments suggest that the proposed method performs favorably against the state-of-the-art SR algorithms in term of visual quality, memory footprint, and inference time. Code is available at \urlhttps://github.com/Zheng222/IMDN.

Deep Fusion Network for Image Completion

Xin Hong
Pengfei Xiong
Renhe Ji
Haoqiang Fan

Deep image completion usually fails to harmonically blend the restored image into existing content, especially in the boundary area. This paper handles this problem from a new perspective of creating a smooth transition and proposes a concise Deep Fusion Network (DFNet). Firstly, a fusion block is introduced to generate a flexible alpha composition map for combining known and unknown regions. The fusion block not only provides a smooth fusion between restored and existing content but also provides an attention map to make network focus more on the unknown pixels. In this way, it builds a bridge for structural and texture information, so that information can be naturally propagated from the known region into completion. Furthermore, fusion blocks are embedded into several decoder layers of the network. Accompanied by the adjustable loss constraints on each layer, more accurate structure information is achieved. We qualitatively and quantitatively compare our method with other state-of-the-art methods on Places2 and CelebA datasets. The results show the superior performance of DFNet, especially in the aspects of harmonious texture transition, texture detail and semantic structural consistency.

Predicting Future Instance Segmentation with Contextual Pyramid ConvLSTMs

Jiangxin Sun
Jiafeng Xie
Jian-Fang Hu
Zihang Lin
Jianhuang Lai
Wenjun Zeng
Wei-shi Zheng

Despite the remarkable progress in instance segmentation, the problem of predicting future instance segmentation remains challenging due to the unobservability of future data. Existing methods mainly address this challenge by forecasting pyramid features to represent unobserved future frames. However, they mainly predict features for each pyramid level independently, and ignore the underlying structural relationship between features of different levels.

In this paper, we propose a novel framework called Contextual Pyramid ConvLSTMs, which contains a set of ConvLSTMs to exploit intra-level spatio-temporal contexts for predicting features of each individual level. Moreover, we also add pathway connections among the ConvLSTMs to transmit information across different ConvLSTMs, which allows our system to capture more inter-level spatio-temporal contextual information. We experimentally show that the proposed method can achieve state-of-the-art performance on two video instance segmentation benchmarks for future instance segmentation prediction.

Cycle In Cycle Generative Adversarial Networks for Keypoint-Guided Image Generation

Hao Tang
Dan Xu
Gaowen Liu
Wei Wang
Nicu Sebe
Yan Yan

In this work, we propose a novel Cycle In Cycle Generative Adversarial Network (C2GAN) for the task of keypoint-guided image generation. The proposed C2GAN is a cross-modal framework exploring a joint exploitation of the keypoint and the image data in an interactive manner. C2GAN contains two different types of generators, i.e., keypoint-oriented generator and image-oriented generator. Both of them are mutually connected in an end-to-end learnable fashion and explicitly form three cycled sub-networks, i.e., one image generation cycle and two keypoint generation cycles. Each cycle not only aims at reconstructing the input domain, and also produces useful output involving in the generation of another cycle. By so doing, the cycles constrain each other implicitly, which provides complementary information from the two different modalities and brings extra supervision across cycles, thus facilitating more robust optimization of the whole network. Extensive experimental results on two publicly available datasets, i.e., Radboud Faces and Market-1501, demonstrate that our approach is effective to generate more photo-realistic images compared with state-of-the-art models.

SESSION: Session 4D: Embedding&Network Learning

Diachronic Cross-modal Embeddings

David Semedo
Joao Magalhaes

Understanding the semantic shifts of multimodal information is only possible with models that capture cross-modal interactions over time. Under this paradigm, a new embedding is needed that structures visual-textual interactions according to the temporal dimension, thus, preserving data's original temporal organisation. This paper introduces a novel diachronic cross-modal embedding (DCM), where cross-modal correlations are represented in embedding space, throughout the temporal dimension, preserving semantic similarity at each instant t. To achieve this, we trained a neural cross-modal architecture, under a novel ranking loss strategy, that for each multimodal instance, enforces neighbour instances' temporal alignment, through subspace structuring constraints based on a temporal alignment window. Experimental results show that our DCM embedding successfully organises instances over time. Quantitative experiments, confirm that DCM is able to preserve semantic cross-modal correlations at each instant t while also providing better alignment capabilities. Qualitative experiments unveil new ways to browse multimodal content and hint that multimodal understanding tasks can benefit from this new embedding.

Domain-Specific Embedding Network for Zero-Shot Recognition

Shaobo Min
Hantao Yao
Hongtao Xie
Zheng-Jun Zha
Yongdong Zhang

Zero-Shot Learning (ZSL) seeks to recognize a sample from either seen or unseen domain by projecting the image data and semantic labels into a joint embedding space. However, most existing methods directly adapt a well-trained projection from one domain to another, thereby ignoring the serious bias problem caused by domain differences. To address this issue, we propose a novel Domain-Specific Embedding Network (DSEN) that can apply specific projections to different domains for unbiased embedding, as well as several domain constraints. In contrast to previous methods, the DSEN decomposes the domain-shared projection function into one domain-invariant and two domain-specific sub-functions to explore the similarities and differences between two domains. To prevent the two specific projections from breaking the semantic relationship, a semantic reconstruction constraint is proposed by applying the same decoder function to them in a cycle consistency way. Furthermore, a domain division constraint is developed to directly penalize the margin between real and pseudo image features in respective seen and unseen domains, which can enlarge the inter-domain difference of visual features. Extensive experiments on four public benchmarks demonstrate the effectiveness of DSEN with an average of $9.2%$ improvement in terms of harmonic mean. The code is available in \urlhttps://github.com/mboboGO/DSEN-for-GZSL.

Collaborative Preference Embedding against Sparse Labels

Shilong Bao
Qianqian Xu
Ke Ma
Zhiyong Yang
Xiaochun Cao
Qingming Huang

Living in the era of the internet, we are now facing with a big bang of online information. As a consequence, we often find ourselves troubling with hundreds and thousands of options before making a decision. As a way to improve the quality of users' online experience, Recommendation System aims to facilitate personalized online decision making processes via predicting users' responses toward different options. However, the vast majority of the literature in the field merely focus on datasets with sufficient amount of samples. Different from the traditional methods, we propose a novel method named as Collaborative Preference Embedding (CPE) which directly deals with sparse and insufficient user preference information. Specifically, we represent the intrinsic pattern of users/items with a high dimensional embedding space. On top of this embedding space, we design two schemes specifically against the limited generalization ability in terms of sparse labels. On one hand, we construct a margin function which could indicate the consistency between the embedding space and the true user preference. From the margin theory point-of-view, we then propose a generalization enhancement scheme for sparse and insufficient labels via optimizing the margin distribution. On the other hand, regarding the embedding as a code for a user/item, we then improve the generalization ability from the coding point-of-view. Specifically, we leverage a compact embedding space by reducing the dependency across different dimensions of a code (embedding). Finally, extensive experiments on a number of real-world datasets demonstrate the superior generalization performance of the proposed algorithm.

Learning Fragment Self-Attention Embeddings for Image-Text Matching

Yiling Wu
Shuhui Wang
Guoli Song
Qingming Huang

In image-text matching task, the key to good matching quality is to capture the rich contextual dependencies between fragments of image and text. However, previous works either simply aggregate the similarity of all possible pairs of image regions and words, or take multi-step cross attention to attend to image regions and words with each other as context, which requires exhaustive similarity computation between all image region and word pairs. In this paper, we propose Self-Attention Embeddings (SAEM) to exploit fragment relations in images or texts by self-attention mechanism, and aggregate fragment information into visual and textual embeddings. Specifically, SAEM extracts salient image regions based on bottom-up attention, and takes WordPiece tokens as sentence fragments. The self-attention layers are built to model subtle and fine-grained fragment relation in image and text respectively, which consists of multi-head self-attention sub-layer and position-wise feed-forward network sub-layer. Consequently, the fragment self-attention mechanism can discover the fragment relations and identify the semantically salient regions in images or words in sentences, and capture their interaction more accurately. By simultaneously exploiting the fine-grained fragment relation in both visual and textual modalities, our method produces more semantically consistent embeddings for representing images and texts, and demonstrates promising image-text matching accuracy and high efficiency on Flickr30K and MSCOCO datasets.

Adaptive Semantic-Visual Tree for Hierarchical Embeddings

Shuo Yang
Wei Yu
Ying Zheng
Hongxun Yao
Tao Mei

Merchandise categories inherently form a semantic hierarchy with different levels of concept abstraction, especially for fine-grained categories. This hierarchy encodes rich correlations among various categories across different levels, which can effectively regularize the semantic space and thus make prediction less ambiguous. However, previous studies of fine-grained image retrieval primarily focus on semantic similarities or visual similarities. In real application, merely using visual similarity may not satisfy the need of consumers to search merchandise with real-life images, e.g., given a red coat as query image, we might get red suit in recall results only based on visual similarity, since they are visually similar; But the users actually want coat rather than suit even the coat is with different color or texture attributes. We introduce this new problem based on photo shopping in real practice. That's why semantic information are integrated to regularize the margins to make "semantic" prior to "visual". To solve this new problem, we propose a hierarchical adaptive semantic-visual tree (ASVT) to depict the architecture of merchandise categories, which evaluates semantic similarities between different semantic levels and visual similarities within the same semantic class simultaneously. The semantic information satisfies the demand of consumers for similar merchandise with the query while the visual information optimize the correlations within the semantic class. At each level, we set different margins based on the semantic hierarchy and incorporate them as prior information to learn a fine-grained feature embedding. To evaluate our framework, we propose a new dataset named JDProduct, with hierarchical labels collected from actual image queries and official merchandise images on online shopping application. Extensive experimental results on the public CARS196 and CUB-200-2011 datasets demonstrate the superiority of our ASVT framework against compared state-of-the-art methods.

Defending Against Adversarial Examples via Soft Decision Trees Embedding

Yingying Hua
Shiming Ge
Xindi Gao
Xin Jin
Dan Zeng

Convolutional neural networks (CNNs) have shown vulnerable to adversarial examples which contain imperceptible perturbations. In this paper, we propose an approach to defend against adversarial examples with soft decision trees embedding. Firstly, we extract the semantic features of adversarial examples with a feature extraction network. Then, a specific soft decision tree is trained and embedded to select the key semantic features for each feature map from convolutional layers and the selected features are fed to a light-weight classification network. To this end, we use the probability distributions of each tree node to quantify the semantic features. In this way, some small perturbations can be effectively removed and the selected features are more discriminative in identifying adversarial examples. Moreover, the influence of adversarial perturbations on classification can be reduced by migrating the interpretability of soft decision trees into the black-box neural networks. We conduct experiments to defend the state-of-the-art adversarial attacks. The experimental results demonstrate that our proposed approach can effectively defend against these attacks and improve the robustness of deep neural networks.

Adaptive Feature Fusion via Graph Neural Network for Person Re-identification

Yaoyu Li
Hantao Yao
Lingyu Duan
Hanxing Yao
Changsheng Xu

Person Re-identification (ReID) targets to identify a probe person appeared under multiple camera views. Existing methods focus on proposing a robust model to capture the discriminative information. However, they all generate a representation by mining useful clues from a given single image, and ignore the intercommunication with other images. To address this issue, we propose a novel network named Feature-Fusing Graph Neural Network (FFGNN), which fully utilizes the relationships among the nearest neighbors of the given image, and allows message propagation to update the feature of the node during representation learning. Given an anchor image, the FFGNN firstly obtains its Top-K nearest images based on the feature generated by the trained Feature-Extracting Network(FEN). We then construct a graph G based on the obtained K+1 images, in which each node represents the feature of an image. The edge of the graph G is obtained by combing the visual similarity and Jaccard similarity between nodes. Within the constructed graph G, FFGNN conducts message propagation and adaptive feature fusion between nodes by iteratively performing graph convolutional operation on the input features. Finally, the FFGNN outputs a robust and discriminative representation which contains the information from its similar images. Extensive experiments on three public person ReID datasets including Market-1501, DukeMTMC-ReID, and CUHK03 demonstrate that the proposed model can achieve significant improvement against state-of-the-art methods.

Learning Semantics-aware Distance Map with Semantics Layering Network for Amodal Instance Segmentation

Ziheng Zhang
Anpei Chen
Ling Xie
Jingyi Yu
Shenghua Gao

In this work, we demonstrate yet another approach to tackle the amodal segmentation problem. Specifically, we first introduce a new representation, namely a semantics-aware distance map (sem-dist map), to serve as our target for amodal segmentation instead of the commonly used masks and heatmaps. The sem-dist map is a kind of level-set representation, of which the different regions of an object are placed into different levels on the map according to their visibility. It is a natural extension of masks and heatmaps, where modal, amodal segmentation, as well as depth order information, are all well-described. Then we also introduce a novel convolutional neural network (CNN) architecture, which we refer to as semantic layering network, to estimate sem-dist maps layer by layer, from the global-level to the instance-level, for all objects in an image. Extensive experiments on the COCOA and D2SA datasets have demonstrated that our framework can predict amodal segmentation, occlusion, and depth order with state-of-the-art performance.

Open Set Deep Learning with A Bayesian Nonparametric Generative Model

Xulun Ye
Jieyu Zhao

Being a widely studied model in machine learning and multimedia community, Deep Neural Network (DNN) has achieved an encouraging success in various applications. However, conventional DNN suffers the difficulty when handling the open set learning problem, in which the true class number is unknown, and the predication label in the testing dataset usually has unseen classes which are not contained in the training set. In this paper, we aim to tackle this problem by unifying deep neural network and Dirichlet process mixture model. Firstly, to learn the deep feature and enable the incorporation of DNN and the Bayesian nonparametric model, we extend deep metric learning to a semi-supervised framework. Secondly, with the learned deep feature, we construct our open set classification method by expanding the Dirichlet process mixture model to a semi-supervised framework. To infer our semi-supervised Bayesian model, the corresponding variational inference algorithm has also been derived. Experiment on synthetic and real world datasets validates our theory analysis and demonstrates the state-of-the-art performance.

Fast Non-Local Neural Networks with Spectral Residual Learning

Lu Chi
Guiyu Tian
Yadong Mu
Lingxi Xie
Qi Tian

Effectively modeling long-range spatial correlation is crucial in context-sensitive visual computing tasks, such as human pose estimation and video classification. Enlarging receptive field is popularly adopted in building such non-local deep networks. However, current solutions, including dilation convolution or self-attention based operators, mostly suffer from either low computational efficacy or insufficient receptive field. This paper proposes spectral residual learning (SRL), a novel network architectural design for achieving fully global receptive field. A neural block that implements SRL has three key components: a local-to-global transform that projects some ordinary local features into a spectral domain, compiled operations in the spectral domain, and a global-to-local transform that converts all data back to the original local format. We show its equivalence to conducting residual learning in some spectral domain and carefully re-formulate a variety of neural layers into their spectral forms, such as ReLU or convolutions. The benefits of SRL is three-fold: first, all operations have global receptive field, namely any update affects all image positions. This can extract richer context information in various vision tasks; Secondly, the local-to-global / global-to-local transforms in SRL are defined by bi-linear unitary matrices, which is both computation and parameter economic; Lastly, SRL is a generic formulation, here instantiated by Fourier transform and real orthogonal matrix. We conduct comprehensive evaluations on two challenging tasks, including human pose estimation from images and video classification. All experiments clearly show performance improvement by large margins in comparison with conventional non-local network designs.

Automatic Check-Out (ACO) receives increased interests in recent years. An important component of the ACO system is the visual item counting, which recognizes the categories and counts of the items chosen by the customers. However, the training of such a system is challenged by the domain adaptation problem, in which the training data are images from isolated items while the testing images are for collections of items. Existing methods solve this problem with data augmentation using synthesized images, but the image synthesis leads to unreal images that affect the training process. In this paper, we propose a new data priming method to solve the domain adaptation problem. Specifically, we first use pre-augmentation data priming, in which we remove distracting background from the training images using the coarse-to-fine strategy and select images with realistic view angles by the pose pruning method. In the post-augmentation step, we train a data priming network using detection and counting collaborative learning, and select more reliable images from testing data to fine-tune the final visual item tallying network. Experiments on the large scale Retail Product Checkout (RPC) dataset demonstrate the superiority of the proposed method, i.e., we achieve 80.51% checkout accuracy compared with 56.68% of the baseline methods. The source codes can be found in https://isrc.iscas.ac.cn/gitlab/research/acm-mm-2019-ACO.

Monocular Depth Estimation as Regression of Classification using Piled Residual Networks

Wen Su
Haifeng Zhang
Jia Li
Wenzhen Yang
Zengfu Wang

Predicting depth from single monocular image is a challenging task in scene understanding. Most existing work predicts depth by regression or classification with features extracted from local neighborhood area. However, neither regression nor classification achieves the final satisfying solution and local context can be insufficient to predict the depth. This paper innovatively addresses this problem as regression of class related features on a piled residual convolutional neural network. Our framework works at two stages. First, a well-designed deep convolutional neural network model is employed to classify the depths in difference-scale invariance space. The model utilizes all scales of context though piled residual paths. The deeper layers that capture high-level semantic features with long-range context can be directly refined using fine-grained features with local context from earlier convolutions. We then apply centered information gain loss to the model to produce intra-class compact and inter-class discriminative features. Second, to obtain depths instead of class labels, we infer depth regression with convolutional layers which model the mapping from class discriminative features to continuous depth values. Experiments on the popular indoor and outdoor datasets show competitive results compared with the recent state of the art methods.

GroundNet: Monocular Ground Plane Normal Estimation with Geometric Consistency

Yunze Man
Xinshuo Weng
Xi Li
Kris Kitani

We focus on estimating the 3D orientation of the ground plane from a single image. We formulate the problem as an inter-mingled multi-task prediction problem by jointly optimizing for pixel-wise surface normal direction, ground plane segmentation, and depth estimates. Specifically, our proposed model, GroundNet, first estimates the depth and surface normal in two separate streams, from which two ground plane normals are then computed deterministically. To leverage the geometric correlation between depth and normal, we propose to add a consistency loss on top of the computed ground plane normals. In addition, a ground segmentation stream is used to isolate the ground regions so that we can selectively back-propagate parameter updates through only the ground regions in the image. Our method achieves the top-ranked performance on ground plane normal estimation and horizon line detection on the real-world outdoor datasets of ApolloScape and KITTI, improving the performance of previous art by up to 17.7% relatively.

WealthAdapt: A General Network Adaptation Framework for Small Data Tasks

Bingyan Liu
Yao Guo
Xiangqun Chen

In this paper, we propose a general network adaptation framework, namely WealthAdapt, to effectively adapt a large network for small data tasks, with the assistance of a wealth of related data. While many existing algorithms have proposed network adaptation techniques for resource-constrained systems, they typically implement network adaptation based on a large dataset and do not perform well when facing small data tasks. Because small data have poor feature expression ability, it may result in incorrect filter selection and overfitting during fine-tuning in the network adaptation process. In WealthAdapt, we first expand the target small data task with the wealth of big data, before we perform network adaptation, in order to enrich the features and improve the fine-tuning performance during adaptation. We formally establish network adaptation for small data tasks as an optimization problem and solve it through two main techniques:model-based fast selection andwealth-incorporated iteration adaptation. Experimental results demonstrate that our framework is applicable to both the vanilla convolutional network VGG-16 and more complex modern architecture ResNet-50, outperforming several state-of-the-art network adaptation pipelines on multiple visual classification tasks includinggeneral object recognition, fine-grained object recognition andscene recognition.

SESSION: Demonstration II

Market2Dish: A Health-aware Food Recommendation System

Hao Jiang
Wenjie Wang
Meng Liu
Liqiang Nie
Ling-Yu Duan
Changsheng Xu

In order to help people develop healthy eating habits, we present a personalized health-aware food recommendation system, calledMarket2Dish. Market2Dish could recognize the ingredients in the micro-videos taken from the market, characterize the health conditions of users from their social media accounts, and ultimately recommend users with the personalized healthy foods. Specifically, we employ a word-class interaction based text classification model to learn the fine-grained similarity between sparse health features on the social media platforms and pre-defined health concepts, and then a category-aware hierarchical memory network based recommender is introduced to learn the user-recipe interactions for better food recommendations. Moreover, we demonstrate this system as an online app for real-time interactions with users.

Remote VR Gaming on Mobile Devices

Mikko Pitkänen
Marko Viitanen
Alexandre Mercat
Jarno Vanne

This paper presents a remote 360-degree virtual reality (VR) gaming system for mobile devices. In this end-to-end scheme, execution of VR game is off-loaded from low-power mobile devices to a remote server where the executed game is rendered based on controller orientation and actions transmitted over the network. The server is running the Unity game engine and Kvazaar video encoder. Kvazaar compresses the rendered views of the game to High Efficiency Video Coding (HEVC) video that is streamed to a player over a regular WiFi link in real time. The frontend of our proof-of-concept demonstrator setup is composed of the Samsung Galaxy S8 smartphone and Google Daydream View VR headset with a controller. The backend server is a laptop equipped with Nvidia GTX 1070 GPU and Intel i7 7820HK CPU. The system is able to run the demonstrated 360-degree shooting VR game with 1080p resolution at 30 fps while keeping motion-to-photon latency close to 50 ms. This approach lets players enter immersive gaming experience without a need to invest in all-in-one VR headsets.

Ultrasound-Based Silent Speech Interface using Sequential Convolutional Auto-encoder

Kele Xu
Yuxiang Wu
Zhifeng Gao

"Silent Speech Interfaces'' (SSI) refers to a system which uses non-audible signals recorded during speech production to perform speech recognition and synthesis tasks. Different approaches have been proposed for the SSI systems. In this paper, we focus on an ultrasound-based SSI. The performance of ultrasound-based SSI system heavily relies on the feature extraction approach. However, most of the previous attempts are often limited to individual frame analysis, and the context information of the image sequence cannot be taken into account. Inspired by the recent success of the recurrent neural network and convolutional auto-encoder, we explore a novel sequential feature extraction approach for SSI system. The architecture can extract spatial and temporal feature from the image sequence, which can be further deployed for the speech recognition and synthetic tasks. By quantitative comparison between different unsupervised feature extraction approaches, the new approach outperforms other methods on the 2010 SSI challenge.

Interactive Exploration of Journalistic Video Footage through Multimodal Semantic Matching

Sarah Ibrahimi
Shuo Chen
Devanshu Arya
Arthur Câmara
Yunlu Chen
Tanja Crijns
Maurits van der Goes
Thomas Mensink
Emiel van Miltenburg
Daan Odijk
William Thong
Jiaojiao Zhao
Pascal Mettes

This demo presents a system for journalists to explore video footage for broadcasts. Daily news broadcasts contain multiple news items that consist of many video shots and searching for relevant footage is a labor intensive task. Without the need for annotated video shots, our system extracts semantics from footage and automatically matches these semantics to query terms from the journalist. The journalist can then indicate which aspects of the query term need to be emphasized, e.g. the title or its thematic meaning. The goal of this system is to support the journalists in their search process by encouraging interaction and exploration with the system.

NeuronUnityIntegration2.0. A Unity Based Application for Motion Capture and Gesture Recognition

Federico Becattini
Andrea Ferracani
Filippo Principi
Marioemanuele Ghianni
Alberto Del Bimbo

NeuronUnityIntgration2.0 (demo video is avilable at http://tiny.cc/u1lz6y) is a plugin for Unity which provides gesture recognition functionalities through the Perception Neuron motion capture suit. The system offers a recording mode, which guides the user through the collection of a dataset of gestures, and a recognition mode, capable of detecting the recorded actions in real time. Gestures are recognized by training Support Vector Machines directly within our plugin. We demonstrate the effectiveness of our application through an experimental evaluation on a newly collected dataset. Furthermore, external applications can exploit NeuronUnityIntgration2.0's recognition capabilities thanks to a set of exposed API.

Real-Time Visual Navigation in Huge Image Sets Using Similarity Graphs

Kai Uwe Barthel
Nico Hezel
Konstantin Schall
Klaus Jung

Nowadays stock photo agencies often have millions of images. Non-stop viewing of 20 million images at a speed of 10 images per second would take more than three weeks. This demonstrates the impossibility to inspect all images and the difficulty to get an overview of the entire collection. Although there has been a lot of effort to improve visual image search, there is little research and support for visual image exploration. Typically, users start "exploring" an image collection with a keyword search or an example image for a similarity search. Both searches lead to long unstructured lists of result images. In earlier publications, we introduced the idea of graph-based image navigation and proposed an efficient algorithm for building hierarchical image similarity graphs for dynamically changing image collections. In this demo we showcase real-time visual exploration of millions of images with a standard web browser. Subsets of images are successively retrieved from the graph and displayed as a visually sorted 2D image map, which can be zoomed and dragged to explore related concepts. Maintaining the positions of previously shown images creates the impression of an "endless map". This approach allows an easy visual image-based navigation, while preserving the complex image relationships of the graph.

A Real-Time Demo for Acoustic Event Classification in Ambient Assisted Living Contexts

Arunodhayan Sampath Kumar
René Erler
Danny Kowerko

In this paper we present a real-time demo for acoustic event classification using a Convolutional Neural Network (CNN). When an acoustic event is fed as input into our system in real-time, the system performs the classification task and denotes to which class the acoustic event belongs. We combined different audio datasets into an own one consisting of 94 classes belonging to the context of Ambient Assisted Living (AAL). The so-called AAL-94 audio set is a combination of publicly available ESC-50 [7], Audio Set [4] and Ultrasound-8k [8] datasets. We enriched these subsets with own laboratory recordings to create a collection of 18,882 audio recordings typical for AAL. The datasets were trained and the classification task is performed using a CNN. The best model from the training process has been snapshot and is used for real-time audio processing in our demo. The latter visualizes the audio classification results in a real-time spectrogram and some statistical plots. Users either interacts creating noises themselves from the 94 available classes shown on an auxiliary screen of the demo, or trigger sounds from a MIDI keyboard to test the system performance live. Current and overall classification results are demonstrated on the main screen.

User-Adaptive Editing for 360 degree Video Streaming with Deep Reinforcement Learning

Lucile Sassatelli
Marco Winckler
Thomas Fisichella
Ramon Aparicio

The development through streaming of 360\degree\ videos is persistently hindered by how much bandwidth they require. Adapting spatially the quality of the sphere to the user's Field of View (FoV) lowers the data rate but requires to keep the playback buffer small, to predict the user's motion or to make replacements to keep the buffered qualities up to date with the moving FoV, all three being uncertain and risky. We have previously shown that opportunistically regaining control on the FoV with active attention-driving techniques makes for additional levers to ease streaming and improve Quality of Experience (QoE). Deep neural networks have been recently shown to achieve best performance for video streaming adaptation and head motion prediction. This demo presents a step ahead in the important investigation of deep neural network approaches to obtain user-adaptive and network-adaptive 360 degree video streaming systems. In this demo, we show how snap-changes, an attention-driving technique, can be automatically modulated by the user's motion to improve the streaming QoE. The control of snap-changes is made with a deep neural network trained on head motion traces with the Deep Reinforcement Learning strategy A3C.

OtonoVR: Arbitrarily Angled Audio-visual VR Experience Using Selective Synthesis Sound Field Technique

Toshiharu Horiuchi
Sumaru Niida
Yasuhiro Takishima

We present an arbitrarily angled audio-visual VR experience app called OtonoVR for 360-degree panoramic videos using our selective synthesis sound field technique. This technique can synthesize two-channel stereo sound with scaled stereo width having an arbitrary angle range from 0 to 360 degrees centering on an arbitrary direction from multi-channel surround sound based on spectral modification. In the app, users can enjoy arbitrarily angled videos that they choose themselves by manipulating the touchscreen, and the stereo sound changes in terms of its spatial synchronization depending on the view. The app has been released for iOS and has been officially endorsed by Japanese idol groups.

Active Learning of Identity Agnostic Roles for Character Grounding in Videos

Jiang Gao

A popular approach for storyline grounding in videos is centered on the consistent classification of main characters in videos. This work investigates learning identity agnostic characters for video grounding. The system starts without any knowledge about the character identities and their facial images, and learns to build individual character models as the video frames stream in. It is challenging to establish character models, especially in movies where characters can go through a lifetime in a clip, and their facial appearances change dramatically between scenes. We designed an active learning algorithm on top of metric learning, with an interactive interface to query users. Different from conventional active learning algorithms, our query proposal function depends not only on appearance, but also on story context of these characters.

Ramen as You Like: Sketch-based Food Image Generation and Editing

Jaehyeong Cho
Wataru Shimoda
Keiji Yanai

In recent years, a large number of images are being posted on SNS. The users often synthesize or modify their photos before uploading them. However, the task of synthesizing and modifying photos requires a lot of time and skill. In this demo, we demonstrate easy and fast image synthesis and modification through "sketch-based food image generation''. The proposed system uses pix2pix to generate realistic food images based on sketched images, and DeepLab V3+ to obtain sketch masks from real photos. A user can create a realistic food image easily and fast by sketching a mask image consisting of food elements. In addition, a user can also edit a mask image automatically generated from a real photo food photo, and generate a modified food image. For training, we have created a new ramen image dataset consisting of 555 images with 15 kinds of pixel-wise labels.

DeepPhysio: Monitored Physiotherapeutic Exercise in the Comfort of your Own Home

Gianmarco Sanesi
Andrew D. Bagdanov
Marco Bertini
Alberto Del Bimbo

This paper describes an action classification pipeline for detecting and evaluating correct execution of actions in video recorded by smartphone cameras; the use case is that of simplifying monitoring of how physiotherapeutic exercises are performed by patients in the comfort of their own home, reducing the need of physical presence of therapists. Our approach is based on applying DensePose to every frame of acquired video and subsequent sequence analysis by an LSTM network. We validate our proposed recognition approach on a subset of the NTU RGB+D dataset in order to determine the best classification pipeline for this application. We also describe a mobile, cross-platform application called DeepPhysio that is designed to allow at physiotherapy patients to obtain immediate feedback about the correctness of the physical exercises. Preliminary usability analysis shows that this type of application can be effective at monitoring physiotherapy exercises.

Using 3D Bookmarks for Desktop and Mobile DASH-3D Clients

Thomas Forgione
Axel Carlier
Géraldine Morin
Wei Tsang Ooi
Vincent Charvillat

Navigating in a 3D networked virtual environment with a six-degree of freedom on a mobile device can be disorientating and challenging to users. In this technical demonstration, we show how 3D bookmarks can be used to simplify such interactions. Our system integrates the 3D bookmarks into a DASH-based network virtual environment in a DASH-compliant manner and is available on both the desktop and mobile.

Automatic Fashion Knowledge Extraction from Social Media

Yunshan Ma
Lizi Liao
Tat-Seng Chua

Fashion knowledge plays a pivotal role in helping people in their dressing. In this paper, we present a novel system to automatically harvest fashion knowledge from social media. It unifies three tasks of occasion, person and clothing discovery from multiple modalities of images, texts and metadata. A contextualized fashion concept learning model is applied to leverage the rich contextual information for improving the fashion concept learning performance. At the same time, to counter the label noise within training data, we employ a weak label modeling method to further boost the performance. We build a website to demonstrate the quality of fashion knowledge extracted by our system.

A Cooking Support System by Extracting Difficult Scenes for Cooking Operations from Recipe Short Videos

Takuya Yonezawa
Yuanyuan Wang
Yukiko Kawai
Kazutoshi Sumiya

Recently, short cooking videos such as Kurashiru and DELISH KITCHEN are attracting attention. These cooking videos can help people learn the essentials of cooking in a short time. However, it is difficult to understand cooking operations by watching a video only one time. Therefore, in this paper, we propose a novel cooking support system to extract cooking video's difficulty by extracting cooking ingredients and operations based on their appearance time from open captions of videos. We also show the effectiveness of our proposed cooking video's difficulty extraction method.

AI Coach: Deep Human Pose Estimation and Analysis for Personalized Athletic Training Assistance

Jianbo Wang
Kai Qiu
Houwen Peng
Jianlong Fu
Jianke Zhu

Accurate pose analysis in sport videos is beneficial to users to improve skills. In this paper, we propose an AI coach system to provide personalized athletic training experiences for posture-wise sports activities, in which the training quality largely depends on the correctness of human poses in a video sequence. we propose to design the system with several distinct features: (1) trajectory extraction for a single human instance by leveraging deep visual tracking, (2) human pose estimation by proposing a novel human joints relation model in spatial and temporal domains,(3) pose correction by abnormal detection, performance rating and exemplar-based visual suggestions. We build an online service of this AI coach system for sports enthusiasts and collect extensive feedbacks. Comparisons with some latest popular sport apps demonstrate the effectiveness of this AI coach system to improve skills for users.

Mind Band: A Crossmedia AI Music Composing Platform

Zhaolin Qiu
Yufan Ren
Canchen Li
Hongfu Liu
Yifan Huang
Yiheng Yang
Songruoyao Wu
Hanjia Zheng
Juntao Ji
Jianjia Yu
Kejun Zhang

Various media information in life can have an important impact on our understanding of music. In this paper, we present a demo, Mind Band, which is a Cross-Media artificial intelligent composing platform using our life elements such as emoji, image and humming. In practice, we base our system on the valence-arousal model. We use emotion analysis of life elements to map them to music pieces, which are generated by a Variational Autoencoder - Generative Adversarial Networks model. We provide users with immersive experience by uploading emoji/image/humming and retrieving emotionally related music pieces back. With this platform, everyone can be a composer.

SESSION: Panel 1

PANEL: Challenges for Multimedia/Multimodal Research in the Next Decade

Shih-Fu Chang
L.P. Morency
Alexander Hauptmann
Alberto Del Bimbo
Cathal Gurrin
Hayley Hung
Heng Ji
Alan Smeaton

The multimedia and multi-modal community is witnessing an explosive transformation in the recent years with major societal impact. With the unprecedented deployment of multimedia devices and systems, multimedia research is critical to our abilities and prospects in advancing state-of-the-art technologies and solving real-world challenges facing the society and the nation. To respond to these challenges and further advance the frontiers of the field of multimedia, this panel will discuss the challenges and visions that may guide future research in the next ten years.

SESSION: Brave New Ideas

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

Shizhe Chen
Bei Liu
Jianlong Fu
Ruihua Song
Qin Jin
Pingping Lin
Xiaoyu Qi
Chunting Wang
Jin Zhou

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products. In this paper, we tackle a new multimedia task of automatic storyboard creation to facilitate this process and inspire human artists. Inspired by the fact that our understanding of languages is based on our past experience, we propose a novel inspire-and-create framework with a story-to-image retriever that selects relevant cinematic images for inspiration and a storyboard creator that further refines and renders images to improve the relevancy and visual consistency. The proposed retriever dynamically employs contextual information in the story with hierarchical attentions and applies dense visual-semantic matching to accurately retrieve and ground images. The creator then employs three rendering steps to increase the flexibility of retrieved images, which include erasing irrelevant regions, unifying styles of images and substituting consistent characters. We carry out extensive experiments on both in-domain and out-of-domain visual story datasets. The proposed model achieves better quantitative performance than the state-of-the-art baselines for storyboard creation. Qualitative visualizations and user studies further verify that our approach can create high-quality storyboards even for stories in the wild.

HyperLearn: A Distributed Approach for Representation Learning in Datasets With Many Modalities

Devanshu Arya
Stevan Rudinac
Marcel Worring

Multimodal datasets contain an enormous amount of relational information, which grows exponentially with the introduction of new modalities. Learning representations in such a scenario is inherently complex due to the presence of multiple heterogeneous information channels. These channels can encode both (a) inter-relations between the items of different modalities and (b) intra-relations between the items of the same modality. Encoding multimedia items into a continuous low-dimensional semantic space such that both types of relations are captured and preserved is extremely challenging, especially if the goal is a unified end-to-end learning framework. The two key challenges that need to be addressed are: 1) the framework must be able to merge complex intra and inter relations without losing any valuable information and 2) the learning model should be invariant to the addition of new and potentially very different modalities. In this paper, we propose a flexible framework which can scale to data streams from many modalities. To that end we introduce a hypergraph-based model for data representation and deploy Graph Convolutional Networks to fuse relational information within and across modalities. Our approach provides an efficient solution for distributing otherwise extremely computationally expensive or even unfeasible training processes across multiple-GPUs, without any sacrifices in accuracy. Moreover, adding new modalities to our model requires only an additional GPU unit keeping the computational time unchanged, which brings representation learning to truly multimodal datasets. We demonstrate the feasibility of our approach in the experiments on multimedia datasets featuring second, third and fourth order relations.

Moment-to-Moment Detection of Internal Thought during Video Viewing from Eye Vergence Behavior

Michael Xuelin Huang
Jiajia Li
Grace Ngai
Hong Va Leong
Andreas Bulling

Internal thought refers to the process of directing attention away from a primary visual task to internal cognitive processing. It is pervasive and closely related to primary task performance. As such, automatic detection of internal thought has significant potential for user modeling in human-computer interaction and multimedia applications. Despite the close link between the eyes and the human mind, only few studies have investigated vergence behavior during internal thought and none has studied moment-to-moment detection of internal thought from gaze. While prior studies relied on long-term data analysis and required a large number of gaze characteristics, we describe a novel method that is user-independent, computationally light-weight and only requires eye vergence information readily available from binocular eye trackers. We further propose a novel paradigm to obtain ground truth internal thought annotations by exploiting human blur perception. We evaluated our method during natural viewing of lecture videos and achieved a 12.1% improvement over the state of the art. These results demonstrate the effectiveness and robustness of vergence-based detection of internal thought and, as such, open new research directions for attention-aware interfaces.

Learning Subjective Attributes of Images from Auxiliary Sources

Francesco Gelli
Tiberio Uricchio
Xiangnan He
Alberto Del Bimbo
Tat-Seng Chua

Recent years have seen unprecedented research on using artificial intelligence to understand the subjective attributes of images and videos. These attributes are not objective properties of the content but are highly dependent on the perception of the viewers. Subjective attributes are extremely valuable in many applications where images are tailored to the needs of a large group, which consists of many individuals with inherently different ideas and preferences. For instance, marketing experts choose images to establish specific associations in the consumers' minds, while psychologists look for pictures with adequate emotions for therapy. Unfortunately, most of the existing frameworks either focus on objective attributes or rely on large scale datasets of annotated images, making them costly and unable to clearly measure multiple interpretations of a single input. Meanwhile, we can see that users or organizations often interact with images in a multitude of real-life applications, such as the sharing of photographs by brands on social media or the re-posting of image microblogs by users. We argue that these aggregated interactions can serve as auxiliary information to infer image interpretations. To this end, we propose a probabilistic learning framework capable of transferring such subjective information to the image-level labels based on a known aggregated distribution. We use our framework to rank images by subjective attributes from the domain knowledge of social media marketing and personality psychology. Extensive studies and visualizations show that using auxiliary information is a viable line of research for the multimedia community to perform subjective attributes prediction.

SESSION: Open Source Software Competition

daBNN: A Super Fast Inference Framework for Binary Neural Networks on ARM devices

Jianhao Zhang
Yingwei Pan
Ting Yao
He Zhao
Tao Mei

It is always well believed that Binary Neural Networks (BNNs) could drastically accelerate the inference efficiency by replacing the arithmetic operations in float-valued Deep Neural Networks (DNNs) with bit-wise operations. Nevertheless, there has not been open-source implementation in support of this idea on low-end ARM devices (e.g., mobile phones and embedded devices). In this work, we propose daBNN --- a super fast inference framework that implements BNNs on ARM devices. Several speed-up and memory refinement strategies for bit-packing, binarized convolution, and memory layout are uniquely devised to enhance inference efficiency. Compared to the recent open-source BNN inference framework, BMXNet, our daBNN is 7x~23x faster on a single binary convolution, and about 6x faster on Bi-Real Net 18 (a BNN variant of ResNet-18). The daBNN is a BSD-licensed inference framework, and its source code, sample projects and pre-trained models are available on-line: https://github.com/JDAI-CV/dabnn.

The VIA Annotation Software for Images, Audio and Video

Abhishek Dutta
Andrew Zisserman

In this paper, we introduce a simple and standalone manual annotation tool for images, audio and video: the VGG Image Annotator (VIA). This is a light weight, standalone and offline software package that does not require any installation or setup and runs solely in a web browser. The VIA software allows human annotators to define and describe spatial regions in images or video frames, and temporal segments in audio or video. These manual annotations can be exported to plain text data formats such as JSON and CSV and therefore are amenable to further processing by other software tools. VIA also supports collaborative annotation of a large dataset by a group of human annotators. The BSD open source license of this software allows it to be used in any academic project or commercial application.

Shooter Localization Using Social Media Videos

Junwei Liang
Jay D. Aronson
Alexander Hauptmann

Nowadays a huge number of user-generated videos are uploaded to social media every second, capturing glimpses of events all over the world. These videos provide important and useful information for reconstructing events like the Las Vegas Shooting in 2017. In this paper, we describe a system that can localize the shooter location only based on a couple of user-generated videos that capture the gunshot sound. Our system first utilizes established video analysis techniques like video synchronization and gunshot temporal localization to organize the unstructured social media videos for users to understand the event effectively. By combining multimodal information from visual, audio and geo-locations, our system can then visualize all possible locations of the shooter in the map. Our system provides a web interface for human-in-the-loop verification to ensure accurate estimations. We present the results of estimating the shooter's location of the Las Vegas Shooting in 2017 and show that our system is able to get accurate location using only the first few gunshots. The full technical report, all relevant source code including the web interface and machine learning models are available.

A Modern C++ Parallel Task Programming Library

Chun-Xun Lin
Tsung-Wei Huang
Guannan Guo
Martin D. F. Wong

In this paper we present Cpp-Taskflow, a C++ parallel programming library that enables users to quickly develop parallel applications using the task dependency graph model. Developers formulate their application as a task dependency graph and Cpp-Taskflow will manage the task execution and concurrency control.The task graph model is expressive and composable. It can express both regular and irregular parallel patterns, and developers can quickly compose large programs from small parallel modules. Cpp-Taskflow has an intuitive and unified API set. Users only need to learn the APIs to build and dispatch a task graph and no complex parallel programming concept is required. We have conducted experiments using both micro-benchmarks and real-world applications and Cpp-Taskflow outperforms state-of-the-art parallel programming libraries in both runtime and coding effort. Cpp-Taskflow is open-source and has been used in both industry and academic projects. From our users' feedback, we believe Cpp-Taskflow can benefit the industry and research community greatly through its ease-of-programming and inspire new research directions in multimedia system/software design.

Docker-Based Evaluation Framework for Video Streaming QoE in Broadband Networks

Cise Midoglu
Anatoliy Zabrovskiy
Ozgu Alay
Daniel Hoelbling-Inzko
Carsten Griwodz
Christian Timmerer

Video streaming is one of the top traffic contributors in the Internet and a frequent research subject. It is expected that streaming traffic will grow 4-fold for video globally and 9-fold for mobile video between 2017 and 2022. In this paper, we present an automatized measurement framework for evaluating video streaming QoE in operational broadband networks, using headless streaming with a Docker-based client, and a server-side implementation allowing for the use of multiple video players and adaptation algorithms. Our framework allows for integration with the acsMONROE testbed and Bitmovin Analytics, which bring on the possibility to conduct large-scale measurements in different networks, including mobility scenarios, and monitor different parameters in the application, transport, network, and physical layers in real-time.

OpenVSLAM: A Versatile Visual SLAM Framework

Shinya Sumikura
Mikiya Shibuya
Ken Sakurada

In this paper, we introduce OpenVSLAM, a visual SLAM framework with high usability and extensibility. Visual SLAM systems are essential for AR devices, autonomous control of robots and drones, etc. However, conventional open-source visual SLAM frameworks are not appropriately designed as libraries called from third-party programs. To overcome this situation, we have developed a novel visual SLAM framework. This software is designed to be easily used and extended. It incorporates several useful features and functions for research and development. OpenVSLAM is released at https://github.com/xdspacelab/openvslam under the 2-clause BSD license.

SESSION: Session 5A: Summaries&Generation

Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks

Xufeng He
Yang Hua
Tao Song
Zongpu Zhang
Zhengui Xue
Ruhui Ma
Neil Robertson
Haibing Guan

With the rapid growth of video data, video summarization technique plays a key role in reducing people's efforts to explore the content of videos by generating concise but informative summaries. Though supervised video summarization approaches have been well studied and achieved state-of-the-art performance, unsupervised methods are still highly demanded due to the intrinsic difficulty of obtaining high-quality annotations. In this paper, we propose a novel yet simple unsupervised video summarization method with attentive conditional Generative Adversarial Networks (GANs). Firstly, we build our framework upon Generative Adversarial Networks in an unsupervised manner. Specifically, the generator produces high-level weighted frame features and predicts frame-level importance scores, while the discriminator tries to distinguish between weighted frame features and raw frame features. Furthermore, we utilize a conditional feature selector to guide GAN model to focus on more important temporal regions of the whole video frames. Secondly, we are the first to introduce the frame-level multi-head self-attention for video summarization, which learns long-range temporal dependencies along the whole video sequence and overcomes the local constraints of recurrent units, e.g., LSTMs. Extensive evaluations on two datasets, SumMe and TVSum, show that our proposed framework surpasses state-of-the-art unsupervised methods by a large margin, and even outperforms most of the supervised methods. Additionally, we also conduct the ablation study to unveil the influence of each component and parameter settings in our framework.

Generating 1 Minute Summaries of Day Long Egocentric Videos

Anuj Rathore
Pravin Nagar
Chetan Arora
C.V. Jawahar

The popularity of egocentric cameras and their always-on nature has lead to the abundance of day-long first-person videos. Because of the extreme shake and highly redundant nature, these videos are difficult to watch from beginning to end and often require summarization tools for their efficient consumption. However, traditional summarization techniques developed for static surveillance videos, or highly curated sports videos and movies are, either, not suitable or simply do not scale for such hours long videos in the wild. On the other hand, specialized summarization techniques developed for egocentric videos limit their focus to important objects and people. In this paper, we present a novel unsupervised reinforcement learning technique to generate video summaries from day long egocentric videos. Our approach can be adapted to generate summaries of various lengths making it possible to view even 1-minute summaries of one's entire day. The technique can also be adapted to various rewards, such as distinctiveness and indicativeness of the summary. When using the facial saliency-based reward, we show that our approach generates summaries focusing on social interactions, similar to the current state-of-the-art (SOTA). Quantitative comparison on the benchmark Disney dataset shows that our method achieves significant improvement in Relaxed F-Score (RFS) (32.56 vs. 19.21) and BLEU score (12.12 vs. 10.64). Finally, we show that our technique can be applied for summarizing traditional, short, hand-held videos as well, where we improve the SOTA F-score on benchmark SumMe and TVSum datasets from 41.4 to 45.6 and 57.6 to 59.1 respectively.

Informative Visual Storytelling with Cross-modal Rules

Jiacheng Li
Haizhou Shi
Siliang Tang
Fei Wu
Yueting Zhuang

Existing methods in the Visual Storytelling field often suffer from the problem of generating general descriptions, while the image contains a lot of meaningful contents remaining unnoticed. The failure of informative story generation can be concluded to the model's incompetence of capturing enough meaningful concepts. The categories of these concepts include entities, attributes, actions, and events, which are in some cases crucial to grounded storytelling. To solve this problem, we propose a method to mine the cross-modal rules to help the model infer these informative concepts given certain visual input. We first build the multimodal transactions by concatenating the CNN activations and the word indices. Then we use the association rule mining algorithm to mine the cross-modal rules, which will be used for the concept inference. With the help of the cross-modal rules, the generated stories are more grounded and informative. Besides, our proposed method holds the advantages of interpretation, expandability, and transferability, indicating potential for wider application. Finally, we leverage these concepts in our encoder-decoder framework with the attention mechanism. We conduct several experiments on the VIsual StoryTelling~(VIST) dataset, the results of which demonstrate the effectiveness of our approach in terms of both automatic metrics and human evaluation. Additional experiments are also conducted showing that our mined cross-modal rules as additional knowledge helps the model gain better performance when trained on a small dataset.

LinesToFacePhoto: Face Photo Generation From Lines With Conditional Self-Attention Generative Adversarial Networks

Yuhang Li
Xuejin Chen
Feng Wu
Zheng-Jun Zha

In this paper, we explore the task of generating photo-realistic face images from lines. Previous methods based on conditional generative adversarial networks (cGANs) have shown their power to generate visually plausible images when a conditional image and an output image share well-aligned structures. However, these models fail to synthesize face images with a whole set of well-defined structures, e.g. eyes, noses, mouths, etc., especially when the conditional line map lacks one or several parts. To address this problem, we propose a conditional self-attention generative adversarial network (CSAGAN). We introduce a conditional self-attention mechanism to cGANs to capture long-range dependencies between different regions in faces. We also build a multi-scale discriminator. The large-scale discriminator enforces the completeness of global structures and the small-scale discriminator encourages fine details, thereby enhancing the realism of generated face images. We evaluate the proposed model on the CelebA-HD dataset by two perceptual user studies and three quantitative metrics. The experiment results demonstrate that our method generates high-quality facial images while preserving facial structures. Our results outperform state-of-the-art methods both quantitatively and qualitatively.

Sentence Specified Dynamic Video Thumbnail Generation

Yitian Yuan
Lin Ma
Wenwu Zhu

With the tremendous growth of videos over the Internet, video thumbnails, providing video content previews, are becoming increasingly crucial to influencing users' online searching experiences. Conventional video thumbnails are generated once purely based on the visual characteristics of videos, and then displayed as requested. Hence, such video thumbnails, without considering the users' searching intentions, cannot provide a meaningful snapshot of the video contents that users concern. In this paper, we define a distinctively new task, namely sentence specified dynamic video thumbnail generation, where the generated thumbnails not only provide a concise preview of the original video contents but also dynamically relate to the users' searching intentions with semantic correspondences to the users' query sentences. To tackle such a challenging task, we propose a novel graph convolved video thumbnail pointer (GTP). Specifically, GTP leverages a sentence specified video graph convolutional network to model both the sentence-video semantic interaction and the internal video relationships incorporated with the sentence information, based on which a temporal conditioned pointer network is then introduced to sequentially generate the sentence specified video thumbnails. Moreover, we annotate a new dataset based on ActivityNet Captions for the proposed new task, which consists of 10,000+ video-sentence pairs with each accompanied by an annotated sentence specified video thumbnail. We demonstrate that our proposed GTP outperforms several baseline methods on the created dataset, and thus believe that our initial results along with the release of the new dataset will inspire further research on sentence specified dynamic video thumbnail generation. Dataset and code are available at https://github.com/yytzsy/GTP

Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation

Yadan Luo
Zi Huang
Zheng Zhang
Ziwei Wang
Jingjing Li
Yang Yang

Visual paragraph generation aims to automatically describe a given image from different perspectives and organize sentences in a coherent way. In this paper, we address three critical challenges for this task in a reinforcement learning setting: the mode collapse, the delayed feedback, and the time-consuming warm-up for policy networks. Generally, we propose a novel Curiosity-driven Reinforcement Learning (CRL) framework to jointly enhance the diversity and accuracy of the generated paragraphs. First, by modeling the paragraph captioning as a long-term decision-making process and measuring the prediction uncertainty of state transitions as intrinsic rewards, the model is incentivized to memorize precise but rarely spotted descriptions to context, rather than being biased towards frequent fragments and generic patterns. Second, since the extrinsic reward from evaluation is only available until the complete paragraph is generated, we estimate its expected value at each time step with temporal-difference learning, by considering the correlations between successive actions. Then the estimated extrinsic rewards are complemented by dense intrinsic rewards produced from the derived curiosity module, in order to encourage the policy to fully explore action space and find a global optimum. Third, discounted imitation learning is integrated for learning from human demonstrations, without separately performing the time-consuming warm-up in advance. Extensive experiments conducted on the Standford image-paragraph dataset demonstrate the effectiveness and efficiency of the proposed method, improving the performance by 38.4% compared with state-of-the-art.

SESSION: Session 5B: Quality of Experience&Interaction

Quality Assessment of In-the-Wild Videos

Dingquan Li
Tingting Jiang
Ming Jiang

Quality assessment of in-the-wild videos is a challenging problem because of the absence of reference videos and shooting distortions. Knowledge of the human visual system can help establish methods for objective quality assessment of in-the-wild videos. In this work, we show two eminent effects of the human visual system, namely, content-dependency and temporal-memory effects, could be used for this purpose. We propose an objective no-reference video quality assessment method by integrating both effects into a deep neural network. For content-dependency, we extract features from a pre-trained image classification neural network for its inherent content-aware property. For temporal-memory effects, long-term dependencies, especially the temporal hysteresis, are integrated into the network with a gated recurrent unit and a subjectively-inspired temporal pooling layer. To validate the performance of our method, experiments are conducted on three publicly available in-the-wild video quality assessment databases: KoNViD-1k, CVD2014, and LIVE-Qualcomm, respectively. Experimental results demonstrate that our proposed method outperforms five state-of-the-art methods by a large margin, specifically, 12.39%, 15.71%, 15.45%, and 18.09% overall performance improvements over the second-best method VBLIINDS, in terms of SROCC, KROCC, PLCC and RMSE, respectively. Moreover, the ablation study verifies the crucial role of both the content-aware features and the modeling of temporal-memory effects. The PyTorch implementation of our method is released at https://github.com/lidq92/VSFA.

Cross-Reference Stitching Quality Assessment for 360° Omnidirectional Images

Jia Li
Kaiwen Yu
Yifan Zhao
Yu Zhang
Long Xu

Along with the development of virtual reality (VR), omnidirectional images play an important role in producing multimedia content with an immersive experience. However, despite various existing approaches for omnidirectional image stitching, how to quantitatively assess the quality of stitched images is still insufficiently explored. To address this problem, we first establish a novel omnidirectional image dataset containing stitched images as well as dual-fisheye images captured from standard quarters of 0$^\circ$, 90$^\circ$, 180$^\circ$, and 270$^\circ$. In this manner, when evaluating the quality of an image stitched from a pair of fisheye images (\eg, 0$^\circ$ and 180$^\circ$), the other pair of fisheye images (\eg, 90$^\circ$ and 270$^\circ$) can be used as the cross-reference to provide ground-truth observations of the stitching regions. Based on this dataset, we propose a set of Omnidirectional Stitching Image Quality Assessment (OS-IQA) metrics. In these metrics, the stitching regions are assessed by exploring the local relationships between the stitched image and its cross-reference with histogram statistics, perceptual hash and sparse reconstruction, while the whole stitched images are assessed by the global indicators of color difference and fitness of blind zones.Qualitative and quantitative experiments show our method outperforms the classic IQA metrics and is highly consistent with human subjective evaluations. To the best of our knowledge, it is the first attempt that assesses the stitching quality of omnidirectional images by using cross-references.

Generalized Playback Bar for Interactive Branched Video

Eric Lindskog
Jesper Wrang
Madeleine Bäckström
Linn Hallonqvist
Niklas Carlsson

During viewing of interactive "branched video", users are asked to make viewing choices that impact the storyline of the video playback. This type of video puts the users in control of their viewing experiences and provides content creators with great flexibility how to personalize the viewing experience of individual viewers. However, in contrast to with traditional video, where the use of a playback bar is default for most -- if not all -- players, there currently does not exist any generic playback bar for branched video that helps visualize the upcoming branch choices. Instead, most branched video implementations are typically custom-made on a per-video basis (e.g., see custom-made Netflix and BBC movies) and do not use a playback bar. As an important step towards addressing this void, we present the first branched video player with a generalized playback bar that visualizes the tree-like video structure and the buffer levels of the different branches. The player is implemented in dash.js and is made public with this publication, is the first of its kind, and allows both the playback bar and the presentation of branch choices to be customized with regards to visual appearance, functionality, and the content itself. Furthermore, the design is generic (making it applicable to any video) and allows content creators to easily create large numbers of branched movies using a simple metafile format. Finally, and most importantly, we perform a three-phase user study in which we evaluate the playback bar, compare with alternative designs, and other branch-related features. The user study highlights the value of a branched video playback bar, and provides interesting insights into how it and other design customization features may best be integrated into a player.

360° Mulsemedia: A Way to Improve Subjective QoE in 360° Videos

Alexandra Covaci
Ramona Trestian
Estêvão Bissoli Saleme
Ioan-Sorin Comsa
Gebremariam Assres
Celso A. S. Santos
Gheorghita Ghinea

Previous research has shown that adding multisensory media-mulsemedia-to traditional audiovisual content has a positive effect on user Quality of Experience (QoE). However, the QoE impact of employing mulsemedia in 360° videos has remained unexplored. Accordingly, in this paper, a QoE study for watching a 360° video-with and without multisensory effects-in a full free-viewpoint VR setting is presented. The parametric space we considered to influence the QoE consists of the encoding quality and the motion level of the transmitted media. To achieve our research aim, we propose a wearable VR system that provides multisensory enhancement of 360° videos. Then, we utilise its capabilities to systematically evaluate the effects of multisensory stimulation on perceived quality degradation for videos with different motion levels and encoding qualities. Our results make a strong case for the inclusion of multisensory effects in 360° videos, as they reveal that both user-perceived quality, as well as enjoyment, are significantly higher when mulsemedia (as opposed to traditional multimedia) is employed in this context. Moreover, these observations hold true independent of the underlying 360° video encoding quality-thus QoE can be significantly enhanced with a minimal impact on networking resources.

ViProVoQ: Towards a Vocabulary for Video Quality Assessment in the Context of Creative Video Production

Simon Wedel
Michael Koppetz
Janto Skowronek
Alexander Raake

This paper presents a method for developing a consensus vocabulary to describe and evaluate the visual experience of videos. As a first result, a vocabulary characterizing the specific look of cinema-type video is presented. Such a vocabulary can be used to relate perceptual features of professional high-end image and video quality of experience (QoE) with the underlying technical characteristics and settings of the video systems involved in the creative content production process. For the vocabulary elicitation, a combination of different survey techniques was applied in this work. As the first step, individual interviews were conducted with experts of the motion picture industry on image quality in the context of cinematography. The data obtained from the interviews was used for the subsequent Real-time Delphi survey, where an extended group of experts worked out a consensus on key aspects of the vocabulary specification. Here, 33 experts were supplied with the anonymized results of the other panelists, which they could use to revise their own assessment. Based on this expert panel, the attributes collected in the interviews were verified and further refined, resulting in the final vocabulary proposed in this paper. Besides an attribute-based sensory evaluation of high-quality image, video and film material, applications of the vocabulary are the development of dimension-based image and video quality models, and the analysis of the multivariate relationship between quality-relevant perceptual attributes and technical system parameters.

DeepQuantizedCS: Quantized Compressive Video Recovery using Deep Convolutional Networks

Saurabh Kumar
Yagnesh Badiyani
Subhasis Chaudhuri

This work proposes a deep learning based approach to sparse signal recovery from compressively sensed (temporally or spectrally collapsed) and single bit quantized measurements. We demonstrate the effectiveness and applicability of this technique with the recovery of video and hyperspectral volumes from such compressed data. The compressively sensed data is represented by single bit quantization using an ordered dithering scheme with modifications to this whole compressive acquisition pipeline for efficiency and ease of practical implementation. All this allows us to have a compressive acquisition setup which doubles as an extremely simple encoder, without a decoder in the loop and which is power, memory, and computationally very efficient, and is suitable for onboard compression applications. When used as a compression engine, the proposed pipeline, unlike existing methods, requires only basic elements namely, adders, multipliers and comparators to offer a significant compression ratio without a need for costly high precision ADCs or transform coding ASICs in the workflow.

SESSION: Session 5C: Transport&Delivery

Towards 6DoF HTTP Adaptive Streaming Through Point Cloud Compression

Jeroen van der Hooft
Tim Wauters
Filip De Turck
Christian Timmerer
Hermann Hellwagner

The increasing popularity of head-mounted devices and 360° video cameras allows content providers to offer virtual reality video streaming over the Internet, using a relevant representation of the immersive content combined with traditional streaming techniques. While this approach allows the user to freely move her head, her location is fixed by the camera's position within the scene. Recently, an increased interest has been shown for free movement within immersive scenes, referred to as six degrees of freedom. One way to realize this is by capturing objects through a number of cameras positioned in different angles, and creating a point cloud which consists of the location and RGB color of a significant number of points in the three-dimensional space. Although the concept of point clouds has been around for over two decades, it recently received increased attention by ISO/IEC MPEG, issuing a call for proposals for point cloud compression. As a result, dynamic point cloud objects can now be compressed to bit rates in the order of 3 to 55 Mb/s, allowing feasible delivery over today's mobile networks. In this paper, we propose PCC-DASH, a standards-compliant means for HTTP adaptive streaming of scenes comprising multiple, dynamic point cloud objects. We present a number of rate adaptation heuristics which use information on the user's position and focus, the available bandwidth, and the client's buffer status to decide upon the most appropriate quality representation of each object. Through an extensive evaluation, we discuss the advantages and drawbacks of each solution. We argue that the optimal solution depends on the considered scene and camera path, which opens interesting possibilities for future work.

Lossy Intermediate Deep Learning Feature Compression and Evaluation

Zhuo Chen
Kui Fan
Shiqi Wang
Ling-Yu Duan
Weisi Lin
Alex Kot

With the unprecedented success of deep learning in computer vision tasks, many cloud-based visual analysis applications are powered by deep learning models. However, the deep learning models are also characterized with high computational complexity and are task-specific, which may hinder the large-scale implementation of the conventional data communication paradigms. To enable a better balance among bandwidth usage, computational load and the generalization capability for cloud-end servers, we propose to compress and transmit intermediate deep learning features instead of visual signals and ultimately utilized features. The proposed strategy also provides a promising way for the standardization of deep feature coding. As the first attempt to this problem, we present a lossy compression framework and evaluation metrics for intermediate deep feature compression. Comprehensive experimental results show the effectiveness of our proposed methods and the feasibility of the proposed data transmission strategy. It is worth mentioning that the proposed compression framework and evaluation metrics have been adopted into the ongoing AVS (Audio Video Coding Standard Workgroup) - Visual Feature Coding Standard.

Band and Quality Selection for Efficient Transmission of Hyperspectral Images

Mohammad Amin Arab
Kiana Calagari
Mohamed Hefeeda

Due to recent technological advances in capturing and processing devices, hyperspectral imaging is becoming available for many commercial and military applications such as remote sensing, surveillance, and forest fire detection. Hyperspectral cameras provide rich information, as they capture each pixel along many frequency bands in the spectrum. The large volume of hyperspectral images as well as their high dimensionality make transmitting them over limited-bandwidth channels a challenge. To address this challenge, we present a method to prioritize the transmission of various components of hyperspectral data based on the application needs, the level of details required, and available bandwidth. This is unlike current works that mostly assume offline processing and the availability of all data beforehand. Our method jointly and optimally selects the spectral bands and their qualities to maximize the utility of the transmitted data. It also enables progressive transmission of hyperspectral data, in which approximate results are obtained with small amount of data and can be refined with additional data. This is a desirable feature for large-scale hyperspectral imaging applications. We have implemented the proposed method and compared it against the state-of-the-art in the literature using hyperspectral imaging datasets. Our experimental results show that the proposed method achieves high accuracy, transmits a small fraction of the hyperspectral data, and significantly outperforms the state-of-the-art; up to 35% improvements in accuracy was achieved.

PiTree: Practical Implementation of ABR Algorithms Using Decision Trees

Zili Meng
Jing Chen
Yaning Guo
Chen Sun
Hongxin Hu
Mingwei Xu

Major commercial client-side video players employ adaptive bitrate (ABR) algorithms to improve user quality of experience (QoE). With the evolvement of ABR algorithms, increasingly complex methods such as neural networks have been adopted to pursue better performance. However, these complex methods are too heavyweight to be directly implemented in client devices, especially mobile phones with very limited resources. Existing solutions suffer from a trade-off between algorithm performance and deployment overhead. To make the implementation of sophisticated ABR algorithms practical, we propose PiTree, a general, high-performance and scalable framework that can faithfully convert sophisticated ABR algorithms into lightweight decision trees to reduce deployment overhead. We also provide a theoretical upper bound on the optimization loss during the conversion. Evaluation results on three representative ABR algorithms demonstrate that PiTree could faithfully convert ABR algorithms into decision trees with <3% average performance degradation. Moreover, comparing to original implementation solutions, PiTree could save operating expenses for large content providers.

AdaCompress: Adaptive Compression for Online Computer Vision Services

Hongshan Li
Yu Guo
Zhi Wang
Shutao Xia
Wenwu Zhu

With the growth of computer vision based applications and services, an explosive amount of images have been uploaded to cloud servers which host such computer vision algorithms, usually in the form of deep learning models. JPEG has been used as the \em de facto compression and encapsulation method before one uploads the images, due to its wide adaptation. However, standard JPEG configuration does not always perform well for compressing images that are to be processed by a deep learning model, e.g., the standard quality level of JPEG leads to 50% of size overhead (compared with the best quality level selection) on ImageNet under the same inference accuracy in popular computer vision models including InceptionNet, ResNet, etc. Knowing this, designing a better JPEG configuration for online computer vision services is still extremely challenging: 1) Cloud-based computer vision models are usually a black box to end-users; thus it is difficult to design JPEG configuration without knowing their model structures. 2) JPEG configuration has to change when different users use it. In this paper, we propose a reinforcement learning based JPEG configuration framework. In particular, we design an agent that adaptively chooses the compression level according to the input image's features and backend deep learning models. Then we train the agent in a reinforcement learning way to adapt it for different deep learning cloud services that act as the \em interactive training environment and feeding a reward with comprehensive consideration of accuracy and data size. In our real-world evaluation on Amazon Rekognition, Face++ and Baidu Vision, our approach can reduce the size of images by 1/2 -- 1/3 while the overall classification accuracy only decreases slightly.

Talking Video Heads: Saving Streaming Bitrate by Adaptively Applying Object-based Video Principles to Interview-like Footage

Maarten Wijnants
Sven Coppers
Gustavo Rovelo Ruiz
Peter Quax
Wim Lamotte

Over-the-top (OTT) streaming services like YouTube and Netflix induce massive amounts of video traffic. To combat the resulting network load, this article empirically explores the use of the object-based video (OBV) methodology that allows for the quality-variant HTTP Adaptive Streaming of respectively the background and foreground object(s) of a video scene. In particular, we study two alternative video object representation methods where the first meticulously follows the object contour, while the second uses axis-aligned bounding box enclosures. We subjectively compare both techniques to traditional, frame-based video compression in the context of live action content featuring talking persons. The resulting mixed methods data shows that (i) OBV-informed users tolerate substantial background quality degradations, and (ii) at an average bitrate reduction of 14 percent, perceptual differences between respectively contour-based OBV and traditional encoding are small or even non-existing for the non-movie content in our corpus. Although our evaluation focuses on interview-like footage, our qualitative data hints that the presented results might be extrapolatable to other video genres. As such, our findings inform content owners and network operators about video bitrate saving opportunities with marginal perceptual impact.

SESSION: Session 5D: Art&Culture

Recognizing the Style of Visual Arts via Adaptive Cross-layer Correlation

Liyi Chen
Jufeng Yang

Visual arts consist of various art forms, \eg painting, sculpture, architecture,~\etc, which not only enrich our lives but also involve works related to aesthetics, history, and culture. Different eras have different artistic appeals, and the art also has characteristics of each era in terms of expression and spiritual pursuit, in which style is an important attribute to describe visual arts. In order to recognize the style of visual arts more effectively, we present an end-to-end trainable architecture to learn a deep style representation. The main component of our architecture, adaptive cross-layer correlation, is inspired by the Gram matrix based correlation calculation. Our proposed method can adaptively weight features in different spatial locations based on their intrinsic similarity. Extensive experiments on three datasets demonstrate the superiority of the proposed method over several state-of-the-art methods.

Melody Slot Machine: A Controllable Holographic Virtual Performer

Masatoshi Hamanaka

This paper describes the "Melody Slot Machine," an interactive music system that enables control over virtual performers. Conventional virtual players focus on what kind of output performance is given to the input performance, and the performance output is difficult to control. The Melody Slot Machine enables the user to select the melody to be played next by the virtual player by rotating a dial. Furthermore, the performer is projected on a holographic display, and the user can feel as if a real virtual player is there. To achieve this, the system needs to change the melody and the performance video.

Generating Captions for Images of Ancient Artworks

Shurong Sheng
Marie-Francine Moens

The neural encoder-decoder framework is widely adopted for image captioning of natural images. However, few works have contributed to generating captions for cultural images using this scheme. In this paper, we propose an artwork type enriched image captioning model where the encoder represents an input artwork image as a 512-dimensional vector and the decoder generates a corresponding caption based on the input image vector. The artwork type is first predicted by a convolutional neural network classifier and then merged into the decoder. We investigate multiple approaches to integrate the artwork type into the captioning model among which is one that applies a step-wise weighted sum of the artwork type vector and the hidden representation vector of the decoder. This model outperforms three baseline image captioning models for a Chinese art image captioning dataset on all evaluation metrics. One of the baselines is a state-of-the-art approach fusing textual image attributes into the captioning model for natural images. The proposed model also obtains promising results for another Egyptian art image captioning dataset.

GP-GAN: Towards Realistic High-Resolution Image Blending

Huikai Wu
Shuai Zheng
Junge Zhang
Kaiqi Huang

It is common but challenging to address high-resolution image blending in the automatic photo editing application. In this paper, we would like to focus on solving the problem of high-resolution image blending, where the composite images are provided. We propose a framework called Gaussian-Poisson Generative Adversarial Network (GP-GAN) to leverage the strengths of the classical gradient-based approach and Generative Adversarial Networks. To the best of our knowledge, it's the first work that explores the capability of GANs in high-resolution image blending task. Concretely, we propose Gaussian-Poisson Equation to formulate the high-resolution image blending problem, which is a joint optimization constrained by the gradient and color information. Inspired by the prior works, we obtain gradient information via applying gradient filters. To generate the color information, we propose a Blending GAN to learn the mapping between the composite images and the well-blended ones. Compared to the alternative methods, our approach can deliver high-resolution, realistic images with fewer bleedings and unpleasant artifacts. Experiments confirm that our approach achieves the state-of-the-art performance on Transient Attributes dataset. A user study on Amazon Mechanical Turk finds that the majority of workers are in favor of the proposed method. The source code is available in \urlhttps://github.com/wuhuikai/GP-GAN, and there's also an online demo in \urlhttp://wuhuikai.me/DeepJS.

Progressive Image Inpainting with Full-Resolution Residual Network

Zongyu Guo
Zhibo Chen
Tao Yu
Jiale Chen
Sen Liu

Recently, learning-based algorithms for image inpainting achieve remarkable progress dealing with squared or irregular holes. However, they fail to generate plausible textures inside damaged area because there lacks surrounding information. A progressive inpainting approach would be advantageous for eliminating central blurriness, i.e., restoring well and then updating masks. In this paper, we propose full-resolution residual network (FRRN) to fill irregular holes, which is proved to be effective for progressive image inpainting. We show that well-designed residual architecture facilitates feature integration and texture prediction. Additionally, to guarantee completion quality during progressive inpainting, we adopt N Blocks, One Dilation strategy, which assigns several residual blocks for one dilation step. Correspondingly, a step loss function is applied to improve the performance of intermediate restorations. The experimental results demonstrate that the proposed FRRN framework for image inpainting is much better than previous methods both quantitatively and qualitatively.

Facial Image-to-Video Translation by a Hidden Affine Transformation

Guangyao Shen
Wenbing Huang
Chuang Gan
Mingkui Tan
Junzhou Huang
Wenwu Zhu
Boqing Gong

There has been a prominent emergence of work on video prediction, aiming to extrapolate the future video frames from the past. Existing temporal-based methods are limited to certain numbers of frames. In this paper, we study video prediction from a single still image in the facial expression domain, a.k.a, facial image-to-video translation. Our main approach, dubbed AffineGAN, associates each facial image with an expression intensity and leverages an affine transformation in the latent space. AffineGAN allows users to control the number of frames to predict as well as the expression intensity for each of them. Unlike previous intensity-based methods, We derive an inverse formulation to the affine transformation, enabling automatic inference of the facial expression intensities from videos --- manual annotation is not only tedious but also ambiguous as people express in various ways and have different opinions about the intensity of a facial image. Both quantitative and qualitative results verify the superiority of AffineGAN over the state of the arts. Notably, in a Turing test with web faces, more than 50% of the facial expression videos generated by AffineGAN are considered real by the Amazon Mechanical Turk workers. This work could improve users' communication experience by enabling them to conveniently and creatively produce expression GIFs, which are popular art forms in online messaging and social networks.

SESSION: Panel 2

Legal and Ethical Challenges in Multimedia Research

Vivek K. Singh
Elisabeth André
Susanne Boll
Mireille Hildebrandt
David A. Shamma
Tat-Seng Chua

Multimedia research has now moved beyond laboratory experiments and is rapidly being deployed in real-life applications including advertisements, social interaction, search, security, automated driving, and healthcare. Hence, the developed algorithms now have a direct impact on the individuals using the abovementioned services and the society as a whole. While there is a huge potential to benefit the society using such technologies, there is also an urgent need to identify the checks and balances to ensure that the impact of such technologies is ethical and positive. This panel will bring together an array of experts who have experience collecting large-scale datasets, building multimedia algorithms, and deploying them in practical applications, as well as, a lawyer whose eyes have been on the fundamental rights at stake. They will lead a discussion on the ethics and lawfulness of dataset creation, licensing, privacy of individuals represented in the datasets, algorithmic transparency, algorithmic bias, explainability, and the implications of application deployment. Through an interactive process engaging the audience, the panel hopes to: increase the awareness of such concepts in the multimedia research community; initiate a discussion on community guidelines all for setting the future direction of conducting multimedia research in a lawful and ethical manner.

SESSION: Grand Challenge: iQIYI Celebrity Video Identification

iQIYI Celebrity Video Identification Challenge

Yuanliu Liu
Peipei Shi
Bo Peng
He Yan
Yong Zhou
Bing Han
Yi Zheng
Chao Lin
Jianbin Jiang
Yin Fan
Tingwei Gao
Ganwen Wang
Jian Liu
Xiangju Lu
Junhui Liu
Danming Xie

We held the iQIYI Celebrity Video Identification Challenge in ACMMULTIMEDIA 2019. The purpose was to encourage the research on video-based person identification. We released the iQIYI-VID-2019 dataset, which contains 200K videos of 10K celebrities. In this paper, we introduce the organization of the challenge, the dataset, the evaluation process, and the results.

ResidualDenseNetwork: A Simple Approach for Video Person Identification

Zixuan Huang
Yuan Chang
Weizhao Chen
Qiwei Shen
Jianxin Liao

Video identification is an important task in the practical application and industry. Based on the iQIYI-VID-2019 dataset, ACM International Conference on Multimedia and iQIYI co-hosted the celebrity video identification challenge. We take part in the competition, propose a new feature fusion method and design a residual dense network which can improve video identification performance in the complex scenes. Only with face features, we achieve 0.9035 in mean Average Precision(mAP) which win the second place on the leadboard. At the same time, it is the best score only with official features. It is worth mention that the flops of our model is only 0.5G and the time required to predict the entire test dataset is only 2 sim 5 minutes. Our method takes accuracy and speed into account, which has a strong practical significance.

Make the Best of Face Clues in iQIYI Celebrity VideoIdentification Challenge 2019

Xi Fang
Ying Zou

iQIYI-VID-2019 is the largest video dataset for multi-modal person identification. It is composed of more than 200k video clips of 10,034 celebrities. Face is a critical clue for person identification when the face is visible in video. However face quality in a video may not always be good, and it also contains a lot of noise caused by detection and feature extraction. Meanwhile, conventional multi-modal person classification methods do not fully exploit the ability of face modality. They do not make full use of face detection confidence and quality evaluation indicators, which are key information in face modality. To address these issues, we develop a quality-based video face feature fusion method in inference with a quality-based face feature denoising and augmentation method in training. Our approach is only based on 512-dimensional face features provided by iQIYI-VID-2019 dataset. Utilizing our proposed novel method, we have achieved the mAP score of 89.83% which is the 4th place in iQIYI Celebrity Video Identification Challenge 2019.

DeepMEF: A Deep Model Ensemble Framework for Video Based Multi-modal Person Identification

Chuanqi Dong
Zheng Gu
Zhonghao Huang
Wen Ji
Jing Huo
Yang Gao

The goal of video based multi-modal person identification is to identify a person of interest using multi-modal video features, such as person's face, body, audio or head features. This task is challenging due to many factors, for example, variant body or face poses, poor face image quality, low frame resolution, etc. To address these problems, we propose a deep model ensemble framework, namely DeepMEF. Specifically, the proposed framework includes three novel modules, i.e., the video feature fusion module, the multi-modal feature fusion module and the model ensemble module. The first and second module form the basic deep model for ensemble, with the video feature fusion module fuses facial features from different frames as one. Then the multi-modal feature fusion module further fuses the face feature and features of other modalities for identification. In this work, we adopt the scene feature extracted by ourselves as the additional input of the multi-modal module. At last, the model ensemble module promotes the overall performance by combining the predictions of multiple multi-modal learners. The proposed method achieves a competitive result of 89.86% in mAP on the iQIYI-VID-2019 dataset, which helps us win the third place in the 2019 iQIYI Celebrity Video Identification Challenge.

A Novel Deep Multi-Modal Feature Fusion Method for Celebrity Video Identification

Jianrong Chen
Li Yang
Yuanyuan Xu
Jing Huo
Yinghuan Shi
Yang Gao

In this paper, we develop a novel multi-modal feature fusion method for the 2019 iQIYI Celebrity Video Identification Challenge, which is held in conjunction with ACM MM 2019. The purpose of this challenge is to retrieve all the video clips of a given identity in the testing set. In this challenge, the multi-modal features of a celebrity are encouraged to be combined for a promising performance, such as face features, head features, body features, and audio features. As we know, the features from different modalities usually have their own influences on the results. To achieve better results, a novel weighted multi-modal feature fusion method is designed to obtain the final feature representation. After many experimental verification, we found that different feature fusion weights for training and testing make the method robust to multi-modal person identification. Experiments on the iQIYI-VID-2019 dataset show that our multi-modal feature fusion strategy effectively improves the accuracy of person identification. Specifically, for competition, we use a single model to get the result of 0.8952 in mAP, which ranks TOP-5 among all the competitive results.

A Hierarchical Framwork with Improved Loss for Large-scale Multi-modal Video Identification

Shichuan Zhang
Zengming Tang
Hao Pan
Xinyu Wei
Jun Huang

This paper introduces our solution for iQIYI Celebrity Video Identification Challenge. After analyzing the iQIYI-VID-2019 dataset, we find the distribution of the dataset is very unbalanced and there are many unlabeled samples in the validation set and the test set. For these challenge, we propose a hierarchical system which combines different models and fuses base classifiers. For the false detections and low-quality features in the dataset, we use a simple and reasonable strategy to fuse features. In order to detect videos more accurately, we choose an improved loss function for the learning of base classifiers. Experiment results show that our framework performs well and evaluation conducted by the organizers shows that our final result gets the ninth place online and mAP 88.08%.

SESSION: Grand Challenge: AI Meets Beauty

Cross-domain Beauty Item Retrieval via Unsupervised Embedding Learning

Zehang Lin
Haoran Xie
Peipei Kang
Zhenguo Yang
Wenyin Liu
Qing Li

Cross-domain image retrieval is always encountering insufficient labelled data in real world. In this paper, we propose unsupervised embedding learning (UEL) for cross-domain beauty and personal care product retrieval to finetune the convolutional neural network (CNN). More specifically, UEL utilizes the non-parametric softmax to train the CNN model as instance-level classification, which reduces the influence of some inevitable problems (e.g., shape variations). In order to obtain better performance, we integrate a few existing retrieval methods trained on different datasets. Furthermore, a query expansion strategy (i.e., diffusion) is adopted to improve the performance. Extensive experiments conducted on a dataset including half million images of beauty and personal product items (Perfect-500K) manifest the effectiveness of our proposed method. Our approach achieves the 2nd place in the leader board of the Grand Challenge of AI Meets Beauty in ACM Multimedia 2019. Our code is available at: https://github.com/RetrainIt/Perfect-Half-Million-Beauty-Product-Image-R....

The Retrieval of the Beautiful: Self-Supervised Salient Object Detection for Beauty Product Retrieval

Jiawei Wang
Shuai Zhu
Jiao Xu
Da Cao

Beauty product retrieval is a challenging task due to the severe image variation issue in real-world scenes. In this work, to mitigate the data variation problem, we contribute a background-agnostic feature extractor, which is trained by a self-supervised salient object detection method. In particular, we first propose a foreground augmentation technique to acquire the augmentation image with its foreground mask. Next, a feature extractor with an attention pooling layer is proposed to learn background-agnostic representations by performing the salient object detection in a self-supervised manner. Finally, we ensemble the background-agnostic features of multiple models to perform the beauty product retrieval. Extensive experimental results have demonstrated the superiority of our proposed framework.

Beauty Product Retrieval Based on Regional Maximum Activation of Convolutions with Generalized Attention

Jun Yu
Guochen Xie
Mengyan Li
Haonian Xie
Lingyun Yu

Beauty and Personal care product retrieval has attracted more and more research attention for its value in real life. However, suffering from data variants and complex background, this task has been very challenging. In this paper, we propose a novel Generalized-attention Regional Maximal Activation of Convolutions (GRMAC) descriptor which helps to generate image features for retrieval. This method introduces attention mechanism to reduce the influence of clustered background and highlight the target, and thus contributes to enhancing the effectiveness of features and boosting the retrieval performance. Different from other attention-based methods, our method supports adjusting mask with a hyperparameter p, which is more flexible and accurate in real application. To demonstrate its effectiveness, we conduct experiments on the dataset containing more than half million personal care products (Perfect-500K) and obtain remarkable results. Furthermore, we try to fuse multiple features from different models for more improvements. And finally, our team (USTC_NELSLIP) ranked 1st in the Grand Challenge of AI Meets Beauty in ACM Multimedia 2019 with a MAP score of 0.408614. Our code is available at: https://github.com/gniknoil/Perfect500K-Beauty-and-Personal-Care-Product...

Beauty Aware Network: An Unsupervised Method for Makeup Product Retrieval

Yi Zhang
Linzi Qu
Lihuo He
Wen Lu
Xinbo Gao

Makeup product retrieval has gained more and more attention for its wide application prospects. However, the challenging problem is that the dataset crawled from Internet doesn't have annotated labels. Therefore, existing methods are unable to obtain well-trained networks. To solve this problem, this paper proposes a trainable network named Beauty Aware Network (BAN) for makeup product retrieval. The core of proposed method is using an unsupervised cluster method to train the beauty classification network. And then a covariance pooling layer is introduced to leverage the statistical information. Finally, a multi-layer fusion strategy is used to capture informative clues in images. The proposed method can get simpler but more efficient features for beauty product retrieval with less computation cost. The experiments conduct on Perfect-500k dataset which has more than half-million images. The results demonstrate the effectiveness of beauty aware network by competitive performance.

SESSION: Grand Challenge: BioMedia

ACM Multimedia BioMedia 2019 Grand Challenge Overview

Steven Hicks
Michael Riegler
Pia Smedsrud
Trine B. Haugen
Kristin Ranheim Randel
Konstantin Pogorelov
Håkon Kvale Stensland
Duc-Tien Dang-Nguyen
Mathias Lux
Andreas Petlund
Thomas de Lange
Peter Thelin Schmidt
Pål Halvorsen

The BioMedia 2019 ACM Multimedia Grand Challenge is the first in a series of competitions focusing on the use of multimedia for different medical use-cases. In this year's challenge, the participants are asked to develop efficient algorithms which automatically detect a variety of findings commonly identified in the gastrointestinal (GI) tract (a part of the human digestive system). The purpose of this task is to develop methods to aid medical doctors performing routine endoscopy inspections of the GI tract. In this paper, we give a detailed description of the four different tasks of this year's challenge, present the datasets used for training and testing, and discuss how each submission is evaluated both qualitatively and quantitatively.

Gastrointestinal Tract Diseases Detection with Deep Attention Neural Network

Yuan Chang
Zixuan Huang
Weizhao Chen
Qiwei Shen

Medical image classification and diagnosis is currently a hot topic in the field of deep learning. The ACM International Conference on Multimedia and Simula co-hosted the MutilMedia Grand Challenge, which aims to use artificial intelligence aiding detection and classification of gastrointestinal image. This competition is divided into four subtasks, including detection, efficient detection, efficient detection (the same hardware for all the participants) and report generation. We participate in the multi-label detection task and propose a new attention model, which can effectively improve the network's ability to classify different types of categories. Our approach also uses a series of different techniques including multi-epoch fusion, automatic data augmentation selection, and adaptive threshold selection. Combining these techniques, we are able to achieve good classification results on the given dataset. Finally, our f1 score is 0.907 and MCC is 0.952 with a high speed.

Automatic Disease Detection and Report Generation for Gastrointestinal Tract Examination

Philipp Harzig
Moritz Einfalt
Rainer Lienhart

In this paper, we present a method to automatically identify diseases from videos of gastrointestinal (GI) tract examinations using a Deep Convolutional Neural Network (DCNN) that processes images from digital endoscopes. Our goal is to aid domain experts by automatically detecting abnormalities and generating a report that summarizes the main findings. We have implemented a model that uses two different DCNN architectures to generate our predictions, which are also capable of running on a mobile device. Using this architecture, we are able to predict findings on individual images. Combined with class activations maps (CAM), we can also automatically generate a textual report describing a video in detail while giving hints about the spatial location of findings and anatomical landmarks. Our work shows one way to use a multi-disease detection pipeline to also generate video reports that summarize key findings.

Enhancing Endoscopic Image Classification with Symptom Localization and Data Augmentation

Trung-Hieu Hoang
Hai-Dang Nguyen
Viet-Anh Nguyen
Thanh-An Nguyen
Vinh-Tiep Nguyen
Minh-Triet Tran

Inspired by recent advances in computer vision and deep learning, we propose new enhancements to tackle problems appearing in endoscopic image analysis, especially abnormality finding and anatomical landmark detection. In details, a combination of Residual Neural Network and Faster R-CNN are jointly applied in order to take all of their advantages and improve the overall performance. Nevertheless, novel data augmentation is designed and adapted to corresponding domains. Our approaches prove their competitive results in term of not only the accuracy but also the inference time in Medico: The 2018 Multimedia for Medicine Task and The Biomedia ACM MM Grand Challenge 2019. These results show the great potential of the collaborating between deep learning models and data augmentation in medical image analysis applications. Especially, more than 4900 bounding boxes localizing the symptom of some classes from KVASIR dataset that we annotated and used in this project are shared online for future research.

Adaptive Ensemble: Solution to the Biomedia ACM MM GrandChallenge 2019

Zhipeng Luo
Xiaowei Wang
Zhenyu Xu
Xue Li
Jiadong Li

In this paper we share our solution to the Biomedia ACM MM Grand Challenge 2019, which focuses on the gastrointestinal tract with an aim to detect and classify abnormalities. We firstly identify the challenges in this task, including the scarce and imbalanced data, and the subtle inter-category variances. Based on these analysis, we propose a solution which leverages the 10-fold cross validation approach to alleviate the over-fitting problem, and design a model to adaptively ensemble all sub-models belonging to all component models. Based on extensive offline evaluations, we verify the performance of the proposed technique under various settings. In the competition, we eventually receive the MCC(Matthews correlation coefficient) score of 0.9480.

Biomedia ACM MM Grand Challenge 2019: Using Data Enhancement to Solve Sample Unbalance

Wenhua Meng
Shan Zhang
Xudong Yao
Xiaoshan Yang
Changsheng Xu
Xiaowen Huang

The Biomedia ACM MM Grand Challenge focuses on medical applications with a task to detect and classify abnormalities within gastrointestinal (GI) tract. As a part of the submission for this challenge, several methods we applied are reported in this paper. The data we used is from the KVASIR dataset and the NEETHUS dataset. It contains training and test data in form of image or video. The main challenge of this task is the data's insufficiency and unbalance, which will significantly decrease the performance. To solve this problem and achieve better result, we conduct multiple data enhancement operations on the data. The method is proved to be efficient. Apart from the operations applied on the data, we also test several classification structures include SCNN (Shallow Convolutional Neural Network), SCNN-SVM (Support Vector Machine), ResNet32-SVM, SVM, ResNet16, ResNet32, ResNet50, ResNet101 and Residual Attention Network. We finally chose ResNet50 as the main structure considered with the balance of accuracy and efficiency. We obtain 91.51% precision, 87.45% sensitivity, 99.48% specificity, 87.93% F1-score (the harmonic mean of precision and sensitivity) and 91.40% MCC (Matthews correlation coefficient) on the test dataset.

SESSION: Grand Challenge: Content-based video relevance prediction

Overview of Content-Based Click-Through Rate Prediction Challenge for Video Recommendation

Peng Wang
Yunsheng Jiang
Chunxu Xu
Xiaohui Xie

Content cold-start is a core problem in recommendation field, by which service providers can mine the potential profit from content that has not yet been discovered by most users, and provide more accurate personalized service to their users. In video recommendation, video and audio features should cover enough semantic information in the purpose of recommendation, thus should take an non-negligible role for content cold-start. This paper summarizes the Content Based Video Relevance Prediction Challenge held by Hulu, a top online streaming video platform in US, in ACM Multimedia conference 2019. The challenge is a content-based CTR prediction task for video recommendation, where millions of user interaction data and thousands of video features are released for research purpose on related topics.

BERT4SessRec: Content-Based Video Relevance Prediction with Bidirectional Encoder Representations from Transformer

Xusong Chen
Dong Liu
Chenyi Lei
Rui Li
Zheng-Jun Zha
Zhiwei Xiong

This paper describes our solution for the Content-Based Video Relevance Prediction (CBVRP) challenge, where the task is to predict user click-through behavior on new TV series or new movies according to the user's historical behavior. We consider the task as a session-based recommendation problem and we focus on the modeling of the session. Thus, we use the Bidirectional Encoder Representations from Transformer (BERT) methodology and propose a BERT for session-based recommendation (BERT4SessRec) method. Our method has two stages: in the pre-training stage, we use all sessions as training data and train the bidirectional session encoder with the masking trick; in the fine-tuning stage, we use the provided click-through data and train the click-through prediction network. Our method achieves session representations with the help of BERT, which effectively captures the bidirectional correlation in each session. In addition, the pre-training stage makes full use of all sessions, overcoming the positive-negative imbalance problem of the click-through data. We report the results of using different kinds of features on the test set of the challenge, which verify the effectiveness of our method.

Exploring Content-based Video Relevance for Video Click-Through Rate Prediction

Xun Wang
Yali Du
Leimin Zhang
Xirong Li
Miao Zhang
Jianfeng Dong

This paper describes our solution for the Hulu Challenge. To answer the challenge, we introduce two content-based models, namely, Cascading Mapping Network (CMN) and Relevant-Enhanced Deep Interest Network (REDIN). CMN predicts video Click-Through Rate (CTR) by predicting content-based video relevance. REDIN mainly improves the popular Deep Interest Network by adding explicit video relevance constraint, which provides guidance for low-level video feature learning thus helpful for CTR prediction. Based on the two models, our solution obtains Area Under Curve (AUC) score of 0.6022 and 0.6155 on the TV-shows and Movie track respectively. What is more, we are one of the only two teams giving scores of over 0.6 on both tracks. The results justify the effectiveness and stability of our proposed solution.

Content-Based Video Relevance Prediction with Multi-view Multi-level Deep Interest Network

Zeyuan Chen
Kai Xu
Wei Zhang

This paper presents our solution for the Hulu Content-Based Video Relevance Prediction (CBVRP) challenge, which focuses on cold-start videos as candidates. The keys to success of this prediction scenario are to learn effective user and video representations. To this end, we develop a multi-view multi-level deep interest network (MMDIN), which involves a multi-level deep interest network to learn user and video representations in a single-view, and a late fusion technique to integrate their multi-view representations corresponding to different types of video features. Through the above manner, the cold-start video prediction could be handled well with representations through their past interaction behaviors with videos and video representations based on their multiple types of content profiles.

Cold-Start Representation Learning: A Recommendation Approach with Bert4Movie and Movie2Vec

Xinran Zhang
Xin Yuan
Yunwei Li
Yanru Zhang

Video relevance computation is one of the most important tasks for the personalized online streaming service. Given the relevance of videos and viewer feedbacks, the system can provide personalized recommendations, which helps viewers discover more contents of interest in most online services. However, the computation of a video relevance table is based on viewers' implicit feedbacks such as watch and search history, which perform poorly for newly added "cold-start'' videos. Facing the cold start problem, we introduce a recommendation method with Bidirectional Encoder Representations from Transformers, which considers the continuity of ordered watching plan and trained the sequence of path from start to end named Bert4Movie. What's more, we propose a method named Movie2Vec to represent the videos in a different way. Our method has been used in our solutions of Content-based Video Relevance Prediction Challenge and got a significant improvement in the AUC.

Time-aware Session Embedding for Click-Through-Rate Prediction

Qidi Xu
Haocheng Xu
Weilong Chen
Chaojun Han
Haoyang Li
Wenxin Tan
Fumin Shen
Heng Tao Shen

TV series correlation computing is one of the most important tasks of personalized online streaming services. With the relevance of TV series and viewer feedback, we can calculate the TV series correlation table based on the viewer's implicit feedback which does not perform well for the newly added "cold start" TV series. In this paper, we aim to improve correlation computing within the cold-start phase. We propose a framework named Time-aware Session Embedding (TSE), with Item Embedding in Session and Time Decay Factor for a multimodal recommendation. We apply an lower- dimensional vector as item embedding and calculate their factor considering the time decay. The framework performed well in the Content-based Video Relevance Prediction Challenge and we get the first place in this competition.

SESSION: Grand Challenge: Live Video Streaming

The ACM Multimedia 2019 Live Video Streaming Grand Challenge

Gang Yi
Dan Yang
Abdelhak Bentaleb
Weihua Li
Yi Li
Kai Zheng
Jiangchuan Liu
Wei Tsang Ooi
Yong Cui

Live video streaming delivery over Dynamic Adaptive Video Streaming (DASH) is challenging as it requires low end-to-end latency, is more prone to stall, and the receiver has to decide online which representation at which bitrate to download and whether to adjust the playback speed to control the latency. To encourage the research community to come together to address this challenge, we organize the Live Video Streaming Grand Challenge at ACM Multimedia 2019. This grand challenge provides a simulation platform onto which the participants can implement their adaptive bitrate (ABR) logic and latency control algorithm, and then benchmark against each other using a common set of video traces and network traces. The ABR algorithms are evaluated using a common Quality-of- Experience (QoE) model that accounts for playback bitrate, latency constraint, frame-skipping penalty, and rebuffering penalty.

A Hybrid Control Scheme for Adaptive Live Streaming

Huan Peng
Yuan Zhang
Yongbei Yang
Jinyao Yan

The live streaming is more challenging than on-demand streaming, because the low latency is also a strong requirement in addition to the trade-off between video quality and jitters in playback. To balance several inherently conflicting performance metrics and improve the overall quality of experience (QoE), many adaptation schemes have been proposed. Bitrate adaptation is one of the major solution for video streaming under time-varying network conditions, which works even better combining with some latency control methods, such as adaptive playback rate control and frame dropping. However, it still remains a challenging problem to design an algorithm to combine these adaptation schemes together. To tackle this problem, we propose a hybrid control scheme for adaptive live streaming, namely HYSA, based on heuristic playback rate control, latency-constrained bitrate control and QoE-oriented adaptive frame dropping. The proposed scheme utilizes Kaufman's Adaptive Moving Average (KAMA) to predict segment bitrates for better rate decisions. Extensive simulations demonstrate that HYSA outperforms most of the existing adaptation schemes on overall QoE.

HD3: Distributed Dueling DQN with Discrete-Continuous Hybrid Action Spaces for Live Video Streaming

Xiaolan Jiang
Yusheng Ji

Live streaming applications are becoming increasingly popular recently, and it exposes new technical challenges compared to regular video streaming. High video quality and low latency are two main requirements in live streaming scenarios. A live streaming application needs to make bitrate and target buffer level decisions as well as sets a continuous latency limit value to skip video frames. We formulate the live streaming task as a reinforcement learning problem with discrete-continuous hybrid action spaces, then propose a novel deep reinforcement learning (DRL) algorithm HD3 which can take hybrid actions to solve it. We compare HD3 with several state-of-the-art DRL algorithms on various network environments, and the simulation results show that HD3 can outperform all the other comparison schemes. We emphasize that HD3 generates a single agent which can perform well on different network conditions and video scenes.

Continuous Bitrate & Latency Control with Deep Reinforcement Learning for Live Video Streaming

Ruying Hong
Qiwei Shen
Lei Zhang
Jing Wang

In this paper, we introduce a continuous bitrate control and latency control model for the Live Video Streaming Challenge. Our model is based on Deep Deterministic Policy Gradient, popular on continuous control tasks. Simultaneously, it can take a fine-grained control through continuous control and does not need to discrete the continuous "latency limit", which is a buffer threshold to minimize end-to-end delay by frame skipping. In all considered live video scenarios, our model can provide a better quality of experience with improvements in average QoE of 3.6% than DQN which discrete the "latency limit". Additionally, challenge results show the effectiveness and applicability of the proposed model, which achieved top performance in 3 different networks that include high, low and oscillating throughput, and ranked the second place in the network with medium throughput.

BitLat: Bitrate-adaptivity and Latency-awareness Algorithm for Live Video Streaming

Chen Wang
Jianfeng Guan
Tongtong Feng
Neng Zhang
Tengfei Cao

With the growing popularity and prosperity of living streaming applications, it is naturally confronting users' quality of experience (QoE) degradation issues especially under dynamic environments arised from nonnegligible factors such as high latency and intermittent bitrate. In this paper, we propose an efficient adaptive bitrate (ABR) algorithm called BitLat to achieve both bitrate-control and latency-control. BitLat is based on reinforcement learning to get strong adaptability for dealing with the complex and changing network conditions. More specifically, in our work, we determine the specific value of latency threshold with the help of current advanced algorithm, and design the structure of the neural network in reinforcement learning, the features used in the training process, and the corresponding reward function. Additional, we use the Dynamic Reward Method to further enhance the performance. Comprehensive experiments are conducted to demonstrate BitLat outperforms the state-of-the-art ABR algorithms, with improvements in average QoE of 20%-62%.

Latency Aware Adaptive Video Streaming using Ensemble Deep Reinforcement Learning

Yin Zhao
Qi-Wei Shen
Wei Li
Tong Xu
Wei-Hua Niu
Si-Ran Xu

The development of live broadcasting represents many new technical challenges on adaptive bitrate(ABR) algorithms, which not only requires stable and high-quality transmission but also low end-to-end latency. Reinforcement learning(RL) achieves promising results and can learn ABR algorithms automatically without using any pre-programmed control rules. However, existing methods only consider bitrate control and ignore latency control. Therefore, in order to effectively reduce the end-to-end latency, we propose an independent latency limit model to control the frame skipping. Moreover, a model ensemble algorithm is implemented to reduce performance variance and improve the user quality of experience (QoE). Experimental results show that our model outperforms base- line methods and demonstrate the effectiveness of our model.

SESSION: Grand Challenge: Relation Understanding in Videos

Relation Understanding in Videos: A Grand Challenge Overview

Xindi Shang
Junbin Xiao
Donglin Di
Tat-Seng Chua

ACM Multimedia 2019 Video Relation Understanding Challenge is the first grand challenge aiming at pushing video content analysis at the relational and structural level. This year, the challenge asks the participants to explore and develop innovative algorithms to detect object entities and their relations based on a large-scale user-generated video dataset. The tasks will advance the foundation of future visual systems that are able to perform complex inferences. This paper presents an overview of the grand challenge, including background, detailed descriptions of the three proposed tasks, the corresponding datasets for training, validation and testing, and the evaluation process.

Video Visual Relation Detection via Multi-modal Feature Fusion

Xu Sun
Tongwei Ren
Yuan Zi
Gangshan Wu

Video visual relation detection is a meaningful research problem, which aims to build a bridge between dynamic vision and language. In this paper, we propose a novel video visual relation detection method with multi-model feature fusion. First, we detect objects on each frame densely with the state-of-the-art video object detection model, flow-guided feature aggregation (FGFA), and generate object trajectories by linking the temporally independent objects with Seq-NMS and KCF tracker. Next, we break the relation candidates, i.e., co-occurrent object trajectory pairs, into short-term segments and predict relations with spatial-temporal feature and language context feature. Finally, we greedily associate the short-term relation segments into complete relation instances. The experiment results show that our proposed method outperforms other methods by a large margin, which also earned us the first place in visual relation detection task of Video Relation Understanding Challenge (VRU), ACMMM 2019.

Relation Understanding in Videos

Sipeng Zheng
Xiangyu Chen
Shizhe Chen
Qin Jin

In this paper, we present our solutions to the grand challenge task "Relation Understanding in Videos" in ACM Multimedia 2019. The challenge task aims to detect instances of target visual relations in a video, where a visual relation instance is represented by a relation triplet

SESSION: Grand Challenge: Social Media Prediction

SMP Challenge: An Overview of Social Media Prediction Challenge 2019

Bo Wu
Wen-Huang Cheng
Peiye Liu
Bei Liu
Zhaoyang Zeng
Jiebo Luo

"SMP Challenge" aims to discover novel prediction tasks for numerous data on social multimedia and seek excellent research teams. Making predictions via social multimedia data (e.g. photos, videos or news) is not only helps us to make better strategic decisions for the future, but also explores advanced predictive learning and analytic methods on various problems and scenarios, such as multimedia recommendation, advertising system, fashion analysis etc.

In the SMP Challenge at ACM Multimedia 2019, we introduce a novel prediction task Temporal Popularity Prediction, which focuses on predicting future interaction or attractiveness (in terms of clicks, views or likes etc.) of new online posts in social media feeds before uploading. We also collected and released a large-scale SMPD benchmark with over 480K posts from 69K users. In this paper, we define the challenge problem, give an overview of the dataset, present statistics of rich information for data and annotation and design the accuracy and correlation evaluation metrics for temporal popularity prediction to the challenge.

Feature Construction for Posts and Users Combined with LightGBM for Social Media Popularity Prediction

Ziliang He
Zijian He
Jiahong Wu
Zhenguo Yang

In this paper, we propose to address the Social Media Prediction (SMP) Challenge by using regression model with multiple features extracted from various aspects of posts. More specifically, we extract textual features, numeric features, and construct user-related features to this end. For textual features, the rich texts possessed by the posts are integrated to build a corpus, based on which we train a language model to learn the vector representation of semantic information. For numeric features, we construct several new features, including the length and the word numbers of title. For the user-related features, we design a "user id count" based on the number of times each user posted in the entire dataset to show the activity of the user. Finally, the multiple features are feed into LightGBM to predict popularity scores. Extensive experiments conducted on the Social Media Prediction Dataset show the superiority of our method. Our approach achieves the 3rd place in the SMP Challenge.

Catboost-based Framework with Additional User Information for Social Media Popularity Prediction

Peipei Kang
Zehang Lin
Shaohua Teng
Guipeng Zhang
Lingni Guo
Wei Zhang

In this paper, a Catboost-based framework is proposed to predict social media popularity. The framework is constituted by two components: feature representation and Catboost training. In the component of feature representation, numerical features are directly used, while categorical features are converted into numerical features by a method of order target statistics in Catboost. Besides, some additional user information is also tracked to enrich the feature space. In the other component, Catboost is adopted as the regression model which is trained by using post-related, user-related and additional user information. Moreover, to make full use of the dataset for model training, a dataset augmentation strategy based on pseudo labels is proposed. This strategy involves in two-stage training. In the first stage, it trains a first-stage model that is used to label the test set as pseudo labeled. In the next stage, a final model is trained based on the new training set that includes original validation set and the pseudo labeled test set. The proposed method achieves the 2nd place in the leader board of the Grand Challenge of Social Media Prediction.

Social Media Popularity Prediction: A Multiple Feature Fusion Approach with Deep Neural Networks

Keyan Ding
Ronggang Wang
Shiqi Wang

Social media popularity prediction (SMPD) aims to predict the popularity of the post shared on online social media platforms. This task is crucial for content providers and consumers in a wide range of real-world applications, including multimedia advertising, recommendation system and trend analysis. In this paper, we propose to fuse features from multiple sources by deep neural networks (DNNs) for popularity prediction. Specifically, high-level image and text features are extracted by the advanced pretrained DNN, and numerical features are captured from the metadata of the posts. All of the features are concatenated and fed into a regressor with multiple dense layers. Experiments have demonstrated the effectiveness of the proposed model on the ACM Multimedia Challenge SMPD2019 dataset. We also verify the importance of each feature via univariate test and ablation study, and provide the insights of feature combination for social media popularity prediction.

Popularity Prediction of Social Media based on Multi-Modal Feature Mining

Chih-Chung Hsu
Li-Wei Kang
Chia-Yen Lee
Jun-Yi Lee
Zhong-Xuan Zhang
Shao-Min Wu

Popularity prediction of social media becomes a more attractive issue in recent years. It consists of multi-type data sources such as image, meta-data, and text information. In order to effectively predict the popularity of a specified post in the social network, fusing multi-feature from heterogeneous data is required. In this paper, a popularity prediction framework for social media based on multi-modal feature mining is presented. First, we discover image semantic features by extracting their image descriptions generated by image captioning. Second, an effective text-based feature engineering is used to construct an effective word-to-vector model. The trained word-to-vector model is used to encode the text information and the semantic image features. Finally, an ensemble regression approach is proposed to aggregate these encoded features and learn the final regressor. Extensive experiments show that the proposed method significantly outperforms other state-of-the-art regression models. We also show that the multi-modal approach could effectively improve the performance in the social media prediction challenge.

Social Media Popularity Prediction Based on Visual-Textual Features with XGBoost

Junhong Chen
Dayong Liang
Zhanmo Zhu
Xiaojing Zhou
Zihan Ye
Xiuyun Mo

Popularity prediction for social media is an efficient way for scientists to explore advanced predictive trend and make better strategic decisions for future. In this paper, we propose a framework that uses visual-textual features combined with XGBoost for popularity prediction. More specifically, the framework contains three procedures, including visual-textual features extraction, features fusion and XGBoost regression. In order to extract the visual-textual data, on the one hand, we first adopt one-hot encoder to encode the metadata of the posts, and then apply a word2vec model to produce word embeddings. On the other hand, we adopt a shape descriptor called Hu moment to extract the visual features from the images. What's more, we exploit user's information, e.g. users' followings and users' followers, for providing extra social features to the regression. After that, we fuse the multi-modal features and input them to the XGBoost directly for popularity prediction. Extensive experiments conducted on the SMPD2019 dataset manifest the effectiveness of our system. Furthermore, our approach achieves the 3nd place on the leader board of the Grand Challenge in ACM Multimedia 2019.

SESSION: Tutorials

Learning from 3D (Point Cloud) Data

Winston H. Hsu

Learning on (3D) point clouds is vital for a broad range of emerging applications such as autonomous driving, robot perception, augmented reality, gaming, and security. Such needs have increased recently due to the prevalence of 3D sensors such as LiDAR, 3D camera, and RGB-D. Point clouds consist of thousands to millions of points; They contain rich information and are complementary to the traditional 2D cameras that we have been working on for years in the multimedia (or vision) community. 3D learning algorithms on point cloud data are new, and exciting, for numerous core problems such as 3D classification, detection, semantic segmentation, and face recognition. Covers the requirements of point cloud data, the background of capturing the data, 3D representations, emerging applications, core problems, state-of-the art learning algorithms, and future research opportunities.

AutoML and Meta-learning for Multimedia

Wenwu Zhu
Xin Wang
Wenpeng Zhang

AutoML and meta-learning are exciting and fast-growing research directions to the research community in both academia and industry. This tutorial is to disseminate and promote the recent research achievements on AutoML and meta-learning as well as their potential applications for multimedia. Specifically, we will first advocate novel, high-quality research findings and innovative solutions to the challenging problems in AutoML and meta-learning. Then we will discuss scenarios of multimedia where AutoML and meta-learning serve as candidates for solutions. Finally, we will point out future research directions on AutoML and meta-learning as well as their potential new applications for multimedia.

Multimedia Forensics

Luisa Verdoliva
Paolo Bestagini

With the availability of powerful and easy-to-use media editing tools, falsifying images and videos has become widespread in the last few years. Coupled with ubiquitous social networks, this allows for the viral dissemination of fake news. This raises huge concerns on multimedia security. This scenario became even worse with the advent of deep learning. New, sophisticated methods have been proposed to accomplish manipulations that were previously unthinkable (e.g., deepfake). This tutorial will present the most reliable methods for detection of manipulated images and for source identification. These are important tools nowadays to carry out fact checking and authorship verification. Hence, this is a timely and relevant research topic in the multimedia security research community.

The tutorial will be focused on digital integrity and source attribution with reference to both images and videos.

For media authenticity the main techniques will be presented for forgery detection and localization, starting from methods that rely on camera-based and format-based artifacts. Then the most innovative solutions based on deep learning will be described, considering both supervised and unsupervised approaches. Results will be presented on challenging datasets and realistic scenarios, such as the spreading of manipulated images and videos over social networks. In addition, the robustness of such methods to adversarial attacks will be analyzed.

The problem of image and video source attribution to the device used for its acquisition will be analyzed under different viewpoints: detecting the used kind of device (e.g., scanner vs. camera); detecting the used make and model (e.g., one model of camera vs. another model); detecting the used specific device (e.g., one device of a specific model vs. another device of the same model). State-of-the-art solutions exploiting either model-based or data-driven techniques will be presented. Results will be shown considering up-to-date standard datasets used for this topic.

A Journey Towards Fully Immersive Media Access

Christian Timmerer
Ali C. Begen

Universal media access (UMA) as proposed almost two decades ago is now reality. We can generate, distribute, share, and consume any media content, anywhere, anytime, and with/on any device. A technical breakthrough was the adaptive streaming over HTTP resulting in the standardization of MPEG Dynamic Adaptive Streaming over HTTP (DASH), which is now successfully deployed in a plethora of environments. The next big thing in adaptive media streaming is virtual reality applications, and specifically, omnidirectional (360-degree) media streaming, which is currently built on top of the existing adaptive streaming ecosystems. This tutorial provides a detailed overview of adaptive streaming of both traditional and omnidirectional media. The tutorial focuses on the basic principles and paradigms for adaptive streaming as well as on already deployed content generation, distribution, and consumption workflows. Additionally, the tutorial provides insights into standards and emerging technologies in the adaptive streaming space. Finally, the tutorial includes the latest approaches for immersive media streaming enabling six Degrees of Freedom (6DoF) DASH through Point Cloud Compression (PCC) and concludes with open research issues and industry efforts in this domain.

Principle-to-program: Neural Fashion Recommendation with Multi-modal Input

Muthusamy Chelliah
Soma Biswas
Lucky Dhakad

Outfit recommendation automatically pairs user-specified reference clothing with the most suitable complement from online shops. Wearing aesthetically is a criterion for matching such fashion items. Fashion style tells a lot about one's personality and emerges from how people assemble clothing outfit from seemingly disjoint items into a cohesive concept. Experts share fashion tips showcasing their compositions to public where each item has both an image and textual meta-data. Also, retrieving products from online shopping catalogs in response to such real-world image query is essential for outfit recommendation. Our earlier tutorial focused on style and compatibility in fashion recommendation mostly based on metric and deep learning approaches. Herein, we cover several other aspects of fashion recommendation using visual signals (e.g., cross-scenario retrieval, attribute classification) and combine text input (e.g., interpretable embedding) as well. Each section concludes walking through programs executed on Jupyter workstation using real-world data sets.

Reproducibility and Experimental Design for Machine Learning on Audio and Multimedia Data

Gerald Friedland

This tutorial provides an actionable perspective on the experimental design for machine learning experiments on multimedia data. The tutorial consists of lectures and hands-on exercises. The lectures provide a theoretical introduction to machine learning design and signal processing. The thought framework presented is derived from the traditional experimental sciences which require published results to be self-contained with regards to reproducibility. In the practical exercises, we will work on calculating and measuring quantities like capacity or generalization ratio for different machine learners and data sets and discuss how these quantities relate to reproducible experimental design.

Medical Multimedia Systems and Applications

Pål Halvorsen
Michael Alexander Riegler
Klaus Schoeffmann

In recent years, we have observed a rise of interest in the multimedia community towards research topics related to health. It can be observed that this goes into two interesting directions. One is personal health with a larger focus on well-being and everyday healthy living. The other direction focuses more on multimedia challenges within the health-care systems, for example, how can multimedia content produced in hospitals be used efficiently but also on the user perspective of patients and health-care personal. Challenges and requirements in this interesting and challenging direction are similar to classic multimedia research, but with some additional pitfalls and challenges. This tutorial aims to give a general introduction to the research area; to provide an overview of specific requirements, pitfalls and challenges; to discuss existing and possible future work; and to elaborate on how machine learning approaches can help in multimedia-related challenges to improve the health-care quality for patients and support medical experts in their daily work.

Multimodal Data Collection for Social Interaction Analysis In-the-Wild

Hayley Hung
Chirag Raman
Ekin Gedik
Stephanie Tan
Jose Vargas Quiros

The benefits of exploiting multi-modality in the analysis of human-human social behaviour has been demonstrated widely in the community. An important aspect of this problem is the collection of data-sets that provide a rich and realistic representation of how people actually socialize with each other in real life. These subtle coordination patterns are influenced by individual beliefs, goals, and, desires related to what an individual stands to lose or gain in the activities they perform in their every day life. These conditions cannot be easily replicated in a lab setting and require a radical re-thinking of both how and what to collect. This tutorial provides a guide on how to create such multi-modal multi-sensor data sets when holistically considering the entire experimental design and data collection process.

SESSION: Workshop Summaries

AI4TV 2019: 1st International Workshop on AI for Smart TV Content Production, Access and Delivery

Raphael Troncy
Jorma Laaksonen
Hamed R. Tavakoli
Lyndon Nixon
Vasileios Mezaris

Technological developments in comprehensive video understanding - detecting and identifying visual elements of a scene, combined with audio understanding (music, speech), as well as aligned with textual information such as captions, subtitles, etc. and background knowledge - have been undergoing a significant revolution during recent years. The workshop brings together experts from academia and industry in order to discuss the latest progress in artificial intelligence research in topics related to multimodal information analysis, and in particular, semantic analysis of video, audio, and textual information for smart digital TV content production, access and delivery.

AVEC'19: Audio/Visual Emotion Challenge and Workshop

Fabien Ringeval
Björn Schuller
Michel Valstar
Nicholas Cummins
Roddy Cowie
Maja Pantic

The ninth Audio-Visual Emotion Challenge and workshop AVEC 2019 was held in conjunction with ACM Multimedia'19. This year, the AVEC series addressed major novelties with three distinct tasks: State-of-Mind Sub-challenge (SoMS), Detecting Depression with Artificial Intelligence Sub-challenge (DDS), and Cross-cultural Emotion Sub-challenge (CES). The SoMS was based on a novel dataset (USoM corpus) that includes self-reported mood (10-point Likert scale) after the narrative of personal stories (two positive and two negative). The DDS was based on a large extension of the DAIC-WOZ corpus (c.f. AVEC 2016) that includes new recordings of patients suffering from depression with the virtual agent conducting the interview being, this time, wholly driven by AI, i.e., without any human intervention. The CES was based on the SEWA dataset (c.f. AVEC 2018) that has been extended with the inclusion of new participants in order to investigate how emotion knowledge of Western European cultures (German, Hungarian) can be transferred to the Chinese culture. In this summary, we mainly describe participation and conditions of the AVEC Challenge.

HealthMedia'19: 4th International Workshop on Multimedia for Personal Health and Health Care

Susanne Boll
Jeannie S. Lee
Jochen Meyer
Nitish Nag
Noel E. O'Connor

Managing one's health is among the most personal and most important challenges. HealthMedia can be viewed as one response from the multimedia community to rise to this challenge. There is an increasing number of research work that shows how core multimedia research is becoming an important enabler for solutions with applications and relevance for the societal questions of health. Within this workshop, we continue to explore the relevance, contribution and future directions of multimedia to health care and personal health. This workshop brings together researchers from diverse topics such as multimedia, tracking, lifelogging, accessibility, HCI, but also health, medicine, and psychology to address challenges and opportunities of multimedia in and for health.

MADiMA'19: 5th International Workshop on Multimedia Assisted Dietary Management

Stavroula G. Mougiakakou
Giovanni Maria Farinella
Keiji Yanai
Dario Allegra

This abstract provides a summary and overview of the 5th International Workshop on Multimedia Assisted Dietary Management.

SALMM'19: First International Workshop on Search as Learning with Multimedia Information

Ralph Ewerth
Stefan Dietze
Anett Hoppe
Ran Yu

The First International Workshop on "Search as Learning with Multimedia Information" (SALMM) presents interdisciplinary contributions that address multimedia aspects insearch as learning scenarios. The research topic of search as learning (SAL) recently emerged in the field of information retrieval and investigates informal, web-based learning processes as they happen every day with the help of search engines. While highly active, the SAL community still shows a focus on textual documents. This conflicts with research on multimedia learning in educational psychology: Humans tend to grasp and internalize new knowledge easier and more efficiently when it is conveyed using multiple modalities. The workshop SALMM aims to bridge the gap between the communities of SAL and multimedia. Related Workshop Proceedings are available in the ACM DL at: \hrefhttps://dl.acm.org/citation.cfm?id=3347451 https://dl.acm.org/citation.cfm?id=3347451

SUMAC 2019: The 1st workshop on Structuring and Understanding of Multimedia heritAge Contents

Valérie Gouet-Brunet
Margarita Khokhlova
Liming Chen
Sander Münster

SUMAC 2019 is the first workshop on Structuring and Understanding of Multimedia heritAge Contents. It is held in Nice, France on October 21, 2019 and is co-located with the 27th ACM International Conference on Multimedia. Its objective is to present and discuss the latest and most significant trends and challenges in the analysis, structuring and understanding of multimedia contents dedicated to the valorization of heritage, with the emphasis on the unlocking of and access to the big data of the past. A representative scope of Computer Science methodologies dedicated to the processing of multimedia heritage contents and their exploitation is covered by the works presented, with the ambition of advancing and raising awareness about this fully developing research field.

FAT/MM'19: 1st International Workshop on Fairness, Accountability, and Transparency in MultiMedia

Xavier Alameda-Pineda
Miriam Redi
Elisa Celis
Nicu Sebe
Shih-Fu Chang

The series of FAT* events aim at bringing together researchers and practitioners interested in fairness, accountability, and transparency of computational methods. The FAT/MM workshop focuses on addressing these issues in the Multimedia field. Multimedia computing technologies operate today at an unprecedented scale, with a growing community of scientists interested in multimedia models, tools and applications. Such continued growth has great implications not only for the scientific community, but also for the society as a whole. Typical risks of large-scale computational models include model bias and algorithmic discrimination. These risks become particularly prominent in the multimedia field, which historically has been focusing on user-centered technologies. To ensure a healthy and constructive development of the best multimedia technologies, this workshop offers a space to discuss how to develop fair, unbiased, representative, and transparent multimedia models, bringing together researchers from different areas to present computational solutions to these issues.

MAHCI 2019: The 2nd Workshop on Multimedia for Accessible Human Computer Interface

Xueliang Liu
Rui Min
Troy McDaniel

Multimedia technology plays a fundamental role to increase usability, and accessibility of computer interfaces in developing of advanced human-computer interaction devices. The 2nd workshop on Multimedia for Accessible Human Computer Interface (MAHCI) continues to provide a forum to both multimedia and HCI researchers to discuss the accessible human computer interface design, development, and evaluation with the state-of-the-art multimedia technology. It also enables multimedia community to expand its interaction with the HCI industry and broaden the scope of deploying multimedia technology in practical applications. The workshop features 6 papers which cover a number of novel applications and new methodologies in a half day program.

MMSports'19: 2^nd ACM International Workshop on Multimedia Content Analysis in Sports

Rainer Lienhart
Thomas B. Moeslund
Hideo Saito

The second ACM International Workshop on Multimedia Content Analysis in Sports (ACM MMSports'19) is held in Nice, France on October 25th, 2019 co-located with the ACM International Conference on Multimedia 2019 (ACM Multimedia 2019). The goal of this workshop is to bring together researchers and practitioners from academia and industry to address challenges and report progress in mining, analyzing, understanding and visualizing the multimedia/multimodal data in sports. The combination of sports and modern technology offers a novel and intriguing field of research with promising approaches for visual broadcast augmentation, understanding, statistical analysis and evaluation, and sensor fusion. There is a lack of research communities focusing on the fusion of multiple modalities. We are helping to close this research gap with this workshop series on multimedia content analysis in sports.

MULEA'19: The First International Workshop on Multimodal Understanding and Learning for Embodied Applications

Jiang (John) Gao
Jia-Yu (Tim) Pan

The First International Workshop on Multimodal Understanding and Learning for Embodied Applications is held in Nice, France, in conjunction with ACM Multimedia 2019. Embodied applications require the learning and knowledge discovery process involving an agent, the environment, and actions, as well as the understanding and grounding of multiple modalities of input signals. Being one of the frontiers in AI research, it covers many of the applications in AI, such as robotics, autonomous driving, multimodal chatbots, or simulated games. The Workshop brings an exciting program with invited speeches, original research papers, and lively discussions on this new and exciting research area.

MM '19- Proceedings of the 27th ACM International Conference on Multimedia

MM '19- Proceedings of the 27th ACM International Conference on Multimedia

SESSION: Keynote I

SESSION: Session 1A: Multimodal Fusion&Visual Relations

SESSION: Session 1C: Fashion&Human Analysis

SESSION: Session 1D: Live Multimedia Applications&Streaming

SESSION: Keynote II

SESSION: Session 2A: Knowledge Processing&Action Analysis

SESSION: Session 2B: Adversarial Learning

SESSION: Session 2C: Captioning&Video Analysis

SESSION: Session 2D: 3D Visual Processing

SESSION: Demonstration I

SESSION: Reproducibility

SESSION: Best Paper Session (*note: Honorable Mentions*)

SESSION: Multimedia Art Exhibition

SESSION: Best Paper Session (note: Honorable Mentions)