# MM '18- 2018 ACM Multimedia Conference on Multimedia Conference

## SESSION: FF-1

### Session details: FF-1

•      Max Mühlhäuser

### SCRATCH: A Scalable Discrete Matrix Factorization Hashing for Cross-Modal Retrieval

•      Chuan-Xiang Li
• Zhen-Duo Chen
• Peng-Fei Zhang
• Xin Luo
• Liqiang Nie
• Wei Zhang
• Xin-Shun Xu

In recent years, many hashing methods have been proposed for the cross-modal retrieval task. However, there are still some issues that need to be further explored. For example, some of them relax the binary constraints to generate the hash codes, which may generate large quantization error. Although some discrete schemes have been proposed, most of them are time-consuming. In addition, most of the existing supervised hashing methods use an n x n similarity matrix during the optimization, making them unscalable. To address these issues, in this paper, we present a novel supervised cross-modal hashing method---Scalable disCRete mATrix faCtorization Hashing, SCRATCH for short. It leverages the collective matrix factorization on the kernelized features and the semantic embedding with labels to find a latent semantic space to preserve the intra- and inter-modality similarities. In addition, it incorporates the label matrix instead of the similarity matrix into the loss function. Based on the proposed loss function and the iterative optimization algorithm, it can learn the hash functions and binary codes simultaneously. Moreover, the binary codes can be generated discretely, reducing the quantization error generated by the relaxation scheme. Its time complexity is linear to the size of the dataset, making it scalable to large-scale datasets. Extensive experiments on three benchmark datasets, namely, Wiki, MIRFlickr-25K, and NUS-WIDE, have verified that our proposed SCRATCH model outperforms several state-of-the-art unsupervised and supervised hashing methods for cross-modal retrieval.

### Predicting Visual Context for Unsupervised Event Segmentation in Continuous Photo-streams

•      Ana Garcia del Molino
• Joo-Hwee Lim
• Ah-Hwee Tan

Segmenting video content into events provides semantic structures for indexing, retrieval, and summarization. Since motion cues are not available in continuous photo-streams, and annotations in lifelogging are scarce and costly, the frames are usually clustered into events by comparing the visual features between them in an unsupervised way. However, such methodologies are ineffective to deal with heterogeneous events, e.g. taking a walk, and temporary changes in the sight direction, e.g. at a meeting. To address these limitations, we propose Contextual Event Segmentation (CES), a novel segmentation paradigm that uses an LSTM-based generative network to model the photo-stream sequences, predict their visual context, and track their evolution. CES decides whether a frame is an event boundary by comparing the visual context generated from the frames in the past, to the visual context predicted from the future. We implemented CES on a new and massive lifelogging dataset consisting of more than 1.5 million images spanning over 1,723 days. Experiments on the popular EDUB-Seg dataset show that our model outperforms the state-of-the-art by over 16% in f-measure. Furthermore, CES' performance is only 3 points below that of human annotators.

### Video-to-Video Translation with Global Temporal Consistency

•      Xingxing Wei
• Jun Zhu
• Sitong Feng
• Hang Su

Although image-to-image translation has been widely studied, the video-to-video translation is rarely mentioned. In this paper, we propose an unified video-to-video translation framework to accom- plish different tasks, like video super-resolution, video colouriza- tion, and video segmentation, etc. A consequent question within video-to-video translation lies in the flickering appearance along with the varying frames. To overcome this issue, a usual method is to incorporate the temporal loss between adjacent frames in the optimization, which is a kind of local frame-wise temporal con- sistency. We instead present a residual error based mechanism to ensure the video-level consistency of the same location in different frames (called (lobal temporal consistency). The global and local consistency are simultaneously integrated into our video-to-video framework to achieve more stable videos. Our method is based on the GAN framework, where we present a two-channel discrimina- tor. One channel is to encode the video RGB space, and another is to encode the residual error of the video as a whole to meet the global consistency. Extensive experiments conducted on different video- to-video translation tasks verify the effectiveness and flexibleness of the proposed method.

### Shared Linear Encoder-based Gaussian Process Latent Variable Model for Visual Classification

•      Jinxing Li
• Bob Zhang
• Guangming Lu
• David Zhang

Multi-view learning has shown its powerful potential in many applications and achieved outstanding performances compared with the single-view based methods. In this paper, we propose a novel multi-view learning model based on the Gaussian Process Latent Variable Model (GPLVM) to learn a shared latent variable in the manifold space with a linear and gaussian process prior based back projection. Different from existing GPLVM methods which only consider a mapping from the latent space to the observed space, the proposed method simultaneously takes a back projection from the observation to the latent variable into account. Concretely, due to the various dimensions of different views, a projection for each view is first learned to linearly map its observation to a subspace. The gaussian process prior is then imposed on another transformation to non-linearly and efficiently map the learned subspace to a shared manifold space. In order to apply the proposed approach to the classification, a discriminative regularization is also embedded to exploit the label information. Experimental results on three real-world databases substantiate the effectiveness and superiority of the proposed approach as compared with several state-of-the-art approaches.

### Step-by-step Erasion, One-by-one Collection: A Weakly Supervised Temporal Action Detector

•      Jia-Xing Zhong
• Nannan Li
• Weijie Kong
• Tao Zhang
• Thomas H. Li
• Ge Li

Weakly supervised temporal action detection is a Herculean task in understanding untrimmed videos, since no supervisory signal except the video-level category label is available on training data. Under the supervision of category labels, weakly supervised detectors are usually built upon classifiers. However, there is an inherent contradiction between classifier and detector; i.e., a classifier in pursuit of high classification performance prefers top-level discriminative video clips that are extremely fragmentary, whereas a detector is obliged to discover the whole action instance without missing any relevant snippet. To reconcile this contradiction, we train a detector by driving a series of classifiers to find new actionness clips progressively, via step-by-step erasion from a complete video. During the test phase, all we need to do is to collect detection results from the one-by-one trained classifiers at various erasing steps. To assist in the collection process, a fully connected conditional random field is established to refine the temporal localization outputs. We evaluate our approach on two prevailing datasets, THUMOS'14 and ActivityNet. The experiments show that our detector advances state-of-the-art weakly supervised temporal action detection results, and even compares with quite a few strongly supervised methods.

### Multi-Human Parsing Machines

•      Jianshu Li
• Jian Zhao
• Yunpeng Chen
• Sujoy Roy
• Shuicheng Yan
• Jiashi Feng
• Terence Sim

Human parsing is an important task in human-centric analysis. Despite the remarkable progress in single-human parsing, the more realistic case of multi-human parsing remains challenging in terms of the data and the model. Compared with the considerable number of available single-human parsing datasets, the datasets for multi-human parsing are very limited in number mainly due to the huge annotation effort required. Besides the data challenge to multi-human parsing, the persons in real-world scenarios are often entangled with each other due to close interaction and body occlusion, making it difficult to distinguish body parts from different person instances. In this paper we propose the Multi-Human Parsing Machines (MHPM) system, which contains an MHP Montage model and an MHP Solver, to address both challenges in multi-human parsing. Specifically, the MHP Montage model in MHPM generates realistic images with multiple persons together with the parsing labels. It intelligently composes single persons onto background scene images while maintaining the structural information between persons and the scene. The generated images can be used to train better multi-human parsing algorithms. On the other hand, the MHP Solver in MHPM solves the bottleneck of distinguishing multiple entangled persons with close interaction. It employs a Group-Individual Push and Pull (GIPP) loss function, which can effectively separate persons with close interaction. We experimentally show that the proposed MHPM can achieve state-of-the-art performance on the multi-human parsing benchmark and the person individualization benchmark, which distinguishes closely entangled person instances.

### Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering

•      Xuanyi Dong
• Linchao Zhu
• De Zhang
• Yi Yang
• Fei Wu

Given only a few image-text pairs, humans can learn to detect semantic concepts and describe the content. For machine learning algorithms, they usually require a lot of data to train a deep neural network to solve the problem. However, it is challenging for the existing systems to generalize well to the few-shot multi-modal scenario, because the learner should understand not only images and texts but also their relationships from only a few examples. In this paper, we tackle two multi-modal problems, i.e., image captioning and visual question answering (VQA), in the few-shot setting.

We propose Fast Parameter Adaptation for Image-Text Modeling (FPAIT) that learns to learn jointly understanding image and text data by a few examples. In practice, FPAIT has two benefits. (1) Fast learning ability. FPAIT learns proper initial parameters for the joint image-text learner from a large number of different tasks. When a new task comes, FPAIT can use a small number of gradient steps to achieve a good performance. (2) Robust to few examples. In few-shot tasks, the small training data will introduce large biases in Convolutional Neural Networks (CNN) and damage the learner's performance. FPAIT leverages dynamic linear transformations to alleviate the side effects of the small training set. In this way, FPAIT flexibly normalizes the features and thus reduces the biases during training. Quantitatively, FPAIT achieves superior performance on both few-shot image captioning and VQA benchmarks.

### Hierarchical Memory Modelling for Video Captioning

•      Junbo Wang
• Wei Wang
• Yan Huang
• Liang Wang
• Tieniu Tan

Translating videos into natural language sentences has drawn much attention recently. The framework of combining visual attention with Long Short-Term Memory (LSTM) based text decoder has achieved much progress. However, the vision-language translation still remains unsolved due to the semantic gap and misalignment between video content and described semantic concept. In this paper, we propose a Hierarchical Memory Model (HMM) - a novel deep video captioning architecture which unifies a textual memory, a visual memory and an attribute memory in a hierarchical way. These memories can guide attention for efficient video representation extraction and semantic attribute selection in addition to modelling the long-term dependency for video sequence and sentences, respectively. Compared with traditional vision-based text decoder, the proposed attribute-based text decoder can largely reduce the semantic discrepancy between video and sentence. To prove the effectiveness of the proposed model, we perform extensive experiments on two public benchmark datasets: MSVD and MSR-VTT. Experiments show that our model not only can discover appropriate video representation and semantic attributes but also can achieve comparable or superior performances than state-of-the-art methods on these datasets.

### Incremental Deep Hidden Attribute Learning

•      Zheng Wang
• Xiang Bai
• Mang Ye
• Shin'ichi Satoh

Person re-identifcation is a key technique to match person images captured in non-overlapping camera views. Due to the sensitivity of visual features to environmental changes, semantic attributes, such as "short-hair" or "long-hair", begin to be investigated to represent person's appearance to improve the re-identifcation performance. Generally, training semantic attribute representations requires massive annotated samples, which limits the applicability on the large-scale practical applications. To alleviate the reliance on annotation efforts, we propose a new person representation with hidden attributes by mining latent information from visual feature in an unsupervised manner. In particular, an auto-encoder model is plugged-in to the deep learning network to compose a Deep Hidden Attribute Network (DHA-Net). The learnt hidden attribute representation preserves the robustness of semantic attributes and simultaneously inherits the discrimination ability of visual features. Experiments conducted on public datasets have validated the effectiveness of DHA-Net. On two large-scale datasets, i.e., Market-1501 and DukeMTMC-reID, the proposed method outperforms the state-of-the-art methods.

### CropNet: Real-Time Thumbnailing

•      Huarong Chen
• Bin Wang
• Tianxiang Pan
• Liwang Zhou
• Hua Zeng

We present a deep learning-based thumbnail generation method called CropNet in this paper. Unlike previous deep learning-based methods, such as Fast-AT, which can utilize detectors introduced in object detection frameworks and generate thousands of proposals, our detector is straightforward and concise, thereby ensuring that the final cropping window is computed by its center and width, with the input aspect ratio. To achieve this goal, CropNet learns specific filters to estimate the center position and utilizes a cascade structure of filters and single neuron for width inference. In addition, CropNet optimizes the center and width jointly for optimal results. We collect a data set of more than 29,000 thumbnail annotations to train CropNet and perform cross-validation between existing data sets. Experiments show that CropNet outperforms existing techniques. Our result is achieved at a test-time speed of 17 ms per image, which is six times faster than the fastest method at present.

### Learning to Transfer: Generalizable Attribute Learning with Multitask Neural Model Search

•      Zhi-Qi Cheng
• Xiao Wu
• Siyu Huang
• Jun-Xiu Li
• Alexander G. Hauptmann
• Qiang Peng

As attribute leaning brings mid-level semantic properties for objects, it can benefit many traditional learning problems in multimedia and computer vision communities. When facing the huge number of attributes, it is extremely challenging to automatically design a generalizable neural network for other attribute learning tasks. Even for a specific attribute domain, the exploration of the neural network architecture is always optimized by a combination of heuristics and grid search, from which there is a large space of possible choices to be searched. In this paper, Generalizable Attribute Learning Model (GALM) is proposed to automatically design the neural networks for generalizable attribute learning. The main novelty of GALM is that it fully exploits the Multi-Task Learning and Reinforcement Learning to speed up the search procedure. With the help of parameter sharing, GALM is able to transfer the pre-searched architecture to different attribute domains. In experiments, we comprehensively evaluate GALM on 251 attributes from three domains: animals, objects, and scenes. Extensive experimental results demonstrate that GALM significantly outperforms the state-of-the-art attribute learning approaches and previous neural architecture search methods on two generalizable attribute learning scenarios.

### Attention-based Pyramid Aggregation Network for Visual Place Recognition

•      Yingying Zhu
• Jiong Wang
• Lingxi Xie
• Liang Zheng

Visual place recognition is challenging in the urban environment and is usually viewed as a large scale image retrieval task. The intrinsic challenges in place recognition exist that the confusing objects such as cars and trees frequently occur in the complex urban scene, and buildings with repetitive structures may cause over-counting and the burstiness problem degrading the image representations. To address these problems, we present an Attention-based Pyramid Aggregation Network (APANet), which is trained in an end-to-end manner for place recognition. One main component of APANet, the spatial pyramid pooling, can effectively encode the multi-size buildings containing geo-information. The other one, the attention block, is adopted as a region evaluator for suppressing the confusing regional features while highlighting the discriminative ones. When testing, we further propose a simple yet effective PCA power whitening strategy, which significantly improves the widely used PCA whitening by reasonably limiting the impact of over-counting. Experimental evaluations demonstrate that the proposed APANet outperforms the state-of-the-art methods on two place recognition benchmarks, and generalizes well on standard image retrieval datasets.

### Semi-supervised Deep Generative Modelling of Incomplete Multi-Modality Emotional Data

•      Changde Du
• Changying Du
• Hao Wang
• Jinpeng Li
• Wei-Long Zheng
• Bao-Liang Lu
• Huiguang He

There are threefold challenges in emotion recognition. First, it is difficult to recognize human's emotional states only considering a single modality. Second, it is expensive to manually annotate the emotional data. Third, emotional data often suffers from missing modalities due to unforeseeable sensor malfunction or configuration issues. In this paper, we address all these problems under a novel multi-view deep generative framework. Specifically, we propose to model the statistical relationships of multi-modality emotional data using multiple modality-specific generative networks with a shared latent space. By imposing a Gaussian mixture assumption on the posterior approximation of the shared latent variables, our framework can learn the joint deep representation from multiple modalities and evaluate the importance of each modality simultaneously. To solve the labeled-data-scarcity problem, we extend our multi-view model to semi-supervised learning scenario by casting the semi-supervised classification problem as a specialized missing data imputation task. To address the missing-modality problem, we further extend our semi-supervised multi-view model to deal with incomplete data, where a missing view is treated as a latent variable and integrated out during inference. This way, the proposed overall framework can utilize all available (both labeled and unlabeled, as well as both complete and incomplete) data to improve its generalization ability. The experiments conducted on two real multi-modal emotion datasets demonstrated the superiority of our framework.

### Twitter Sentiment Analysis via Bi-sense Emoji Embedding and Attention-based LSTM

•      Yuxiao Chen
• Jianbo Yuan
• Quanzeng You
• Jiebo Luo

Sentiment analysis on large-scale social media data is important to bridge the gaps between social media contents and real world activities including political election prediction, individual and public emotional status monitoring and analysis, and so on. Although textual sentiment analysis has been well studied based on platforms such as Twitter and Instagram, analysis of the role of extensive emoji uses in sentiment analysis remains light. In this paper, we propose a novel scheme for Twitter sentiment analysis with extra attention on emojis. We first learn bi-sense emoji embeddings under positive and negative sentimental tweets individually, and then train a sentiment classifier by attending on these bi-sense emoji embeddings with an attention-based long short-term memory network (LSTM). Our experiments show that the bi-sense embedding is effective for extracting sentiment-aware embeddings of emojis and outperforms the state-of-the-art models. We also visualize the attentions to show that the bi-sense emoji embedding provides better guidance on the attention mechanism to obtain a more robust understanding of the semantics and sentiments.

### Facial Expression Recognition in the Wild: A Cycle-Consistent Adversarial Attention Transfer Approach

•      Feifei Zhang
• Tianzhu Zhang
• Qirong Mao
• Lingyu Duan
• Changsheng Xu

Facial expression recognition (FER) is a very challenging problem due to different expressions under arbitrary poses. Most conventional approaches mainly perform FER under laboratory controlled environment. Different from existing methods, in this paper, we formulate the FER in the wild as a domain adaptation problem, and propose a novel auxiliary domain guided Cycle-consistent adversarial Attention Transfer model (CycleAT) for simultaneous facial image synthesis and facial expression recognition in the wild. The proposed model utilizes large-scale unlabeled web facial images as an auxiliary domain to reduce the gap between source domain and target domain based on generative adversarial networks (GAN) embedded with an effective attention transfer module, which enjoys several merits. First, the GAN-based method can automatically generate labeled facial images in the wild through harnessing information from labeled facial images in source domain and unlabeled web facial images in auxiliary domain. Second, the class-discriminative spatial attention maps from the classifier in source domain are leveraged to boost the performance of the classifier in target domain. Third, it can effectively preserve the structural consistency of local pixels and global attributes in the synthesized facial images through pixel cycle-consistency and discriminative loss. Quantitative and qualitative evaluations on two challenging in-the-wild datasets demonstrate that the proposed model performs favorably against state-of-the-art methods.

### Inferring User Emotive State Changes in Realistic Human-Computer Conversational Dialogs

•      Runnan Li
• Zhiyong Wu
• Jia Jia
• Jingbei Li
• Wei Chen
• Helen Meng

Human-computer conversational interactions are increasingly pervasive in real-world applications, such as chatbots and virtual assistants. The user experience can be enhanced through affective design of such conversational dialogs, especially in enabling the computer to understand the emotive state in the user's input, and to generate an appropriate system response within the dialog turn. Such a system response may further influence the user's emotive state in the subsequent dialog turn. In this paper, we focus on the change in the user's emotive states in adjacent dialog turns, to which we refer as user emotive state change. We propose a multi-modal, multi-task deep learning framework to infer the user's emotive states and emotive state changes simultaneously. Multi-task learning convolution fusion auto-encoder is applied to fuse the acoustic and textual features to generate a robust representation of the user's input. Long-short term memory recurrent auto-encoder is employed to extract features of system responses at the sentence-level to better capture factors affecting user emotive states. Multi-task learned structured output layer is adopted to model the dependency of user emotive state change, conditioned upon the user input's emotive states and system response in current dialog turn. Experimental results demonstrate the effectiveness of the proposed method.

### Self-boosted Gesture Interactive System with ST-Net

•      Zhengzhe Liu
• Xiaojuan Qi
• Lei Pang

In this paper, we propose a self-boosted intelligent system for joint sign language recognition and automatic education. A novel Spatial-Temporal Net (ST-Net) is designed to exploit the temporal dynamics of localized hands for sign language recognition. Features from ST-Net can be deployed by our education system to detect failure modes of the learners. Moreover, the education system can help collect a vast amount of data for training ST-Net. Our sign language recognition and education system help improve each other step-by-step.On the one hand, benefited from accurate recognition system, the education system can detect the failure parts of the learner more precisely. On the other hand, with more training data gathered from the education system, the recognition system becomes more robust and accurate. Experiments on Hong Kong sign language dataset containing 227 commonly used words validate the effectiveness of our joint recognition and education system.

### Slackliner - An Interactive Slackline Training Assistant

•      Felix Kosmalla
• Christian Murlowski
• Florian Daiber
• Antonio Krüger

In this paper we present Slackliner, an interactive slackline training assistant which features a life-size projection, skeleton tracking and real-time feedback. As in other sports, proper training leads to a faster buildup of skill and lessens the risk for injuries. We chose a set of exercises from slackline literature and implemented an interactive trainer which guides the user through the exercises and gives feedback if the exercise was executed correctly.

A post analysis gives the user feedback about her performance. We conducted a user study to compare the interactive slackline training system with a classic approach using a personal trainer. No significant difference was found between groups regarding balancing time, number of steps and the walking distance on the line for the left and right foot. Significant main effects for the balancing time on line, without considering the group, have been found. User feedback acquired by questionnaires and semi-structured interviews was very positive. Overall, the results indicate that the interactive slackline training system can be used as an enjoyable and effective alternative to classic training methods.

### A Unified Generative Adversarial Framework for Image Generation and Person Re-identification

•      Yaoyu Li
• Tianzhu Zhang
• Lingyu Duan
• Changsheng Xu

Person re-identification (re-id) aims to match a certain person across multiple non-overlapping cameras. It is a challenging task because the same person's appearance can be very different across camera views due to the presence of large pose variations. To overcome this issue, in this paper, we propose a novel unified person re-id framework by exploiting person poses and identities jointly for simultaneous person image synthesis under arbitrary poses and pose-invariant person re-identification. The framework is composed of a GAN based network and two Feature Extraction Networks (FEN), and enjoys following merits. First, it is a unified generative adversarial model for person image generation and person re-identification. Second, a pose estimator is utilized into the generator as a supervisor in the training process, which can effectively help pose transfer and guide the image generation with any desired pose. As a result, the proposed model can automatically generate a person image under an arbitrary pose. Third, the identity-sensitive representation is explicitly disentangled from pose variations through the person identity and pose embedding. Fourth, the learned re-id model can have better generalizability on a new person re-id dataset by using the synthesized images as auxiliary samples. Extensive experimental results on four standard benchmarks including Market-1501 [69], DukeMTMC-reID [40], CUHK03 [23], and CUHK01 [22] demonstrate that the proposed model can perform favorably against state-of-the-art methods.

### FoV-Aware Edge Caching for Adaptive 360° Video Streaming

•      Anahita Mahzari
• Afshin Taghavi Nasrabadi
• Aliehsan Samiei
• Ravi Prakash

In recent years, there has been growing popularity of Virtual Reality (VR), enabled by technologies like 360° video streaming. Streaming 360° video is extremely challenging due to high bandwidth and low latency requirements. Some VR solutions employ adaptive 360° video streaming which tries to reduce bandwidth consumption by only streaming high resolution video for user's Field of View (FoV). FoV is the part of the video which is being viewed by the user at any given time. Although FoV-adaptive 360° video streaming has been helpful in reducing bandwidth requirements, streaming 360° video from distant content servers is still challenging due to network latency. Caching popular content close to the end users not only decreases network latency, but also alleviates network bandwidth demands by reducing the number of future requests that have to be sent all the way to remote content servers. In this paper, we propose a novel caching policy based on users' FoV, called FoV-aware caching policy. In FoV-aware caching policy, we learn a probabilistic model of common-FoV for each 360° video based on previous users' viewing histories to improve caching performance. Through experiments with real users' head movement dataset, we show that our proposed approach improves cache hit ratio compared to Least Frequently Used (LFU) and Least Recently Used (LRU) caching policies by at least 40% and 17%, respectively.

## SESSION: Keynote 1

### Session details: Keynote 1

•      Susanne Boll

### Don't just Look -- Smell, Taste, and Feel the Interaction

•      Marianna Obrist

While our understanding of the sensory modalities for human-computer interaction (HCI) is advancing, there is still a huge gap in our understanding of 'how' to best integrate different modalities into the interaction with technology. While we have established a rich ecosystem for visual and auditory design, our sense of touch, taste, and smell are not yet supported with dedicated software tools, interaction techniques, and design tools. Hence, we are missing out on the opportunity to exploit the power of those sensory modalities, their strong link to emotions and memory, and their ability to facilitate recall and recognition in information processing and decision making. In this keynote, Dr Obrist presents an overview of scientific and technological developments in multisensory research and design with an emphasis on an experience-centered design approach that bridges and integrates knowledge on human sensory perception and advances in computing technology.

## SESSION: FF-2

### Session details: FF-2

•      Peng Cui

### Style Separation and Synthesis via Generative Adversarial Networks

•      Rui Zhang
• Sheng Tang
• Yu Li
• Junbo Guo
• Yongdong Zhang
• Jintao Li
• Shuicheng Yan

Style synthesis attracts great interests recently, while few works focus on its dual problem "style separation". In this paper, we propose the Style Separation and Synthesis Generative Adversarial Network (S3-GAN) to simultaneously implement style separation and style synthesis on object photographs of specific categories. Based on the assumption that the object photographs lie on a manifold, and the contents and styles are independent, we employ S3-GAN to build mappings between the manifold and a latent vector space for separating and synthesizing the contents and styles. The S3-GAN consists of an encoder network, a generator network, and an adversarial network. The encoder network performs style separation by mapping an object photograph to a latent vector. Two halves of the latent vector represent the content and style, respectively. The generator network performs style synthesis by taking a concatenated vector as input. The concatenated vector contains the style half vector of the style target image and the content half vector of the content target image. Once obtaining the images from the generator network, an adversarial network is imposed to generate more photo-realistic images. Experiments on CelebA and UT Zappos 50K datasets demonstrate that the S3-GAN has the capacity of style separation and synthesis simultaneously, and could capture various styles in a single model.

### Group Re-Identification: Leveraging and Integrating Multi-Grain Information

•      Hao Xiao
• Weiyao Lin
• Bin Sheng
• Ke Lu
• Junchi Yan
• Jingdong Wang
• Errui Ding
• Yihao Zhang
• Hongkai Xiong

This paper addresses an important yet less-studied problem: re-identifying groups of people in different camera views. Group re-identification (Re-ID) is very challenging since it is not only interfered by view-point and human pose variations in the traditional single-object Re-ID tasks, but also suffers from group layout and group member variations. To handle these issues, we propose to leverage the information of multi-grain objects: individual person and subgroups of two and three people inside a group image. We compute multi-grain representations to characterize the appearance and spatial features of multi-grain objects and evaluate the importance weight of each object for group Re-ID, so as to handle the interferences from group dynamics. We compute the optimal group-wise matching by using a multi-order matching process based on the multi-grain representation and importance weights. Furthermore, we dynamically update the importance weights according to the current matching results and then compute a new optimal group-wise matching. The two steps are iteratively conducted, yielding the final matching results.Experimental results on various datasets demonstrate the effectiveness of our approach.

### OSMO: Online Specific Models for Occlusion in Multiple Object Tracking under Surveillance Scene

•      Xu Gao
• Tingting Jiang

With demands of the intelligent monitoring, multiple object tracking (MOT) in surveillance scene has become an essential but challenging task. Occlusion is the primary difficulty in surveillance MOT, which can be categorized into the inter-object occlusion and the obstacle occlusion. Many current studies on general MOT focus on the former occlusion, but few studies have been conducted on the latter one. In fact, there are useful prior knowledge in surveillance videos, because the scene structure is fixed. Hence, we propose two models for dealing with these two kinds of occlusions. The attention-based appearance model is proposed to solve the inter-object occlusion, and the scene structure model is proposed to solve the obstacle occlusion. We also design an obstacle map segmentation method for segmenting obstacles from the surveillance scene. Furthermore, to evaluate our method, we propose four new surveillance datasets that contain videos with obstacles. Experimental results show the effectiveness of our two models.

### Video Forecasting with Forward-Backward-Net: Delving Deeper into Spatiotemporal Consistency

•      Yuke Li

Video forecasting is an emerging topic in the computer vision field, and it is a pivotal step toward unsupervised video understanding. However, the predictions generated from the state-of-the-art methods might be far from ideal quality, due to a lack of guidance from the labeled data of correct predictions (e.g., the annotated future pose of a person). Hence, building a network for better predicting future sequences in an unsupervised manner has to be further pursued. To this end, we put forth a novel Forward-Backward-Net (FB-Net) architecture, which delves deeper into spatiotemporal consistency. It first derives the forward consistency from the raw historical observations. In contrast to mainstream video forecasting approaches, FB-Net then investigates the backward consistency from the future to the past to reinforce the predictions. The final predicted results are inferred by jointly taking both the forward and backward consistencies into account. Moreover, we embed the motion dynamics and the visual content into a single framework via the FB-Net architecture, which significantly differs from learning each component throughout the videos separately. We evaluate our FB-Net on the large-scale KTH and UCF101 datasets. The experiments show that it can introduce considerable margin improvements with respect to most recent leading studies.

### Feature Constrained by Pixel: Hierarchical Adversarial Deep Domain Adaptation

•      Rui Shao
• Xiangyuan Lan
• Pong C. Yuen

### Fast and Light Manifold CNN based 3D Facial Expression Recognition across Pose Variations

•      Zhixing Chen
• Di Huang
• Yunhong Wang
• Liming Chen

This paper proposes a novel approach to 3D Facial Expression Recognition (FER), and it is based on a Fast and Light Manifold CNN model, namely FLM-CNN. Different from current manifold CNNs, FLM-CNN adopts a human vision inspired pooling structure and a multi-scale encoding strategy to enhance geometry representation, which highlights shape characteristics of expressions and runs efficiently. Furthermore, a sampling tree based preprocessing method is presented, and it sharply saves memory when applied to 3D facial surfaces, without much information loss of original data. More importantly, due to the property of manifold CNN features of being rotation-invariant, the proposed method shows a high robustness to pose variations. Extensive experiments are conducted on BU-3DFE, and state-of-the-art results are achieved, indicating its effectiveness.

### Explore Multi-Step Reasoning in Video Question Answering

•      Xiaomeng Song
• Yucheng Shi
• Xin Chen
• Yahong Han

### Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling

•      Shancheng Fang
• Hongtao Xie
• Zheng-Jun Zha
• Nannan Sun
• Jianlong Tan
• Yongdong Zhang

Recent dominant approaches for scene text recognition are mainly based on convolutional neural network (CNN) and recurrent neural network (RNN), where the CNN processes images and the RNN generates character sequences. Different from these methods, we propose an attention-based architecture1 which is completely based on CNNs. The distinctive characteristics of our method include: (1) the method follows encoder-decoder architecture, in which the encoder is a two-dimensional residual CNN and the decoder is a deep one-dimensional CNN. (2) An attention module that captures visual cues, and a language module that models linguistic rules are designed equally in the decoder. Therefore the attention and language can be viewed as an ensemble to boost predictions jointly. (3) Instead of using a single loss from language aspect, multiple losses from attention and language are accumulated for training the networks in an end-to-end way. We conduct experiments on standard datasets for scene text recognition, including Street View Text, IIIT5K and ICDAR datasets. The experimental results show our CNN-based method has achieved state-of-the-art performance on several benchmark datasets, even without the use of RNN.

### Temporal Sequence Distillation: Towards Few-Frame Action Recognition in Videos

•      Zhaoyang Zhang
• Zhanghui Kuang
• Ping Luo
• Litong Feng
• Wei Zhang

Video Analytics Software as a Service (VA SaaS) has been rapidly growing in recent years. VA SaaS is typically accessed by users using a lightweight client. Because the transmission bandwidth between the client and cloud is usually limited and expensive, it brings great benefits to design cloud video analysis algorithms with a limited data transmission requirement. Although considerable research has been devoted to video analysis, to our best knowledge, little of them has paid attention to the transmission bandwidth limitation in SaaS. As the first attempt in this direction, this work introduces a problem of few-frame action recognition, which aims at maintaining high recognition accuracy, when accessing only a few frames during both training and test. Unlike previous work that processed dense frames, we present Temporal Sequence Distillation (TSD), which distills a long video sequence into a very short one for transmission. By end-to-end training with 3D CNNs for video action recognition, TSD learns a compact and discriminative temporal and spatial representation of video frames. On Kinetics dataset, TSD+I3D typically requires only 50% of the number of frames compared to I3D, a state-of-the-art video action recognition algorithm, to achieve almost the same accuracies. The proposed TSD has three appealing advantages. Firstly, TSD has a lightweight architecture and can be deployed in the client, eg., mobile devices, to produce compressed representative frames to save transmission bandwidth. Secondly, TSD significantly reduces the computations to run video action recognition with compressed frames on the cloud, while maintaining high recognition accuracies. Thirdly, TSD can be plugged in as a preprocessing module of any existing 3D CNNs. Extensive experiments show the effectiveness and characteristics of TSD.

### Previewer for Multi-Scale Object Detector

•      Zhihang Fu
• Zhongming Jin
• Guo-Jun Qi
• Chen Shen
• Rongxin Jiang
• Yaowu Chen
• Xian-Sheng Hua

Most multi-scale detectors face a challenge of small-size false positives due to the inadequacy of low-level features, which have small receptive field sizes and weak semantic capabilities. This paper demonstrates independent predictions from different feature layers on the same region is beneficial for reducing false positives. We propose a novel light-weight previewer block, which previews the objectness probability for the potential regression region of each prior box, using the stronger features with larger receptive fields and more contextual information for better predictions. This previewer block is generic and can be easily implemented in multi-scale detectors, such as SSD, RFBNet and MS-CNN. Extensive experiments are conducted on PASCAL VOC and KITTI pedestrian benchmark to show the superiority of the proposed method.

### Learning Discriminative Features with Multiple Granularities for Person Re-Identification

•      Guanshuo Wang
• Yufeng Yuan
• Xiong Chen
• Jiwei Li
• Xi Zhou

The combination of global and partial features has been an essential solution to improve discriminative performances in person re-identification (Re-ID) tasks. Previous part-based methods mainly focus on locating regions with specific pre-defined semantics to learn local representations, which increases learning difficulty but not efficient or robust to scenarios with large variances. In this paper, we propose an end-to-end feature learning strategy integrating discriminative information with various granularities. We carefully design the Multiple Granularity Network (MGN), a multi-branch deep network architecture consisting of one branch for global feature representations and two branches for local feature representations. Instead of learning on semantic regions, we uniformly partition the images into several stripes, and vary the number of parts in different local branches to obtain local feature representations with multiple granularities. Comprehensive experiments implemented on the mainstream evaluation datasets including Market-1501, DukeMTMC-reid and CUHK03 indicate that our method robustly achieves state-of-the-art performances and outperforms any existing approaches by a large margin. For example, on Market-1501 dataset in single query mode, we obtain a top result of Rank-1/mAP=96.6%/94.2% with this method after re-ranking.

### StripNet: Towards Topology Consistent Strip Structure Segmentation

•      Guoxiang Qu
• Wenwei Zhang
• Zhe Wang
• Xing Dai
• Jianping Shi
• Junjun He
• Fei Li
• Xiulan Zhang
• Yu Qiao

In this work, we propose to study a special semantic segmentation problem where the targets are long and continuous strip patterns. Strip patterns widely exist in medical images and natural photos, such as retinal layers in OCT images and lanes on the roads, and segmentation of them has practical significance. Traditional pixel-level segmentation methods largely ignore the structure prior of strip patterns and thus easily suffer from the topological inconformity problem, such as holes and isolated islands in segmentation results. To tackle this problem, we design a novel deep framework, StripNet, that leverages the strong end-to-end learning ability of CNNs to predict the structured outputs as a sequence of boundary locations of the target strips. Specifically, StripNet decomposes the original segmentation problem into more easily solved local boundary-regression problems, and takes account of the topological constraints on the predicted boundaries. Moreover, our framework adopts a coarse-to-fine strategy and uses carefully designed heatmaps for training the boundary localization network. We examine StripNet on two challenging strip pattern segmentation tasks, retinal layer segmentation and lane detection. Extensive experiments demonstrate that StripNet achieves excellent results and outperforms state-of-the-art methods in both tasks.

### Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

•      Samuel Albanie
• Arsha Nagrani
• Andrea Vedaldi
• Andrew Zisserman

Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.

### Personalized Multiple Facial Action Unit Recognition through Generative Adversarial Recognition Network

•      Can Wang
• Shangfei Wang

Personalized facial action unit (AU) recognition is challenging due to subject-dependent facial behavior. This paper proposes a method to recognize personalized multiple facial AUs through a novel generative adversarial network, which adapts the distribution of source domain facial images to that of target domain facial images and detects multiple AUs by leveraging AU dependencies. Specifically, we use a generative adversarial network to generate synthetic images from source domain; the synthetic images have a similar appearance to the target subject and retain the AU patterns of the source images. We simultaneously leverage AU dependencies to train a multiple AU classifier. Experimental results on three benchmark databases demonstrate that the proposed method can successfully realize unsupervised domain adaptation for individual AU detection, and thus outperforms state-of-the-art AU detection methods.

### Investigation of Small Group Social Interactions Using Deep Visual Activity-Based Nonverbal Features

•      Cigdem Beyan
• Vittorio Murino

Understanding small group face-to-face interactions is a prominent research problem for social psychology while the automatic realization of it recently became popular in social computing. This is mainly investigated in terms of nonverbal behaviors, as they are one of the main facet of communication. Among several multi-modal nonverbal cues, visual activity is an important one and its sufficiently good performance can be crucial for instance, when the audio sensors are missing. The existing visual activity-based nonverbal features, which are all hand-crafted, were able to perform well enough for some applications while did not perform well for some other problems. Given these observations, we claim that there is a need of more robust feature representations, which can be learned from data itself. To realize this, we propose a novel method, which is composed of optical flow computation, deep neural network based feature learning, feature encoding and classification. Additionally, a comprehensive analysis between different feature encoding techniques is also presented. The proposed method is tested on three research topics, which can be perceived during small group interactions i.e. meetings: i) emergent leader detection, ii) emergent leadership style prediction, and iii) high/low extraversion classification. The proposed method shows (significantly) better results not only as compared to the state of the art visual activity based-nonverbal features but also when the state of the art visual activity based-nonverbal features are combined with other audio-based and video-based nonverbal features.

### Cross-Species Learning: A Low-Cost Approach to Learning Human Fight from Animal Fight

•      Eugene Yujun Fu
• Michael Xuelin Huang
• Hong Va Leong
• Grace Ngai

Detecting human fight behavior from videos is important in social signal processing, especially in the context of surveillance. However, the uncommon occurrence of real human fight events generally restricts the data collection for fight detection in machine learning, and thus hampers the performance of contemporary data-driven approaches. To address this challenge, we present a novel cross-species learning method with a set of low-computational cost motion features for fight detection. It effectively circumvents the problem of limited human fight data for data-demaining approaches. Our method exploits the intrinsic commonality between human and animal fights, such as the physical acceleration of moving body parts. It also leverages an ensemble learning mechanism to adapt useful knowledge from similar source subsets across species. Our evaluation results demonstrate the effectiveness of the proposed feature representation for cross-species adaptation. We believe that cross-species learning is not only a promising solution to the data constraint issue, but it also sheds lights on the studies of other human mental and social behaviors in cross-disciplinary research.

### Personalized Serious Games for Cognitive Intervention with Lifelog Visual Analytics

•      Qianli Xu
• Vigneshwaran Subbaraju
• Chee How Cheong
• Aijing Wang
• Kathleen Kang
• Munirah Bashir
• Yanhong Dong
• Liyuan Li
• Joo-Hwee Lim

This paper presents a novel serious game app and a method to cre- ate and integrate personalized game content based on lifelog visual analytics. The main objective is to extract personalized content from visual lifelogs, integrate it into mobile games, and evaluate the effect of personalization on user experience. First, a suite of visual analysis methods is proposed to extract semantic informa- tion from visual lifelogs and discover the association among the lifelog entities. The outcome is dataset that contains augmented and personal lifelog images. Next, a mobile game app is developed that makes use of the dataset as game content. Finally, an experiment is conducted to evaluate user gameplay behaviors in the wild over three months, where a mixture of generic and personalized game content is deployed. It is observed that user adherence is heightened by personalized game content as compared to generic content. Also observed is a higher enjoyment level in personalized than generic game content. The result provides the first empirical evidence of the effect of personalized games on user adherence and preference for cognitive intervention. This work paves the way for effective cognitive training with user-generated content.

### Drawing in a Virtual 3D Space - Introducing VR Drawing in Elementary School Art Education

•      Wendy Bolier
• Wolfgang Hürst
• Guido van Bommel
• Joost Bosman
• Harriët Bosman

Drawing is an important part of elementary school education, especially since it contributes to the development of spatial skills. Virtual reality enables us to draw not just on a flat 2D surface, but in 3D space. Our research aims at showing if and how this form of 3D drawing can be beneficial for art education. This paper presents first insights into potential benefits and obstacles when introducing 3D drawing at elementary schools. In an experiment with 18 children, we studied practical aspects, proficiency, and spatial ability development. Our results show improvement in the children's 3D drawing skills but not in their spatial abilities. Their drawing skills also do seem to be correlated with their mental rotation ability, although further research is needed to conclusively confirm this.

### CIRCE: Real-Time Caching for Instance Recognition on Cloud Environments and Multi-Core Architectures

•      Luca Lovagnini
• Wenxiao Zhang
• Farshid Hassani Bijarbooneh
• Pan Hui

In the smartphone era, instance recognition (IR) applications are widely used on mobile devices. However, IR applications are computationally expensive for a single mobile device, and usually they are delegated to back end servers. Nevertheless, even by exploiting the computational power offered by cloud services, IR tasks are not executed in real-time (i.e., in less than 30ms), which is crucial for interactive mobile applications. In this work, we present caching for instance recognition on cloud environments (CIRCE), a similarity caching (SC) framework designed to improve the performance of mobile IR applications in terms of execution time. Additionally, we introduce a parallel version of the Hessian-Affine detector combined with the SIFT descriptor, and a novel cache hit threshold (CHT) algorithm. Finally, we present two new public image datasets that we create to evaluate CIRCE. By using a cache size of just few hundreds, we obtain a hit ratio of at least 66% and a precision of at least 97%. In case of a cache hit, our system performs IR tasks in at most 14ms on three different applications.

### Jaguar: Low Latency Mobile Augmented Reality with Flexible Tracking

•      Wenxiao Zhang
• Bo Han
• Pan Hui

In this paper, we present the design, implementation and evaluation of Jaguar, a mobile Augmented Reality (AR) system that features accurate, low-latency, and large-scale object recognition and flexible, robust, and context-aware tracking. Jaguar pushes the limit of mobile AR's end-to-end latency by leveraging hardware acceleration with GPUs on edge cloud. Another distinctive aspect of Jaguar is that it seamlessly integrates marker-less object tracking offered by the recently released AR development tools (e.g., ARCore and ARKit) into its design. Indeed, some approaches used in Jaguar have been studied before in a standalone manner, e.g., it is known that cloud offloading can significantly decrease the computational latency of AR. However, the question of whether the combination of marker-less tracking, cloud offloading and GPU acceleration would satisfy the desired end-to-end latency of mobile AR (i.e., the interval of camera frames) has not been eloquently addressed yet. We demonstrate via a prototype implementation of our proposed holistic solution that Jaguar reduces the end-to-end latency to ~33 ms. It also achieves accurate six degrees of freedom tracking and 97% recognition accuracy for a dataset with 10,000 images.

## SESSION: Keynote 2

•      Tao Mei

### Challenges and Practices of Large Scale Visual Intelligence in the Real-World

•      Xian-Sheng Hua

Visual intelligence is one of the key aspects of Artificial Intelligence. Considerable technology progresses along this direction have been made in the past a few years. However, how to incubate the right technologies and convert them into real business values in the real-world remains a challenge. In this talk, we will analyze current challenges of visual intelligence in the real-world and try to summarize a few key points that help us successfully develop and apply technologies to solve real-world problems. In particular, we will introduce a few successful examples, including "City Brain", "Luban (visual design)", from the problem definition/discovery, to technology development, to product design, and to realizing business values. City Brain: A city is an aggregate of a huge amount of heterogeneous data. However, extracting meaningful values from that data is nontrivial. City Brain is an end-to-end system whose goal is to glean irreplaceable values from big-city data, specifically videos, with the assistance of rapidly evolving AI technologies and fast-growing computing capacity. From cognition to optimization, to decision-making, from search to prediction and ultimately, to intervention, City Brain improves the way we manage the city, as well as the way we live in it. In this talk, we will introduce current practices of the City Brain platform, as well as what we can do to achieve the goal and make it a reality, step by step. Luban: Different from most typical visual intelligence technologies, which are more focused on analyzing, recognizing or searching visual objects, the goal of Luban (visual design) is to create visual content. In particular, we will introduce an automatic 2D banner design technique that is based on deep learning and reinforcement learning. We will detail how Luban was created and how it changed the world of 2D banner design by creating 50M banners a day.

## SESSION: Deep-1 (Image Translation)

### Session details: Deep-1 (Image Translation)

•      Nicu Sebe

### Structure Guided Photorealistic Style Transfer

•      Yuheng Zhi
• Huawei Wei
• Bingbing Ni

Recent style transfer methods based on deep networks strive to generate more content matching stylized images by adding semantic guidance in the iterative process. However, these approaches can just guarantee the transfer of integral color and texture distribution between semantically equivalent regions, but local variation within these regions cannot be accurately captured. Therefore, the resulting image lacks local plausibility. To this end, we develop a non-parametric patch based style transfer framework to synthesize more content coherent images. By designing a novel patch matching algorithm which simultaneously takes high-level category information and geometric structure information (e.g., human pose and building structure) into account, our proposed method is capable of transferring more detailed distribution and producing more photorealistic stylized images. We show that our approach achieves remarkable style transfer results on contents with geometric structure, including human body, vehicles, buildings, etc.

### Crossing-Domain Generative Adversarial Networks for Unsupervised Multi-Domain Image-to-Image Translation

•      Xuewen Yang
• Dongliang Xie
• Xin Wang

State-of-the-art techniques in Generative Adversarial Networks (GANs) have shown remarkable success in image-to-image translation from peer domain X to domain Y using paired image data. However, obtaining abundant paired data is a non-trivial and expensive process in the majority of applications. When there is a need to translate images across n domains, if the training is performed between every two domains, the complexity of the training will increase quadratically. Moreover, training with data from two domains only at a time cannot benefit from data of other domains, which prevents the extraction of more useful features and hinders the progress of this research area. In this work, we propose a general framework for unsupervised image-to-image translation across multiple domains, which can translate images from domain X to any a domain without requiring direct training between the two domains involved in image translation. A byproduct of the framework is the reduction of computing time and computing resources since it needs less time than training the domains in pairs as is done in state-of-the-art works. Our proposed framework consists of a pair of encoders along with a pair of GANs which learns high-level features across different domains to generate diverse and realistic samples from. Our framework shows competing results on many image-to-image tasks compared with state-of-the-art techniques.

### Multi-View Image Generation from a Single-View

•      Bo Zhao
• Xiao Wu
• Zhi-Qi Cheng
• Hao Liu
• Zequn Jie
• Jiashi Feng

How to generate multi-view images with realistic-looking appearance from only a single view input is a challenging problem. In this paper, we attack this problem by proposing a novel image generation model termed VariGANs, which combines the merits of the variational inference and the Generative Adversarial Networks (GANs). It generates the target image in a coarse-to-fine manner instead of a single pass which suffers from severe artifacts. It first performs variational inference to model global appearance of the object (e.g., shape and color) and produces coarse images of different views. Conditioned on the generated coarse images, it then performs adversarial learning to fill details consistent with the input and generate the fine images. Extensive experiments conducted on two clothing datasets, MVC and DeepFashion, have demonstrated that the generated images with the proposed VariGANs are more plausible than those generated by existing approaches, which provide more consistent global appearance as well as richer and sharper details.

### Sparsely Grouped Multi-Task Generative Adversarial Networks for Facial Attribute Manipulation

•      Jichao Zhang
• Yezhi Shu
• Songhua Xu
• Gongze Cao
• Fan Zhong
• Meng Liu
• Xueying Qin

Recently, Image-to-Image Translation (IIT) has achieved great progress in image style transfer and semantic context manipulation for images. However, existing approaches require exhaustively labelling training data, which is labor demanding, difficult to scale up, and hard to adapt to a new domain. To overcome such a key limitation, we propose Sparsely Grouped Generative Adversarial Networks (SG-GAN) as a novel approach that can translate images in sparsely grouped datasets where only a few train samples are labelled. Using a one-input multi-output architecture, SG-GAN is well-suited for tackling multi-task learning and sparsely grouped learning tasks. The new model is able to translate images among multiple groups using only a single trained model. To experimentally validate the advantages of the new model, we apply the proposed method to tackle a series of attribute manipulation tasks for facial images as a case study. Experimental results show that SG-GAN can achieve comparable results with state-of-the-art methods on adequately labelled datasets while attaining a superior image translation quality on sparsely grouped datasets~\footnoteCode is available at https://github.com/zhangqianhui/SGGAN-tensorflow..

## SESSION: Vision-1 (Machine Learning)

### Session details: Vision-1 (Machine Learning)

•      Jingkuan Song

### Visual Domain Adaptation with Manifold Embedded Distribution Alignment

•      Jindong Wang
• Wenjie Feng
• Yiqiang Chen
• Han Yu
• Meiyu Huang
• Philip S. Yu

Visual domain adaptation aims to learn robust classifiers for the target domain by leveraging knowledge from a source domain. Existing methods either attempt to align the cross-domain distributions, or perform manifold subspace learning. However, there are two significant challenges: (1) degenerated feature transformation, which means that distribution alignment is often performed in the original feature space, where feature distortions are hard to overcome. On the other hand, subspace learning is not sufficient to reduce the distribution divergence. (2) unevaluated distribution alignment, which means that existing distribution alignment methods only align the marginal and conditional distributions with equal importance, while they fail to evaluate the different importance of these two distributions in real applications. In this paper, we propose a Manifold Embedded Distribution Alignment (MEDA) approach to address these challenges. MEDA learns a domain-invariant classifier in Grassmann manifold with structural risk minimization, while performing dynamic distribution alignment to quantitatively account for the relative importance of marginal and conditional distributions. To the best of our knowledge, MEDA is the first attempt to perform dynamic distribution alignment for manifold domain adaptation. Extensive experiments demonstrate that MEDA shows significant improvements in classification accuracy compared to state-of-the-art traditional and deep methods.

### Causally Regularized Learning with Agnostic Data Selection Bias

•      Zheyan Shen
• Peng Cui
• Kun Kuang
• Bo Li
• Peixuan Chen

Most of previous machine learning algorithms are proposed based on the i.i.d. hypothesis. However, this ideal assumption is often violated in real applications, where selection bias may arise between training and testing process. Moreover, in many scenarios, the testing data is not even available during the training process, which makes the traditional methods like transfer learning infeasible due to their need on prior of test distribution. Therefore, how to address the agnostic selection bias for robust model learning is of paramount importance for both academic research and real applications. In this paper, under the assumption that causal relationships among variables are robust across domains, we incorporate causal technique into predictive modeling and propose a novel Causally Regularized Logistic Regression (CRLR) algorithm by jointly optimize global confounder balancing and weighted logistic regression. Global confounder balancing helps to identify causal features, whose causal effect on outcome are stable across domains, then performing logistic regression on those causal features constructs a robust predictive model against the agnostic bias. To validate the effectiveness of our CRLR algorithm, we conduct comprehensive experiments on both synthetic and real world datasets. Experimental results clearly demonstrate that our CRLR algorithm outperforms the state-of-the-art methods, and the interpretability of our method can be fully depicted by the feature visualization.

### Robust Correlation Filter Tracking with Shepherded Instance-Aware Proposals

•      Yanjie Liang
• Qiangqiang Wu
• Yi Liu
• Yan Yan
• Hanzi Wang

In recent years, convolutional neural network (CNN) based correlation filter trackers have achieved state-of-the-art results on the benchmark datasets. However, the CNN based correlation filters cannot effectively handle large scale variation and distortion (such as fast motion, background clutter, occlusion, etc.), leading to the sub-optimal performance. In this paper, we propose a novel CNN based correlation filter tracker with shepherded instance-aware proposals, namely DeepCFIAP, which automatically estimates the target scale in each frame and re-detects the target when distortion happens. DeepCFIAP is proposed to take advantage of the merits of both instance-aware proposals and CNN based correlation filters. Compared with the CNN based correlation filter trackers, DeepCFIAP can successfully solve the problems of large scale variation and distortion via the shepherded instance-aware proposals, resulting in more robust tracking performance. Specifically, we develop a novel proposal ranking algorithm based on the similarities between proposals and instances. In contrast to the detection proposal based trackers, DeepCFIAP shepherds the instance-aware proposals towards their optimal positions via the CNN based correlation filters, resulting in more accurate tracking results. Extensive experiments on two challenging benchmark datasets demonstrate that the proposed DeepCFIAP performs favorably against state-of-the-art trackers and it is especially feasible for long-term tracking.

### A Unified Framework for Multimodal Domain Adaptation

•      Fan Qi
• Xiaoshan Yang
• Changsheng Xu

Domain adaptation aims to train a model on labeled data from a source domain while minimizing test error on a target domain. Most of existing domain adaptation methods only focus on reducing domain shift of single-modal data. In this paper, we consider a new problem of multimodal domain adaptation and propose a unified framework to solve it. The proposed multimodal domain adaptation neural networks(MDANN) consist of three important modules. (1) A covariant multimodal attention is designed to learn a common feature representation for multiple modalities. (2) A fusion module adaptively fuses attended features of different modalities. (3) Hybrid domain constraints are proposed to comprehensively learn domain-invariant features by constraining single modal features, fused features, and attention scores. Through jointly attending and fusing under an adversarial objective, the most discriminative and domain-adaptive parts of the features are adaptively fused together. Extensive experimental results on two real-world cross-domain applications (emotion recognition and cross-media retrieval) demonstrate the effectiveness of the proposed method.

## SESSION: Multimedia-1 (Multimedia Recommendation & Discovery)

### Session details: Multimedia-1 (Multimedia Recommendation & Discovery)

•      Mark Liao

### What Dress Fits Me Best?: Fashion Recommendation on the Clothing Style for Personal Body Shape

•      Shintami Chusnul Hidayati
• Cheng-Chun Hsu
• Yu-Ting Chang
• Kai-Lung Hua
• Jianlong Fu
• Wen-Huang Cheng

Clothing is an integral part of life. Also, it is always an uneasy task for people to make decisions on what to wear. An essential style tip is to dress for the body shape, i.e., knowing one's own body shape (e.g., hourglass, rectangle, round and inverted triangle) and selecting the types of clothes that will accentuate the body's good features. In the literature, although various fashion recommendation systems for clothing items have been developed, none of them had explicitly taken the user's basic body shape into consideration. In this paper, therefore, we proposed a first framework for learning the compatibility of clothing styles and body shapes from social big data, with the goal to recommend a user about what to wear better in relation to his/her essential body attributes. The experimental results demonstrate the superiority of our proposed approach, leading to a new aspect for research into fashion recommendation.

### CSAN: Contextual Self-Attention Network for User Sequential Recommendation

•      Xiaowen Huang
• Shengsheng Qian
• Quan Fang
• Jitao Sang
• Changsheng Xu

The sequential recommendation is an important task for online user-oriented services, such as purchasing products, watching videos, and social media consumption. Recent work usually used RNN-based methods to derive an overall embedding of the whole behavior sequence, which fails to discriminate the significance of individual user behaviors and thus decreases the recommendation performance. Besides, RNN-based encoding has fixed size and makes further recommendation application inefficient and inflexible. The online sequential behaviors of a user are generally heterogeneous, polysemous, and dynamically context-dependent. In this paper, we propose a unified Contextual Self-Attention Network (CSAN) to address the three properties. Heterogeneous user behaviors are considered in our model that are projected into a common latent semantic space. Then the output is fed into the feature-wise self-attention network to capture the polysemy of user behaviors. In addition, the forward and backward position encoding matrices are proposed to model dynamic contextual dependency. Through extensive experiments on two real-world datasets, we demonstrate the superior performance of the proposed model compared with other state-of-the-art algorithms.

### Attentive Interactive Convolutional Matching for Community Question Answering in Social Multimedia

•      Jun Hu
• Shengsheng Qian
• Quan Fang
• Changsheng Xu

Nowadays, community-based question answering (CQA) services have accumulated millions of users to share valuable knowledge. An essential function in CQA tasks is the accurate matching of answers w.r.t given questions. Existing methods usually ignore the redundant, heterogeneous, and multi-modal properties of CQA systems. In this paper, we propose a multi-modal attentive interactive convolutional matching method (MMAICM) to model the multi-modal content and social context jointly for questions and answers in a unified framework for CQA retrieval, which explores the redundant, heterogeneous, and multi-modal properties of CQA systems jointly. A well-designed attention mechanism is proposed to focus on useful word-pair interactions and neglect meaningless and noisy word-pair interactions. Moreover, a multi-modal interaction matrix method and a novel meta-path based network representation approach are proposed to consider the multi-modal content and social context, respectively. The attentive interactive convolutional matching network is proposed to infer the relevance between questions and answers, which can capture both the lexical and the sequential information of the contents. Experiment results on two real-world datasets demonstrate the superior performance of MMAICM compared with other state-of-the-art algorithms.

### Beyond the Product: Discovering Image Posts for Brands in Social Media

•      Francesco Gelli
• Tiberio Uricchio
• Xiangnan He
• Alberto Del Bimbo
• Tat-Seng Chua

Brands and organizations are using social networks such as Instagram to share image or video posts regularly, in order to engage and maximize their presence to the users. Differently from the traditional advertising paradigm, these posts feature not only specific products, but also the value and philosophy of the brand, known as brand associations in marketing literature. In fact, marketers are spending considerable resources to generate their content in-house, and increasingly often, to discover and repost the content generated by users. However, to choose the right posts for a brand in social media remains an open problem. Driven by this real-life application, we define the new task of content discovery for brands, which aims to discover posts that match the marketing value and brand associations of a target brand. We identify two main challenges in this new task: high inter-brand similarity and brand-post sparsity; and propose a tailored content-based learning-to-rank system to discover content for a target brand. Specifically, our method learns fine-grained brand representation via explicit modeling of brand associations, which can be interpreted as visual words shared among brands. We collected a new large-scale Instagram dataset, consisting of more than 1.1 million image and video posts from the history of 927 brands of fourteen verticals such as food and fashion. Extensive experiments indicate that our model can effectively learn fine-grained brand representations and outperform the closest state-of-the-art solutions.

## SESSION: Vision-2 (Object & Scene Understanding)

### Session details: Vision-2 (Object & Scene Understanding)

•      Zheng-Jun Zha

### Collaborative Annotation of Semantic Objects in Images with Multi-granularity Supervisions

•      Lishi Zhang
• Chenghan Fu
• Jia Li

Per-pixel masks of semantic objects are very useful in many applications, which, however, are tedious to be annotated. In this paper, we propose a collaborative annotation approach to efficiently generate per-pixel masks of semantic objects in tagged images with multi-granularity supervisions. Given a set of tagged images, a computer agent is dynamically generated to roughly localize the semantic objects described by the tag. The agent first extracts massive object proposals and then infer the tag-related ones under the weak and strong supervisions from linguistically and visually similar images as well as previously annotated objects. By representing such supervisions by over-complete dictionaries, tag-related proposals can pop-out according to their sparse coding length, which are then converted to superpixels with binary labels. After that, human annotators participate in the annotation by flipping labels and dividing superpixels with clicks, which are used as click supervisions that teaches the agent to recover false positives/negatives in processing images with the same tags. Experimental results show that our approach can facilitate the annotation and generate object masks that are consistent with those generated by the LabelMe toolbox.

### GraphNet: Learning Image Pseudo Annotations for Weakly-Supervised Semantic Segmentation

•      Mengyang Pu
• Yaping Huang
• Qingji Guan
• Qi Zou

Weakly-supervised semantic image segmentation suffers from lacking accurate pixel-level annotations. In this paper, we propose a novel graph convolutional network-based method, called GraphNet, to learn pixel-wise labels from weak annotations. Firstly, we construct a graph on the superpixels of a training image by combining the low-level spatial relation and high-level semantic content. Meanwhile, scribble or bounding box annotations are embedded into the graph, respectively. Then, GraphNet takes the graph as input and learns to predict high-confidence pseudo image masks by a convolutional network operating directly on graphs. At last, a segmentation network is trained supervised by these pseudo image masks. We comprehensively conduct experiments on the PASCAL VOC 2012 and PASCAL-CONTEXT segmentation benchmarks. Experimental results demonstrate that GraphNet is effective to predict the pixel labels with scribble or bounding box annotations. The proposed framework yields state-of-the-art results in the community.

### Boosting Scene Parsing Performance via Reliable Scale Prediction

•      Hengcan Shi
• Hongliang Li
• Qingbo Wu
• Fanman Meng
• King N. Ngan

Segmenting objects on suitable scales is a key factor to improve the scene parsing performance. Existing methods either simply average multi-scale results or predict scales by weakly-supervised models, due to the lack of scale labels. In this paper, we propose a novel fully-supervised Scale Prediction Model. On one hand, the proposed Scale Prediction Model learns parsing scales by the strong scale supervision, which is automatically generated from the scene parsing ground truth without any extra manually annotation. On the other hand, we explore the relationship between scale and object class, and propose to use the object class information to further improve the reliability of the scale prediction. The proposed Scale Prediction Model improves 23.1%, 20.1% and 29.3% scale prediction accuracies on the NYU Depth v2, PASCAL-Context and SIFT Flow datasets, respectively. Based on the Scale Prediction Model, we design a Scale Parsing Net (SPNet) for scene parsing, which segments each object on the scale predicted by the Scale Prediction Model. Moreover, SPNet leverages the intermediate result (i.e., the object class) to refine the parsing results. The experiment results show that SPNet outperforms many state-of-the-art methods on multiple scene parsing datasets.

### Learning to Synthesize 3D Indoor Scenes from Monocular Images

•      Fan Zhu
• Li Liu
• Jin Xie
• Fumin Shen
• Ling Shao
• Yi Fang

Depth images have always been playing critical roles for indoor scene understanding problems, and are particularly important for tasks in which 3D inferences are involved. However, since depth images are not universally available, abandoning them from the testing stage can significantly improve the generality of a method. In this work, we consider the scenarios where depth images are not available in the testing data, and propose to learn a convolutional long short-term memory (Conv LSTM) network and a regression convolutional neural network (regression ConvNet) using only monocular RGB images. The proposed networks benefit from 2D segmentations, object-level spatial context, object-scene dependencies and objects' geometric information, where optimization is governed by the semantic label loss, which measures the label consistencies of both objects and scenes, and the 3D geometrical loss, which measures the correctness of objects' 6Dof estimation. Conv LSTM and regression ConvNet are applied to scene/object classification, object detection and 6Dof estimation tasks respectively, where we utilize the joint inference from both networks and further provide the perspective of synthesizing fully rigged 3D scenes according to objects' arrangements in monocular images. Both quantitative and qualitative experimental results are provided on the NYU-v2 dataset, and we demonstrate that the proposed Conv LSTM can achieve state-of-the-art performance without requiring the depth information.

## SESSION: Multimodal-1 (Multimodal Reasoning)

### Session details: Multimodal-1 (Multimodal Reasoning)

•      Xian-Sheng Hua

### Visual Spatial Attention Network for Relationship Detection

•      Chaojun Han
• Fumin Shen
• Li Liu
• Yang Yang
• Heng Tao Shen

Visual relationship detection, which aims to predict a

### Object-Difference Attention: A Simple Relational Attention for Visual Question Answering

•      Chenfei Wu
• Jinlai Liu
• Xiaojie Wang
• Xuan Dong

Attention mechanism has greatly promoted the development of Visual Question Answering (VQA). Attention distribution, which weights differently on objects (such as image regions or bounding boxes) in an image according to their importance for answering a question, plays a crucial role in attention mechanism. Most of the existing work focuses on fusing image features and text features to calculate the attention distribution without comparisons between different image objects. As a major property of attention, selectivity depends on comparisons between different objects. Comparisons provide more information for assigning attentions better. For achieving this, we propose an object-difference attention (ODA) which calculates the probability of attention by implementing difference operator between different image objects in an image under the guidance of questions in hand. Experimental results on three publicly available datasets show our ODA based VQA model achieves the state-of-the-art results. Furthermore, a general form of relational attention is proposed. Besides ODA, several other relational attentions are given. Experimental results show those relational attentions have strengths on different types of questions.

### Life-long Cross-media Correlation Learning

•      Jinwei Qi
• Yuxin Peng
• Yunkan Zhuo

With the numerous and dynamically increasing of multimedia data, such as image and text, lying in different domains, there arise two major challenges for cross-media retrieval. First, measuring the similarities for cross-media correlation between different media types is quite difficult, due to their inconsistent distributions and representations. Second, storing and retraining on such data becomes infeasible, because data of new domain arrives in sequence while the existing ones are not always available. Thus, it requires to only utilize the data of new domain for training while preserving the original correlation capabilities simultaneously. To address the above issues, in this paper we propose Cross-media Life-long Learning (CmLL) approach, which can leverage the knowledge learned from the existing data, to obtain better correlation performance in new domain. The main contributions are summarized as follows: (1) Cross-media adapting network. We construct hierarchical network to not only share the knowledge from different media types in high level, but also realize life-long learning on new cross-media domain by expanding network capacity adaptively, which can support the adaptivity and extensibility for cross-media correlation learning. (2) Cross-media life-long learning. We propose both intra-domain distribution alignment as well as inter-domain knowledge distillation, which can not only effectively preserve the correlation ability in old cross-media domains, but also improve the performance in new domain by transferring knowledge among different domains. We conduct extensive experiments to verify the effectiveness of our proposed CmLL approach, which are performed on multiple cross-media datasets for different domains under lifelong learning scenarios.

### Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder

•      Yue Gu
• Xinyu Li
• Kaixiang Huang
• Shiyu Fu
• Kangning Yang
• Shuhong Chen
• Moliang Zhou
• Ivan Marsic

Human conversation analysis is challenging because the meaning can be expressed through words, intonation, or even body language and facial expression. We introduce a hierarchical encoder-decoder structure with attention mechanism for conversation analysis. The hierarchical encoder learns word-level features from video, audio, and text data that are then formulated into conversation-level features. The corresponding hierarchical decoder is able to predict different attributes at given time instances. To integrate multiple sensory inputs, we introduce a novel fusion strategy with modality attention. We evaluated our system on published emotion recognition, sentiment analysis, and speaker trait analysis datasets. Our system outperformed previous state-of-the-art approaches in both classification and regressions tasks on three datasets. We also outperformed previous approaches in generalization tests on two commonly used datasets. We achieved comparable performance in predicting co-existing labels using the proposed model instead of multiple individual models. In addition, the easily-visualized modality and temporal attention demonstrated that the proposed attention mechanism helps feature selection and improves model interpretability.

## SESSION: System-1 (Video Analysis & Streaming)

### Session details: System-1 (Video Analysis & Streaming)

•      Xin Yang

### End-to-End Blind Quality Assessment of Compressed Videos Using Deep Neural Networks

•      Wentao Liu
• Zhengfang Duanmu
• Zhou Wang

Blind video quality assessment (BVQA) algorithms are traditionally designed with a two-stage approach - a feature extraction stage that computes typically hand-crafted spatial and/or temporal features, and a regression stage working in the feature space that predicts the perceptual quality of the video. Unlike the traditional BVQA methods, we propose a Video Multi-task End-to-end Optimized neural Network (V-MEON) that merges the two stages into one, where the feature extractor and the regressor are jointly optimized. Our model uses a multi-task DNN framework that not only estimates the perceptual quality of the test video but also provides a probabilistic prediction of its codec type. This framework allows us to train the network with two complementary sets of labels, both of which can be obtained at low cost. The training process is composed of two steps. In the first step, early convolutional layers are pre-trained to extract spatiotemporal quality-related features with the codec classification subtask. In the second step, initialized with the pre-trained feature extractor, the whole network is jointly optimized with the two subtasks together. An additional critical step is the adoption of 3D convolutional layers, which creates novel spatiotemporal features that lead to a significant performance boost. Experimental results show that the proposed model clearly outperforms state-of-the-art BVQA methods.The source code of V-MEON is available at https://ece.uwaterloo.ca/~zduanmu/acmmm2018bvqa.

### FlexStream: Towards Flexible Adaptive Video Streaming on End Devices using Extreme SDN

•      Ibrahim Ben Mustafa
• Emir Halepovic

We present FlexStream, a programmable framework realized by implementing Software-Defined Networking (SDN) functionality on end devices. FlexStream exploits the benefits of both centralized and distributed components to achieve dynamic management of end devices, as required and in accordance with specified policies. We evaluate FlexStream on one example use case -- the adaptive video streaming, where bandwidth control is employed to drive selection of video bitrates, improve stability and increase robustness against background traffic. When applied to competing streaming clients, FlexStream reduces bitrate switching by 81%, stall duration by 92%, and startup delay by 44%, while improving fairness among players. In addition, we report the first implementation of SDN-based control in Android devices running in real Wi-Fi and live cellular networks.

### CLS: A Cross-user Learning based System for Improving QoE in 360-degree Video Adaptive Streaming

•      Lan Xie
• Xinggong Zhang
• Zongming Guo

Viewport adaptive streaming is emerging as a promising way to deliver high quality 360-degree video. It is still a critical issue to predict user's viewpoint and deliver partial video within the viewport. Current widely-used motion-based or content-saliency methods have low precision, especially for long-term prediction. In this paper, benefiting from data-driven learning, we propose a Cross-user Learning based System (CLS) to improve the precision of viewport prediction. Since users have similar region-of-interest (ROI) when watching a same video, it is possible to exploit cross-users' ROI behavior to predict viewport. We use a machine learning algorithm to group users according to historical fixations, and predict the viewing probability by the class. Additionally, we present a QoE-driven rate allocation to minimize the expected streaming distortion under bandwidth constraint, and give a Multiple-Choice Knapsack solution. Experiments demonstrate that CLS provides 2dB quality improvement than full-image streaming and 1.5 dB quality improvement than linear regression (LR) method. On average, the precision of viewpoint prediction improve 15% compared with LR.

### A Distributed Approach for Bitrate Selection in HTTP Adaptive Streaming

•      Abdelhak Bentaleb
• Ali C. Begen
• Roger Zimmermann

Past research has shown that concurrent HTTP adaptive streaming (HAS) players behave selfishly and the resulting competition for shared resources leads to underutilization or oversubscription of the network, presentation quality instability and unfairness among the players, all of which adversely impact the viewer experience. While coordination among the players, as opposed to all being selfish, has its merits and may alleviate some of these issues. A fully distributed architecture is still desirable in many deployments and better reflects the design spirit of HAS. In this study, we focus on and propose a distributed bitrate adaptation scheme for HAS that borrows ideas from consensus and game theory frameworks. Experimental results show that the proposed distributed approach provides significant improvements in terms of viewer experience, presentation quality stability, fairness and network utilization, without using any explicit communication between the players.

## SESSION: FF-3

•      Zhu Li

### High-Quality Exposure Correction of Underexposed Photos

•      Qing Zhang
• Ganzhao Yuan
• Chunxia Xiao
• Lei Zhu
• Wei-Shi Zheng

We address the problem of correcting the exposure of underexposed photos. Previous methods have tackled this problem from many different perspectives and achieved remarkable progress. However, they usually fail to produce natural-looking results due to the existence of visual artifacts such as color distortion, loss of detail, exposure inconsistency, etc. We find that the main reason why existing methods induce these artifacts is because they break a perceptually similarity between the input and output. Based on this observation, an effective criterion, termed as perceptually bidirectional similarity (PBS) is proposed. Based on this criterion and the Retinex theory, we cast the exposure correction problem as an illumination estimation optimization, where PBS is defined as three constraints for estimating illumination that can generate the desired result with even exposure, vivid color and clear textures. Qualitative and quantitative comparisons, and the user study demonstrate the superiority of our method over the state-of-the-art methods.

### A Margin-based MLE for Crowdsourced Partial Ranking

•      Qianqian Xu
• Jiechao Xiong
• Xinwei Sun
• Zhiyong Yang
• Xiaochun Cao
• Qingming Huang
• Yuan Yao

A preference order or ranking aggregated from pairwise comparison data is commonly understood as a strict total order. However, in real-world scenarios, some items are intrinsically ambiguous in comparisons, which may very well be an inherent uncertainty of the data. In this case, the conventional total order ranking can not capture such uncertainty with mere global ranking or utility scores. In this paper, we are specifically interested in the recent surge in crowdsourcing applications to predict partial but more accurate (i.e., making less incorrect statements) orders rather than complete ones. To do so, we propose a novel framework to learn some probabilistic models of partial orders as a margin-based Maximum Likelihood Estimate (MLE) method. We prove that the induced MLE is a joint convex optimization problem with respect to all the parameters, including the global ranking scores and margin parameter. Moreover, three kinds of generalized linear models are studied, including the basic uniform model, Bradley-Terry model, and Thurstone-Mosteller model, equipped with some theoretical analysis on FDR and Power control for the proposed methods. The validity of these models are supported by experiments with both simulated and real-world datasets, which shows that the proposed models exhibit improvements compared with traditional state-of-the-art algorithms.

### PHD-GIFs: Personalized Highlight Detection for Automatic GIF Creation

•      Ana Garcia del Molino
• Michael Gygli

Highlight detection models are typically trained to identify cues that make visual content appealing or interesting for the general public, with the objective of reducing a video to such moments. However, this "interestingness" of a video segment or image is subjective. Thus, such highlight models provide results of limited relevance for the individual user. On the other hand, training one model per user is inefficient and requires large amounts of personal information which is typically not available. To overcome these limitations, we present a global ranking model which can condition on a particular user's interests. Rather than training one model per user, our model is personalized via its inputs, which allows it to effectively adapt its predictions, given only a few user-specific examples. To train this model, we create a large-scale dataset of users and the GIFs they created, giving us an accurate indication of their interests. Our experiments show that using the user history substantially improves the prediction accuracy. On a test set of 850 videos, our model improves the recall by 8% with respect to generic highlight detectors. Furthermore, our method proves more precise than the user-agnostic baselines even with only one single person-specific example.

### Cross-Domain Adversarial Feature Learning for Sketch Re-identification

•      Lu Pang
• Yaowei Wang
• Yi-Zhe Song
• Tiejun Huang
• Yonghong Tian

Under person re-identification (Re-ID), a query photo of the target person is often required for retrieval. However, one is not always guaranteed to have such a photo readily available under a practical forensic setting. In this paper, we define the problem of Sketch Re-ID, which instead of using a photo as input, it initiates the query process using a professional sketch of the target person. This is akin to the traditional problem of forensic facial sketch recognition, yet with the major difference that our sketches are whole-body other than just the face. This problem is challenging because sketches and photos are in two distinct domains. Specifically, a sketch is the abstract description of a person. Besides, person appearance in photos is variational due to camera viewpoint, human pose and occlusion. We address the Sketch Re-ID problem by proposing a cross-domain adversarial feature learning approach to jointly learn the identity features and domain-invariant features. We employ adversarial feature learning to filter low-level interfering features and remain high-level semantic information. We also contribute to the community the first Sketch Re-ID dataset with 200 persons, where each person has one sketch and two photos from different cameras associated. Extensive experiments have been performed on the proposed dataset and other common sketch datasets including CUFSF and QUML-shoe. Results show that the proposed method outperforms the state-of-the-arts.

### Semantic Human Matting

•      Quan Chen
• Tiezheng Ge
• Yanyu Xu
• Zhiqiang Zhang
• Xinxin Yang
• Kun Gai

Human matting, high quality extraction of humans from natural images, is crucial for a wide variety of applications. Since the matting problem is severely under-constrained, most previous methods require user interactions to take user designated trimaps or scribbles as constraints. This user-in-the-loop nature makes them difficult to be applied to large scale data or time-sensitive scenarios. In this paper, instead of using explicit user input constraints, we employ implicit semantic constraints learned from data and propose an automatic human matting algorithm Semantic Human Matting(SHM). SHM is the first algorithm that learns to jointly fit both semantic information and high quality details with deep networks. In practice, simultaneously learning both coarse semantics and fine details is challenging. We propose a novel fusion strategy which naturally gives a probabilistic estimation of the alpha matte. We also construct a very large dataset with high quality annotations consisting of 35,513 unique foregrounds to facilitate the learning and evaluation of human matting. Extensive experiments on this dataset and plenty of real images show that SHM achieves comparable results with state-of-the-art interactive matting methods.

### Geometry Guided Adversarial Facial Expression Synthesis

•      Lingxiao Song
• Zhihe Lu
• Ran He
• Zhenan Sun
• Tieniu Tan

Facial expression synthesis has drawn much attention in the field of computer graphics and pattern recognition. It has been widely used in face animation and recognition. However, it is still challenging due to the high-level semantic presence of large and non-linear face geometry variations. This paper proposes a Geometry-Guided Generative Adversarial Network (G2-GAN) for continuously-adjusting and identity-preserving facial expression synthesis. We employ facial geometry (fiducial points) as a controllable condition to guide facial texture synthesis with specific expression. A pair of generative adversarial subnetworks is jointly trained towards opposite tasks: expression removal and expression synthesis. The paired networks form a mapping cycle between neutral expression and arbitrary expressions, with which the proposed approach can be conducted among unpaired data. The proposed paired networks also facilitate other applications such as face transfer, expression interpolation and expression-invariant face recognition. Experimental results on several facial expression databases show that our method can generate compelling perceptual results on different expression editing tasks.

### Detecting Abnormality without Knowing Normality: A Two-stage Approach for Unsupervised Video Abnormal Event Detection

•      Siqi Wang
• Yijie Zeng
• Qiang Liu
• Chengzhang Zhu
• En Zhu
• Jianping Yin

Abnormal event detection in video surveillance is a valuable but challenging problem. Most methods adopt a supervised setting that requires collecting videos with only normal events for training. However, very few attempts are made under unsupervised setting that detects abnormality without priorly knowing normal events. Existing unsupervised methods detect drastic local changes as abnormality, which overlooks the global spatio-temporal context. This paper proposes a novel unsupervised approach, which not only avoids manually specifying normality for training as supervised methods do, but also takes the whole spatio-temporal context into consideration. Our approach consists of two stages: First, normality estimation stage trains an autoencoder and estimates the normal events globally from the entire unlabeled videos by a self-adaptive reconstruction loss thresholding scheme. Second, normality modeling stage feeds the estimated normal events from the previous stage into one-class support vector machine to build a refined normality model, which can further exclude abnormal events and enhance abnormality detection performance. Experiments on various benchmark datasets reveal that our method is not only able to outperform existing unsupervised methods by a large margin (up to 14.2% AUC gain), but also favorably yields comparable or even superior performance to state-of-the-art supervised methods.

### BeautyGAN: Instance-level Facial Makeup Transfer with Deep Generative Adversarial Network

•      Tingting Li
• Ruihe Qian
• Chao Dong
• Si Liu
• Qiong Yan
• Wenwu Zhu
• Liang Lin

Facial makeup transfer aims to translate the makeup style from a given reference makeup face image to another non-makeup one while preserving face identity. Such an instance-level transfer problem is more challenging than conventional domain-level transfer tasks, especially when paired data is unavailable. Makeup style is also different from global styles (e.g., paintings) in that it consists of several local styles/cosmetics, including eye shadow, lipstick, foundation, and so on. Extracting and transferring such local and delicate makeup information is infeasible for existing style transfer methods. We address the issue by incorporating both global domain-level loss and local instance-level loss in an dual input/output Generative Adversarial Network, called BeautyGAN. Specifically, the domain-level transfer is ensured by discriminators that distinguish generated images from domains' real samples. The instance-level loss is calculated by pixel-level histogram loss on separate local facial regions. We further introduce perceptual loss and cycle consistency loss to generate high quality faces and preserve identity. The overall objective function enables the network to learn translation on instance-level through unsupervised adversarial learning. We also build up a new makeup dataset that consists of 3834 high-resolution face images. Extensive experiments show that BeautyGAN could generate visually pleasant makeup faces and accurate transferring results. Data and code are available at http://liusi-group.com/projects/BeautyGAN.

### Trusted Guidance Pyramid Network for Human Parsing

•      Xianghui Luo
• Zhuo Su
• Jiaming Guo
• Gengwei Zhang
• Xiangjian He

Human parsing, which segments a human-centric image into pixel-wise categorization, has a wide range of applications. However, none of the existing methods can productively solve the issue of label parsing fragmentation due to confused and complicated annotations. In this paper, we propose a novel Trusted Guidance Pyramid Network (TGPNet) to address this limitation. Based on a pyramid architecture, we design a Pyramid Residual Pooling (PRP) module setting at the end of a bottom-up approach to capture both global and local level context. In the top-down approach, we propose a Trusted Guidance Multi-scale Supervision (TGMS) that efficiently integrates and supervises multi-scale contextual information. Furthermore, we present a simple yet powerful Trusted Guidance Framework (TGF) which imposes global-level semantics into parsing results directly without extra ground truth labels in model training. Extensive experiments on two public human parsing benchmarks well demonstrate that our TGPNet has a strong ability in solving label parsing fragmentation problem and has an obtained improvement than other methods.

### I read, I saw, I tell: Texts Assisted Fine-Grained Visual Classification

•      Jingjing Li
• Lei Zhu
• Zi Huang
• Ke Lu
• Jidong Zhao

In visual classification tasks, it is hard to tell the subtle differences from one species to another similar breeds. Such a challenging problem is generally known as Fine-Grained Visual Classification (FGVC). In this paper, we propose a novel FGVC approach called Texts Assisted Fine-Grained Visual Classification (TA-FGVC). TA-FGVC reads from texts to gain attention, sees the images with the gained attention and then tells the subtle differences. Technically, we propose a deep neural network which learns a visual-semantic embedding model. The proposed deep architecture mainly consists of two parts: one for visual localization, and the other for visual to semantic projection. The model is fed with both visual features which are extracted from raw images and semantic information which are learned from two sources: gleaned from unannotated texts and gathered from image attributes. At the very last layer of the model, each image is embedded into the semantic space which is related to class labels. Finally, the categorization results from both visual stream and visual-semantic stream are combined to achieve the ultimate decision. Extensive experiments on open standard benchmarks verify the superiority of our model against several state of the art work.

### Look Deeper See Richer: Depth-aware Image Paragraph Captioning

•      Ziwei Wang
• Yang Li
• Zi Huang
• Hongzhi Yin

With the widespread availability of image captioning at a sentence level, how to automatically generate image paragraphs is yet well explored. Describing an image by a full paragraph involves organising sentences orderly, coherently and diversely, inevitably leading higher complexity than by a single sentence. Existing image paragraph captioning methods give a series of sentences to represent the objects and regions of interests, where the descriptions are essentially generated by feeding the image fragments containing objects and regions into conventional image single-sentence captioning models. This strategy is difficult to generate the descriptions that guarantee the stereoscopic hierarchy and non-overlapping objects. In this paper, we propose a Depth-aware Attention Model (\textitDAM ) to generate paragraph captions for images. The depths of image areas are firstly estimated in order to discriminate objects in a range of spatial locations, which can further guide the linguistic decoder to reveal spatial relationships among objects. This model completes the paragraph in a logical and coherent manner. By incorporating the attention mechanism, the learned model swiftly shifts the sentence focus during paragraph generation, whilst avoiding verbose descriptions on a same object. Extensive quantitative experiments and the user study have been conducted on the Visual Genome dataset, which demonstrate the effectiveness and the interpretability of the proposed model.

### Learning Multimodal Taxonomy via Variational Deep Graph Embedding and Clustering

•      Huaiwen Zhang
• Quan Fang
• Shengsheng Qian
• Changsheng Xu

Taxonomy learning is an important problem and facilitates various applications such as semantic understanding and information retrieval. Previous work for building semantic taxonomies has primarily relied on labor-intensive human contributions or focused on text-based extraction. In this paper, we investigate the problem of automatically learning multimodal taxonomies from the multimedia data on the Web. A systematic framework called Variational Deep Graph Embedding and Clustering (VDGEC) is proposed consisting of two stages as concept graph construction and taxonomy induction via variational deep graph embedding and clustering. VDGEC discovers hierarchical concept relationships by exploiting the semantic textual-visual correspondences and contextual co-occurrences in an unsupervised manner. The unstructured semantics and noisy issues of multimedia documents are carefully addressed by VDGEC for high quality taxonomy induction. We conduct extensive experiments on the real-world datasets. Experimental results demonstrate the effectiveness of the proposed framework, where VDGEC outperforms previous unsupervised approaches by a large gap.

### Watch, Think and Attend: End-to-End Video Classification via Dynamic Knowledge Evolution Modeling

•      Junyu Gao
• Tianzhu Zhang
• Changsheng Xu

Video classification has been achieved by automatically mining the underlying concepts (\eg actions, events) in videos, which plays an essential role in intelligent video analysis. However, most existing algorithms only exploit the visual cues of these concepts but ignore external knowledge information for modeling their relationships during the evolution of videos. In fact, humans have remarkable ability to utilize acquired knowledge to reason about the dynamically changing world. To narrow the knowledge gap between existing methods and humans, we propose an end-to-end video classification framework based on a structured knowledge graph, which can model the dynamic knowledge evolution in videos overtime. Here, we map the concepts of videos to the nodes of the knowledge graph. To effectively leverage the knowledge graph, we adopt a graph convLSTM model to not only identify local knowledge structures in each video shot but also model dynamic patterns of knowledge evolution across these shots. Furthermore, a novel knowledge-based attention model is designed by considering the importance of each video shot and relationships between concepts. We show that by using knowledge graphs, our framework is able to improve the performance of various existing methods. Extensive experimental results on two video classification benchmarks UCF101 and Youtube-8M demonstrate the favorable performance of the proposed framework.

### Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection

•      Yongcheng Liu
• Lu Sheng
• Jing Shao
• Junjie Yan
• Shiming Xiang
• Chunhong Pan

Multi-label image classification is a fundamental but challenging task towards general visual understanding. Existing methods found the region-level cues (e.g., features from RoIs) can facilitate multi-label classification. Nevertheless, such methods usually require laborious object-level annotations (i.e., object labels and bounding boxes) for effective learning of the object-level visual features. In this paper, we propose a novel and efficient deep framework to boost multi-label classification by distilling knowledge from weakly-supervised detection task without bounding box annotations. Specifically, given the image-level annotations, (1) we first develop a weakly-supervised detection (WSD) model, and then (2) construct an end-to-end multi-label image classification framework augmented by a knowledge distillation module that guides the classification model by the WSD model according to the class-level predictions for the whole image and the object-level visual features for object RoIs. The WSD model is the teacher model and the classification model is the student model. After this cross-task knowledge distillation, the performance of the classification model is significantly improved and the efficiency is maintained since the WSD model can be safely discarded in the test phase. Extensive experiments on two large-scale datasets (MS-COCO and NUS-WIDE) show that our framework achieves superior performances over the state-of-the-art methods on both performance and efficiency.

### Unregularized Auto-Encoder with Generative Adversarial Networks for Image Generation

•      Jiayu Wang
• Wengang Zhou
• Jinhui Tang
• Zhongqian Fu
• Qi Tian
• Houqiang Li

With the development of deep neural networks, recent years have witnessed the increasing research interest on generative models. Specificly, Variational Auto-Encoders (VAE) and Generative Adversarial Networks (GAN) have achieved impressive results in various generative tasks. VAE is well established and theoretically elegant, but tends to generate blurry samples. In contrast, GAN has shown the advantage in visual quality of generated images, but suffers the difficulty in translating a random vector into a desired high-dimensional sample. As a result, the training dynamics in GAN are often unstable and the generated samples could collapse to limited modes. In this paper, we propose a new Auto-Encoder Generative Adversarial Networks (AEGAN), which takes advantages of both VAE and GAN. In our approach, instead of matching the encoded distribution of training samples to the prior Pz as in VAE, we map the random vector into the encoded latent space by adversarial training based on GAN. Besides, we also match the decoded distribution of training samples with that from random vectors. To evaluate our approach, we make comparison with other encoder-decoder based generative models on three public datasets. The experiments with both qualitative and quantitative results demonstrate the superiority of our algorithm over the comparison generative models.

### When to Learn What: Deep Cognitive Subspace Clustering

•      Yangbangyan Jiang
• Zhiyong Yang
• Qianqian Xu
• Xiaochun Cao
• Qingming Huang

Subspace clustering aims at clustering data points drawn from a union of low-dimensional subspaces. Recently deep neural networks are introduced into this problem to improve both representation ability and precision for non-linear data. However, such models are sensitive to noise and outliers, since both difficult and easy samples are treated equally. On the contrary, in the human cognitive process, individuals tend to follow a learning paradigm from easy to hard and less to more. In other words, human beings always learn from simple concepts, then absorb more complicated ones gradually. Inspired by such learning scheme, in this paper, we propose a robust deep subspace clustering framework based on the principle of human cognitive process. Specifically, we measure the easinesses of samples dynamically so that our proposed method could gradually utilize instances from easy to more complex ones in a robust way. Meanwhile, a promising solution is designed to update the weights and parameters using an alternative optimization strategy, followed by a theoretical analysis to demonstrated the rationality of the proposed method. Experimental results on three popular benchmark datasets demonstrate the validity of the proposed method.

### Depth Structure Preserving Scene Image Generation

•      Wendong Zhang
• Feng Gao
• Bingbing Ni
• Lingyu Duan
• Yichao Yan
• Jingwei Xu
• Xiaokang Yang

Key to automatically generate natural scene images is to properly arrange amongst various spatial elements, especially in the depth cue. To this end, we introduce a novel depth structure preserving scene image generation network (DSP-GAN), which favors a hierarchical architecture, for the purpose of depth structure preserving scene image generation. The main trunk of the proposed infrastructure is built upon a Hawkes point process that models high-order spatial dependency between different depth layers. Within each layer generative adversarial sub-networks are trained collaboratively to generate realistic scene components, conditioned on the layer information produced by the point process. We experiment our model on annotated natural scene images collected from SUN dataset and demonstrate that our models are capable of generating depth-realistic natural scene image.

### CA3Net: Contextual-Attentional Attribute-Appearance Network for Person Re-Identification

•      Jiawei Liu
• Zheng-Jun Zha
• Hongtao Xie
• Zhiwei Xiong
• Yongdong Zhang

Person re-identification aims to identify the same pedestrian across non-overlapping camera views. Deep learning techniques have been applied for person re-identification recently, towards learning representation of pedestrian appearance. This paper presents a novel Contextual-Attentional Attribute-Appearance Network ($\rm CA^3Net$) for person re-identification. The $\rm CA^3Net$ simultaneously exploits the complementarity between semantic attributes and visual appearance, the semantic context among attributes, visual attention on attributes as well as spatial dependencies among body parts, leading to discriminative and robust pedestrian representation. Specifically, an attribute network within $\rm CA^3Net$ is designed with an Attention-LSTM module. It concentrates the network on latent image regions related to each attribute as well as exploits the semantic context among attributes by a LSTM module. An appearance network is developed to learn appearance features from the full body, horizontal and vertical body parts of pedestrians with spatial dependencies among body parts. The $\rm CA^3Net$ jointly learns the attribute and appearance features in a multi-task learning manner, generating comprehensive representation of pedestrians. Extensive experiments on two challenging benchmarks, i.e., Market-1501 and DukeMTMC-reID datasets, have demonstrated the effectiveness of the proposed approach.

### RGCNN: Regularized Graph CNN for Point Cloud Segmentation

•      Gusi Te
• Wei Hu
• Amin Zheng
• Zongming Guo

Point cloud, an efficient 3D object representation, has become popular with the development of depth sensing and 3D laser scanning techniques. It has attracted attention in various applications such as 3D tele-presence, navigation for unmanned vehicles and heritage reconstruction. The understanding of point clouds, such as point cloud segmentation, is crucial in exploiting the informative value of point clouds for such applications. Due to the irregularity of the data format, previous deep learning works often convert point clouds to regular 3D voxel grids or collections of images before feeding them into neural networks, which leads to voluminous data and quantization artifacts. In this paper, we instead propose a regularized graph convolutional neural network (RGCNN) that directly consumes point clouds. Leveraging on spectral graph theory, we treat features of points in a point cloud as signals on graph, and define the convolution over graph by Chebyshev polynomial approximation. In particular, we update the graph Laplacian matrix that describes the connectivity of features in each layer according to the corresponding learned features, which adaptively captures the structure of dynamic graphs. Further, we deploy a graph-signal smoothness prior in the loss function, thus regularizing the learning process. Experimental results on the ShapeNet part dataset show that the proposed approach significantly reduces the computational complexity while achieving competitive performance with the state of the art. Also, experiments show RGCNN is much more robust to both noise and point cloud density in comparison with other methods. We further apply RGCNN to point cloud classification and achieve competitive results on ModelNet40 dataset.

### Deep Triplet Quantization

•      Bin Liu
• Yue Cao
• Mingsheng Long
• Jianmin Wang
• Jingdong Wang

Deep hashing establishes efficient and effective image retrieval by end-to-end learning of deep representations and hash codes from similarity data. We present a compact coding solution, focusing on deep learning to quantization approach that has shown superior performance over hashing solutions for similarity retrieval. We propose Deep Triplet Quantization (DTQ), a novel approach to learning deep quantization models from the similarity triplets. To enable more effective triplet training, we design a new triplet selection approach, Group Hard, that randomly selects hard triplets in each image group. To generate compact binary codes, we further apply a triplet quantization with weak orthogonality during triplet training. The quantization loss reduces the codebook redundancy and enhances the quantizability of deep representations through back-propagation. Extensive experiments demonstrate that DTQ can generate high-quality and compact binary codes, which yields state-of-the-art image retrieval performance on three benchmark datasets, NUS-WIDE, CIFAR-10, and MS-COCO.

## SESSION: Keynote 3

### Session details: Keynote 3

•      Jiebo Luo

### What has Art Got to do With It?

•      Ernest A. Edmonds

What can multi-media systems design learn from art? How can the research agenda be advanced by looking at art? How can we improve creativity support and the amplification of that important human capability? Interactive art has become a common part of life as a result of the many ways in which the computer and the Internet have facilitated it. Multi-media computing is as important to interactive art as mixing the colors of paint are to painting. This talk reviews recent work that looks at these issues through art research. In interactive digital art, the artist is concerned with how the artwork behaves, how the audience interacts with it, and, ultimately, how participants experience art as well as their degree of engagement. The talk examines these issues and brings together a collection of research results from art practice that illuminates this significant new and expanding area. In particular, this work points towards a much-needed critical language that can be used to describe, compare and frame research into the support of creativity.

## SESSION: Best Paper Session

### Session details: Best Paper Session

•      (1) Rainer Tao (2) Lienhart Mei

### GestureGAN for Hand Gesture-to-Gesture Translation in the Wild

•      Hao Tang
• Wei Wang
• Dan Xu
• Yan Yan
• Nicu Sebe

Hand gesture-to-gesture translation in the wild is a challenging task since hand gestures can have arbitrary poses, sizes, locations and self-occlusions. Therefore, this task requires a high-level understanding of the mapping between the input source gesture and the output target gesture. To tackle this problem, we propose a novel hand Gesture Generative Adversarial Network (GestureGAN). GestureGAN consists of a single generator G and a discriminator D, which takes as input a conditional hand image and a target hand skeleton image. GestureGAN utilizes the hand skeleton information explicitly, and learns the gesture-to-gesture mapping through two novel losses, the color loss and the cycle-consistency loss. The proposed color loss handles the issue of "channel pollution" while back-propagating the gradients. In addition, we present the Frechet ResNet Distance (FRD) to evaluate the quality of generated images. Extensive experiments on two widely used benchmark datasets demonstrate that the proposed GestureGAN achieves state-of-the-art performance on the unconstrained hand gesture-to-gesture translation task. Meanwhile, the generated images are in high-quality and are photo-realistic, allowing them to be used as data augmentation to improve the performance of a hand gesture classifier. Our model and code are available at https://github.com/Ha0Tang/GestureGAN.

### Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training

•      Bei Liu
• Jianlong Fu
• Makoto P. Kato
• Masatoshi Yoshikawa

Automatic generation of natural language from images has attracted extensive attention. In this paper, we take one step further to investigate generation of poetic language (with multiple lines) to an image for automatic poetry creation. This task involves multiple challenges, including discovering poetic clues from the image (e.g., hope from green), and generating poems to satisfy both relevance to the image and poeticness in language level. To solve the above challenges, we formulate the task of poem generation into two correlated sub-tasks by multi-adversarial training via policy gradient, through which the cross-modal relevance and poetic language style can be ensured. To extract poetic clues from images, we propose to learn a deep coupled visual-poetic embedding, in which the poetic representation from objects, sentiments \footnoteWe consider both adjectives and verbs that can express emotions and feelings as sentiment words in this research. and scenes in an image can be jointly learned. Two discriminative networks are further introduced to guide the poem generation, including a multi-modal discriminator and a poem-style discriminator. To facilitate the research, we have released two poem datasets by human annotators with two distinct properties: 1) the first human annotated image-to-poem pair dataset (with $8,292$ pairs in total), and 2) to-date the largest public English poem corpus dataset (with $92,265$ different poems in total). Extensive experiments are conducted with 8K images, among which 1.5K image are randomly picked for evaluation. Both objective and subjective evaluations show the superior performances against the state-of-the-art methods for poem generation from images. Turing test carried out with over $500$ human subjects, among which 30 evaluators are poetry experts, demonstrates the effectiveness of our approach.

### Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing

•      Jian Zhao
• Jianshu Li
• Yu Cheng
• Terence Sim
• Shuicheng Yan
• Jiashi Feng

Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-identification and autonomous driving, etc. To this end, models need to comprehensively perceive the semantic information and the differences between instances in a multi-human image, which is recently defined as the multi-human parsing task. In this paper, we present a new large-scale database "Multi-Human Parsing (MHP)" for algorithm development and evaluation, and advances the state-of-the-art in understanding humans in crowded scenes. MHP contains 25,403 elaborately annotated images with 58 fine-grained semantic category labels, involving 2-26 persons per image and captured in real-world scenes from various viewpoints, poses, occlusion, interactions and background. We further propose a novel deep Nested Adversarial Network (NAN) model for multi-human parsing. NAN consists of three Generative Adversarial Network (GAN)-like sub-nets, respectively performing semantic saliency prediction, instance-agnostic parsing and instance-aware clustering. These sub-nets form a nested structure and are carefully designed to learn jointly in an end-to-end way. NAN consistently outperforms existing state-of-the-art solutions on our MHP and several other datasets, and serves as a strong baseline to drive the future research for multi-human parsing.

### Knowledge-aware Multimodal Dialogue Systems

•      Lizi Liao
• Yunshan Ma
• Xiangnan He
• Richang Hong
• Tat-Seng Chua

By offering a natural way for information seeking, multimodal dialogue systems are attracting increasing attention in several domains such as retail, travel etc. However, most existing dialogue systems are limited to textual modality, which cannot be easily extended to capture the rich semantics in visual modality such as product images. For example, in fashion domain, the visual appearance of clothes and matching styles play a crucial role in understanding the user's intention. Without considering these, the dialogue agent may fail to generate desirable responses for users. In this paper, we present a Knowledge-aware Multimodal Dialogue (KMD) model to address the limitation of text-based dialogue systems. It gives special consideration to the semantics and domain knowledge revealed in visual content, and is featured with three key components. First, we build a taxonomy-based learning module to capture the fine-grained semantics in images the category and attributes of a product). Second, we propose an end-to-end neural conversational model to generate responses based on the conversation history, visual semantics, and domain knowledge. Lastly, to avoid inconsistent dialogues, we adopt a deep reinforcement learning method which accounts for future rewards to optimize the neural conversational model. We perform extensive evaluation on a multi-turn task-oriented dialogue dataset in fashion domain. Experiment results show that our method significantly outperforms state-of-the-art methods, demonstrating the efficacy of modeling visual modality and domain knowledge for dialogue systems.

## SESSION: Doctoral Symposium

### Session details: Doctoral Symposium

•      Meng Wang

### End2End Semantic Segmentation for 3D Indoor Scenes

•      Na Zhao

This research is concerned with semantic segmentation of 3D point clouds arising from videos of 3D indoor scenes. It is an important building block of 3D scene understanding and has promising applications such as augmented reality and robotics. Although various deep learning based approaches have been proposed to replicate the success of 2D semantic segmentation in 3D domain, they either result in severe information loss or fail to model the geometric structures well. In this paper, we aim to model the local and global geometric structures of 3D scenes by designing an end-to-end 3D semantic segmentation framework. It captures the local geometries from point-level feature learning and voxel-level aggregation, models the global structures via 3D CNN, and enforces label consistency with high-order CRF. Through preliminary experiments conducted on two indoor datasets, we describe our insights on the proposed approach, and present some directions to be pursued in the future.

### On Reducing Effort in Evaluating Laparoscopic Skills

•      Sabrina Kletz

Training and evaluation of laparoscopic skills have become an important aspect of young surgeons' education. The evaluation process is currently performed manually by experienced surgeons through reviewing video recordings of laparoscopic procedures for detecting technical errors using conventional video players and specific pen and paper rating schemes. The problem is, that the manual review process is time-consuming and exhausting, but nevertheless necessary to support young surgeons in their educational training. Motivated by the need to reduce the effort in evaluating laparoscopic skills, this PhD project aims at investigating state-of-the-art content analysis approaches for finding error-prone video sections in surgery videos. In this proposal, the focus specifically lies on performance assessment in gynecologic laparoscopy using the Generic Error Rating Tool (GERT).

### Decode Human Life from Social Media

•      Tianran Hu

In this big data era, people leave clues of their life consciously or unconsciously on many social media platforms in various forms. By mining data from social media, researchers can uncover the patterns of human life at both individual and group levels. Social media is one of the major data sources for such studies for mainly two reasons. 1) The huge volume and open access of data on these platforms, and 2) the diversity of data on different platforms, such as multimedia data on Twitter and Facebook, geolocation data on Foursquare and Yelp, as well as career data on Linkedin. In this paper, we introduce our work on studying human life based on social media data, and report the plan for our subsequent studies. Our work is intended to decodes human life from two perspectives. From a linguistic perspective, we study the language patterns of different social groups of people. The learned language patterns can reveal the specific characteristics of these groups, and provide novel angles to understanding people. From a mobility perspective, we extract the mobility patterns of individual person, and groups of people such as residents of certain regions. Using the detected mobility patterns, we mine knowledge of human life including the lifestyles and shopping patterns of cities and regions. We intend to combine these two perspectives in our ongoing work, and introduce a novel framework for study human life.

## SESSION: FF-4

### Session details: FF-4

•      Wen-Huang Cheng

### Learning Semantic Structure-preserved Embeddings for Cross-modal Retrieval

• Yiling Wu
• Shuhui Wang
• Qingming Huang

This paper learns semantic embeddings for multi-label cross-modal retrieval. Our method exploits the structure in semantics represented by label vectors to guide the learning of embeddings. First, we construct a semantic graph based on label vectors which incorporates data from both modalities, and enforce the embeddings to preserve the local structure of this semantic graph. Second, we enforce the embeddings to well reconstruct the labels, i.e., the global semantic structure. In addition, we encourage the embeddings to preserve local geometric structure of each modality. Accordingly, the local and global semantic structure consistencies as well as the local geometric structure consistency are enforced, simultaneously. The mappings between inputs and embeddings are designed to be nonlinear neural network with larger capacity and more flexibility. The overall objective function is optimized by stochastic gradient descent to gain the scalability on large datasets. Experiments conducted on three real world datasets clearly demonstrate the superiority of our proposed approach over the state-of-the-art methods.

### Post Tuned Hashing: A New Approach to Indexing High-dimensional Data

•      Zhendong Mao
• Quan Wang
• Yongdong Zhang
• Bin Wang

Learning to hash has proven to be an effective solution for indexing high-dimensional data by projecting them to similarity-preserving binary codes. However, most existing methods end up the learning scheme with a binarization stage, i.e. binary quantization, which inevitably destroys the neighborhood structure of original data. As a result, those methods still suffer from great similarity loss and result in unsatisfactory indexing performance. In this paper we propose a novel hashing model, namely Post Tuned Hashing (PTH), which includes a new post-tuning stage to refine the binary codes after binarization. The post-tuning seeks to rebuild the destroyed neighborhood structure, and hence significantly improves the indexing performance. We cast the post-tuning into a binary quadratic optimization framework and, despite its NP-hardness, give a practical algorithm to efficiently obtain a high-quality solution. Experimental results on five noted image benchmarks show that our PTH improves previous state-of-the-art methods by 13-58% in mean average precision.

### Cross-modal Moment Localization in Videos

•      Meng Liu
• Xiang Wang
• Liqiang Nie
• Qi Tian
• Baoquan Chen
• Tat-Seng Chua

In this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the localization of moments, but also the multimodal comprehension of textual-temporal information (e.g., "first" and "leaving") that helps to distinguish the desired moment from the others, especially those with the similar visual content. While existing studies treat a given language query as a single unit, we propose to decompose it into two components: the relevant cue related to the desired moment localization and the irrelevant one meaningless to the localization. This allows us to flexibly adapt to arbitrary queries in an end-to-end framework. In our proposed model, a language-temporal attention network is utilized to learn the word attention based on the temporal context information in the video. Therefore, our model can automatically select "what words to listen to" for localizing the desired moment. We evaluate the proposed model on two public benchmark datasets: DiDeMo and Charades-STA. The experimental results verify its superiority over several state-of-the-art methods.

### Multi-Scale Correlation for Sequential Cross-modal Hashing Learning

•      Zhaoda Ye
• Yuxin Peng

Cross-modal hashing aims to learn hash functions, which map heterogeneous multimedia data into common Hamming space for fast and flexible cross-modal retrieval. Recently, several cross-modal hashing methods learn the hash functions by mining the correlation among multimedia data. However, they ignore two properties of cross-modal data: 1) The features of different scale in single modality consist different information, such as texture, object and scene feature in the image, which can provide multi-scale information on retrieval task. 2) The correlation among the features of different modalities and scales can provide multi-scale relationship for better cross-modal hashing learning. In this paper, we propose Multi-scale Correlation Sequential Cross-modal Hashing Learning (MCSCH) approach. The main contributions of the MCSCH can be summarized as follows: 1) We propose a multi-scale feature guided sequential hashing learning method which sequentially generates the hash code guided by different scale features through a RNN based network. The multi-scale feature guided sequential hashing learning method utilizes the scale information, which enhances the diversity of the hash codes and reduces the error caused by extreme situation in specifc features. 2) We propose a multi-scale correlation mining strategy during the multi-scale feature guided sequential hashing learning, which can simultaneously mine the correlation among the features of different modalities and scales. Through this strategy, we can mine any pair of scale features in different modalities and obtain abundant scale correlation for better cross-modal retrieval. Experiments on two widely-used datasets demonstrate the effectiveness of our proposed MCSCH approach.

### Generative Adversarial Product Quantisation

•      Litao Yu
• Yongsheng Gao
• Jun Zhou

Product Quantisation (PQ) has been recognised as an effective encoding technique for scalable multimedia content analysis. In this paper, we propose a novel learning framework that enables an end-to-end encoding strategy from raw images to compact PQ codes. The system aims to learn both PQ encoding functions and codewords for content-based image retrieval. In detail, we first design a trainable encoding layer that is pluggable into neural networks, so the codewords can be trained in back-forward propagation. Then we integrate it into a Deep Convolutional Generative Adversarial Network (DC-GAN). In our proposed encoding framework, the raw images are directly encoded by passing through the convolutional and encoding layers, and the generator aims to use the codewords as constrained inputs to generate full image representations that are visually similar to the original images. By taking the advantages of the generative adversarial model, our proposed system can produce high-quality PQ codewords and encoding functions for scalable multimedia retrieval tasks. Experiments show that the proposed architecture GA-PQ outperforms the state-of-the-art encoding techniques on three public image datasets.

### Aesthetic-Driven Image Enhancement by Adversarial Learning

•      Yubin Deng
• Chen Change Loy
• Xiaoou Tang

We introduce EnhanceGAN, an adversarial learning based model that performs automatic image enhancement. Traditional image enhancement frameworks typically involve training models in a fully-supervised manner, which require expensive annotations in the form of aligned image pairs. In contrast to these approaches, our proposed EnhanceGAN only requires weak supervision (binary labels on image aesthetic quality) and is able to learn enhancement operators for the task of aesthetic-based image enhancement. In particular, we show the effectiveness of a piecewise color enhancement module trained with weak supervision, and extend the proposed EnhanceGAN framework to learning a deep filtering-based aesthetic enhancer. The full differentiability of our image enhancement operators enables the training of EnhanceGAN in an end-to-end manner. We further demonstrate the capability of EnhanceGAN in learning aesthetic-based image cropping without any groundtruth cropping pairs. Our weakly-supervised EnhanceGAN reports competitive quantitative results on aesthetic-based color enhancement as well as automatic image cropping, and a user study confirms that our image enhancement results are on par with or even preferred over professional enhancement.

### Attention-based Multi-Patch Aggregation for Image Aesthetic Assessment

•      Kekai Sheng
• Weiming Dong
• Chongyang Ma
• Xing Mei
• Feiyue Huang
• Bao-Gang Hu

Aggregation structures with explicit information, such as image attributes and scene semantics, are effective and popular for intelligent systems for assessing aesthetics of visual data. However, useful information may not be available due to the high cost of manual annotation and expert design. In this paper, we present a novel multi-patch (MP) aggregation method for image aesthetic assessment. Different from state-of-the-art methods, which augment an MP aggregation network with various visual attributes, we train the model in an end-to-end manner with aesthetic labels only (i.e., aesthetically positive or negative). We achieve the goal by resorting to an attention-based mechanism that adaptively adjusts the weight of each patch during the training process to improve learning efficiency. In addition, we propose a set of objectives with three typical attention mechanisms (i.e., average, minimum, and adaptive) and evaluate their effectiveness on the Aesthetic Visual Analysis (AVA) benchmark. Numerical results show that our approach outperforms existing methods by a large margin. We further verify the effectiveness of the proposed attention-based objectives via ablation studies and shed light on the design of aesthetic assessment systems.

### An End-to-End Quadrilateral Regression Network for Comic Panel Extraction

•      Zheqi He
• Yafeng Zhou
• Yongtao Wang
• Siwei Wang
• Xiaoqing Lu
• Zhi Tang
• Ling Cai

Comic panel extraction, i.e., decomposing a comic page image into panels, has become a fundamental technique for meeting many practical needs of mobile comic reading such as comic content adaptation and comic animating. Most of existing approaches are based on handcrafted low-level visual patterns and heuristics rules, thus having limited ability to deal with irregular comic panels. Only one existing method is based on deep learning and achieves better experimental results, but its architecture is redundant and its time efficiency is not good. To address these problems, we propose an end-to-end, two-stage quadrilateral regressing network architecture for comic panel detection, which inherits the architecture of Faster R-CNN. At the first stage, we propose a quadrilateral region proposal network for generating panel proposals, based on a newly proposed quadrilateral regression method. At the second stage, we classify the proposals and refine their shapes with the proposed quadrilateral regression method again. Extensive experimental results demonstrate that the proposed method significantly outperforms the existing comic panel detection methods on multiple datasets by F1-score and page accuracy.

### Monocular Camera Based Real-Time Dense Mapping Using Generative Adversarial Network

•      Xin Yang
• Jinyu Chen
• Zhiwei Wang
• Qiaozhe Zhang
• Wenyu Liu
• Chunyuan Liao
• Kwang-Ting Cheng

Monocular simultaneous localization and mapping (SLAM) is a key enabling technique for many computer vision and robotics applications. However, existing methods either can obtain only sparse or semi-dense maps in highly-textured image areas or fail to achieve a satisfactory reconstruction accuracy. In this paper, we present a new method based on a generative adversarial network,named DM-GAN, for real-time dense mapping based on a monocular camera. Specifcally, our depth generator network takes a semidense map obtained from motion stereo matching as a guidance to supervise dense depth prediction of a single RGB image. The depth generator is trained based on a combination of two loss functions, i.e. an adversarial loss for enforcing the generated depth maps to reside on the manifold of the true depth maps and a pixel-wise mean square error (MSE) for ensuring the correct absolute depth values. Extensive experiments on three public datasets demonstrate that our DM-GAN signifcantly outperforms the state-of-the-art methods in terms of greater reconstruction accuracy and higher depth completeness.

### JPEG Decompression in the Homomorphic Encryption Domain

•      Xiaojing Ma
• Changming Liu
• Sixing Cao
• Bin B. Zhu

Privacy-preserving processing is desirable for cloud computing to relieve users' concern of loss of control of their uploaded data. This may be fulfilled with homomorphic encryption. With widely used JPEG, it is desirable to enable JPEG decompression in the homomorphic encryption domain. This is a great challenge since JPEG decoding needs to determine a matched codeword, which then extracts a codeword-dependent number of coefficients. With no access to the information of encrypted content, a decoder does not know which codeword is matched, and thus cannot tell how many coefficients to extract, not to mention to compute their values. In this paper, we propose a novel scheme that enables JPEG decompression in the homomorphic encryption domain. The scheme applies a statically controlled iterative procedure to decode one coefficient per iteration. In one iteration, each codeword is compared with the bitstream to compute an encrypted Boolean that represents if the codeword is a match or not. Each codeword would produce an output coefficient and generate a new bitstream by dropping consumed bits as if it were a match. If a codeword is associated with more than one coefficient, the codeword is replaced with the codeword representing the remaining undecoded coefficients for the next decoding iteration. The summation of each codeword's output multiplied by its matching Boolean is the output of the current iteration. This is equivalent to selecting the output of a matched codeword. A side benefit of our statically controlled decoding procedure is that paralleled Single-Instruction Multiple-Data (SIMD) is fully supported, wherein multiple plaintexts are encrypted into a single plaintext, and decoding a ciphertext block corresponds to decoding all corresponding plaintext blocks. SIMD also reduces the total size of ciphertexts of an image. Experimental results are reported to show the performance of our proposed scheme.

### MiniView Layout for Bandwidth-Efficient 360-Degree Video

•      Mengbai Xiao
• Shuoqian Wang
• Chao Zhou
• Li Liu
• Zhenhua Li
• Yao Liu
• Songqing Chen

With the recent increase in popularity of VR devices, 360-degree video has become increasingly popular. As more users experience this new medium, it will likely see further increases in popularity as users experience its greater immersiveness compared to traditional video streams. 360-degree video streams must encode the omnidirectional view, and, with current encoding techniques, these views require significantly higher bandwidth than traditional video streams. These larger bandwidth requirements comprise the main barrier toward wider adoption by video streaming services.

To reduce bandwidth requirements of 360-degree streaming, we propose the MiniView Layout. Compared to the standard cube layout, with equal pixel densities, 360-degree videos encoded in the MiniView Layout can save 16% of the encoded video size while delivering similar visual qualities. In conjunction with the MiniView Layout, we make the following contributions toward improving the 360-degree video ecosystem: i) We create a "projection efficiency" metric that quantifies the efficiencies of sphere-to-2D projections. ii) We introduce the ffmpeg360 tool. ffmpeg360 transcodes 360-degree videos and measures comparative 360-degree video quality given user head movement traces. The tool performs these tasks efficiently, using OpenGL for GPU acceleration.

### Real-time 3D Face-Eye Performance Capture of a Person Wearing VR Headset

•      Guoxian Song
• Jianfei Cai
• Tat-Jen Cham
• Jianmin Zheng
• Juyong Zhang
• Henry Fuchs

Teleconference or telepresence based on virtual reality (VR) head-mount display (HMD) device is a very interesting and promising application since HMD can provide immersive feelings for users. However, in order to facilitate face-to-face communications for HMD users, real-time 3D facial performance capture of a person wearing HMD is needed, which is a very challenging task due to the large occlusion caused by HMD. The existing limited solutions are very complex either in setting or in approach as well as lacking the performance capture of 3D eye gaze movement. In this paper, we propose a convolutional neural network (CNN) based solution for real-time 3D face-eye performance capture of HMD users without complex modification to devices. To address the issue of lacking training data, we generate massive pairs of HMD face-label dataset by data synthesis as well as collecting VR-IR eye dataset from multiple subjects. Then, we train a dense-fitting network for facial region and an eye gaze network to regress 3D eye model parameters. Extensive experimental results demonstrate that our system can efficiently and effectively produce in real time a vivid personalized 3D avatar with the correct identity, pose, expression and eye motion corresponding to the HMD user.

### Bridge the Gap Between VQA and Human Behavior on Omnidirectional Video: A Large-Scale Dataset and a Deep Learning Model

•      Chen Li
• Mai Xu
• Xinzhe Du
• Zulin Wang

Omnidirectional video enables spherical stimuli with the $360 \times 180^ \circ$ viewing range. Meanwhile, only the viewport region of omnidirectional video can be seen by the observer through head movement (HM), and an even smaller region within the viewport can be clearly perceived through eye movement (EM). Thus, the subjective quality of omnidirectional video may be correlated with HM and EM of human behavior. To fill in the gap between subjective quality and human behavior, this paper proposes a large-scale visual quality assessment (VQA) dataset of omnidirectional video, called VQA-OV, which collects 60 reference sequences and 540 impaired sequences. Our VQA-OV dataset provides not only the subjective quality scores of sequences but also the HM and EM data of subjects. By mining our dataset, we find that the subjective quality of omnidirectional video is indeed related to HM and EM. Hence, we develop a deep learning model, which embeds HM and EM, for objective VQA on omnidirectional video. Experimental results show that our model significantly improves the state-of-the-art performance of VQA on omnidirectional video.

### Tracking-assisted Weakly Supervised Online Visual Object Segmentation in Unconstrained Videos

•      Zongpu Zhang
• Yang Hua
• Tao Song
• Zhengui Xue
• Ruhui Ma
• Neil Robertson
• Haibing Guan

This paper tackles the task of online video object segmentation with weak supervision, i.e., labeling the target object and background with pixel-level accuracy in unconstrained videos, given only one bounding box information in the first frame. We present a novel tracking-assisted visual object segmentation framework to achieve this. On the one hand, initialized with a given bounding box in the first frame, the auxiliary object tracking module guides the segmentation module frame by frame by providing motion and region information, which is usually missing in semi-supervised methods. Moreover, compared with the unsupervised approach, our approach with such minimum supervision can focus on the target object without bringing unrelated objects into the final results. On the other hand, the video object segmentation module also improves the robustness of the visual object tracking module by pixel-level localization and objectness information. Thus, segmentation and tracking in our framework can mutually help each other in an online manner. To verify the generality and effectiveness of the proposed framework, we evaluate our weakly supervised method on two cross-domain datasets, i.e., the DAVIS and VOT2016 datasets, with the same configuration and parameter setting. Experimental results show the top performance of our method, which is even better than the leading semi-supervised methods. Furthermore, we conduct the extensive ablation study on our approach to investigate the influence of each component and main parameters.

### ThoughtViz: Visualizing Human Thoughts Using Generative Adversarial Network

•      Praveen Tirupattur
• Yogesh Singh Rawat
• Concetto Spampinato
• Mubarak Shah

Studying human brain signals has always gathered great attention from the scientific community. In Brain Computer Interface (BCI) research, for example, changes of brain signals in relation to specific tasks (e.g., thinking something) are detected and used to control machines. While extracting spatio-temporal cues from brain signals for classifying state of human mind is an explored path, decoding and visualizing brain states is new and futuristic. Following this latter direction, in this paper, we propose an approach that is able not only to read the mind, but also to decode and visualize human thoughts. More specifically, we analyze brain activity, recorded by an ElectroEncephaloGram (EEG), of a subject while thinking about a digit, character or an object and synthesize visually the thought item. To accomplish this, we leverage the recent progress of adversarial learning by devising a conditional Generative Adversarial Network (GAN), which takes, as input, encoded EEG signals and generates corresponding images. In addition, since collecting large EEG signals in not trivial, our GAN model allows for learning distributions with limited training data. Performance analysis carried out on three different datasets -- brain signals of multiple subjects thinking digits, characters, and objects -- show that our approach is able to effectively generate images from thoughts of a person. They also demonstrate that EEG signals encode explicitly cues from thoughts which can be effectively used for generating semantically relevant visualizations.

### A Feature-Adaptive Semi-Supervised Framework for Co-saliency Detection

•      Xiaoju Zheng
• Zheng-Jun Zha
• Liansheng Zhuang

Co-saliency detection, which refers to the discovery of common salient foreground regions in a group of relevant images, has attracted increasing attention due to its widespread applications in many vision tasks. Existing methods assemble features from multiple views toward a comprehensive representation, however overlook the efficacy disparity among various features in detecting co-saliency. This paper proposes a novel feature-adaptive semi-supervised (FASS) framework for co-saliency detection, which seamlessly integrates multi-view feature learning, graph structure optimization and co-saliency prediction in a unified solution. In particular, the FASS exploits the efficacy disparity of multi-view features at both view and element levels by a joint formulation of view-wise feature weighting and element-wise feature selection, leading to an effective representation robust to feature noise and redundancy as well as adaptive to the task at hand. It predicts co-saliency map by optimizing co-saliency label prorogation over a graph of both labeled and unlabeled image regions. The graph structure is optimized jointly with feature learning and co-saliency prediction to precisely characterize underlying correlation among regions. The FASS is thus able to produce satisfactory co-saliency map based on the effective exploration of multi-view features as well as inter-region correlation. Extensive experiments on three benchmark datasets, i.e., iCoseg, Cosal2015 and MSRC, have demonstrated that the proposed FASS outperforms the state-of-the-art methods.

### iSPA-Net: Iterative Semantic Pose Alignment Network

•      Jogendra Nath Kundu
• Rahul M. V.
• Venkatesh Babu R.

Understanding and extracting 3D information of objects from monocular 2D images is a fundamental problem in computer vision. In the task of 3D object pose estimation, recent data driven deep neural network based approaches suffer from scarcity of real images with 3D keypoint and pose annotations. Drawing inspiration from human cognition, where the annotators use a 3D CAD model as structural reference to acquire ground-truth viewpoints for real images; we propose an iterative Semantic Pose Alignment Network, called iSPA-Net. Our approach focuses on exploiting semantic 3D structural regularity to solve the task of fine-grained pose estimation by predicting viewpoint difference between a given pair of images. Such image comparison based approach also alleviates the problem of data scarcity and hence enhances scalability of the proposed approach for novel object categories with minimal annotation. The fine-grained object pose estimator is also aided by correspondence of learned spatial descriptor of the input image pair. The proposed pose alignment framework enjoys the faculty to refine its initial pose estimation in consecutive iterations by utilizing an online rendering setup along with effectiveness of a non-uniform bin classification of pose-difference. This enables iSPA-Net to achieve state-of-the-art performance on various real image viewpoint estimation datasets. Further, we demonstrate effectiveness of the approach for multiple applications. First, we show results for active object viewpoint localization to capture images from similar pose considering only a single image as pose reference. Second, we demonstrate the ability of the learned semantic correspondence to perform unsupervised part-segmentation transfer using only a single part-annotated 3D template model per object class. To encourage reproducible research, we have released the codes for our proposed algorithm.

### Extractive Video Summarizer with Memory Augmented Neural Networks

•      Litong Feng
• Ziyin Li
• Zhanghui Kuang
• Wei Zhang

Online videos have been growing explosively in recent years. How to help human users efficiently browse videos becomes more and more important. Video summarization can automatically shorten a video through extracting key-shots from the raw video, which is helpful for digesting video data. State-of-the-art supervised video summarization algorithms directly learn from manually-created summaries to mimic the key-frame/key-shot selection criterion of humans. Humans usually create a summary after viewing and understanding the whole video, and the global attention mechanism capturing information from all video frames plays a key role in the summarization process. However, previous supervised approaches ignored the temporal relations or simply modeled local inter-dependency across frames. Motivated by this observation, we proposed a memory augmented extractive video summarizer, which utilizes an external memory to record visual information of the whole video with high capacity. With the external memory, the video summarizer simply predicts the importance score of a video shot based on the global understanding of the video frames. The proposed method outperforms previous state-of-the-art algorithms on the public SumMe and TVSum datasets. More importantly, we demonstrate that the global attention modeling has two advantages: good transferring ability across datasets and high robustness to noisy videos.

### Fully Point-wise Convolutional Neural Network for Modeling Statistical Regularities in Natural Images

•      Jing Zhang
• Yang Cao
• Yang Wang
• Chenglin Wen
• Chang Wen Chen

Modeling statistical regularity plays an essential role in ill-posed image processing problems. Recently, deep learning based methods have been presented to implicitly learn statistical representation of pixel distributions in natural images and leverage it as a constraint to facilitate subsequent tasks, such as color constancy and image dehazing. However, the existing CNN architecture is prone to variability and diversity of pixel intensity within and between local regions, which may result in inaccurate statistical representation. To address this problem, this paper presents a novel fully point-wise CNN architecture for modeling statistical regularities in natural images. Specifically, we propose to randomly shuffle the pixels in the origin images and leverage the shuffled image as input to make CNN more concerned with the statistical properties. Moreover, since the pixels in the shuffled image are independent identically distributed, we can replace all the large convolution kernels in CNN with point-wise (1*1) convolution kernels while maintaining the representation ability. Experimental results on two applications: color constancy and image dehazing, demonstrate the superiority of our proposed network over the existing architectures, i.e., using 1/10~1/100 network parameters and computational cost while achieving comparable performance.

### Online Action Tube Detection via Resolving the Spatio-temporal Context Pattern

•      Jingjia Huang
• Nannan Li
• Jiaxing Zhong
• Thomas H. Li
• Ge Li

At present, spatio-temporal action detection in the video is still a challenging problem, considering the complexity of the background, the variety of the action or the change of the viewpoint in the unconstrained environment. Most of current approaches solve the problem via a two-step processing: first detecting actions at each frame; then linking them, which neglects the continuity of the action and operates in an offline and batch processing manner. In this paper, we attempt to build an online action detection model that introduces the spatio-temporal coherence existed among action regions when performing action category inference and position localization. Specifically, we seek to represent the spatio-temporal context pattern via establishing an encoder-decoder model based on the convolutional recurrent network. The model accepts a video snippet as input and encodes the dynamic information of the action in the forward pass. During the backward pass, it resolves such information at each time instant for action detection via fusing the current static or motion cue. Additionally, we propose an incremental action tube generation algorithm, which accomplishes action bounding-boxes association, action label determination and the temporal trimming in a single pass. Our model takes in the appearance, motion or fused signals as input and is tested on two prevailing datasets, UCF-Sports and UCF-101. The experiment results demonstrate the effectiveness of our method which achieves a performance superior or comparable to compared existing approaches.

### Enhancing Visual Question Answering Using Dropout

•      Zhiwei Fang
• Jing Liu
• Yanyuan Qiao
• Qu Tang
• Yong Li
• Hanqing Lu

Using dropout in Visual Question Answering (VQA) is a common practice to prevent overfitting. However, in multi-path networks, the current way to use dropout may cause two problems: the co-adaptations of neurons and the explosion of output variance. In this paper, we propose the coherent dropout and the siamese dropouy to solve the two problems, respectively. Specifically, in coherent dropout, all relevant dropout layers in multiple paths are forced to work coherently to maximize the ability of preventing neuron co-adaptations. We show that the coherent dropout is simple in implementation but very effective to overcome overfitting. As for the explosion of output variance, we develop a siamese dropout mechanism to explicitly minimize the difference between the two output vectors produced from the same input data during training phase. Such mechanism can reduce the gap between training and inference phases and make the VQA model more robust. Extensive experiments are conducted to verify the effectiveness of coherent dropout and siamese dropout. And the results also show that our methods can bring additional improvements on the state-of-the-art VQA models.

### Face-Voice Matching using Cross-modal Embeddings

•      Shota Horiguchi
• Naoyuki Kanda
• Kenji Nagamatsu

Face-voice matching is a task to find correspondence between faces and voices. Many researches in cognitive science have confirmed human ability in the face-voice matching tasks. Such ability is useful for creating natural human machine interaction systems and in many other applications. In this paper, we propose a face-voice matching model that learns cross-modal embeddings between face images and voice characteristics. We constructed a novel FVCeleb dataset which consists of face images and utterances from 1,078 persons. These persons were selected from the MS-Celeb-1M face image dataset and the VoxCeleb audio dataset. In two-alternative forced-choice matching task with an audio input and two face-image candidates of the same gender, our model achieved 62.2% and 56.5% accuracy on the FVCeleb and the subset of the GRID corpus, respectively. These results are very similar to human performance reported in cognitive science studies.

### Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval

•      Jing-Jing Chen
• Chong-Wah Ngo
• Fu-Li Feng
• Tat-Seng Chua

Finding a right recipe that describes the cooking procedure for a dish from just one picture is inherently a difficult problem. Food preparation undergoes a complex process involving raw ingredients, utensils, cutting and cooking operations. This process gives clues to the multimedia presentation of a dish (e.g., taste, colour, shape). However, the description of the process is implicit, implying only the \em cause of dish presentation rather than the visual \em effect that can be vividly observed on a picture. Therefore, different from other cross-modal retrieval problems in the literature, recipe search requires the understanding of textually described procedure to predict its possible consequence on visual appearance. In this paper, we approach this problem from the perspective of attention modeling. Specifically, we model the attention of words and sentences in a recipe and align them with its image feature such that both text and visual features share high similarity in multi-dimensional space. Through a large food dataset, Recipe1M, we empirically demonstrate that understanding the cooking procedure can lead to improvement in a large margin compared to the existing methods which mostly consider only ingredient information. Furthermore, with attention modeling, we show that language-specific named-entity extraction based on domain knowledge becomes optional. The result gives light to the feasibility of performing cross-lingual cross-modal recipe retrieval with off-the-shelf machine translation engines.

### Decoupled Novel Object Captioner

•      Yu Wu
• Linchao Zhu
• Lu Jiang
• Yi Yang

Image captioning is a challenging task where the machine automatically describes an image by sentences or phrases. It often requires a large number of paired image-sentence annotations for training. However, a pre-trained captioning model can hardly be applied to a new domain in which some novel object categories exist, i.e., the objects and their description words are unseen during model training. To correctly caption the novel object, it requires professional human workers to annotate the images by sentences with the novel words. It is labor expensive and thus limits its usage in real-world applications. In this paper, we introduce the zero-shot novel object captioning task where the machine generates descriptions without extra training sentences about the novel object. To tackle the challenging problem, we propose a Decoupled Novel Object Captioner (DNOC) framework that can fully decouple the language sequence model from the object descriptions. DNOC has two components. 1) A Sequence Model with the Placeholder (SM-P) generates a sentence containing placeholders. The placeholder represents an unseen novel object. Thus, the sequence model can be decoupled from the novel object descriptions. 2) A key-value object memory built upon the freely available detection model, contains the visual information and the corresponding word for each object. A query generated from the SM-P is used to retrieve the words from the object memory. The placeholder will further be filled with the correct word, resulting in a caption with novel object descriptions. The experimental results on the held-out MSCOCO dataset demonstrate the ability of DNOC in describing novel concepts.

### Temporal Cross-Media Retrieval with Soft-Smoothing

•      David Semedo
• Joao Magalhaes

Multimedia information have strong temporal correlations that shape the way modalities co-occur over time. In this paper we study the dynamic nature of multimedia and social-media information, where the temporal dimension emerges as a strong source of evidence for learning the temporal correlations across visual and textual modalities. So far, cross-media retrieval models, explored the correlations between different modalities (e.g. text and image) to learn a common subspace, in which semantically similar instances lie in the same neighbourhood. Building on such knowledge, we propose a novel temporal cross-media neural architecture, that departs from standard cross-media methods, by explicitly accounting for the temporal dimension through temporal subspace learning. The model is softly-constrained with temporal and inter-modality constraints that guide the new subspace learning task by favouring temporal correlations between semantically similar and temporally close instances. Experiments on three distinct datasets show that accounting for time turns out to be important for cross-media retrieval. Namely, the proposed method outperforms a set of baselines on the task of temporal cross-media retrieval, demonstrating its effectiveness for performing temporal subspace learning.

### Photo Squarization by Deep Multi-Operator Retargeting

•      Yu Song
• Fan Tang
• Weiming Dong
• Xiaopeng Zhang
• Oliver Deussen
• Tong-Yee Lee

Squared forms of photos are widely used in social media as album covers or thumbnails of image streams. In this study, we realize photo squarization by modeling Retargeting Visual Perception Issues, which reflect human perception preference toward image ratargeting. General image retargeting techniques deal with three common issues, namely, salient content, object shape, and scene composition, to preserve the important information of original image. We propose a new way based on multi-operator techniques to investigate human behavior in balancing the three issues. We establish a new dataset and observe human behavior by inviting investigators to retarget images to square manually. We propose a data-driven approach composed of perception and distillation modules by using deep learning techniques to predict human perception preference. The perception part learns the relations among the three issues, and the distillation part transfers the learned relations to a simple but effective network. Our study contributes to deep learning literature by optimizing a network index and lightening its running burden. Experimental results show that photo squarization results generated by the proposed model are consistent with human visual perception results.

### Non-locally Enhanced Encoder-Decoder Network for Single Image De-raining

•      Guanbin Li
• Xiang He
• Wei Zhang
• Huiyou Chang
• Le Dong
• Liang Lin

Single image rain streaks removal has recently witnessed substantial progress due to the development of deep convolutional neural networks. However, existing deep learning based methods either focus on the entrance and exit of the network by decomposing the input image into high and low frequency information and employing residual learning to reduce the mapping range, or focus on the introduction of cascaded learning scheme to decompose the task of rain streaks removal into multi-stages. These methods treat the convolutional neural network as an encapsulated end-to-end mapping module without deepening into the rationality and superiority of neural network design. In this paper, we delve into an effective end-to-end neural network structure for stronger feature expression and spatial correlation learning. Specifically, we propose a non-locally enhanced encoder-decoder network framework, which consists of a pooling indices embedded encoder-decoder network to efficiently learn increasingly abstract feature representation for more accurate rain streaks modeling while perfectly preserving the image detail. The proposed encoder-decoder framework is composed of a series of non-locally enhanced dense blocks that are designed to not only fully exploit hierarchical features from all the convolutional layers but also well capture the long-distance dependencies and structural information. Extensive experiments on synthetic and real datasets demonstrate that the proposed method can effectively remove rain-streaks on rainy image of various densities while well preserving the image details, which achieves significant improvements over the recent state-of-the-art methods.

### An ADMM-Based Universal Framework for Adversarial Attacks on Deep Neural Networks

•      Pu Zhao
• Sijia Liu
• Yanzhi Wang
• Xue Lin

Deep neural networks (DNNs) are known vulnerable to adversarial attacks. That is, adversarial examples, obtained by adding delicately crafted distortions onto original legal inputs, can mislead a DNN to classify them as any target labels. In a successful adversarial attack, the targeted mis-classification should be achieved with the minimal distortion added. In the literature, the added distortions are usually measured by $L_0$, $L_1$, $L_2$, and $L_\infty$ norms, namely, L_0, L_1, L_2, and L_∞ attacks, respectively. However, there lacks a versatile framework for all types of adversarial attacks. This work for the first time unifies the methods of generating adversarial examples by leveraging ADMM (Alternating Direction Method of Multipliers), an operator splitting optimization approach, such that $L_0$, $L_1$, $L_2$, and $L_\infty$ attacks can be effectively implemented by this general framework with little modifications. Comparing with the state-of-the-art attacks in each category, our ADMM-based attacks are so far the strongest, achieving both the 100% attack success rate and the minimal distortion.

### Local Convolutional Neural Networks for Person Re-Identification

•      Jiwei Yang
• Xu Shen
• Xinmei Tian
• Houqiang Li
• Jianqiang Huang
• Xian-Sheng Hua

Recent works have shown that person re-identification can be substantially improved by introducing attention mechanisms, which allow learning both global and local representations. However, all these works learn global and local features in separate branches. As a consequence, the interaction/boosting of global and local information are not allowed, except in the final feature embedding layer. In this paper, we propose local operations as a generic family of building blocks for synthesizing global and local information in any layer. This building block can be inserted into any convolutional networks with only a small amount of prior knowledge about the approximate locations of local parts. For the task of person re-identification, even with only one local block inserted, our local convolutional neural networks (Local CNN) can outperform state-of-the-art methods consistently on three large-scale benchmarks, including Market-1501, CUHK03, and DukeMTMC-ReID.

### Conditional Expression Synthesis with Face Parsing Transformation

•      Zhihe Lu
• Tanhao Hu
• Lingxiao Song
• Zhaoxiang Zhang
• Ran He

Facial expression synthesis with various intensities is a challenging synthesis task due to large identity appearance variations and a paucity of efficient means for intensity measurement. This paper advances the expression synthesis domain by the introduction of a Couple-Agent Face Parsing based Generative Adversarial Network (CAFP-GAN) that unites the knowledge of facial semantic regions and controllable expression signals. Specially, we employ a face parsing map as a controllable condition to guide facial texture generation with a special expression, which can provide a semantic representation of every pixel of facial regions. Our method consists of two sub-networks: face parsing prediction network (FPPN) uses controllable labels (expression and intensity) to generate a face parsing map transformation that corresponds to the labels from the input neutral face, and facial expression synthesis network (FESN) makes the pretrained FPPN as a part of it to provide the face parsing map as a guidance for expression synthesis. To enhance the reality of results, couple-agent discriminators are served to distinguish fake-real pairs in both two sub-nets. Moreover, we only need the neutral face and the labels to synthesize the unknown expression with different intensities. Experimental results on three popular facial expression databases show that our method has the compelling ability on continuous expression synthesis.

### Attentive Recurrent Neural Network for Weak-supervised Multi-label Image Classification

•      Liang Li
• Shuhui Wang
• Shuqiang Jiang
• Qingming Huang

Multi-label image classification is a fundamental and challenging task in computer vision, and recently achieved significant progress by exploiting semantic relations among labels. However, the spatial positions of labels for multi-labels images are usually not provided in real scenarios, which brings insuperable barrier to conventional models. In this paper, we propose an end-to-end attentive recurrent neural network for multi-label image classification under only image-level supervision, which learns the discriminative feature representations and models the label relations simultaneously. First, inspired by attention mechanism, we propose a recurrent highlight network (RHN) which focuses on the most related regions in the image to learn the discriminative feature representations for different objects in an iterative manner. Second, we develop a gated recurrent relation extractor (GRRE) to model the label relations using multiplicative gates in a recurrent fashion, which learns to decide how multiple labels of the image influence the relation extraction. Extensive experiments on three benchmark datasets show that our model outperforms the state-of-the-arts, and performs better on small-object categories and under the scenario with large number of labels.

### Deep Cross Modal Learning for Caricature Verification and Identification (CaVINet)

•      Jatin Garg
• Skand Vishwanath Peri
• Himanshu Tolani
• Narayanan C. Krishnan

Learning from different modalities is a challenging task. In this paper, we look at the challenging problem of cross modal face verification and recognition between caricature and visual image modalities. Caricature have exaggerations of facial features of a person. Due to the significant variations in the caricatures, building vision models for recognizing and verifying data from this modality is an extremely challenging task. Visual images with significantly lesser amount of distortions can act as a bridge for the analysis of caricature modality. We introduce a publicly available large Caricature-VIsual dataset [CaVI] with images from both the modalities that captures the rich variations in the caricature of an identity. This paper presents the first cross modal architecture that handles extreme distortions of caricatures using a deep learning network that learns similar representations across the modalities. We use two convolutional networks along with transformations that are subjected to orthogonality constraints to capture the shared and modality specific representations. In contrast to prior research, our approach neither depends on manually extracted facial landmarks for learning the representations, nor on the identities of the person for performing verification. The learned shared representation achieves 91% accuracy for verifying unseen images and 75% accuracy on unseen identities. Further, recognizing the identity in the image by knowledge transfer using a combination of shared and modality specific representations, resulted in an unprecedented performance of 85% rank-1 accuracy for caricatures and 95% rank-1 accuracy for visual images.

### Few-Shot Adaptation for Multimedia Semantic Indexing

•      Nakamasa Inoue
• Koichi Shinoda

We propose a few-shot adaptation framework, which bridges zero-shot learning and supervised many-shot learning, for semantic indexing of image and video data. Few-shot adaptation provides robust parameter estimation with few training examples, by optimizing the parameters of zero-shot learning and supervised many-shot learning simultaneously. In this method, first we build a zero-shot detector, and then update it by using the few examples. Our experiments show the effectiveness of the proposed framework on three datasets: TRECVID Semantic Indexing 2010, 2014, and ImageNET. On the ImageNET dataset, we show that our method outperforms recent few-shot learning methods. On the TRECVID 2014 dataset, we achieve 15.19~% and 35.98~% in Mean Average Precision under the zero-shot condition and the supervised condition, respectively. To the best of our knowledge, these are the best results on this dataset.

### Fashion Sensitive Clothing Recommendation Using Hierarchical Collocation Model

•      Zhengzhong Zhou
• Xiu Di
• Wei Zhou
• Liqing Zhang

Automatic clothing recommendation grows dramatically due to the booming of apparel e-commerce. In this paper, we propose a novel clothing recommendation approach which is sensitive to the fashion trend. The proposed approach incorporates the expert knowledge into multiple dimensional information including purchase behaviors, image contents and product descriptions so as to provide recommendation of clothing in line with the forefront of fashion. Meanwhile, to meet with human visual aesthetics and user's collocation experience, we propose the integration of the convolutional neural network and the hierarchical collocation model (HCM) into our framework. The former is to extract effective visual features and attribute descriptors from the clothing items, while the latter embeds them into the concept of style topics which interpret the collocation pattern from a higher level of semantic knowledge. Such a data driven recommendation approach is able to learn clothing collocation metric from multi-dimensional clothing information. Experimental results show that our HCM method achieves better performance than other state-of-the-art baselines. Besides, it also ensures the fashion sensitivity of the recommended outfits.

### Multi-Scale Context Attention Network for Image Retrieval

•      Yihang Lou
• Yan Bai
• Shiqi Wang
• Ling-Yu Duan

Recent attempts on the Convolutional Neural Network (CNN) based image retrieval usually adopt the output of a specific convolutional or fully connected layer as feature representation. Though superior representation capability has yielded better retrieval performance, the scale variation and clutter distracting remain to be two challenging problems in CNN based image retrieval. In this work, we propose a Multi-Scale Context Attention Network (MSCAN) to generate global descriptors, which is able to selectively focus on the informative regions with the assistance of multi-scale context information. We model the multi-scale context information by an improved Long Short-Term Memory (LSTM) network across different layers. As such, the proposed global descriptor is equipped with the scale aware attention capability. Experimental results show that our proposed method can effectively capture the informative regions in images and retain reliable attention responses when encountering scale variation and clutter distracting. Moreover, we compare the performance of the proposed scheme with the state-of-the-art global descriptors, and extensive results verify that the proposed MSCAN can achieve superior performance on several image retrieval benchmarks.

### Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval

•      Yibing Zhan
• Jun Yu
• Zhou Yu
• Rong Zhang
• Dacheng Tao
• Qi Tian

In this paper, we propose a novel method with comprehensive distance-preserving autoencoders (CDPAE) to address the problem of unsupervised cross-modal retrieval. Previous unsupervised methods rely primarily on pairwise distances of representations extracted from cross media spaces that co-occur and belong to the same objects. However, besides pairwise distances, the CDPAE also considers heterogeneous distances of representations extracted from cross media spaces as well as homogeneous distances of representations extracted from single media spaces that belong to different objects. The CDPAE consists of four components. First, denoising autoencoders are used to retain the information from the representations and to reduce the negative influence of redundant noises. Second, a comprehensive distance-preserving common space is proposed to explore the correlations among different representations. This aims to preserve the respective distances between the representations within the common space so that they are consistent with the distances in their original media spaces. Third, a novel joint loss function is defined to simultaneously calculate the reconstruction loss of the denoising autoencoders and the correlation loss of the comprehensive distance-preserving common space. Finally, an unsupervised cross-modal similarity measurement is proposed to further improve the retrieval performance. This is carried out by calculating the marginal probability of two media objects based on a kNN classifier. The CDPAE is tested on four public datasets with two cross-modal retrieval tasks: "query images by texts" and "query texts by images". Compared with eight state-of-the-art cross-modal retrieval methods, the experimental results demonstrate that the CDPAE outperforms all the unsupervised methods and performs competitively with the supervised methods.

### Temporal Hierarchical Attention at Category- and Item-Level for Micro-Video Click-Through Prediction

•      Xusong Chen
• Dong Liu
• Zheng-Jun Zha
• Wengang Zhou
• Zhiwei Xiong
• Yan Li

Micro-video sharing gains great popularity in recent years, which calls for effective recommendation algorithm to help user find their interested micro-videos. Compared with traditional online (e.g. YouTube) videos, micro-videos contributed by grass-root users and taken by smartphones are much shorter (tens of seconds) and more short of tags or descriptive text, making the recommendation of micro-videos a challenging task. In this paper, we investigate how to model user's historical behaviors so as to predict the user's click-through of micro-videos. Inspired by the recent deep network-based methods, we propose a Temporal Hierarchical Attention at Category- and Item-Level (THACIL) network for user behavior modeling. First, we use temporal windows to capture the short-term dynamics of user interests; Second, we leverage a category-level attention mechanism to characterize user's diverse interests, as well as an item-level attention mechanism for fine-grained profiling of user interests; Third, we adopt forward multi-head self-attention to capture the long-term correlation within user behaviors. Our proposed THACIL network was tested on MicroVideo-1.7M, a new dataset of 1.7 million micro-videos, coming from real data of a micro-video sharing service in China. Experimental results demonstrate the effectiveness of the proposed method in comparison with the state-of-the-art solutions.

### Historical Context-based Style Classification of Painting Images via Label Distribution Learning

•      Jufeng Yang
• Liyi Chen
• Le Zhang
• Xiaoxiao Sun
• Dongyu She
• Shao-Ping Lu
• Ming-Ming Cheng

Analyzing and categorizing the style of visual art images, especially paintings, is gaining popularity owing to its importance in understanding and appreciating the art. The evolution of painting style is both continuous, in a sense that new styles may inherit, develop or even mutate from their predecessors and multi-modal because of various issues such as the visual appearance, the birthplace, the origin time and the art movement. Motivated by this peculiarity, we introduce a novel knowledge distilling strategy to assist visual feature learning in the convolutional neural network for painting style classification. More specifically, a multi-factor distribution is employed as soft-labels to distill complementary information with visual input, which extracts from different historical context via label distribution learning. The proposed method is well-encapsulated in a multi-task learning framework which allows end-to-end training. We demonstrate the superiority of the proposed method over the state-of-the-art approaches on Painting91, OilPainting, and Pandora datasets.

### Direction-aware Neural Style Transfer

•      Hao Wu
• Zhengxing Sun
• Weihang Yuan

Neural learning methods have been shown to be effective in style transfer. These methods, which are called NST, aim to synthesize a new image that retains the high-level structure of a content image while keeps the low-level features of a style image. However, these models using convolutional structures only extract local statistical features of style images and semantic features of content images. Since the absence of low-level features in the content image, these methods would synthesize images that look unnatural and full of traces of machines. In this paper, we find that direction, that is, the orientation of each painting stroke, can capture the soul of image style preferably and thus generates much more natural and vivid stylizations. According to this observation, we propose a Direction-aware Neural Style Transfer (DaNST) with two major innovations. First, a novel direction field loss is proposed to steer the direction of strokes in the synthesized image. And to build this loss function, we propose novel direction field loss networks to generate and compare the direction fields of content image and synthesized image. By incorporating the direction field loss in neural style transfer, we obtain a new optimization objective. Through minimizing this objective, we can produce synthesized images that better follow the direction field of the content image. Second, our method provides a simple interaction mechanism to control the generated direction fields, and further control the texture direction in synthesized images. Experiments show that our method outperforms state-of-the-art in most styles such as oil painting and mosaic.

### ChipGAN: A Generative Adversarial Network for Chinese Ink Wash Painting Style Transfer

•      Bin He
• Feng Gao
• Daiqian Ma
• Boxin Shi
• Ling-Yu Duan

Style transfer has been successfully applied on photos to generate realistic western paintings. However, because of the inherently different painting techniques adopted by Chinese and western paintings, directly applying existing methods cannot generate satisfactory results for Chinese ink wash painting style transfer. This paper proposes ChipGAN, an end-to-end Generative Adversarial Network based architecture for photo to Chinese ink wash painting style transfer. The core modules of ChipGAN enforce three constraints -- voids, brush strokes, and ink wash tone and diffusion -- to address three key techniques commonly adopted in Chinese ink wash painting. We conduct stylization perceptual study to score the similarity of generated paintings to real paintings by consulting with professional artists based on the newly built Chinese ink wash photo and image dataset. The advantages in visual quality compared with state-of-the-art networks and high stylization perceptual study scores show the effectiveness of the proposed method.

### CloudVR: Cloud Accelerated Interactive Mobile Virtual Reality

•      Teemu Kämäräinen
• Matti Siekkinen
• Jukka Eerikäinen
• Antti Ylä-Jääski

High quality immersive Virtual Reality experience currently requires a PC setup with cable connected head mounted display, which is expensive and restricts user mobility. This paper presents CloudVR which is a system for cloud accelerated interactive mobile VR. It is designed to provide short rotation and interaction latencies through panoramic rendering and dynamic object placement. CloudVR also includes rendering optimizations to reduce server-side computational load and bandwidth requirements between the server and client. Performance measurements with a CloudVR prototype suggest that the optimizations make it possible to double the server's framerate and halve the amount of bandwidth required and that small objects can be quickly moved at run time to client device for rendering to provide shorter interaction latency. A small-scale user study indicates that CloudVR users do not notice small network latencies (20ms) and even much longer ones (100-200ms) become non-trivial to detect when they do not affect the interaction with objects. Finally, we present a design of CloudVR extension to multi-user scenarios.

### Your Attention is Unique: Detecting 360-Degree Video Saliency in Head-Mounted Display for Head Movement Prediction

•      Anh Nguyen
• Zhisheng Yan
• Klara Nahrstedt

### Hybrid Point Cloud Attribute Compression Using Slice-based Layered Structure and Block-based Intra Prediction

•      Yiting Shao
• Qi Zhang
• Ge Li
• Zhu Li
• Li Li

Point cloud compression is a key enabler for the emerging applications of immersive visual communication, autonomous driving and smart cities, etc. In this paper, we propose a hybrid point cloud attribute compression scheme built on an original layered data structure. First, a slice partition scheme and a geometry-adaptive k-dimensional tree (k-d tree) method are devised to generate layer structures. Second, we introduce an efficient block-based intra prediction scheme containing to exploit spatial correlations among adjacent points. Third, an adaptive transform scheme based on Graph Fourier Transform (GFT) is Lagrangian optimized to achieve better transform efficiency. The Lagrange multiplier is off-line derived based on the statistics of attribute coding. Last but not least, multiple scan modes are dedicated to improve coding efficiency for entropy coding. Experimental results demonstrate that our method performs better than the state-of-the-art region-adaptive hierarchical transform (RAHT) system, and on average a 37.21% BD-rate gain is achieved. Comparing with the test model for category 1 (TMC1) anchors, which were recently published by MPEG-3DG group on 121st MPEG meeting, a 8.81% BD-rate gain is obtained.

### QARC: Video Quality Aware Rate Control for Real-Time Video Streaming based on Deep Reinforcement Learning

•      Tianchi Huang
• Rui-Xiao Zhang
• Chao Zhou
• Lifeng Sun

Real-time video streaming is now one of the main applications in all network environments. Due to the fluctuation of throughput under various network conditions, how to choose a proper bitrate adaptively has become an upcoming and interesting issue. To tackle this problem, most proposed rate control methods work for providing high video bitrates instead of video qualities. Nevertheless, we notice that there exists a trade-off between sending bitrate and video quality, which motivates us to focus on how to reach a balance between them.

In this paper, we propose QARC (video Quality Aware Rate Control), a rate control algorithm that aims to obtain a higher perceptual video quality with possible lower sending rate and transmission latency. Starting from scratch, QARC uses deep reinforcement learning(DRL) algorithm to train a neural network to select future bitrates based on previously observed network status and past video frames. To overcome the "state explosion problem'', we design a neural network to predict future perceptual video quality as a vector for taking the place of the raw picture in the DRL's inputs.

We evaluate QARC via trace-driven simulation, outperforming existing approach with improvements in average video quality of 18% - 25% and decreasing in average latency with 23% -45%. Meanwhile, comparing QARC with offline optimal high bitrate method on various network conditions, we find that QARC also yields a solid result.

### Optimizing Personalized Interaction Experience in Crowd-Interactive Livecast: A Cloud-Edge Approach

•      Haitian Pang
• Cong Zhang
• Fangxin Wang
• Han Hu
• Zhi Wang
• Jiangchuan Liu
• Lifeng Sun

Enabling users to interact with broadcasters and audience, the crowd-interactive livecast greatly improves viewer's quality of experience (QoE) and attracts millions of daily active users recently. In addition to striking the balance between resource utilization and viewers' QoE met in the traditional video streaming service, this novel service needs to take supererogatory efforts to improve the interaction QoE, which reflects the viewer interaction experience. To tackle this issue, we conduct measurement studies over a large-scale dataset crawled from a representative livecast service provider. We observe that the individual's interaction pattern is quite heterogeneous: only 10% viewers proactively participate in the interaction, and the rest viewers usually watch passively. Incorporating the insight into the emerging cloud-edge architecture, we propose a framework PIECE, which optimizes the Personalized Interaction Experience with Cloud-Edge architecture (PIECE) for intelligent user access control and livecast distribution. In particular, we first devise a novel deep neural network based algorithm to predict users' interaction intensity using the historical viewer pattern. We then design an algorithm to maximize the individual's QoE, by strategically matching viewer sessions and transcoding-delivery paths over cloud-edge infrastructure. Finally, we use trace-driven experiments to verify the effectiveness of PIECE. Our results show that our prediction algorithm outperforms the state-of-the-art algorithms with a much smaller mean absolute error (40% reduction). Furthermore, in comparison with the cloud-based video delivery strategy, the proposed framework can simultaneously improve the average viewers QoE (26% improvement) and interaction QoE (21% improvement), while maintaining a high streaming bitrate.

## SESSION: Demo + Video + Makers' Program

### Session details: Demo + Video + Makers' Program

•      (1) Kwanghoon Yong Man (2) Sohn Ro

### Give Me One Portrait Image, I Will Tell You Your Emotion and Personality

•      Songyou Peng
• Le Zhang
• Stefan Winkler
• Marianne Winslett

Personality and emotion are both central to affective computing. Existing works address them individually. In this demo we investigate if such high-level affect traits and their relationship can be jointly learned from face images in the wild. To this end, we introduce an end-to-end trainable and deep Siamese-like network. At inference time, our system can take one portrait photo as input and predict one's Big-Five apparent personality as well as emotion attributes. With such a system, we also demonstrate the feasibility of inferring the apparent personality directly fro emotion.

### Demo: Phase-based Acoustic Localization and Motion Tracking for Mobile Interaction

•      Yang Liu
• Yang Yang
• Weidong Fang
• Wuxiong Zhang

Motion tracking, as a mechanism of mobile interaction, allows devices to get fine-gained user input by locating the real-time position of target devices (e.g., smart phones, smart watches) in the air. With the proliferation of mobile devices and smart multimedia devices (e.g., smart TV, home audio system), the ubiquitous speakers and microphones in the devices provide more diverse ways of acoustic-based mobile interaction. In this demonstration, we propose a fine-gained motion tracking system, which can be developed on commercial mobile devices and track the devices with millimeter level (mm-level) accuracy. We first compensate the phase offset between receiver and audio source at each frequency. We then use the acoustic phase change at receiver to achieve accurate distance measurement. Finally, we implement our system on off-the-shelf devices, and achieve a fine-gained motion tracking in two-dimensional space. Our experiments show that our system achieves high accuracy as well as high sensitivity: our system could detect the sight and slow movement caused by human breathing for example.

### AI Painting: An Aesthetic Painting Generation System

•      Cunjun Zhang
• Kehua Lei
• Jia Jia
• Yihui Ma
• Zhiyuan Hu

There are many great works done in image generation. However, it is still an open problem how to generate a painting, which is meeting the aesthetic rules in specific style. Therefore, in this paper, we propose a demonstration to generate a specific painting based on users' input. In the system called AI Painting, we generate an original image from content text, transfer the image into a specific aesthetic effect, simulate the image into specific artistic genre, and illustrate the painting process.

### SoMin.ai: Social Multimedia Influencer Discovery Marketplace

•      Aleksandr Farseev
• Kirill Lepikhin
• Hendrik Schwartz
• Eu Khoon Ang
• Kenny Powar

In this technical demonstration, we showcase the first ai-driven social multimedia influencer discovery marketplace, called SoMin. The platform combines advanced data analytics and behavioral science to help marketers find, understand their audience and engage the most relevant social media micro-influencers at a large scale. SoMin harvests brand-specific life social multimedia streams in a specified market domain, followed by rich analytics and semantic-based influencer search. The Individual User Profiling models extrapolate the key personal characteristics of the brand audience, while the influencer retrieval engine reveals the semantically-matching social media influencers to the platform users. The influencers are matched in terms of both their-posted content and social media audiences, while the evaluation results demonstrate an excellent performance of the proposed recommender framework. By leveraging influencers at a large scale, marketers will be able to execute more effective marketing campaigns of higher trust and at a lower cost.

### AniDance: Real-Time Dance Motion Synthesize to the Song

•      Taoran Tang
• Hanyang Mao
• Jia Jia

In this paper, we present a demo named AniDance that can synthesize dance motions with melody in real-time. When users sing a song or play one in their phone to AniDance, their melody will drive the 3D-space character to dance to create a lively dance animation. In practice, we conduct a music oriented 3D-space dance motion dataset by capturing real dance performances, using LSTM-autoencoder to identify the relation between music and dance. Based on these technologies, users can create valid choreographies that capable of musical expression, witch can promote their learning ability and interest in dance and music.

### ArtSight: An Artistic Data Exploration Engine

•      Gjorgji Strezoski
• Inske Groenen
• Jurriaan Besenbruch
• Marcel Worring

This technical demo presents ArtSight, a comprehensive query-by-color explorative interface built on top of the large scale artistic dataset OmniArt. Color is of paramount importance in the artistic realm and querying such large data collections by colors that appear in their palette allows for intuitive exploration. This demo allows users to browse the 3 million artwork items in the OmniArt collection by color, and hierarchically filter each result-set by multiple attributes existing in the collection itself. Colors are extracted from the digital photographic reproductions in an unsupervised fashion in palettes of twelve and matched with their meta-data seamlessly to exploit both modalities in our filtering module. The user interaction quality is moderated by a responsive framework with touch capability and an unfoldable interactive 3D sphere visualization offering two exploration options - CompactExplore or GridExplore.

### Meet AR-bot: Meeting Anywhere, Anytime with Movable Spatial AR Robot

•      Yoon Jung Park
• Yoonsik Yang
• Hyocheol Ro
• JungHyun Byun
• Seougho Chae
• Tack Don Han

Many kinds of preparations are needed when meeting. For example, projector, laptop, cables and ETC. As such, this video have constructed Meet AR-bot, which helps users to keep meeting going smoothly, based on the projection of Augmented Reality(AR). Our system can easily provide meeting room environment through the movable setting via wheel-based stand. Users do not need to carry a personal laptop and connect them to the projector. Robot reconstructs the 3D geometry information through pan-tilt system and compute projection areas to project information in the space. Users can also control through mobile devices. We offer presentation, table interaction, file sharing and virtual object registration by mobile device.

### Magical Rice Bowl: A Real-time Food Category Changer

•      Ryosuke Tanno
• Daichi Horita
• Wataru Shimoda
• Keiji Yanai

In this demo, we demonstrate "Real-time Food Category Change'' based on a Conditional Cycle GAN (cCycle GAN) with a large-scale food image data collected from the Twitter Stream. Conditional Cycle GAN is an extension of CycleGAN, which enables "Food Category Change'' among ten kinds of typical foods served in bowl-type dishes such as beef rice bowl and ramen noodles. The proposed system enables us to change the appearance of a given food photo according to the given category keeping the shape of the given food but exchanging its textures. For training, we used two hundred and thirty thousand food images which achieved very natural food category change among ten kinds of typical Japanese foods: ramen noodle, curry rice, fried rice, beef rice bowl, chilled noodle, spaghetti with meat source, white rice, eel bowl, and fried noodle.

### Exploring Temporal Communities in Mass Media Archives

•      Haolin Ren
• Benjamin Renoust
• Guy Melançon
• Marie-Luce Viaud
• Shin'ichi Satoh

One task key to the analysis of large multimedia archive over time is to dynamically monitor the activity of concepts and entities with their interactions. This is helpful to analyze threads of topics over news archives (how stories unfold), or to monitor evolutions and development of social groups. Dynamic graph modeling is a powerful tool to capture these interactions over time, while visualization and finding communities still remain difficult, especially with a high density of links. We propose to extract the backbone of dynamic graphs in order to ease community detection and guide the exploration of trends evolution. Through the graph structure, we interactively coordinate node-link diagrams, Sankey diagrams, time series, and animations in order to extract patterns and follow community behavior. We illustrate our system with the exploration of the role of soccer in 6 years of TV/radio magazines in France, and the role of North Korea in about 10 years of Japanese news.

### SoniControl - A Mobile Ultrasonic Firewall

•      Matthias Zeppelzauer
• Alexis Ringot
• Florian Taurer

The exchange of data between mobile devices in the near-ultrasonic frequency band is a new promising technology for near field communication (NFC) but also raises a number of privacy concerns. We present the first ultrasonic firewall that reliably detects ultrasonic communication and provides the user with effective means to prevent hidden data exchange. This demonstration showcases a new media-based communication technology ("data over audio") together with its related privacy concerns. It enables users to (i) interactively test out and experience ultrasonic information exchange and (ii) shows how to protect oneself against unwanted tracking.

### MusicMapp: A Deep Learning Based Solution for Music Exploration and Visual Interaction

•      Mohammed Habibullah Baig
• Jibin Rajan Varghese
• Zhangyang Wang

We present MusicMapp, the world's first large-scale interactive visualization of full-length songs as a point-cloud map, based on high-level features extracted using a customized deep convolutional recurrent neural network (Deep CRNN). MusicMapp will provide the audience with a novel way of experiencing music, opening up new horizons for research and exploration in musicology, regarding how music is perceived, consumed, and interacted with. The demo of MusicMapp will highlight a series of features, including but not limited to: 1) a cloud-based Android App visualizing songs as a point cloud; 2) personalized music exploration and recommendation; and 3) a social-network sharing mechanism built among the users exploring songs.

### Demonstration of an Open Source Framework for Qualitative Evaluation of CBIR Systems

•      Paula Gómez Duran
• Eva Mohedano
• Kevin McGuinness
• Xavier Giró-i-Nieto
• Noel E. O'Connor

Evaluating image retrieval systems in a quantitative way, for example by computing measures like mean average precision, allows for objective comparisons with a ground-truth. However, in cases where ground-truth is not available, the only alternative is to collect feedback from a user. Thus, qualitative assessments become important to better understand how the system works. Visualizing the results could be, in some scenarios, the only way to evaluate the results obtained and also the only opportunity to identify that a system is failing. This necessitates developing a User Interface (UI) for a Content Based Image Retrieval (CBIR) system that allows visualization of results and improvement via capturing user relevance feedback. A well-designed UI facilitates understanding of the performance of the system, both in cases where it works well and perhaps more importantly those which highlight the need for improvement. Our open-source system implements three components to facilitate researchers to quickly develop these capabilities for their retrieval engine. We present: a web-based user interface to visualize retrieval results and collect user annotations; a server that simplifies connection with any underlying CBIR system; and a server that manages the search engine data. The software itself is described in a separate submission to the ACM MM Open Source Software Competition.

### A Demonstration of an Intelligent Storytelling System

•      Yun-Gyung Cheong
• Woo-Hyun Park
• Hye-Yeon Yu

Automated story generation has received much attention in the last few decades, as storytelling enhances the user experience in the game, education, and training domains. In this paper, we demonstrate an intelligent story generation system which plans a story using an AI planner. It then displays the story as a 3D animation using the Unity Game engine. The central idea underlying our system lies in the use of a database as a communication channel among sub-modules. The demonstration showcases the complete pipeline process of story generation and visualization where the story changes as the user interacts with the system.

### IcooBook: When the Picture Book for Children Encounters Aesthetics of Interaction

•      Yaohua Bu
• Jia Jia
• Xiang Li
• Suping Zhou
• Xiaobo Lu

In this work, we propose a novel PCA (Perception & Cognition & Affection) model from the prospective of aesthetics in interaction. Based on PCA, we establish a new electronic interactive picture book for children, named IcooBook. At the first level of perception, the proposed IcooBook provides interfaces of multi-sensory interaction; at the second level of cognition, IcooBook builds immersive interactive scenes; at the third level of affection, IcooBook creates high-level interaction modes based on automatic emotion recognition. The research on user study had proved the effectiveness of IcooBook in helping children being focusing on reading, getting better understanding about the context, and further encouraging children to appreciate the beauty of deep affective interaction.

### An Implementation of a DASH Client for Browsing Networked Virtual Environment

•      Thomas Forgione
• Axel Carlier
• Géraldine Morin
• Wei Tsang Ooi
• Vincent Charvillat
• Praveen Kumar Yadav

We demonstrate the use of DASH, a widely-deployed standard for streaming video content, for streaming 3D content in an NVE (Networked Virtual Environment) consisting of 3D geometry and associated textures. We have developed a DASH client for NVE to show how NVE benefits from the advantages of DASH: it offers a scalable, easy-to-deploy 3D streaming framework. In our system, the 3D content is first statically partitioned into compliant DASH data, and metadata is provided in order for the client to manage which data to download. Based on a proposed utility metric for geometry and texture at the different resolution, the client can choose the content to request depending on its viewpoint. We effectively provide a Web-based client to navigate through our sample 3D scene, while deriving the streaming requests from its computation of the necessary online parameters, in a receiver-driven manner.

### Knowledge-aware Multimodal Fashion Chatbot

•      Lizi Liao
• You Zhou
• Yunshan Ma
• Richang Hong
• Tat-Seng Chua

Multimodal fashion chatbot provides a natural and informative way to fulfill customers' fashion needs. However, making it 'smart' in generating substantive responses remains a challenging problem. In this paper, we present a multimodal domain knowledge enriched fashion chatbot. It forms a taxonomy-based learning module to capture the fine-grained semantics in images and leverages an end-to-end neural conversational model to generate responses based on the conversation history, visual semantics, and domain knowledge. To avoid inconsistent dialogues, deep reinforcement learning method is used to further optimize the model.

### SVIAS: Scene-segmented Video Information Annotation System

•      Alex Lee
• Chang-Uk Kwak
• Jeong-Woo Son
• Sun-Joong Kim

We had designed the scene-segmented video information annotation system using video segmentation and information annotation. For video segmentation, the proposed system adapts the multiview deep convolution neural network. Segmented scenes are annotated by using the unsupervised sentence embedding model for closed captions. Both functionalities effectively work together with the web interface designed to tie not only our functionalities but also external content providers.

### Interactive Story Maker: Tagged Video Retrieval System for Video Re-creation Service

•      Chang-Uk Kwak
• Min-Ho Han
• Sun-Joong Kim
• Gyeong-June Hahm

Users who want to reuse existing videos and re-create new videos need a video retrieval system that searches relevant video clips from a vast amount of video clips. In addition, a re-creation tool is needed that allows users to arrange videos according to the user defined story line using the retrieved videos. Current video retrieval services are based on simple tags, they have potential limitations on searching relevant videos for natural language queries. That is, if you enter a scenario that describes a scene as a query, there is a limitation in getting the appropriate retrieval results. To address these problems, this paper introduces a system that performs query preprocessing and expansions and provides users with retrieval results for a natural language query. In addition, we introduce a web based re-creation tool that can construct story line using retrieved videos and play a re-created video.

### HeterStyle: A Heterogeneous Video Style Transfer Application

•      Xingyu Liu
• Jingfan Guo
• Tongwei Ren
• Yahong Han
• Lei Huang
• Gangshan Wu

Video style transfer aims to synthesize a stylized video that preserves the content of a given video and is rendered in the style of a reference image.A key issue in video style transfer is how to balance video content preservation and reference style rendering, in order to avoid over-stylization with serious video content loss or under-stylization with unrecognized reference style. In this demonstration, we illustrate a novel video style transfer application, named HeterStyle, which can stylize different regions in the video with adaptive intensities.The core algorithm of HeterStyle application is our proposed heterogeneous video style transfer method, which minimizes a heterogeneous style transfer loss function considering content, style and temporal consistency in a Convolutional Neural Networks based optimization framework.With the HeterStyle application, a user can easily generate the stylized videos with good video content preservation and reference style rendering.

### PAMI: Projection Augmented Meeting Interface for Video Conferencing

•      Hyocheol Ro
• Inhwan Kim
• JungHyun Byun
• Yoonsik Yang
• Yoon Jung Park
• Seungho Chae
• Tackdon Han

Video conferencing, which helps gather opinions and make decisions quickly among employees who are not in the same location, is now a very important communication tool in the workplace. Our research is one of these video conferencing solutions, specifically proposed to address the difficulties of analog materials sharing and feedback, and has added some useful features for the smooth use of conference participants. We conducted a comparative experiment on our proposed method of file sharing and the method that we had previously used in video conferencing. As a result, the proposed system yielded better results in terms of time and usability during a full-scale collaborative situation in which feedback was provided.

### ChildAR-bot: Educational Playing Projection-based AR Robot for Children

•      Yoon Jung Park
• Yoonsik Yang
• Hyocheol Ro
• Jinwon Cha
• Kyuri Kim
• Tack Don Han

Children encounter a variety of experiences through play, which can improve their ability to form ideas and undergo multi-faceted development. Using Augmented Reality (AR) technology to integrate various digital learning elements with real environments can lead to increased learning ability. This study proposes a 360° rotatable and portable system specialized for education and development through projection-based AR play. This system allows existing projection-based AR technology, which once could only be experienced at large-scale exhibitions and experience centers, to be used in individual and small-scale spaces. It also promotes the development of multi-sensory abilities through a multi-modality which provides various intuitive and sensory interactions. By experiencing the various educational play applications provided by the proposed system, children can increase their physical, perceptive, and emotional abilities and thinking skills.

## SESSION: Deep-2 (Recognition)

•      Qin Jin

### Mining Semantics-Preserving Attention for Group Activity Recognition

•      Yansong Tang
• Zian Wang
• Peiyang Li
• Jiwen Lu
• Ming Yang
• Jie Zhou

In this paper, we propose a Semantics-Preserving Teacher-Student (SPTS) model for group activity recognition in videos, which aims to mine the semantics-preserving attention to automatically seek the key people and discard the misleading people. Conventional methods usually aggregate the features extracted from individual persons by pooling operations, which cannot fully explore the contextual information for group activity recognition. To address this, our SPTS networks first learn a Teacher Network in semantic domain, which classifies the word of group activity based on the words of individual actions. Then we carefully design a Student Network in vision domain, which recognizes the group activity according to the input videos, and enforce the Student Network to mimic the Teacher Network during the learning process. In this way, we allocate semantics-preserving attention to different people, which adequately explores the contextual information of different people and requires no extra labelled data. Experimental results on two widely used benchmarks for group activity recognition clearly show the superior performance of our method in comparisons with the state-of-the-arts.

### Participation-Contributed Temporal Dynamic Model for Group Activity Recognition

•      Rui Yan
• Jinhui Tang
• Xiangbo Shu
• Zechao Li
• Qi Tian

Group activity recognition, a challenging task that a number of individuals occur in the scene of activity while only a small subset of them participate in, has received increasing attentions. However, most of the previous methods model all the individuals' actions equivalently while ignoring a fact that not all of them are contributed to the discrimination of group activity. That is to say, only a small number of key actors (participants) play important roles in the whole group activity. Inspired by this, we explore a new "One to Key" idea to progressively aggregate temporal dynamics of key actors with different participation degrees over time from each person. Here, we focus on two types of key actors in the whole activity, who steadily move in the whole process (long moving time) or intensely move (but closely related to the group activity) at a significant moment. Based on this, we propose a novel Participation-Contributed Temporal Dynamic Model (PC-TDM) to recognize group activity, which mainly consists of a "One" network and a "One to Key" network. Specifically, "One" network aims at modeling the individual dynamic of each person. "One to Key" network feeds the outputs from the "One" network into a Bidirectional LSTM (Bi-LSTM) according to the order of individual's moving time. Subsequently, each output state of Bi-LSTM weighted by a trainable time-varying attention factor is aggregated by going through LSTM one-by-one. Experimental results on two benchmarks demonstrate that the proposed method improves group activity recognition performance compared to the state-of-the-arts.

### WildFish: A Large Benchmark for Fish Recognition in the Wild

•      Peiqin Zhuang
• Yali Wang
• Yu Qiao

Fish recognition is an important task to understand the marine ecosystem and biodiversity. It is often challenging to identify fish species in the wild, due to the following difficulties. First, most fish benchmarks are small-scale, which may limit the representation power of machine learning models. Second, the number of fish species is huge, and there may still exist unknown categories in our planet. The traditional classifiers often fail to deal with this open-set scenario. Third, certain fish species are highly-confused. It is often hard to figure out the subtle differences, only by the unconstrained images. Motivated by these facts, we introduce a large-scale WildFish benchmark for fish recognition in the wild. Specifically, we make three contributions in this paper. First, WildFish is the largest image data set for wild fish recognition, to our best knowledge. It consists of 1000 fish categories with 54,459 unconstrained images, allowing to train high-capacity models for automatic fish classification. Second, we propose a novel open-set fish classification task for realistic scenarios, and investigate the open-set deep learning framework with a number of practical designs. Third, we propose a novel fine-grained recognition task, with the guidance of pairwise textual descriptions. Via leveraging the comparison knowledge in the sentence, we design a multi-modal fish net to effectively distinguish two confused categories in a pair. Finally, we release WildFish (https://github.com/PeiqinZhuang/WildFish), in order to bring benefit to more research studies in multimedia and beyond.

### PVNet: A Joint Convolutional Network of Point Cloud and Multi-View for 3D Shape Recognition

•      Haoxuan You
• Yifan Feng
• Rongrong Ji
• Yue Gao

3D object recognition has attracted wide research attention in the field of multimedia and computer vision. With the recent proliferation of deep learning, various deep models with different representations have achieved the state-of-the-art performance. Among them, point cloud and multi-view based 3D shape representations are promising recently, and their corresponding deep models have shown significant performance on 3D shape recognition. However, there is little effort concentrating point cloud data and multi-view data for 3D shape representation, which is, in our consideration, beneficial and compensated to each other. In this paper, we propose the Point-View Network (PVNet), the first framework integrating both the point cloud and the multi-view data towards joint 3D shape recognition. More specifically, an embedding attention fusion scheme is proposed that could employ high-level features from the multi-view data to model the intrinsic correlation and discriminability of different structure features from the point cloud data. In particular, the discriminative descriptions are quantified and leveraged as the soft attention mask to further refine the structure feature of the 3D shape. We have evaluated the proposed method on the ModelNet40 dataset for 3D shape classification and retrieval tasks. Experimental results and comparisons with state-of-the-art methods demonstrate that our framework can achieve superior performance.

## SESSION: Multimedia-2 (Socical & Emotional Multimedia)

• Rongrong Ji

### EmotionGAN: Unsupervised Domain Adaptation for Learning Discrete Probability Distributions of Image Emotions

•      Sicheng Zhao
• Xin Zhao
• Guiguang Ding
• Kurt Keutzer

Deep neural networks have performed well on various benchmark vision tasks with large-scale labeled training data; however, such training data is expensive and time-consuming to obtain. Due to domain shift or dataset bias, directly transferring models trained on a large-scale labeled source domain to another sparsely labeled or unlabeled target domain often results in poor performance. In this paper, we consider the domain adaptation problem in image emotion recognition. Specifically, we study how to adapt the discrete probability distributions of image emotions from a source domain to a target domain in an unsupervised manner. We develop a novel adversarial model for emotion distribution learning, termed EmotionGAN, which alternately optimizes the Generative Adversarial Network (GAN) loss, semantic consistency loss, and regression loss. The EmotionGAN model can adapt source domain images such that they appear as if they were drawn from the target domain, while preserving the annotation information. Extensive experiments are conducted on the FlickrLDL and TwitterLDL datasets, and the results demonstrate the superiority of the proposed method as compared to state-of-the-art approaches.

### USAR: An Interactive User-specific Aesthetic Ranking Framework for Images

•      Pei Lv
• Meng Wang
• Yongbo Xu
• Ze Peng
• Junyi Sun
• Shimei Su
• Bing Zhou
• Mingliang Xu

When assessing whether an image is of high or low quality, it is indispensable to take personal preference into account. Existing aesthetic models lay emphasis on hand-crafted features or deep features commonly shared by high quality images, but with limited or no consideration for personal preference and user interaction. To that end, we propose a novel and user-friendly aesthetic ranking framework via powerful deep neural network and a small amount of user interaction, which can automatically estimate and rank the aesthetic characteristics of images in accordance with users' preference. Our framework takes as input a series of photos that users prefer, and produces as output a reliable, user-specific aesthetic ranking model matching with users' preference. Considering the subjectivity of personal preference and the uncertainty of user's single selection, a unique and exclusive dataset will be constructed interactively to describe the preference of one individual by retrieving the most similar images with regard to those specified by users. Based on this unique user-specific dataset and sufficient well-designed aesthetic attributes, a customized aesthetic distribution model can be learned, which concatenates both personalized preference and aesthetic rules. We conduct extensive experiments and user studies on two large-scale public datasets, and demonstrate that our framework outperforms those work based on conventional aesthetic assessment or ranking model.

### Deep Multimodal Image-Repurposing Detection

•      Ekraam Sabir
• Wael AbdAlmageed
• Yue Wu
• Prem Natarajan

Nefarious actors on social media and other platforms often spread rumors and falsehoods through images whose metadata (e.g., captions) have been modified to provide visual substantiation of the rumor/falsehood. This type of modification is referred to as image repurposing, in which often an unmanipulated image is published along with incorrect or manipulated metadata to serve the actor's ulterior motives. We present the Multimodal Entity Image Repurposing (MEIR) dataset, a substantially challenging dataset over that which has been previously available to support research into image repurposing detection. The new dataset includes location, person, and organization manipulations on real-world data sourced from Flickr. We also present a novel, end-to-end, deep multimodal learning model for assessing the integrity of an image by combining information extracted from the image with related information from a knowledge base. The proposed method is compared against state-of-the-art techniques on existing datasets as well as MEIR, where it outperforms existing methods across the board, with AUC improvement up to 0.23.

### Facial Expression Recognition Enhanced by Thermal Images through Adversarial Learning

•      Bowen Pan
• Shangfei Wang

Currently, fusing visible and thermal images for facial expression recognition requires two modalities during both training and testing. Visible cameras are commonly used in real-life applications, and thermal cameras are typically only available in lab situations due to their high price. Thermal imaging for facial expression recognition is not frequently used in real-world situations. To address this, we propose a novel thermally enhanced facial expression recognition method which uses thermal images as privileged information to construct better visible feature representation and improved classifiers by incorporating adversarial learning and similarity constraints during training. Specifically, we train two deep neural networks from visible images and thermal images. We impose adversarial loss to enforce statistical similarity between the learned representations of two modalities, and a similarity constraint to regulate the mapping functions from visible and thermal representation to expressions. Thus, thermal images are leveraged to simultaneously improve visible feature representation and classification during training. To mimic real-world scenarios, only visible images are available during testing. We further extend the proposed expression recognition method for partially unpaired data to explore thermal images' supplementary role in visible facial expression recognition when visible images and thermal images are not synchronously recorded. Experimental results on the MAHNOB Laughter database demonstrate that our proposed method can effectively regularize visible representation and expression classifiers with the help of thermal images, achieving state-of-the-art recognition performance.

## SESSION: Panel-1

### Session details: Panel-1

•      (1) Jun Jitao (2) Yu Sang

### Deep Learning for Multimedia: Science or Technology?

•      Jitao Sang
• Jun Yu
• Ramesh Jain
• Rainer Lienhart
• Peng Cui
• Jiashi Feng

Deep learning has been successfully explored in addressing different multimedia topics recent years, ranging from object detection, semantic classification, entity annotation, to multimedia captioning, multimedia question answering and storytelling. Open source libraries and platforms such as Tensorflow, Caffe, MXnet significantly help promote the wide deployment of deep learning in solving real-world applications. On one hand, deep learning practitioners, while not necessary to understand the involved math behind, are able to set up and make use of a complex deep network. One recent deep learning tool based on Keras even provides the graphical interface to enable straightforward 'drag and drop' operation for deep learning programming. On the other hand, however, some general theoretical problems of learning such as the interpretation and generalization, have only achieved limited progress. Most deep learning papers published these days follow the pipeline of designing/modifying network structures - tuning parameters - reporting performance improvement in specific applications. We have even seen many deep learning application papers without one single equation. Theoretical interpretation and the science behind the study are largely ignored. While excited about the successful application of deep learning in classical and novel problems, we multimedia researchers are responsible to think and solve the fundamental topics in deep learning science. Prof. Guanrong Chen recently wrote an editorial note titled 'Science and Technology, not SciTech' [1]. This panel falls into similar discussion and aims to invite prestigious multimedia researchers and active deep learning practitioners to discuss the positioning of deep learning research now and in the future. Specifically, each panelist is asked to present their opinions on the following five questions: 1)How do you think the current phenomenon that deep learning applications are explosively growing, while the general theoretical problems remain slow progress? 2)Do you agree that deployment of deep learning techniques is getting easy (with a low barrier), while deep learning research is difficult (with a high barrier) 3)What do you think are the core problems for deep learning techniques? 4)What do you think are the core problems for deep learning science? 5)What's your suggestion on the multimedia research in the post-deep learning era?

## SESSION: Open Source Software Competition

### Session details: Open Source Software Competition

•      Min-Chun Hu

### VIVID: Virtual Environment for Visual Deep Learning

•      Kuan-Ting Lai
• Chia-Chih Lin
• Chun-Yao Kang
• Mei-Enn Liao
• Ming-Syan Chen

Due to the advances in deep reinforcement learning and the demand of large training data, virtual-to-real learning has gained lots of attention from computer vision community recently. As state-of-the-art 3D engines can generate photo-realistic images suitable for training deep neural networks, researchers have been gradually applied 3D virtual environment to learn different tasks including autonomous driving, collision avoidance, and image segmentation, to name a few. Although there are already many open-source simulation environments readily available, most of them either provide small scenes or have limited interactions with objects in the environment. To facilitate visual recognition learning, we present a new Virtual Environment for Visual Deep Learning (VIVID), which offers large-scale diversified indoor and outdoor scenes. Moreover, VIVID leverages the advanced human skeleton system, which enables us to simulate numerous complex human actions. VIVID has a wide range of applications and can be used for learning indoor navigation, action recognition, event detection, etc. We also release several deep learning examples in Python to demonstrate the capabilities and advantages of our system.

### A General-purpose Distributed Programming System using Data-parallel Streams

•      Tsung-Wei Huang
• Chun-Xun Lin
• Guannan Guo
• Martin D. F. Wong

In this paper we present DtCraft, a distributed execution engine that enables a new powerful programming model to streamline cluster computing. Applications are described in a set of data-parallel streams, leaving difficult execution details and concurrency controls handled by our system kernel transparently. Compared with existing systems, DtCraft is unique in (1) an efficient stream-oriented programming paradigm using modern C++17, (2) an in-context resource controller and task executor based on Linux container technology, and (3) ease of development from prototyping machines to production cloud environments. These capabilities power industry applications and create new research directions in machine learning, stream processing, and distributed multimedia systems.

### cilantro: A Lean, Versatile, and Efficient Library for Point Cloud Data Processing

•      Konstantinos Zampogiannis
• Cornelia Fermuller
• Yiannis Aloimonos

We introduce Cilantro, an open-source C++ library for geometric and general-purpose point cloud data processing. The library provides functionality that covers low-level point cloud operations, spatial reasoning, various methods for point cloud segmentation and generic data clustering, flexible algorithms for robust or local geometric alignment, model fitting, as well as powerful visualization tools. To accommodate all kinds of workflows, Cilantro is almost fully templated, and most of its generic algorithms operate in arbitrary data dimension. At the same time, the library is easy to use and highly expressive, promoting a clean and concise coding style. Cilantro is highly optimized, has a minimal set of external dependencies, and supports rapid development of performant point cloud processing software in a wide variety of contexts.

### Web-Based Configurable Image Annotations

•      Matthieu Pizenberg
• Axel Carlier
• Emmanuel Faure
• Vincent Charvillat

We introduce a new application for annotating images, with the purpose of constituting training datasets for machine learning algorithms. Our open-source software is meant to be easily used and deployed, configured to meet the annotation needs of any use case, and embeddable in crowdsourcing campaigns using the Amazon Mechanical Turk service.

## SESSION: Vision-3 (Applications in Multimedia)

### Session details: Vision-3 (Applications in Multimedia)

•      Zheng-Jun Zha

### Only Learn One Sample: Fine-Grained Visual Categorization with One Sample Training

•      Xiangteng He
• Yuxin Peng

The progress of fine-grained visual categorization (FGVC) benefits from the application of deep neural networks, especially convolutional neural networks (CNNs), which heavily rely on large amounts of labeled data for training. However, it is hard to obtain the accurate labels of similar fine-grained subcategories because labeling needs professional knowledge, which is labor-consuming and time-consuming. Therefore, it is appealing and significant to recognize these similar fine-grained subcategories with a few labeled samples or even only one for training, which is a highly challenging task. In this paper, we propose OLOS (Only Learn One Sample), a new data augmentation approach for fine-grained visual categorization with only one sample training, and its main novelties are: (1) A 4-stage data augmentation approach is proposed to increase both the volume and variety of the one training image, which provides more visual information with multiple views and scales. It consists of a 2-stage data generation and a 2-stage data selection. (2) The 2-stage data generation approach is proposed to produce image patches relevant to the object and its parts for the one training image, as well as produce new images conditioned on the textual descriptions of the training image. (3) The 2-stage data selection approach is proposed to conduct screening on the generated images in order that useful information is remained and noisy information is eliminated. Experimental results and analyses on fine-grained visual categorization benchmark demonstrate that our proposed OLOS approach can be applied on top of existing methods, and improves their categorization performance.

### LA-Net: Layout-Aware Dense Network for Monocular Depth Estimation

•      Kecheng Zheng
• Zheng-Jun Zha
• Yang Cao
• Xuejin Chen
• Feng Wu

Depth estimation from monocular images is an ill-posed and inherently ambiguous problem. Recently, deep learning technique has been applied for monocular depth estimation seeking data-driven solutions. However, most existing methods focus on pursuing the minimization of average depth regression error at pixel level and neglect to encode the global layout of scene, resulting in layout-inconsistent depth map. This paper proposes a novel Layout-Aware Convolutional Neural Network (LA-Net) for accurate monocular depth estimation by simultaneously perceiving scene layout and local depth details. Specifically, a Spatial Layout Network (SL-Net) is proposed to learn a layout map representing the depth ordering between local patches. A Layout-Aware Depth Estimation Network (LDE-Net) is proposed to estimate pixel-level depth details using multi-scale layout maps as structural guidance, leading to layout-consistent depth map. A dense network module is used as the base network to learn effective visual details resorting to dense feed-forward connections. Moreover, we formulate an order-sensitive softmax loss to well constrain the ill-posed depth inferring problem. Extensive experiments on both indoor scene (NYUD-v2) and outdoor scene (Make3D) datasets have demonstrated that the proposed LA-Net outperforms the state-of-the-art methods and leads to faithful 3D projections.

### Robustness and Discrimination Oriented Hashing Combining Texture and Invariant Vector Distance

•      Ziqing Huang
• Shiguang Liu

Image hashing is a novel technology of multimedia processing with wide applications. Robustness and discrimination are two of the most important objectives of image hashing. Different from existing hashing methods without a good balance with respect to robustness and discrimination, which largely restrict the application in image retrieval and copy detection, i.e., seriously reducing the retrieval accuracy of similar images, we propose a new hashing method which can preserve two kinds of complementary features (global feature via texture and local feature via DCT coefficients) to achieve a good balance between robustness and discrimination. Specifically, the statistical characteristics in gray-level co-occurrence matrix (GLCM) are extracted to well reveal the texture changes of an image, which is of great benefit to improve the perceptual robustness. Then, the normalized image is divided into image blocks, and the dominant DCT coefficients in the first row/column are selected to form a feature matrix. The Euclidean distance between vectors of the feature matrix is invariant to commonly-used digital operations, which helps make hash more compact. Various experiments show that our approach achieves a better balance between robustness and discrimination than the state-of-the-art algorithms.

### Joint Global and Co-Attentive Representation Learning for Image-Sentence Retrieval

•      Shuhui Wang
• Yangyu Chen
• Junbao Zhuo
• Qingming Huang
• Qi Tian

In image-sentence retrieval task, correlated images and sentences involve different levels of semantic relevance. However, existing multi-modal representation learning paradigms fail to capture the meaningful component relation on word and phrase level, while the attention-based methods still suffer from component-level mismatching and huge computation burden. We propose a Joint Global and Co-Attentive Representation learning method (JGCAR) for image-sentence retrieval. We formulate a global representation learning task which utilizes both intra-modal and inter-modal relative similarity to optimize the semantic consistency of the visual/textual component representations. We further develop a co-attention learning procedure to fully exploit different levels of visual-linguistic relations. We design a novel softmax-like bi-directional ranking loss to learn the co-attentive representation for image-sentence similarity computation. It is capable of discovering the correlative components and rectifying inappropriate component-level correlation to produce more accurate sentence-level ranking results. By joint global and co-attentive representation learning, the latter benefits from the former by producing more semantically consistent component representation, and the former also benefits from the latter by back-propagating the contextual information. Image-sentence retrieval is performed as a two-step process in the testing stage, inheriting advantages on both effectiveness and efficiency. Experiments show that JGCAR outperforms existing methods on MSCOCO and Flickr30K image-sentence retrieval tasks.

## SESSION: Multimodal-2 (Cross-Modal Translation)

### Session details: Multimodal-2 (Cross-Modal Translation)

•      Xian-Sheng Hua

### Text-to-image Synthesis via Symmetrical Distillation Networks

•      Mingkuan Yuan
• Yuxin Peng

Text-to-image synthesis aims to automatically generate images according to text descriptions given by users, which is a highly challenging task. The main issues of text-to-image synthesis lie in two gaps: the heterogeneous and homogeneous gaps. The heterogeneous gap is between the high-level concepts of text descriptions and the pixel-level contents of images, while the homogeneous gap exists between synthetic image distributions and real image distributions. For addressing these problems, we exploit the excellent capability of generic discriminative models (e.g. VGG19), which can guide the training process of a new generative model on multiple levels to bridge the two gaps. The high-level representations can teach the generative model to extract necessary visual information from text descriptions, which can bridge the heterogeneous gap. The mid-level and low-level representations can lead it to learn structures and details of images respectively, which relieves the homogeneous gap. Therefore, we propose Symmetrical Distillation Networks (SDN) composed of a source discriminative model as "teacher" and a target generative model as "student". The target generative model has a symmetrical structure with the source discriminative model, in order to transfer hierarchical knowledge accessibly. Moreover, we decompose the training process into two stages with different distillation paradigms for promoting the performance of the target generative model. Experiments on two widely-used datasets are conducted to verify the effectiveness of our proposed SDN.

### Context-Aware Visual Policy Network for Sequence-Level Image Captioning

•      Daqing Liu
• Zheng-Jun Zha
• Hanwang Zhang
• Yongdong Zhang
• Feng Wu

Many vision-language tasks can be reduced to the problem of sequence prediction for natural language output. In particular, recent advances in image captioning use deep reinforcement learning (RL) to alleviate the "exposure bias'' during training: ground-truth subsequence is exposed in every step prediction, which introduces bias in test when only predicted subsequence is seen. However, existing RL-based image captioning methods only focus on the language policy while not the visual policy (eg, visual attention), and thus fail to capture the visual context that are crucial for compositional reasoning such as visual relationships (\eg, "man riding horse'') and comparisons (eg. "smaller cat"). To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for sequence-level image captioning. At every time step, CAVP explicitly accounts for the previous visual attentions as the context, and then decides whether the context is helpful for the current word generation given the current visual attention. Compared against traditional visual attention that only fixes a single image region at every step, CAVP can attend to complex visual compositions over time. The whole image captioning model --- CAVP and its subsequent language policy network --- can be efficiently optimized end-to-end by using an actor-critic policy gradient method with respect to any caption evaluation metric. We demonstrate the effectiveness of CAVP by state-of-the-art performances on MS-COCO offline split and online server, using various metrics and sensible visualizations of qualitative visual context. The code is available at \urlhttps://github.com/daqingliu/CAVP

### SibNet: Sibling Convolutional Encoder for Video Captioning

•      Sheng Liu
• Zhou Ren
• Junsong Yuan

Video captioning is a challenging task owing to the complexity of understanding the copious visual information in videos and describing it using natural language. Different from previous work that encodes video information using a single flow, in this work, we introduce a novel Sibling Convolutional Encoder (SibNet) for video captioning, which utilizes a two-branch architecture to collaboratively encode videos. The first content branch encodes the visual content information of the video via autoencoder, and the second semantic branch encodes the semantic information by visual-semantic joint embedding. Then both branches are effectively combined with soft-attention mechanism and finally fed into a RNN decoder to generate captions. With our SibNet explicitly capturing both content and semantic information, the proposed method can better represent the rich information in videos. Extensive experiments on YouTube2Text and MSR-VTT datasets validate that the proposed architecture outperforms existing methods by a large margin across different evaluation metrics.

### Paragraph Generation Network with Visual Relationship Detection

•      Wenbin Che
• Xiaopeng Fan
• Ruiqin Xiong
• Debin Zhao

Paragraph generation of images is a new concept, aiming to produce multiple sentences to describe a given image. In this paper, we propose a paragraph generation network with introducing visual relationship detection. We first detect regions which may contain important visual objects and then predict their relationships. Paragraphs are produced based on object regions which have valid relationship with others. Compared with previous works which generate sentences based on region features, we explicitly explore and utilize visual relationships in order to improve final captions. The experimental results show that such strategy could improve paragraph generating performance from two aspects: more details about object relations are detected and more accurate sentences are obtained. Furthermore, our model is more robust to region detection fluctuation.

## SESSION: Panel-2

### Session details: Panel-2

•      (1) Jiaying Wen-Huang (2) Liu Cheng

### AI + Multimedia Make Better Life?

•      Wen-Huang Cheng
• Jiaying Liu
• Mohan S. Kankanhalli
• Abdulmotaleb El Saddik
• Benoit Huet

## SESSION: FF-5

•      Zhu Li

### Online Inter-Camera Trajectory Association Exploiting Person Re-Identification and Camera Topology

•      Na Jiang
• SiChen Bai
• Yue Xu
• Chang Xing
• Zhong Zhou
• Wei Wu

Online inter-camera trajectory association is a promising topic in intelligent video surveillance, which concentrates on associating trajectories belong to the same individual across different cameras according to time. It remains challenging due to the inconsistent appearance of a person in different cameras and the lack of spatio-temporal constraints between cameras. Besides, the orientation variations and the partial occlusions significantly increase the difficulty of inter-camera trajectory association. Targeting to solve these problems, this work proposes an orientation-driven person re-identification (ODPR) and an effective camera topology estimation based on appearance features for online inter-camera trajectory association. ODPR explicitly leverages the orientation cues and stable torso features to learn discriminative feature representations for identifying trajectories across cameras, which alleviates the pedestrian orientation variations by the designed orientation-driven loss function and orientation aware weights. The effective camera topology estimation introduces appearance features to generate the correct spatio-temporal constraints for narrowing the retrieval range, which improves the time efficiency and provides the possibility for intelligent inter-camera trajectory association in large-scale surveillance environments. Extensive experimental results demonstrate that our proposed approach significantly outperforms most state-of-the-art methods on the popular person re-identification datasets and the public multi-target, multi-camera tracking benchmark.

### Learning Local Descriptors with Adversarial Enhancer from Volumetric Geometry Patches

•      Jing Zhu
• Yi Fang

Local matching problems (e.g. key point matching, geometry registration) are significant but challenging tasks in computer vision field. In this paper, we propose to learn a robust local 3D descriptor from volumetric point patches to tackle the local matching tasks. Intuitively, given two inputs, it would be easy for a network to map the inputs to a space with similar characteristics (e.g. similar outputs for similar inputs, far different outputs for far different inputs), but the difficult case for a network would be to map the inputs into a space with opposite characteristics (e.g. far different outputs for very similar inputs but very similar outputs for far different inputs). Inspired by this intuition, in our proposed method, we design a siamese-network-based local descriptor generator to learn a local descriptor with small distances between match pairs and large distances between non-match pairs. Specifically, an adversarial enhancer is introduced to map the outputs of the local descriptor generator into an opposite space that match pairs have the maximum differences and non-match pairs have the minimum differences. The local descriptor generator and the adversarial enhancer are trained in an adversarial manner. By competing with the adversarial enhancer, the local descriptor generator learns to generate a much stronger descriptor for given volumetric point patches. The experiments conducted on real-world scan datasets, including 7-scenes and SUN3D, and the synthetic scan augmented ICL-NUIM dataset show that our method can achieve superior performance over other state-of-the-art approaches on both keypoint matching and geometry registration, such as fragment alignment and scene reconstruction.

### Context-Dependent Diffusion Network for Visual Relationship Detection

•      Zhen Cui
• Chunyan Xu
• Wenming Zheng
• Jian Yang

Visual relationship detection can bridge the gap between computer vision and natural language for scene understanding of images. Different from pure object recognition tasks, the relation triplets of subject-predicate-object lie on an extreme diversity space, such asperson-behind-person andcar-behind-building, while suffering from the problem of combinatorial explosion. In this paper, we propose a context-dependent diffusion network (CDDN) framework to deal with visual relationship detection. To capture the interactions of different object instances, two types of graphs, word semantic graph and visual scene graph, are constructed to encode global context interdependency. The semantic graph is built through language priors to model semantic correlations across objects, whilst the visual scene graph defines the connections of scene objects so as to utilize the surrounding scene information. For the graph-structured data, we design a diffusion network to adaptively aggregate information from contexts, which can effectively learn latent representations of visual relationships and well cater to visual relationship detection in view of its isomorphic invariance to graphs. Experiments on two widely-used datasets demonstrate that our proposed method is more effective and achieves the state-of-the-art performance.

### Connectionist Temporal Fusion for Sign Language Translation

•      Shuo Wang
• Dan Guo
• Wen-gang Zhou
• Zheng-Jun Zha
• Meng Wang

Continuous sign language translation (CSLT) is a weakly supervised problem aiming at translating vision-based videos into natural languages under complicated sign linguistics, where the ordered words in a sentence label have no exact boundary of each sign action in the video. This paper proposes a hybrid deep architecture which consists of a temporal convolution module (TCOV), a bidirectional gated recurrent unit module (BGRU), and a fusion layer module (FL) to address the CSLT problem. TCOV captures short-term temporal transition on adjacent clip features (local pattern), while BGRU keeps the long-term context transition across temporal dimension (global pattern). FL concatenates the feature embedding of TCOV and BGRU to learn their complementary relationship (mutual pattern). Thus we propose a joint connectionist temporal fusion (CTF) mechanism to utilize the merit of each module. The proposed joint CTC loss optimization and deep classification score-based decoding fusion strategy are designed to boost performance. With only once training, our model under the CTC constraints achieves comparable performance to other existing methods with multiple EM iterations. Experiments are tested and verified on a benchmark, i.e. the RWTH-PHOENIX-Weather dataset, which demonstrate the effectiveness of our proposed method.

### Support Neighbor Loss for Person Re-Identification

•      Kai Li
• Zhengming Ding
• Kunpeng Li
• Yulun Zhang
• Yun Fu

Person re-identification (re-ID) has recently been tremendously boosted due to the advancement of deep convolutional neural networks (CNN). The majority of deep re-ID methods focus on designing new CNN architectures, while less attention is paid on investigating the loss functions. Verification loss and identification loss are two types of losses widely used to train various deep re-ID models, both of which however have limitations. Verification loss guides the networks to generate feature embeddings of which the intra-class variance is decreased while the inter-class ones is enlarged. However, training networks with verification loss tends to be of slow convergence and unstable performance when the number of training samples is large. On the other hand, identification loss has good separating and scalable property. But its neglect to explicitly reduce the intra-class variance limits its performance on re-ID, because the same person may have significant appearance disparity across different camera views. To avoid the limitations of the two types of losses, we propose a new loss, called support neighbor (SN) loss. Rather than being derived from data sample pairs or triplets, SN loss is calculated based on the positive and negative support neighbor sets of each anchor sample, which contain more valuable contextual information and neighborhood structure that are beneficial for more stable performance. To ensure scalability and separability, a softmax-like function is formulated to push apart the positive and negative support sets. To reduce intra-class variance, the distance between the anchor's nearest positive neighbor and furthest positive sample is penalized. Integrating SN loss on top of Resnet50, superior re-ID results to the state-of-the-art ones are obtained on several widely used datasets.

### Perceptual Temporal Incoherence Aware Stereo Video Retargeting

•      Bing Li
• Chia-Wen Lin
• Shan Liu
• Tiejun Huang
• Wen Gao
• C.-C. Jay Kuo

Stereo video retargeting aims to avoid shape and depth distortions while maintaining temporal coherence of shape and depth while resizing a stereo video to a desired size. Existing methods resort to extending stereo image retargeting schemes to stereo video retargeting by imposing temporal constraints to consistently resize all corresponding regions so as to maintain temporal coherence. However, such a direct extension often incurs conflicts among the requirements for preserving shape information and depth information and maintaining their temporal coherence, thereby failing to meet one or more of these requirements. We find that properly relaxing temporal constraints for non-paired regions at frame boundaries can effectively mitigate conflicts among depth, shape, and temporal constraints without severely degrading temporal coherence perceptually. Based on this new finding, we derive effective temporal constraints to improve the viewing experience of a 3D scene for stereo video retargeting. Accordingly, we propose an efficient grid-based implementation for our method. Experimental results show that our method achieves superior visual quality over existing methods.

### A Large-scale RGB-D Database for Arbitrary-view Human Action Recognition

•      Yanli Ji
• Feixiang Xu
• Yang Yang
• Fumin Shen
• Heng Tao Shen
• Wei-Shi Zheng

Current researches mainly focus on single-view and multiview human action recognition, which can hardly satisfy the requirements of human-robot interaction (HRI) applications to recognize actions from arbitrary views. The lack of databases also sets up barriers. In this paper, we newly collect a large-scale RGB-D action database for arbitrary-view action analysis, including RGB videos, depth and skeleton sequences. The database includes action samples captured in 8 fixed viewpoints and varying-view sequences which covers the entire 360 view angles. In total, 118 persons are invited to act 40 action categories, and 25,600 video samples are collected. Our database involves more articipants, more viewpoints and a large number of samples. More importantly, it is the first database containing the entire 360? varying-view sequences. The database provides sufficient data for cross-view and arbitrary-view action analysis. Besides, we propose a View-guided Skeleton CNN (VS-CNN) to tackle the problem of arbitrary-view action recognition. Experiment results show that the VS-CNN achieves superior performance.

### Spotting and Aggregating Salient Regions for Video Captioning

•      Huiyun Wang
• Youjiang Xu
• Yahong Han

Towards an interpretable video captioning process, we target to locate salient regions of video objects along with the sequentially uttering words. This paper proposes a new framework to automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning. First, in a Spot Module, we automatically learn the saliency value of each location to separate salient regions from video content as the foreground and the rest as background by two operations of 'hard separation' and 'soft separation', respectively. Then, in an Aggregate Module, to aggregate the foreground/background descriptors into a discriminative spatio-temporal representation, we devise a trainable video VLAD process to learn the aggregation parameters. Finally, we utilize the attention mechanism to decode the spatio-temporal representations of different regions into video descriptions. Experiments on two benchmark datasets demonstrate our method outperforms most of the state-of-the-art methods in terms of [email protected], METEOR and CIDEr metrics for the task of video captioning. Also examples demonstrate our method can successfully utter words to sequentially salient regions of video objects.

### Adaptive Temporal Encoding Network for Video Instance-level Human Parsing

•      Qixian Zhou
• Xiaodan Liang
• Ke Gong
• Liang Lin

Beyond the existing single-person and multiple-person human parsing tasks in static images, this paper makes the first attempt to investigate a more realistic video instance-level human parsing that simultaneously segments out each person instance and parses each instance into more fine-grained parts (\eg, head, leg, dress). We introduce a novel Adaptive Temporal Encoding Network (ATEN) that alternatively performs temporal encoding among key frames and flow-guided feature propagation from other consecutive frames between two key frames. Specifically, ATEN first incorporates a Parsing-RCNN to produce the instance-level parsing result for each key frame, which integrates both the global human parsing and instance-level human segmentation into a unified model. To balance between accuracy and efficiency, the flow-guided feature propagation is used to directly parse consecutive frames according to their identified temporal consistency with key frames. On the other hand, ATEN leverages the convolution gated recurrent units (convGRU) to exploit temporal changes over a series of key frames, which are further used to facilitate the frame-level instance-level parsing. By alternatively performing direct feature propagation between consistent frames and temporal encoding network among key frames, our ATEN achieves a good balance between frame-level accuracy and time efficiency, which is a common crucial problem in video object segmentation research. To demonstrate the superiority of our ATEN, extensive experiments are conducted on the most popular video segmentation benchmark (DAVIS) and a newly collected Video Instance-level Parsing (VIP) dataset, which is the first video instance-level human parsing dataset comprised of 404 sequences and over 20k frames with instance-level and pixel-wise annotations.

### User-Guided Deep Anime Line Art Colorization with Conditional Adversarial Networks

•      Yuanzheng Ci
• Xinzhu Ma
• Zhihui Wang
• Haojie Li
• Zhongxuan Luo

Scribble colors based line art colorization is a challenging computer vision problem since neither greyscale values nor semantic information is presented in line arts, and the lack of authentic illustration-line art training pairs also increases difficulty of model generalization. Recently, several Generative Adversarial Nets (GANs) based methods have achieved great success. They can generate colorized illustrations conditioned on given line art and color hints. However, these methods fail to capture the authentic illustration distributions and are hence perceptually unsatisfying in the sense that they often lack accurate shading. To address these challenges, we propose a novel deep conditional adversarial architecture for scribble based anime line art colorization. Specifically, we integrate the conditional framework with WGAN-GP criteria as well as the perceptual loss to enable us to robustly train a deep network that makes the synthesized images more natural and real. We also introduce a local features network that is independent of synthetic data. With GANs conditioned on features from such network, we notably increase the generalization capability over "in the wild" line arts. Furthermore, we collect two datasets that provide high-quality colorful illustrations and authentic line arts for training and benchmarking. With the proposed model trained on our illustration dataset, we demonstrate that images synthesized by the presented approach are considerably more realistic and precise than alternative approaches.

### BitStream: Efficient Computing Architecture for Real-Time Low-Power Inference of Binary Neural Networks on CPUs

•      Tianli Zhao
• Xiangyu He
• Jian Cheng
• Jing Hu

Convolutional Neural Networks (CNN) have been widely used in many multimedia applications such as image classification, speech recognition, natural language processing and so on. Nevertheless, the high performance of deep CNN models also comes with high consumption of computation and storage resources, making it difficult to run CNN models in real time applications on mobile devices, where computational ability, memory resource and power are largely constrained. Binary network is a recently proposed tech- nique to reduce the computational and memory complexity of CNN, in which the expensive floating-point operations can be replaced by much cheaper bit-wise operations. Despite its obvious advantages, only few works explored the efficient implementation of Binary Neural Networks (BNN). In this work, we present a general architecture for efficient binary convolution referred to as BitStream with the latest computation flow for BNNs instead of the traditional row-major im2col based one. We mainly optimize memory access during computation of BNNs, the proposed calculation flow is cache friendly as well as can largely reduce memory overhead of BNNs, decidedly leading to its memory efficiency and further computational efficiency. Extensive evaluations on various networks demon- strate the efficiency of the proposed method. For instance, memory consumption of BitStream is reduced by 18-32× than original networks and 3× than existing implementations of BNNs during inference. Moreover, our implemented binary Alexnet can achieve 8.07× and 2.84× speedup over floating point precision model and conventional implementations of BNNs on 8 × Cortex A53 CPU s. With 4 × Intel CORE i7 6700 CPUs, the binary vgg-like convolutional network on CIFAR-10 runs even 1.69× faster than floating point precision version on TitanX GPU.

### Attentive Crowd Flow Machines

•      Lingbo Liu
• Ruimao Zhang
• Jiefeng Peng
• Guanbin Li
• Bowen Du
• Liang Lin

Traffic flow prediction is crucial for urban traffic management and public safety. Its key challenges lie in how to adaptively integrate the various factors that affect the flow changes. In this paper, we propose a unified neural network module to address this problem, called Attentive Crowd Flow Machine~(ACFM), which is able to infer the evolution of the crowd flow by learning dynamic representations of temporally-varying data with an attention mechanism. Specifically, the ACFM is composed of two progressive ConvLSTM units connected with a convolutional layer for spatial weight prediction. The first LSTM takes the sequential flow density representation as input and generates a hidden state at each time-step for attention map inference, while the second LSTM aims at learning the effective spatial-temporal feature expression from attentionally weighted crowd flow features. Based on the ACFM, we further build a deep architecture with the application to citywide crowd flow prediction, which naturally incorporates the sequential and periodic data as well as other external influences. Extensive experiments on two standard benchmarks (i.e., crowd flow in Beijing and New York City) show that the proposed method achieves significant improvements over the state-of-the-art methods.

### Video-based Person Re-identification via Self-Paced Learning and Deep Reinforcement Learning Framework

•      Deqiang Ouyang
• Jie Shao
• Yonghui Zhang
• Yang Yang
• Heng Tao Shen

Person re-identification is an important task in video surveillance, focusing on finding the same person across different cameras. However, most existing methods of video-based person re-identification still have some limitations (e.g., the lack of effective deep learning framework, the robustness of the model, and the same treatment for all video frames) which make them unable to achieve better recognition performance. In this paper, we propose a novel self-paced learning algorithm for video-based person re-identification, which could gradually learn from simple to complex samples for a mature and stable model. Self-paced learning is employed to enhance video-based person re-identification based on deep neural network, so that deep neural network and self-paced learning are unified into one frame. Then, based on the trained self-paced learning, we propose to employ deep reinforcement learning to discard misleading and confounding frames and find the most representative frames from video pairs. With the advantage of deep reinforcement learning, our method can learn strategies to select the optimal frame groups. Experiments show that the proposed framework outperforms the existing methods on the iLIDS-VID, PRID-2011 and MARS datasets.

### Interpretable Multimodal Retrieval for Fashion Products

•      Lizi Liao
• Xiangnan He
• Bo Zhao
• Chong-Wah Ngo
• Tat-Seng Chua

Deep learning methods have been successfully applied to fashion retrieval. However, the latent meaning of learned feature vectors hinders the explanation of retrieval results and integration of user feedback. Fortunately, there are many online shopping websites organizing fashion items into hierarchical structures based on product taxonomy and domain knowledge. Such structures help to reveal how human perceive the relatedness among fashion products. Nevertheless, incorporating structural knowledge for deep learning remains a challenging problem. This paper presents techniques for organizing and utilizing the fashion hierarchies in deep learning to facilitate the reasoning of search results and user intent. The novelty of our work originates from the development of an EI (Exclusive & Independent) tree that can cooperate with deep models for end-to-end multimodal learning. EI tree organizes the fashion concepts into multiple semantic levels and augments the tree structure with exclusive as well as independent constraints. It describes the different relationships among sibling concepts and guides the end-to-end learning of multi-level fashion semantics. From EI tree, we learn an explicit hierarchical similarity function to characterize the semantic similarities among fashion products. It facilitates the interpretable retrieval scheme that can integrate the concept-level feedback. Experiment results on two large fashion datasets show that the proposed approach can characterize the semantic similarities among fashion items accurately and capture user's search intent precisely, leading to more accurate search results as compared to the state-of-the-art methods.

### Generating Defensive Plays in Basketball Games

•      Chieh-Yu Chen
• Wenze Lai
• Hsin-Ying Hsieh
• Wen-Hao Zheng
• Yu-Shuen Wang
• Jung-Hong Chuang

In this paper, we present a method to generate realistic defensive plays in a basketball game based on the ball and the offensive team's movements. Our system allows players and coaches to simulate how the opposing team will react to a newly developed offensive strategy for evaluating its effectiveness. To achieve the aim, we train on the NBA dataset a conditional generative adversarial network that learns spatio-temporal interactions between players' movements. The network consists of two components: a generator that takes a latent noise vector and the offensive team's trajectories as input to generate defensive team's trajectories; and a discriminator that evaluates the realistic degree of the generated results. Since a basketball game can be easily identified as fake if the ball handler, who is not defended, does not shoot the ball or cut into the restricted area, we add the wide open penalty to the objective function to assist model training. To evaluate the results, we compared the similarity of the real and the generated defensive plays, in terms of the players' movement speed and acceleration, distance to defend ball handlers and non- ball handlers, and the frequency of wide open occurrences. In addition, we conducted a user study with 59 participants for subjective tests. Experimental results show the high fidelity of the generated defensive plays to real data and demonstrate the feasibility of our algorithm.

### Dense Auto-Encoder Hashing for Robust Cross-Modality Retrieval

•      Hong Liu
• Mingbao Lin
• Shengchuan Zhang
• Yongjian Wu
• Feiyue Huang
• Rongrong Ji

Cross-modality retrieval has been widely studied, which aims to search images as response to text queries or vice versa. When faced with large-scale dataset, cross-modality hashing serves as an efficient and effective solution, which learns binary codes to approximate the cross-modality similarity in the Hamming space. Most recent cross-modality hashing schemes focus on learning the hash functions from data instances with fully modalities. However, how to learn robust binary codes when facing incomplete modality (i.e., with one modality missed or partially observed), is left unexploited, which however widely occurs in real-world applications. In this paper, we propose a novel cross-modality hashing, termed Dense Auto-encoder Hashing (DAH), which can explicitly impute the missed modality and produce robust binary codes by leveraging the relatedness among different modalities. To that effect, we propose a novel Dense Auto-encoder Network (DAN) to impute the missing modalities, which densely connects each layer to every other layer in a feed-forward fashion. For each layer, a noisy auto-encoder block is designed to calculate the residue between the current prediction and original data. Finally, a hash-layer is added to the end of DAN, which serves as a special binary encoder model to deal with the incomplete modality input. Quantitative experiments on three cross-modality visual search benchmarks, i.e., the Wiki, NUS-WIDE, and FLICKR-25K, have shown that the proposed DAH has superior performance over the state-of-the-art approaches.

### Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis

•      Taoran Tang
• Jia Jia
• Hanyang Mao

Dance is greatly influenced by music. Studies on how to synthesize music-oriented dance choreography can promote research in many fields, such as dance teaching and human behavior research. Although considerable effort has been directed toward investigating the relationship between music and dance, the synthesis of appropriate dance choreography based on music remains an open problem. There are two main challenges: 1) how to choose appropriate dance figures, i.e., groups of steps that are named and specified in technical dance manuals, in accordance with music and 2) how to artistically enhance choreography in accordance with music. To solve these problems, in this paper, we propose a music-oriented dance choreography synthesis method using a long short-term memory (LSTM)-autoencoder model to extract a mapping between acoustic and motion features. Moreover, we improve our model with temporal indexes and a masking method to achieve better performance. Because of the lack of data available for model training, we constructed a music-dance dataset containing choreographies for four types of dance, totaling 907,200 frames of 3D dance motions and accompanying music, and extracted multidimensional features for model training. We employed this dataset to train and optimize the proposed models and conducted several qualitative and quantitative experiments to select the best-fitted model. Finally, our model proved to be effective and efficient in synthesizing valid choreographies that are also capable of musical expression.

### Musicality-Novelty Generative Adversarial Nets for Algorithmic Composition

•      Gong Chen
• Yan Liu
• Sheng-hua Zhong
• Xiang Zhang

Algorithmic composition, which enables computer to generate music like human composers, has lasting charm because it intends to approximate artistic creation, most mysterious part of human intelligence. To deliver both melodious and refreshing music, this paper proposes the Musicality-Novelty Generative Adversarial Nets for algorithmic composition. With the same generator, two adversarial nets alternately optimize the musicality and novelty of the machine-composed music. A new model called novelty game is presented to maximize the minimal distance between the machine-composed music sample and any human-composed music sample in the novelty space, where all well-known human composed music products are far from each other. We implement the proposed framework using three supervised CNNs with one for generator, one for musicality critic and one for novelty critic on the time-pitch feature space. Specifically, the novelty critic is implemented by Siamese neural networks with temporal alignment using dynamic time warping. We provide empirical validations by generating the music samples under various scenarios.

### Improving QoE of ABR Streaming Sessions through QUIC Retransmissions

•      Divyashri Bhat
• Rajvardhan Deshmukh
• Michael Zink

While adaptive bitrate (ABR) streaming has contributed significantly to the reduction of video playout stalling, ABR clients continue to suffer from the variation of bit rate qualities over the duration of a streaming session. Similar to stalling, these variations in bit rate quality have a negative impact on the users' Quality of Experience (QoE). In this paper, we use a trace from a large-scale CDN to show that such quality changes occur in a significant amount of streaming sessions and investigate an ABR video segment retransmission approach to reduce the number of such quality changes. As the new HTTP/2 standard is becoming increasingly popular, we also see an increase in the usage of QUIC as an alternative protocol for the transmission of web traffic including video streaming. Using various network conditions, we conduct a systematic comparison of existing transport layer approaches for HTTP/2 that is best suited for ABR segment retransmissions. Since it is well known that both protocols provide a series of improvements over HTTP/1.1, we perform experiments both in controlled environments and over transcontinental links in the Internet and find that these benefits also "trickle up'' into the application layer when it comes to ABR video streaming where QUIC retransmissions can significantly improve the average quality bitrate while simultaneously minimizing bit rate variations over the duration of a streaming session.

### From Data to Knowledge: Deep Learning Model Compression, Transmission and Communication

•      Ziqian Chen
• Shiqi Wang
• Dapeng Oliver Wu
• Tiejun Huang
• Ling-Yu Duan

With the advances of artificial intelligence, recent years have witnessed a gradual transition from the big data to the big knowledge. Based on the knowledge-powered deep learning models, the big data such as the vast text, images and videos can be efficiently analyzed. As such, in addition to data, the communication of knowledge implied in the deep learning models is also strongly desired. As a specific example regarding the concept of knowledge creation and communication in the context of Knowledge Centric Networking (KCN), we investigate the deep learning model compression and demonstrate its promise use through a set of experiments. In particular, towards future KCN, we introduce efficient transmission of deep learning models in terms of both single model compression and multiple model prediction. The necessity, importance and open problems regarding the standardization of deep learning models, which enables the interoperability with the standardized compact model representation bitstream syntax, are also discussed.

## SESSION: Keynote 4

### Session details: Keynote 4

•      Kyoung Mu Lee

### Living with AI in Connected Devices for valuable Experience

•      Gary Geunbae Lee

This talk describes how AI technology can be used in multiple connected devices around us such as smartphone, TV, refrigerator and several consumer electronic devices, thus giving new and exciting customer experiences and values. This talk starts with Samsung's AI vision as a device company, and introduces 5 strategic principles along with industrial usages of AI technology including speech and natural language, visual understanding, data intelligence and autonomous driving all with deep learning techniques heavily involved. These applications naturally form a platform for both on device/edge, cloud and machine learning services for various current and future Samsung devices.

## SESSION: Multimedia -3 (Multimedia Search)

### Session details: Multimedia -3 (Multimedia Search)

•      Jitao Sang

### Supervised Online Hashing via Hadamard Codebook Learning

•      Mingbao Lin
• Rongrong Ji
• Hong Liu
• Yongjian Wu

In recent years, binary code learning, a.k.a. hashing, has received extensive attention in large-scale multimedia retrieval. It aims to encode high-dimensional data points into binary codes, hence the original high-dimensional metric space can be efficiently approximated via Hamming space. However, most existing hashing methods adopted offline batch learning, which is not suitable to handle incremental datasets with streaming data or new instances. In contrast, the robustness of the existing online hashing remains as an open problem, while the embedding of supervised/semantic information hardly boosts the performance of the online hashing, mainly due to the defect of unknown category numbers in supervised learning. In this paper, we propose an online hashing scheme, termed Hadamard Codebook based Online Hashing (HCOH), which aims to solving the above problems towards robust and supervised online hashing. In particular, we first assign an appropriate high-dimensional binary codes to each class label, which is generated randomly by Hadamard codes. Subsequently, LSH is adopted to reduce the length of such Hadamard codes in accordance with the hash bits, which can adapt the predefined binary codes online, and theoretically guarantee the semantic similarity. Finally, we consider the setting of stochastic data acquisition, which facilitates our method to efficiently learn the corresponding hashing functions via stochastic gradient descend (SGD) online. Notably, the proposed HCOH can be embedded with supervised labels and is not limited to a predefined category number. Extensive experiments on three widely-used benchmarks demonstrate the merits of the proposed scheme over the state-of-the-art methods.

### Cascaded Feature Augmentation with Diffusion for Image Retrieval

•      Yuanqiang Fang
• Wengang Zhou
• Yijuan Lu
• Jinhui Tang
• Qi Tian
• Houqiang Li

Recently, as an effective re-ranking technique, diffusion has attracted considerable attention in research on image retrieval. It inherits from random surfer model and is effective to deeply explore data manifold structure. However, as a common practice, diffusion is performed at query time which relies heavily on initial retrieval shortlists and suffers the bottleneck of online time-efficiency. To this end, in this paper, we present a more generalized method named CFA (cascaded feature augmentation) based on diffusion. First of all, we transfer diffusion process from online stage to offline stage and innovatively utilize output of diffusion to augment database features in a cascaded mode, which can eliminate iteration process at query time radically. Second, to scale the diffusion method to large image database, we propose a cascaded cluster diffusion technique for feature augmentation which largely reduces computational cost. Third, we extend our cascaded feature augmentation scheme to cases with multiple features without involving extra memory and time cost. Our CFA is compatible with other re-ranking methods. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed algorithm.

### Deep Priority Hashing

•      Zhangjie Cao
• Ziping Sun
• Mingsheng Long
• Jianmin Wang
• Philip S. Yu

Deep hashing enables image retrieval by end-to-end learning of deep representations and hash codes from training data with pairwise similarity information. Subject to the distribution skewness underlying the similarity information, most existing deep hashing methods may underperform for imbalanced data due to misspecified loss functions. This paper presents Deep Priority Hashing (DPH), an end-to-end architecture that generates compact and balanced hash codes in a Bayesian learning framework. The main idea is to reshape the standard cross-entropy loss for similarity-preserving learning such that it down-weighs the loss associated to highly-confident pairs. This idea leads to a novel priority cross-entropy loss, which prioritizes the training on uncertain pairs over confident pairs. Also, we propose another priority quantization loss, which prioritizes hard-to-quantize examples for generation of nearly lossless hash codes. Extensive experiments demonstrate that DPH can generate high-quality hash codes and yield state-of-the-art image retrieval results on three datasets, ImageNet, NUS-WIDE, and MS-COCO.

### Fast Discrete Cross-modal Hashing With Regressing From Semantic Labels

•      Xingbo Liu
• Xiushan Nie
• Wenjun Zeng
• Chaoran Cui
• Lei Zhu
• Yilong Yin

Hashing has recently received great attention in cross-modal retrieval. Cross-modal retrieval aims at retrieving information across heterogeneous modalities (e.g., texts vs. images). Cross-modal hashing compresses heterogeneous high-dimensional data into compact binary codes with similarity preserving, which provides efficiency and facility in both retrieval and storage. In this study, we propose a novel fast discrete cross-modal hashing (FDCH) method with regressing from semantic labels to take advantage of supervised labels to improve retrieval performance. In contrast to existing methods that learn the projection from hash codes to semantic labels, the proposed FDCH regresses the semantic labels of training examples to the corresponding hash codes with a drift. It not only accelerates the hash learning process, but also helps generate stable hash codes. Furthermore, the drift can adjust the regression and enhance the discriminative capability of hash codes. Especially in the case of training efficiency, FDCH is much faster than existing methods. Comparisons with several state-of-the-art techniques on three benchmark datasets have demonstrated the superiority of FDCH under various cross-modal retrieval scenarios.

## SESSION: Experience-1 (Multimedia Entertainment and Experience)

### Session details: Experience-1 (Multimedia Entertainment and Experience)

•      Zhisheng Yan

### ModaNet: A Large-scale Street Fashion Dataset with Polygon Annotations

•      Shuai Zheng
• Fan Yang
• M. Hadi Kiapour
• Robinson Piramuthu

Understanding clothes from a single image would have huge commercial and cultural impacts on modern societies. However, this task remains a challenging computer vision problem due to wide variations in the appearance, style, brand and layering of clothing items. We present a new database called ModaNet, a large-scale collection of images based on Paperdoll dataset. Our dataset provides 55,176 street images, fully annotated with polygons on top of the 1 million weakly annotated street images in Paperdoll. ModaNet aims to provide a technical benchmark to fairly evaluate the progress of applying the latest computer vision techniques that rely on large data for fashion understanding. The rich annotation of the dataset allows to measure the performance of state-of-the-art algorithms for object detection, semantic segmentation and polygon prediction on street fashion images in detail.

### SLIONS: A Karaoke Application to Enhance Foreign Language Learning

• Riwu Wang
• Douglas Turnbull
• Ye Wang

Singing songs can be an engaging and effective activity when learning a foreign language. In this paper, we describe a multi-language karaoke application called SLIONS: Singing and Listening to Improve Our Natural Speaking. When developing this application, we followed a user-centered design process which was informed by conducting interviews with domain experts, extensive usability testing, and reviewing existing gamified karaoke and language learning applications. The key feature of SLIONS is that we used automatic speech recognition (ASR) to provide students with personalized, granular feedback based on their singing pronunciation. We also provided multi-modal instruction: audio of music and singing tracks, video of a professional singer and translated text of lyrics to help students learn and master each song in the foreign language. To test the efficacy of SLIONS, we conducted a one-week pilot study with English and Chinese language learning students (N=15). The initial quantitative results show that our application can improve pronunciation and may improve vocabulary. In addition, the qualitative feedback from the students suggests that SLIONS is both fun to use and motivates students to practice speaking and singing in a foreign language.

### Context-Aware Unsupervised Text Stylization

•      Shuai Yang
• Jiaying Liu
• Wenhan Yang
• Zongming Guo

In this work, we present a novel algorithm to stylize the text without supervision, which provides a flexible and convenient way to invoke fantastic text expressions. Rather than employing the fixed pair of target text and source style images, our unsupervised framework establishes an implicit mapping for them by using an abstract imagery of the style image as bridges. Based on the mapping, we progressively narrow the visual discrepancy between text and style images by the proposed legibility-preserving structure transfer and texture transfer algorithms, which effectively balance the text legibility and style consistency. Furthermore, we explore a seamless composition of the stylized text and a background image, in which the optimal text layout is determined by a context-aware layout design algorithm utilizing cues for both seamlessness and aesthetics. Given the layout, the text can be seamlessly embedded into the background by texture synthesis under a context-aware boundary constraint. Experimental results demonstrate the effectiveness of the proposed method in automatic artistic typography creation and visual-textual presentation synthesis.

### Songle Sync: A Large-Scale Web-based Platform for Controlling Various Devices in Synchronization with Music

•      Jun Kato
• Masa Ogata
• Takahiro Inoue
• Masataka Goto

This paper presents Songle Sync, a web-based platform on which hundreds of Internet-connected devices - including smartphones, computers, and other physical computing devices - can be controlled to synchronize with music playback. It uses music-understanding technologies to dynamically synthesize music-driven multimedia performances from a musical piece of choice. To simultaneously control hundreds of devices, a conventional architecture keeps always-on connections between them. However, it does not scale and suffers from latency and jitter issues when there are various devices with potentially unstable networks. We address this with a novel autonomous control architecture in which each device is notified of forthcoming musical events (e.g., beats and chorus sections) to automatically drive various changes in multimedia performances. Moreover, we provide a development kit of an event-driven multimedia framework for JavaScript, example programs, and an interactive tutorial. To evaluate the platform, we compared latencies, jitters, and amounts of network traffic between ours and the conventional architecture. To examine use cases in the wild, we deployed the platform to drive over a hundred of a variety of devices. We also developed a web browser-based application for a multimedia performance with music playback. It provided audiences of hundreds with a bring-your-own-device experience of synchronized animations on smartphones. In addition, the development kit was used in a two-day hackathon. We report lessons learned from these studies and discuss the future of the Internet of Musical Things.

## SESSION: System-2 (Smart Multimedia Systems)

### Session details: System-2 (Smart Multimedia Systems)

•      Yijuan Lu

### Fine-Grained Grocery Product Recognition by One-Shot Learning

•      Weidong Geng
• Feilin Han
• Jiangke Lin
• Liuyi Zhu
• Jieming Bai
• Suzhen Wang
• Lin He
• Qiang Xiao
• Zhangjiong Lai

Fine-grained grocery product recognition via camera is a challenging task to identify the visually similar products with subtle differences by using single-shot training examples. To address this issue? we present a novel hybrid classification approach that combines feature-based matching and one-shot deep learning with a coarse-to-fine strategy. The candidate regions of product instances are first detected and coarsely labeled by recurring features in product images without any training. Then, attention maps are generated to guide the classifier to focus on fine discriminative details by magnifying the influences of the features in the candidate regions of interest (ROI) and suppressing the interferences of the features outside, improving the accuracy of fine-grained grocery products recognition effectively. Our framework also performs a good adaptability which allows existing classifier to be refined without retraining for new coming product classes. As an additional contribution, we collect a new grocery product database with 102 classes from 2 stores. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods.

### Reconfigurable Inverted Index

•      Yusuke Matsui
• Ryota Hinami
• Shin'ichi Satoh

Existing approximate nearest neighbor search systems suffer from two fundamental problems that are of practical importance but have not received sufficient attention from the research community. First, although existing systems perform well for the whole database, it is difficult to run a search over a subset of the database. Second, there has been no discussion concerning the performance decrement after many items have been newly added to a system. We develop a reconfigurable inverted index (Rii) to resolve these two issues. Based on the standard IVFADC system, we design a data layout such that items are stored linearly. This enables us to efficiently run a subset search by switching the search method to a linear PQ scan if the size of a subset is small. Owing to the linear layout, the data structure can be dynamically adjusted after new items are added, maintaining the fast speed of the system. Extensive comparisons show that Rii achieves a comparable performance with state-of-the art systems such as Faiss.

### Robust Billboard-based, Free-viewpoint Video Synthesis Algorithm to Overcome Occlusions under Challenging Outdoor Sport Scenes

•      Hiroshi Sankoh
• Sei Naito
• Keisuke Nonaka
• Houari Sabirin
• Jun Chen

The paper proposes an algorithm to robustly reconstruct an accurate billboard model of an individual object including an occluded one in each camera. Each billboard model is utilized to synthesize high-quality, free-viewpoint video especially for outdoor sport scenes in which roughly calibrated cameras are sparsely placed. The two main contributions of the proposed algorithm are (1) robustness to occlusions caused by overlaps of multiple objects in every camera, that is one of the biggest issues for billboard-based method, and (2) applicability to challenging shooting conditions in which accurate 3D model cannot be reconstructed because of calibration errors, small number of cameras and so on. In order to achieve the contributions above, the algorithm does not try to reproduce an accurate 3D model of each object but utilize a "rough 3D model". The algorithm precisely extracts an individual object region in every camera by reconstructing a "rough 3D model" of each object and back-projecting it to every camera. The 3D coordinate for each billboard to be located is calculated based on the position of a rough 3D model. Experimental results compare the visual quality of free-viewpoint videos synthesized with our proposed method and conventional methods and show the effectiveness of our proposed method in terms of the naturalness of positional relationships and the fineness of the surface textures of all the objects.

### iHuman3D: Intelligent Human Body 3D Reconstruction using a Single Flying Camera

•      Wei Cheng
• Lan Xu
• Lei Han
• Yuanfang Guo
• Lu Fang

Aiming at autonomous, adaptive and real-time human body reconstruction technique, this paper presents iHuman3D: an intelligent human body 3D reconstruction system using a single aerial robot integrated with an RGB-D camera. Specifically, we propose a real-time and active view planning strategy based on a highly efficient ray casting algorithm in GPU and a novel information gain formulation directly in TSDF. We also propose the human body reconstruction module by revising the traditional volumetric fusion pipeline with a compactly-designed non-rigid deformation for slight motion of the human target. We unify both the active view planning and human body reconstruction in the same TSDF volume-based representation. Quantitative and qualitative experiments are conducted to validate that the proposed iHuman3D system effectively removes the constraint of extra manual labor, enabling real-time and autonomous reconstruction of human body.

## SESSION: FF-6

### Session details: FF-6

•      Benoit Huet

### Examine before You Answer: Multi-task Learning with Adaptive-attentions for Multiple-choice VQA

•      Lianli Gao
• Pengpeng Zeng
• Jingkuan Song
• Xianglong Liu
• Heng Tao Shen

### Residual-Guide Network for Single Image Deraining

•      Zhiwen Fan
• Huafeng Wu
• Xueyang Fu
• Yue Huang
• Xinghao Ding

Single image rain streaks removal is extremely important since rainy condition adversely affects many computer vision systems. Deep learning based methods have great success in image deraining tasks. In this paper, we propose a novel residual-guide feature fusion network, called ResGuideNet, for single image deraining that progressively predicts high-quality reconstruction while using fewer parameters than previous methods. Specifically, we propose a cascaded network and adopt residuals from shallower blocks to guide deeper blocks. We can obtain a coarse-to-fine estimation of negative residual as the blocks go deeper with this strategy. The outputs of different blocks are merged into the final reconstruction. We adopt recursive convolution to build each block and apply supervision to intermediate de-rained results. ResGuideNet is detachable to meet different rainy conditions. For images with light rain streaks and limited computational resource at test time, we can obtain a decent performance even with several building blocks. Experiments validate that ResGuideNet can benefit other low- and high-level vision tasks.

### From Volcano to Toyshop: Adaptive Discriminative Region Discovery for Scene Recognition

•      Zhengyu Zhao
• Martha Larson

As deep learning approaches to scene recognition emerge, they have continued to leverage discriminative regions at multiple scales, building on practices established by conventional image classification research. However, approaches remain largely generic, and do not carefully consider the special properties of scenes. In this paper, inspired by the intuitive differences between scenes and objects, we propose Adi-Red, an adaptive approach to discriminative region discovery for scene recognition. Adi-Red uses a CNN classifier, which was pre-trained using only image-level scene labels, to discover discriminative image regions directly. These regions are then used as a source of features to perform scene recognition. The use of the CNN classifier makes it possible to adapt the number of discriminative regions per image using a simple, yet elegant, threshold, at relatively low computational cost. Experimental results on the scene recognition benchmark dataset SUN397 demonstrate the ability of Adi-Red to outperform the state of the art. Additional experimental analysis on the Places dataset reveals the advantages of Adi-Red, and highlight how they are specific to scenes. We attribute the effectiveness of Adi-Red to the ability of adaptive region discovery to avoid introducing noise, while also not missing out on important information.

### The Effect of Foveation on High Dynamic Range Video Perception

•      Joshua Sowerby
• Yang Zhang
• Dimitris Agrafiotis

When watching a video, the viewer's eyes will fixate on a certain point within each frame. Areas far from the viewers gaze location are perceived with much lower visual acuity than those around the fixation point. This effect is known as foveation. In this paper, the effect of foveation on High Dynamic Range (HDR) video perception is investigated. Using eye tracking data recorded from six different HDR sequences, the bit depth of individual frames are variably encoded, with the pixels with the highest bit depth corresponding to areas around the most likely fixation point for the frame. The bit depth of pixels within the modified frame will then gradually reduce, dependent on how far the pixel is located from the fixation point. To lower the bit depth of the HDR content, a tone mapping operator (TMO) is used. The particular TMO that is used generates an optimal tone mapping curve for every frame, which is used for both tone mapping to reduce the bit depth, and for inverse tone mapping for display purposes. However, this procedure can often cause large amounts of flickering, as well as banding artefacts, which reduce the perceptual quality of the video. Methods to mitigate these effects are proposed and implemented in this paper. Subjective performance evaluations were carried out involving 17 participants in order to evaluate the proposed methodology. Results show that when the lowest bit depth is 8 bits, the modified video is indistinguishable from the original. However, when 6 bit regions are introduced, a significant difference is noticed. Dithering and increasing the foveation region significantly improves the perceptual quality of the modified sequence.

### An Efficient Deep Quantized Compressed Sensing Coding Framework of Natural Images

•      Wenxue Cui
• Feng Jiang
• Xinwei Gao
• Shengping Zhang
• Debin Zhao

Traditional image compressed sensing (CS) coding frameworks solve an inverse problem that is based on the measurement coding tools (prediction, quantization, entropy coding, etc.) and the optimization based image reconstruction method. These CS coding frameworks face the challenges of improving the coding efficiency at the encoder, while also suffering from high computational complexity at the decoder. In this paper, we move forward a step and propose a novel deep network based CS coding framework of natural images, which consists of three sub-networks: sampling sub-network, offset sub-network and reconstruction sub-network that responsible for sampling, quantization and reconstruction, respectively. By cooperatively utilizing these sub-networks, it can be trained in the form of an end-to-end metric with a proposed rate-distortion optimization loss function. The proposed framework not only improves the coding performance, but also reduces the computational cost of the image reconstruction dramatically. Experimental results on benchmark datasets demonstrate that the proposed method is capable of achieving superior rate-distortion performance against state-of-the-art methods.

### PoB: Toward Reasoning Patterns of Beauty in Image Data

•      Diep Thi Ngoc Nguyen
• Hideki Nakayama
• Naoaki Okazaki
• Tatsuya Sakaeda

Aiming to develop of computational grammar system for visual information, we design a 4-tier framework that consists of four levels of 'visual grammar of images.' As a first step of realization, we propose a new dataset, named the PoB dataset, in which each image is annotated with multiple labels of armature patterns that compose the pictorial scene. The PoB dataset includes of a 10,000-painting dataset for art and a 4,959-image dataset for photography. In this paper, we discuss the consistency analysis of our dataset and its applicability. We also demonstrate how the armature patterns in the PoB dataset are useful in assessing aesthetic quality of images, and how well a deep learning algorithm can recognize these patterns. This paper seeks to set a new direction in image understanding with a more holistic approach beyond discrete objects and in aesthetic reasoning with a more interpretative way.

### Partial Multi-view Subspace Clustering

•      Nan Xu
• Yanqing Guo
• Xin Zheng
• Qianyu Wang
• Xiangyang Luo

For many real-world multimedia applications, data are often described by multiple views. Therefore, multi-view learning researches are of great significance. Traditional multi-view clustering methods assume that each view has complete data. However, missing data or partial data are more common in real tasks, which results in partial multi-view learning. Therefore, we propose a novel multi-view clustering method, called Partial Multi-view Subspace Clustering (PMSC), to address the partial multi-view problem. Unlike most existing partial multi-view clustering methods that only learn a new representation of the original data, our method seeks the latent space and performs data reconstruction simultaneously to learn the subspace representation. The learned subspace representation can reveal the underlying subspace structure embedded in original data, leading to a more comprehensive data description. In addition, we enforce the subspace representation to be non-negative, yielding an intuitive weight interpretation among different data. The proposed method can be optimized by the Augmented Lagrange Multiplier (ALM) algorithm. Experiments on one synthetic dataset and four benchmark datasets validate the effectiveness of PMSC under the partial multi-view scenario.

### Pseudo Transfer with Marginalized Corrupted Attribute for Zero-shot Learning

•      Teng Long
• Xing Xu
• Youyou Li
• Fumin Shen
• Jingkuan Song
• Heng Tao Shen

Zero-shot learning (ZSL) aims to recognize unseen classes that are excluded from training classes. ZSL suffers from 1) Zero-shot bias (Z-Bias) --- model is biased towards seen classes because unseen data is inaccessible for training; 2) Zero-shot variance (Z-Variance) --- associating different images to same semantic embedding yields large associating error. To reduce Z-Bias, we propose a pseudo transfer mechanism, where we first synthesize the distribution of unseen data using semantic embeddings, then we minimize the mismatch between the seen distribution and the synthesized unseen distribution. To reduce Z-Variance, we implicitly corrupted one semantic embedding multiple times to generate image-wise semantic vectors, with which our model learn robust classifiers. Lastly, we integrate our Z-Bias and Z-variance reduction techniques with a linear ZSL model to show its usefulness. Our proposed model successfully overcomes the Z-bias and Z-variance problems. Extensive experiments on five benchmark datasets including ImageNet-1K demonstrate that our model outperforms the state-of-the-art methods with fast training.

### Semi-Supervised DFF: Decoupling Detection and Feature Flow for Video Object Detectors

•      Guangxing Han
• Xuan Zhang
• Chongrong Li

For efficient video object detection, our detector consists of a spatial module and a temporal module. The spatial module aims to detect objects in static frames using convolutional networks, and the temporal module propagates high-level CNN features to nearby frames via light-weight feature flow. Alternating the spatial and temporal module by a proper interval makes our detector fast and accurate. Then we propose a two-stage semi-supervised learning framework to train our detector, which fully exploits unlabeled videos by decoupling the spatial and temporal module. In the first stage, the spatial module is learned by traditional supervised learning. In the second stage, we employ both feature regression loss and feature semantic loss to learn our temporal module via unsupervised learning. Different to traditional methods, our method can largely exploit unlabeled videos and bridges the gap of object detectors in image and video domain. Experiments on the large-scale ImageNet VID dataset demonstrate the effectiveness of our method. Code will be made publicly available.

### Unsupervised Learning of 3D Model Reconstruction from Hand-Drawn Sketches

•      Lingjing Wang
• Cheng Qian
• Jifei Wang
• Yi Fang

3D objects modeling has gained considerable attention in the visual computing community. We propose a low-cost unsupervised learning model for 3D objects reconstruction from hand-drawn sketches. Recent advancements in deep learning opened new opportunities to learn high-quality 3D objects from 2D sketches via supervised networks. However, the limited availability of labeled 2D hand-drawn sketches data (i.e. sketches and its corresponding 3D ground truth models) hinders the training process of supervised methods. In this paper, driven by a novel design of combination of retrieval and reconstruction process, we developed a learning paradigm to reconstruct 3D objects from hand-drawn sketches, without the use of well-labeled hand-drawn sketch data during the entire training process. Specifically, the paradigm begins with the training of an adaption network via autoencoder with adversarial loss, embedding the unpaired 2D rendered image domain with the hand-drawn sketch domain to a shared latent vector space. Then from the embedding latent space, for each testing sketch image, we retrieve a few (e.g. five) nearest neighbors from the training 3D data set as prior knowledge for a 3D Generative Adversarial Network. Our experiments verify our network's robust and superior performance in handling 3D volumetric object generation from single hand-drawn sketch without requiring any 3D ground truth labels.

### Deep Adaptive Temporal Pooling for Activity Recognition

•      Sibo Song
• Ngai-Man Cheung
• Vijay Chandrasekhar

Deep neural networks have recently achieved competitive accuracy for human activity recognition. However, there is room for improvement, especially in modeling of long-term temporal importance and determining the activity relevance of different temporal segments in a video. To address this problem, we propose a learnable and differentiable module: Deep Adaptive Temporal Pooling (DATP). DATP applies a self-attention mechanism to adaptively pool the classification scores of different video segments. Specifically, using frame-level features, DATP regresses importance of different temporal segments, and generates weights for them. Remarkably, DATP is trained using only the video-level label. There is no need of additional supervision except video-level activity class label. We conduct extensive experiments to investigate various input features and different weight models. Experimental results show that DATP can learn to assign large weights to key video segments. More importantly, DATP can improve training of frame-level feature extractor. This is because relevant temporal segments are assigned large weights during back-propagation. Overall, we achieve state-of-the-art performance on UCF101, HMDB51 and Kinetics datasets.

### Person Re-identification with Hierarchical Deep Learning Feature and efficient XQDA Metric

•      Mingyong Zeng
• Chang Tian
• Zemin Wu

Feature learning and metric learning are two important components in person re-identification (re-id). In this paper, we utilize both aspects to refresh the current State-Of-The-Arts (SOTA). Our solution is based on a classification network with label smoothing regularization (LSR) and multi-branch tree structure. The insight is that some middle network layers are found surprisingly better than the last layers on the re-id task. A Hierarchical Deep Learning Feature (HDLF) is thus proposed by combining such useful middle layers. To learn the best metric for the high-dimensional HDLF, an efficient eXQDA metric is proposed to deal with the large-scale big-data scenarios. The proposed HDLF and eXQDA are evaluated with current SOTA methods on five benchmark datasets. Our methods achieve very high re-id results, which are far beyond state-of-the-art solutions. For example, our approach reaches 81.6%, 96.1% and 95.6% Rank-1 accuracies on the ILIDS-VID, PRID2011 and Market-1501 datasets. Besides, the code and related materials (lists of over 1800 re-id papers and 170 top conference re-id papers) are released for research purposes.

### Cumulative Nets for Edge Detection

•      Jingkuan Song
• Zhilong Zhou
• Lianli Gao
• Xing Xu
• Heng Tao Shen

Lots of recent progress have been made by using Convolutional Neural Networks (CNN) for edge detection. Due to the nature of hierarchical representations learned in CNN, it is intuitive to design side networks utilizing the richer convolutional features to improve the edge detection. However, different side networks are isolated, and the final results are usually weighted sum of the side outputs with uneven qualities. To tackle these issues, we propose a Cumulative Network (C-Net), which learns the side network cumulatively based on current visual features and low-level side outputs, to gradually remove detailed or sharp boundaries to enable high-resolution and accurate edge detection. Therefore, the lower-level edge information is cumulatively inherited while the superfluous details are progressively abandoned. In fact, recursively Learningwhere to remove superfluous details from the current edge map with the supervision of a higher-level visual feature is challenging. Furthermore, we employ atrous convolution (AC) and atrous convolution pyramid pooling (ASPP) to robustly detect object boundaries at multiple scales and aspect ratios. Also, cumulatively refining edges using high-level visual information and lower-lever edge maps is achieved by our designed cumulative residual attention (CRA) block. Experimental results show that our C-Net sets new records for edge detection on both two benchmark datasets: BSDS500 (i.e., .819 ODS, .835 OIS and .862 AP) and NYUDV2 (i.e., .762 ODS, .781 OIS, .797 AP). C-Net has great potential to be applied to other deep learning based applications, e.g., image classification and segmentation.

### Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

•      Niluthpol Chowdhury Mithun
• Rameswar Panda
• Evangelos E. Papalexakis
• Amit K. Roy-Chowdhury

Cross-modal retrieval between visual data and natural language description remains a long-standing challenge in multimedia. While recent image-text retrieval methods offer great promise by learning deep representations aligned across modalities, most of these methods are plagued by the issue of training with small-scale datasets covering a limited number of images with ground-truth sentences. Moreover, it is extremely expensive to create a larger dataset by annotating millions of images with sentences and may lead to a biased model. Inspired by the recent success of webly supervised learning in deep neural networks, we capitalize on readily-available web images with noisy annotations to learn robust image-text joint representation. Specifically, our main idea is to leverage web images and corresponding tags, along with fully annotated datasets, in training for learning the visual-semantic joint embedding. We propose a two-stage approach for the task that can augment a typical supervised pair-wise ranking loss based formulation with weakly-annotated web images to learn a more robust visual-semantic embedding. Experiments on two standard benchmark datasets demonstrate that our method achieves a significant performance gain in image-text retrieval compared to state-of-the-art approaches.

### Multi-modal Preference Modeling for Product Search

•      Yangyang Guo
• Zhiyong Cheng
• Liqiang Nie
• Xin-Shun Xu
• Mohan Kankanhalli

The visual preference of users for products has been largely ignored by the existing product search methods. In this work, we propose a multi-modal personalized product search method, which aims to search products which not only are relevant to the submitted textual query, but also match the user preferences from both textual and visual modalities. To achieve the goal, we first leverage the also_view and buy_after_viewing products to construct the visual and textual latent spaces, which are expected to preserve the visual similarity and semantic similarity of products, respectively. We then propose a translation-based search model (TranSearch ) to 1) learn a multi-modal latent space based on the pre-trained visual and textual latent spaces; and 2) map the users, queries and products into this space for direct matching. The TranSearch model is trained based on a comparative learning strategy, such that the multi-modal latent space is oriented to personalized ranking in the training stage. Experiments have been conducted on real-world datasets to validate the effectiveness of our method. The results demonstrate that our method outperforms the state-of-the-art method by a large margin.

### Learning Joint Multimodal Representation with Adversarial Attention Networks

•      Feiran Huang
• Xiaoming Zhang
• Zhoujun Li

Recently, learning a joint representation for the multimodal data (e.g., containing both visual content and text description) has attracted extensive research interests. Usually, the features of different modalities are correlational and compositive, and thus a joint representation capturing the correlation is more effective than a subset of the features. Most of existing multimodal representation learning methods suffer from lack of additional constraints to enhance the robustness of the learned representations. In this paper, a novel Adversarial Attention Networks (AAN) is proposed to incorporate both the attention mechanism and the adversarial networks for effective and robust multimodal representation learning. Specifically, a visual-semantic attention model with siamese learning strategy is proposed to encode the fine-grained correlation between visual and textual modalities. Meanwhile, the adversarial learning model is employed to regularize the generated representation by matching the posterior distribution of the representation to the given priors. Then, the two modules are incorporated into a integrated learning framework to learn the joint multimodal representation. Experimental results in two tasks, i.e., multi-label classification and tag recommendation, show that the proposed model outperforms state-of-the-art representation learning methods.

### Dest-ResNet: A Deep Spatiotemporal Residual Network for Hotspot Traffic Speed Prediction

•      Binbing Liao
• Jingqing Zhang
• Ming Cai
• Siliang Tang
• Yifan Gao
• Chao Wu
• Shengwen Yang
• Wenwu Zhu
• Yike Guo
• Fei Wu

With the ever-increasing urbanization process, the traffic jam has become a common problem in the metropolises around the world, making the traffic speed prediction a crucial and fundamental task. This task is difficult due to the dynamic and intrinsic complexity of the traffic environment in urban cities, yet the emergence of crowd map query data sheds new light on it. In general, a burst of crowd map queries for the same destination in a short duration (called "hotspot'') could lead to traffic congestion. For example, queries of the Capital Gym burst on weekend evenings lead to traffic jams around the gym. However, unleashing the power of crowd map queries is challenging due to the innate spatiotemporal characteristics of the crowd queries. To bridge the gap, this paper firstly discovers hotspots underlying crowd map queries. These discovered hotspots address the spatiotemporal variations. Then Dest-ResNet (Deep spatiotemporal Residual Network) is proposed for hotspot traffic speed prediction. Dest-ResNet is a sequence learning framework that jointly deals with two sequences in different modalities, i.e., the traffic speed sequence and the query sequence. The main idea of Dest-ResNet is to learn to explain and amend the errors caused when the unimodal information is applied individually. In this way, Dest-ResNet addresses the temporal causal correlation between queries and the traffic speed. As a result, Dest-ResNet shows a 30% relative boost over the state-of-the-art methods on real-world datasets from Baidu Map.

### Learning and Fusing Multimodal Deep Features for Acoustic Scene Categorization

•      Yifang Yin
• Rajiv Ratn Shah
• Roger Zimmermann

Convolutional Neural Networks (CNNs) have been widely applied to audio classification recently where promising results have been obtained. Previous CNN-based systems mostly learn from two-dimensional time-frequency representations such as MFCC and spectrograms, which may tend to emphasize more on the background noise of the scene. To learn the key acoustic events, we introduce a three-dimensional CNN to emphasize on the different spectral characteristics from neighboring regions in spatial-temporal domain. A novel acoustic scene classification system based on multimodal deep feature fusion is proposed in this paper, where three CNNs have been presented to perform 1D raw waveform modeling, 2D time-frequency image modeling, and 3D spatial-temporal dynamics modeling, respectively. The learnt features are shown to be highly complementary to each other, which are next combined in a feature fusion network to obtain significantly improved classification predictions. Comprehensive experiments have been conducted on two large-scale acoustic scene datasets, namely the DCASE16 dataset and the LITIS Rouen dataset. Experimental results demonstrate the effectiveness of our proposed approach, as our solution achieves state-of-the-art classification rates and improves the average classification accuracy by 1.5% - 8.2% compared to the top ranked systems in the DCASE16 challenge.

### Dynamic Sound Field Synthesis for Speech and Music Optimization

•      Zhenyu Tang
• Nicolas Morales
• Dinesh Manocha

We present a novel acoustic optimization algorithm to synthesize dynamic sound fields in a static scene. Our approach places new active loudspeakers or virtual sources in the scene so that the dynamic sound field in a region satisfies optimization criteria to improve speech and music perception. We use a frequency domain formulation of sound propagation and reduce the computation of dynamic sound field synthesis to solving a linear least squares problem, and do not impose any constraints on the environment or loudspeakers type, or loudspeaker placement. We highlight the performance on complex indoor scenes in terms of speech and music improvements. We evaluate the performance with a user study and highlight the perceptual benefits for virtual reality and multimedia applications.

### DASH for 3D Networked Virtual Environment

•      Thomas Forgione
• Axel Carlier
• Géraldine Morin
• Wei Tsang Ooi
• Vincent Charvillat
• Praveen Kumar Yadav

DASH is now a widely deployed standard for streaming video content due to its simplicity, scalability, and ease of deployment. In this paper, we explore the use of DASH for a different type of media content -- networked virtual environment (NVE), with different properties and requirements. We organize a polygon soup with textures into a structure that is compatible with DASH MPD (Media Presentation Description), with a minimal set of view-independent metadata for the client to make intelligent decisions about what data to download at which resolution. We also present a DASH-based NVE client that uses a view-dependent and network dependent utility metric to decide what to download, based only on the information in the MPD file. We show that DASH can be used on NVE for 3D content streaming. Our work opens up the possibility of using DASH for highly interactive applications, beyond its current use in video streaming.

## SESSION: Keynote 5

### Session details: Keynote 5

•      Wenwu Zhu

### Transforming Retailing Experiences with Artificial Intelligence

•      Bowen Zhou

Artificial Intelligence (AI) is making big impacts in our daily life. In this talk, we will show how AI is transforming retail industry. In particular, we propose the brand-new concept of Retail as a Service (RaaS), where retail is redefined as the natural combination of content and interaction. With the capability of knowing more about consumers, products and retail scenarios integrating online and offline, AI is providing more personalized and comprehensive multimodal content and enabling more natural interactions between consumers and services, through the innovative technologies we invented at JD.com. We will show 1) how computer vision techniques can better understand consumers, help consumers easily discover products, and support multimodal content generation, 2) how the natural language processing techniques can be used to support intelligent customer services through emotion computing, 3) how AI is building the very fundamental technology infrastructure for RaaS.

## SESSION: Deep-3 (Image Processing-Inpainting, Super-Resolution, Deblurring)

### Session details: Deep-3 (Image Processing-Inpainting, Super-Resolution, Deblurring)

•      Shuqiang Jiang

### Learning Collaborative Generation Correction Modules for Blind Image Deblurring and Beyond

•      Risheng Liu
• Yi He
• Shichao Cheng
• Xin Fan
• Zhongxuan Luo

Blind image deblurring plays a very important role in many vision and multimedia applications. Most existing works tend to introduce complex priors to estimate the sharp image structures for blur kernel estimation. However, it has been verified that directly optimizing these models is challenging and easy to fall into degenerate solutions. Although several experience-based heuristic inference strategies, including trained networks and designed iterations, have been developed, it is still hard to obtain theoretically guaranteed accurate solutions. In this work, a collaborative learning framework is established to address the above issues. Specifically, we first design two modules, named Generator and Corrector, to extract the intrinsic image structures from the data-driven and knowledge-based perspectives, respectively. By introducing a collaborative methodology to cascade these modules, we can strictly prove the convergence of our image propagations to a deblurring-related optimal solution. As a nontrivial byproduct, we also apply the proposed method to address other related tasks, such as image interpolation and edge-preserved smoothing. Plenty of experiments demonstrate that our method can outperform the state-of-the-art approaches on both synthetic and real datasets.

### When Deep Fool Meets Deep Prior: Adversarial Attack on Super-Resolution Network

•      Minghao Yin
• Yongbing Zhang
• Xiu Li
• Shiqi Wang

This paper investigates the vulnerability of the deep prior used in deep learning based image restoration. In particular, the image super-resolution, which relies on the strong prior information to regularize the solution space and plays important roles in the image pre-processing for future viewing and analysis, is shown to be vulnerable to the well-designed adversarial examples. We formulate the adversarial example generation process as an optimization problem, and given super-resolution model three different types of attack are designed based on the subsequent tasks: (i) style transfer attack; (ii) classification attack; (iii) caption attack. Another interesting property of our design is that the attack is hidden behind the super-resolution process, such that the utilization of low resolution images is not significantly influenced. We show that the vulnerability to adversarial examples could bring risks to the pre-processing modules such as super-resolution deep neural network, which is also of paramount significance for the security of the whole system. Our results also shed light on the potential security issues of the pre-processing modules, and raise concerns regarding the corresponding countermeasures for adversarial examples.

### Semantic Image Inpainting with Progressive Generative Networks

•      Haoran Zhang
• Zhenzhen Hu
• Changzhi Luo
• Wangmeng Zuo
• Meng Wang

Recently, image inpainting task has revived with the help of deep learning techniques. Deep neural networks, especially the generative adversarial networks~(GANs) make it possible to recover the missing details in images. Due to the lack of sufficient context information, most existing methods fail to get satisfactory inpainting results. This work investigates a more challenging problem, e.g., the newly-emerging semantic image inpainting - a task to fill in large holes in natural images. In this paper, we propose an end-to-end framework named progressive generative networks~(PGN), which regards the semantic image inpainting task as a curriculum learning problem. Specifically, we divide the hole filling process into several different phases and each phase aims to finish a course of the entire curriculum. After that, an LSTM framework is used to string all the phases together. By introducing this learning strategy, our approach is able to progressively shrink the large corrupted regions in natural images and yields promising inpainting results. Moreover, the proposed approach is quite fast to evaluate as the entire hole filling is performed in a single forward pass. Extensive experiments on Paris Street View and ImageNet dataset clearly demonstrate the superiority of our approach. Code for our models is available at https://github.com/crashmoon/Progressive-Generative-Networks.

### Structural inpainting

•      Huy V. Vo
• Ngoc Q. K. Duong
• Patrick Pérez

Scene-agnostic visual inpainting remains very challenging despite progress in patch-based methods. Recently, Pathak et al. [26] have introduced convolutional "context encoders'' (CEs) for unsupervised feature learning through image completion tasks. With the additional help of adversarial training, CEs turned out to be a promising tool to complete complex structures in real inpainting problems. In the present paper we propose to push further this key ability by relying on perceptual reconstruction losses at training time. We show on a wide variety of visual scenes the merit of the approach forstructural inpainting, and confirm it through a user study. Combined with the optimization-based refinement of [32] with neural patches, our context encoder opens up new opportunities for prior-free visual inpainting.

## SESSION: Brand New Ideas

### Session details: Brand New Ideas

•      Kiyoharu Aizawa

### Fluid Annotation: A Human-Machine Collaboration Interface for Full Image Annotation

• Mykhaylo Andriluka
• Jasper R. R. Uijlings
• Vittorio Ferrari

We introduce Fluid Annotation, an intuitive human-machine collaboration interface for annotating the class label and outline of every object and background region in an image. Fluid annotation is based on three principles:(I) Strong Machine-Learning aid. We start from the output of a strong neural network model, which the annotator can edit by correcting the labels of existing regions, adding new regions to cover missing objects, and removing incorrect regions.The edit operations are also assisted by the model.(II) Full image annotation in a single pass. As opposed to performing a series of small annotation tasks in isolation [51,68], we propose a unified interface for full image annotation in a single pass.(III) Empower the annotator.We empower the annotator to choose what to annotate and in which order. This enables concentrating on what the ma-chine does not already know, i.e. putting human effort only on the errors it made. This helps using the annotation budget effectively.

Through extensive experiments on the COCO+Stuff dataset [11,51], we demonstrate that Fluid Annotation leads to accurate an-notations very efficiently, taking 3x less annotation time than the popular LabelMe interface [70].

### Images2Poem: Generating Chinese Poetry from Image Streams

• Lixin Liu
• Xiaojun Wan
• Zongming Guo

Natural language generation from visual inputs has attracted extensive research attention recently. Generating poetry from visual content is an interesting but very challenging task. We propose and address the new multimedia task of generating classical Chinese poetry from image streams. In this paper, we propose an Images2Poem model with a selection mechanism and an adaptive self-attention mechanism for the problem. The model first selects representative images to summarize the image stream. During decoding, it adaptively pays attention to the information from either source-side image stream or target-side previously generated characters. It jointly summarizes the images and generates relevant, high-quality poetry from image streams. Experimental results demonstrate the effectiveness of the proposed approach. Our model outperforms baselines in different human evaluation metrics.

### Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

•      Yaman Kumar
• Mayank Aggarwal
• Pratham Nawal
• Shin'ichi Satoh
• Rajiv Ratn Shah
• Roger Zimmermann

Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.

### ALERT: Adding a Secure Layer in Decision Support for Advanced Driver Assistance System (ADAS)

•      Kanchan Bahirat
• Umang Shah
• Alvaro A. Cardenas
• Balakrishnan Prabhakaran

### Cross-Modal Health State Estimation

•      Nitish Nag
• Vaibhav Pandey
• Preston J. Putzel
• Hari Bhimaraju
• Srikanth Krishnan
• Ramesh Jain

Individuals create and consume more diverse data about themselves today than any time in history. Sources of this data include wearable devices, images, social media, geo-spatial information and more. A tremendous opportunity rests within cross-modal data analysis that leverages existing domain knowledge methods to understand and guide human health. Especially in chronic diseases, current medical practice uses a combination of sparse hospital based biological metrics (blood tests, expensive imaging, etc.) to understand the evolving health status of an individual. Future health systems must integrate data created at the individual level to better understand health status perpetually, especially in a cybernetic framework. In this work we fuse multiple user created and open source data streams along with established biomedical domain knowledge to give two types of quantitative state estimates of cardiovascular health. First, we use wearable devices to calculate cardiorespiratory fitness (CRF), a known quantitative leading predictor of heart disease which is not routinely collected in clinical settings. Second, we estimate inherent genetic traits, living environmental risks, circadian rhythm, and biological metrics from a diverse dataset. Our experimental results on 24 subjects demonstrate how multi-modal data can provide personalized health insight. Understanding the dynamic nature of health status will pave the way for better health based recommendation engines, better clinical decision making and positive lifestyle changes.

## SESSION: Grand Challenge-1

### Session details: Grand Challenge-1

•      Shuqiang Jiang

### An Effective Text-based Characterization Combined with Numerical Features for Social Media Headline Prediction

•      Liuwu Li
• Sihong Huang
• Ziliang He
• Wenyin Liu

In this paper, a text-based characterization combined with numerical features for Social Media Headline Prediction (SMHP) is proposed. Description of images, users' emotions and opinions are all described in text, our text-based characterization learns these important features by training a Doc2vec model. Numerical features of social cues contain general characteristics of social media headline, we build an effective method to extract numerical features. Experiments conducted on real-world SMHP dataset manifest the effectiveness of the proposed approach, which achieves the following performance: Spearmanr's Rho: 0.4559, MAE:1.9797.

### An Iterative Refinement Approach for Social Media Headline Prediction

•      Chih-Chung Hsu
• Chia-Yen Lee
• Ting-Xuan Liao
• Jun-Yi Lee
• Tsai-Yne Hou
• Ying-Chu Kuo
• Jing-Wen Lin
• Ching-Yi Hsueh
• Zhong-Xuan Zhang
• Hsiang-Chin Chien

In this study, we propose a novel iterative refinement approach to predict the popularity score of the social media meta-data effectively. With the rapid growth of the social media on the Internet, how to adequately forecast the view count or popularity becomes more important. Conventionally, the ensemble approach such as random forest regression achieves high and stable performance on various prediction tasks. However, most of the regression methods may not precisely predict the extreme high or low values. To address this issue, we first predict the initial popularity score and retrieve their residues. In order to correctly compensate those extreme values, we adopt an ensemble regressor to compensate the residues to further improve the prediction performance. Comprehensive experiments are conducted to demonstrate the proposed iterative refinement approach outperforms the state-of-the-art regression approach.

### Random Forest Exploiting Post-related and User-related Features for Social Media Popularity Prediction

•      Feitao Huang
• Junhong Chen
• Zehang Lin
• Peipei Kang
• Zhenguo Yang

Social media headline prediction (SMHP) is a thriving application scenario, which aims to predict the popularity of the post data shared on social media. In this paper, we propose to use multi-aspect features combined with the random forest (RF) model for popularity predictions. Firstly, we extract features by combining both metadata of the posts and users' features. More specifically, we adopt the binary coding strategy for dimensionality reduction and deal with the missing values by using some strategies, i.e., estimating the missing geographic information according to the information of users, and filling the missing features with median. Furthermore, regression models can be used directly to make predictions. In particular, a random forest (RF) model is adopted since it does not require much effort in tuning hyper-parameters and performs effectively. Extensive experiments conducted on the SMHP dataset consisting of 340K image posts shared by 80K users manifest the effectiveness of our method. Our approach achieves the 4nd place in the leader board of the Grand Challenge of SMHP in ACM Multimedia 2018.

### Content-Based Video Relevance Prediction with Second-Order Relevance and Attention Modeling

•      Xusong Chen
• Rui Zhao
• Shengjie Ma
• Dong Liu
• Zheng-Jun Zha

This paper describes our proposed method for the Content-Based Video Relevance Prediction (CBVRP) challenge. Our method is based on deep learning, i.e. we train a deep network to predict the relevance between two video sequences from their features. We explore the usage of second-order relevance, both in preparing training data, and in extending the deep network. Second-order relevance refers to e.g. the relevance between x and z if x is relevant to y and y is relevant to z. In our proposed method, we use second-order relevance to increase positive samples and decrease negative samples, when preparing training data. We further extend the deep network with an attention module, where the attention mechanism is designed for second-order relevant video sequences. We verify the effectiveness of our method on the validation set of the CBVRP challenge.

## SESSION: Vision-4 (Representation Learning)

### Session details: Vision-4 (Representation Learning)

•      Marcel Worring

### Fine-Grained Representation Learning and Recognition by Exploiting Hierarchical Semantic Embedding

•      Tianshui Chen
• Wenxi Wu
• Yuefang Gao
• Le Dong
• Xiaonan Luo
• Liang Lin

Object categories inherently form a hierarchy with different levels of concept abstraction, especially for fine-grained categories. For example, birds (Aves) can be categorized according to a four-level hierarchy of order, family, genus, and species. This hierarchy encodes rich correlations among various categories across different levels, which can effectively regularize the semantic space and thus make prediction less ambiguous. However, previous studies of fine-grained image recognition primarily focus on categories of one certain level and usually overlook this correlation information. In this work, we investigate simultaneously predicting categories of different levels in the hierarchy and integrating this structured correlation information into the deep neural network by developing a novel Hierarchical Semantic Embedding (HSE) framework. Specifically, the HSE framework sequentially predicts the category score vector of each level in the hierarchy, from highest to lowest. At each level, it incorporates the predicted score vector of the higher level as prior knowledge to learn finer-grained feature representation. During training, the predicted score vector of the higher level is also employed to regularize label prediction by using it as soft targets of corresponding sub-categories. To evaluate the proposed framework, we organize the 200 bird species of the Caltech-UCSD birds dataset with the four-level category hierarchy and construct a large-scale butterfly dataset that also covers four level categories. Extensive experiments on these two and the newly-released VegFru datasets demonstrate the superiority of our HSE framework over the baseline methods and existing competitors.

### Dissimilarity Representation Learning for Generalized Zero-Shot Recognition

•      Gang Yang
• Jinlu Liu
• Jieping Xu
• Xirong Li

Generalized zero-shot learning (GZSL) aims to recognize any test instance coming either from a known class or from a novel class that has no training instance. To synthesize training instances for novel classes and thus resolving GZSL as a common classification problem, we propose a Dissimilarity Representation Learning (DSS) method. Dissimilarity representation is to represent a specific instance in terms of its (dis)similarity to other instances in a visual or attribute based feature space. In the dissimilarity space, instances of the novel classes are synthesized by an end-to-end optimized neural network. The neural network realizes two-level feature mappings and domain adaptions in the dissimilarity space and the attribute based feature space. Experimental results on five benchmark datasets, i.e., AWA, AWA$_2$, SUN, CUB, and aPY, show that the proposed method improves the state-of-the-art with a large margin, approximately 10% gain in terms of the harmonic mean of the top-1 accuracy. Consequently, this paper establishes a new baseline for GZSL.

### Attribute-Aware Attention Model for Fine-grained Representation Learning

•      Kai Han
• Jianyuan Guo
• Chao Zhang
• Mingjian Zhu

How to learn a discriminative fine-grained representation is a key point in many computer vision applications, such as person re-identification, fine-grained classification, fine-grained image retrieval, etc. Most of the previous methods focus on learning metrics or ensemble to derive better global representation, which are usually lack of local information. Based on the considerations above, we propose a novel Attribute-Aware Attention Model ($A^3M$), which can learn local attribute representation and global category representation simultaneously in an end-to-end manner. The proposed model contains two attention models: attribute-guided attention module uses attribute information to help select category features in different regions, at the same time, category-guided attention module selects local features of different attributes with the help of category cues. Through this attribute-category reciprocal process, local and global features benefit from each other. Finally, the resulting feature contains more intrinsic information for image recognition instead of the noisy and irrelevant features. Extensive experiments conducted on Market-1501, CompCars, CUB-200-2011 and CARS196 demonstrate the effectiveness of our $A^3M$.

### GNAS: A Greedy Neural Architecture Search Method for Multi-Attribute Learning

•      Siyu Huang
• Xi Li
• Zhi-Qi Cheng
• Zhongfei Zhang
• Alexander Hauptmann

A key problem in deep multi-attribute learning is to effectively discover the inter-attribute correlation structures. Typically, the conventional deep multi-attribute learning approaches follow the pipeline of manually designing the network architectures based on task-specific expertise prior knowledge and careful network tunings, leading to the inflexibility for various complicated scenarios in practice. Motivated by addressing this problem, we propose an efficient greedy neural architecture search approach (GNAS) to automatically discover the optimal tree-like deep architecture for multi-attribute learning. In a greedy manner, GNAS divides the optimization of global architecture into the optimizations of individual connections step by step. By iteratively updating the local architectures, the global tree-like architecture gets converged where the bottom layers are shared across relevant attributes and the branches in top layers more encode attribute-specific features. Experiments on three benchmark multi-attribute datasets show the effectiveness and compactness of neural architectures derived by GNAS, and also demonstrate the efficiency of GNAS in searching neural architectures.

## SESSION: Grand Challenge-2

### Session details: Grand Challenge-2

•      Shuqiang Jiang

### Feature Re-Learning with Data Augmentation for Content-based Video Recommendation

•      Jianfeng Dong
• Xirong Li
• Chaoxi Xu
• Gang Yang
• Xun Wang

This paper describes our solution for the Hulu Content-based Video Relevance Prediction Challenge. Noting the deficiency of the original features, we propose feature re-learning to improve video relevance prediction. To generate more training instances for supervised learning, we develop two data augmentation strategies, one for frame-level features and the other for video-level features. In addition, late fusion of multiple models is employed to further boost the performance. Evaluation conducted by the organizers shows that our best run outperforms the Hulu baseline, obtaining relative improvements of 26.2% and 30.2% on the TV-shows track and the Movies track, respectively, in terms of [email protected] The results clearly justify the effectiveness of the proposed solution.

### Beauty Product Image Retrieval Based on Multi-Feature Fusion and Feature Aggregation

•      Qi Wang
• Jingxiang Lai
• Kai Xu
• Wenyin Liu
• Liang Lei

We propose a beauty product image retrieval method based on multi-feature fusion and feature aggregation. The key idea is representing the image with the feature vector obtained by multi-feature fusion and feature aggregation. VGG16 and ResNet50 are chosen to extract image features, and Crow is adopted to perform deep feature aggregation. Benefited from the idea of transfer learning, we fine turn VGG16 on the Perfect-500K data set to improve the performance of image retrieval. The proposed method won the third price in Perfect Corp. Challenge 2018 with the best result 0.270676 mAP. We released our code on GitHub: https://github.com/wangqi12332155/ACMMM-beauty-AI-challenge.

### Unprecedented Usage of Pre-trained CNNs on Beauty Product

•      Jian Han Lim
• Nurul Japar
• Chun Chet Ng
• Chee Seng Chan

How does a pre-trained Convolution Neural Network (CNN) model perform on beauty and personal care items (i.e Perfect-500K) This is the question we attempt to answer in this paper by adopting several well known deep learning models pre-trained on ImageNet, and evaluate their performance using different distance metrics. In the Perfect Corp Challenge, we manage to secure fourth position by using only the pre-trained model.

### Regional Maximum Activations of Convolutions with Attention for Cross-domain Beauty and Personal Care Product Retrieval

•      Zehang Lin
• Zhenguo Yang
• Feitao Huang
• Junhong Chen

Cross-domain beauty and personal care product image retrieval is a challenging problem due to data variations (e.g., brightness, viewpoint, and scale), and the rich types of items. In this paper, we present a regional maximum activations of convolutions with attention (RA-MAC) descriptor to extract image features for retrieval. RA-MAC improves the regional maximum activations of convolutions (R-MAC) descriptor considering the influence of background in cross-domain images (i.e., shopper domain and seller domain). More specifically, RA-MAC utilizes the characteristics of the convolutional layer to find the attention of an image, and reduces the influence of the unimportant regions in an unsupervised manner. Furthermore, a few strategies have been exploited to improve the performance, such as multiple features fusion, query expansion, and database augmentation. Extensive experiments conducted on a dataset consisting of half a million images of beauty care products (Perfect-500K) manifest the effectiveness of RA-MAC. Our approach achieves the 2nd place in the leader board of the Grand Challenge of AI Meets Beauty in ACM Multimedia 2018. Our code is available at: https://github.com/RetrainIt/Perfect-Half-Million-Beauty-Product-Image-R....

## SESSION: Interactive Art

### Session details: Interactive Art

•      Hyunjung Shim

### Shadow Calligraphy of Dance: An Image-Based Interactive Installation for Capturing Flowing Human Figures

•      Lyn Chao-ling Chen
• He-lin Luo

In the artwork, the topic of flowing human figures has been discussed. People pass through familiar places day by day, in which they create connection among them and the city. The impressions, memories and experiences turn the definition of the space in the city into place, and it is meaningful and creates a virtual layer upon the physical world. The artwork tried to arouse people to aware the connection among them and the environment by revealing the invisible traces. The interactive installation was set in outdoor exhibition, and a camera was set align the road and a projector was used for performing image on the wall of the nearby building. Object detection technology has been used in the interactive installation for capturing movements of people. GMM modeling was adopted for capturing frames with vivid face features, and the parameters was set for generating afterimage effect. The projected picture on the wall combined with 25 frames in different update time setting for performing a delayed vision, and only one region in the center of the image played the current frame in real-time, for arousing audience to notice the connection between their movements and the projected picture. In addition, some of them were reversed in horizontal direction for creating a dynamic Chinese brush painting with aesthetic composition. The remaining figures on the wall as mark or print remind people their traces in the city, and that creates the connection among the city and people who has been to the place at the same time. In the interactive installation, the improvisational painting of body calligraphy was exhibited in a collaborative way, in which revealed the face features or human shapes of the crowd in physical point, and also the collaborative experiences or memories in mental aspect.

### Cellular Music: An Interactive Game of Life Sequencer

•      Anis Haron
• Yong Soon Xuan
• Wong Chee Onn

Cellular Music is an algorithmic sequencer based on the rules of Conway's game of life presented as an interactive audio based installation. The installation uses a camera to detect motion in space and uses this data as seeds for the game of life algorithm. The algorithm runs on a 20x20 grid with 20 iterations of the algorithm visualised in individual layers, creating a cubical grid of 20x20x20 cells. Cells are mapped to musical pitches and scanned to generate music.

### TAGapp Visualization: An Application Based Visual Art Installation

•      Yong Soon Xuan
• Wong Chee Onn
• Tan Kong Cheng
• Anis Haron

"TAGapp Visualization" is an interactive visual art installation, which can be set up easily at any TAGapp related venue such as event, galleries and museum. The interaction of TAGapp visualization is mobile application, audiences can download the application into smart device. The features of the application are information retriever, indoor navigation and positioning. Audiences obtain the information by a very natural gesture, TAGapp visualization will display the journey of every individual audience. After download the application, audience have to generate their own account and pick the color, so the visualization will display the sphere according to what the audience pick. Each sphere represent each audience. TAGapp will track the visiting journey of every single audiences and stored into the database and all these actions would determine the visual component of the visual art. Besides that, the uses of this art installation can help the event, galleries and museum management. This real-time participatory interactive visual art enable to connect with the audiences' behavior.

## SESSION: Tutorials

### Similarity-Based Processing of Motion Capture Data

•      Jan Sedmidubsky
• Pavel Zezula

Motion capture technologies digitize human movements by tracking 3D positions of specific skeleton joints in time. Such spatio-temporal data have an enormous application potential in many fields, ranging from computer animation, through security and sports to medicine, but their computerized processing is a difficult problem. The recorded data can be imprecise, voluminous, and the same movement action can be performed by various subjects in a number of alternatives that can vary in speed, timing or a position in space. This requires employing completely different data-processing paradigms compared to the traditional domains such as attributes, text or images. The objective of this tutorial is to explain fundamental principles and technologies designed for similarity comparison, searching, subsequence matching, classification and action detection in the motion capture data. Specifically, we emphasize the importance of similarity needed to express the degree of accordance between pairs of motion sequences and also discuss the machine-learning approaches able to automatically acquire content-descriptive movement features. We explain how the concept of similarity together with the learned features can be employed for searching similar occurrences of interested actions within a long motion sequence. Assuming a user-provided categorization of example motions, we discuss techniques able to recognize types of specific movement actions and detect such kinds of actions within continuous motion sequences. Selected operations will be demonstrated by on-line web applications.

### Structured Deep Learning for Pixel-level Understanding

•      Yunchao Wei
• Xiaodan Liang
• Si Liu
• Liang Lin

This article summarizes the corresponding half-day tutorial at ACM Multimedia 2018. This tutorial reviews recent progresses for pixel-level understanding with structured deep learning, including 1) human-centric analysis: human parsing and pose estimation; 2) part-based analysis: object part and face parsing; 3) weakly-supervised analysis: object localization and semantic segmentation; 4) depth estimation: stereo matching.

### Social and Political Event Analysis based on Rich Media

•      Jungseock Joo
• Zachary C. Steinert-Threlkeld
• Jiebo Luo

This tutorial aims to provide a comprehensive overview on the applications of rich social media data for real world social and political event analysis, which is a new emerging topic in multimedia research. We will discuss the recent evolution of social media as venues for social and political interaction and their impacts on the real world events using specific examples. We will introduce large scale datasets drawn from social media sources and review concrete research projects that build on computer vision and deep learning based methods. Existing researches in social media have examined various patterns of information diffusion and contagion, user activities and networking, and social media-based predictions of real world events. Most existing works, however, rely on non-content or text based features and do not fully leverage rich multiple modalities -- visuals and acoustics -- which are prevalent in most online social media. Such approaches underutilize vibrant and integrated characteristics of social media especially because the current audiences are getting more attracted to visual information centric media. This proposal highlights the impacts of rich multimodal data to the real world events and elaborates on relevant recent research projects -- the concrete development, data governance, technical details, and their implications to politics and society -- on the following topics. 1) Decoding non-verbal content to identify intent and impact in political messages in mass and social media, such as political advertisements, debates, or news footage; 2) Recognition of emotion, expressions, and viewer perception from communicative gestures, gazes, and facial expressions; 3) Geo-coded Twitter image analysis for protest and social movement analysis; 4) Election outcome prediction and voter understanding by using social media post; and 5) Detection of misinformation, rumors, and fake news and analyzing their impacts in major political events such as the U.S. presidential election.

### To Recognize Families In the Wild: A Machine Vision Tutorial

•      Joseph P. Robinson
• Ming Shao
• Yun Fu

Automatic kinship recognition has relevance in an abundance of applications. For starters, aiding forensic investigations, as kinship is a powerful cue that could narrow the search space (e.g., knowledge that the 'Boston Bombers' were brothers could have helped identify the suspects sooner). In short, there are many beneficiaries that could result from such technologies: whether the consumer (e.g., automatic photo library management), scholar (e.g., historic lineage & genealogical studies), data analyzer (e.g., social-media- based analysis), investigator (e.g., cases of missing children and human trafficking. For instance, it is unlikely that a missing child found online would be in any database, however, more than likely a family member would be), or even refugees. Besides application- based problems, and as already hinted, kinship is a powerful cue that could serve as a face attribute capable of greatly reducing the search space in more general face-recognition problems. In this tutorial, we will introduce the background information, progress leading us up to these points, several current state-of-the-art algorithms spanning various views of the kinship recognition problem (e.g., verification, classification, tri-subject). We will then cover our large-scale Families In the Wild (FIW) image collection, several challenge competitions it as been used in, along with the top per- forming deep learning approaches. The tutorial will end with a discussion about future research directions and practical use-cases.

### Deep Learning Interpretation

•      Jitao Sang

Deep learning has been successfully exploited in addressing different multimedia problems in recent years. The academic researchers are now transferring their attention from identifying what problem deep learning CAN address to exploring what problem deep learning CAN NOT address. This tutorial starts with a summarization of six 'CAN NOT' problems deep learning fails to solve at the current stage, i.e., low stability, debugging difficulty, poor parameter transparency, poor incrementality, poor reasoning ability, and machine bias. These problems share a common origin from the lack of deep learning interpretation. This tutorial attempts to correspond the six 'NOT' problems to three levels of deep learning interpretation: (1) Locating - accurately and efficiently locating which feature contributes much to the output. (2) Understanding - bidirectional semantic accessing between human knowledge and deep learning algorithm. (3) Expandability - well storing, accumulating and reusing the models learned from deep learning. Existing studies falling into these three levels will be reviewed in detail, and a discussion on the future interesting directions will be provided in the end.

### Interactive Video Search: Where is the User in the Age of Deep Learning?

•      Klaus Schoeffmann
• Werner Bailer
• Cathal Gurrin
• Jakub Lokoč

In this tutorial we discuss interactive video search tools and methods, review their need in the age of deep learning, and explore video and multimedia search challenges and their role as evaluation benchmarks in the field of multimedia information retrieval. We cover three different campaigns (TRECVID, Video Browser Showdown, and the Lifelog Search Challenge), discuss their goals and rules, and present their achieved findings over the last half-decade. Moreover, we talk about datasets, tasks, evaluation procedures, and examples of interactive video search tools, as well as how they evolved over the years. Participants of this tutorial will be able to gain collective insights from all three challenges and use them for focusing their research efforts on outstanding problems that still remain unsolved in this area.

### Human Behavior Understanding: From Action Recognition to Complex Event Detection

•      Ting Yao
• Jingen Liu

Analyzing human behaviour in videos is one of the fundamental problems of computer vision and multimedia understanding. The task is very challenging as video is an information-intensive media with large variations and complexities in content. With the development of deep learning techniques, researchers have strived to push the limits of human behaviour understanding in a wide variety of applications from action recognition to event detection. This tutorial will present recent advances under the umbrella of human behaviour understanding, which range from the fundamental problem of how to learn "good" video representations, to the challenges of categorizing video content into human action classes, finally to multimedia event detection and surveillance event detection in complex scenarios.

### The Importance of Medical Multimedia

•      Michael Riegler
• Pål Halvorsen
• Bernd Münzer
• Klaus Schoeffmann

Multimedia research is becoming more and more important for the medical domain, where an increasing number of videos and images are integrated in the daily routine of surgical and diagnostic work. While the collection of medical multimedia data is not an issue, appropriate tools for efficient use of this data are missing. This includes management and inspection of the data, visual analytics, as well as learning relevant semantics and using recognition results for optimizing surgical and diagnostic processes. The characteristics and requirements in this interesting but challenging field are different than the ones in classic multimedia domains. Therefore, this tutorial gives a general introduction to the field, provides a broad overview of specific requirements and challenges, discusses existing work and open challenges, and elaborates in detail how machine learning approaches can help in multimedia-related fields to improve the performance of surgeons/clinicians.

## SESSION: Workshop Summaries

### AltMM 2018 - 3rd International Workshop on Multimedia Alternate Realities

•      Teresa Chambel
• Francesca De Simone
• Rene Kaiser
• Nimesha Ranasinghe
• Wendy Van den Broeck

AltMM 2018 is the 3rd edition of the International Workshop on Multimedia Alternate Realities at ACM Multimedia. Our ambition remains to engage researchers and practitioners in discussions on how we can successfully create meaningful multimedia 'alternate realities' experiences. One of the main strengths of this workshop is that we combine different perspectives to explore how the synergy between multimedia technologies can foster and shape the creation of alternate realities and make their access an enriching and valuable experience.

### Summary for AVEC 2018: Bipolar Disorder and Cross-Cultural Affect Recognition

•      Fabien Ringeval
• Björn Schuller
• Michel Valstar
• Roddy Cowie
• Maja Pantic

The eighth Audio-Visual Emotion Challenge and workshop AVEC 2018 was held in conjunction with ACM Multimedia'18. This year, the AVEC series addressed major novelties with three distinct sub-challenges: bipolar disorder classification, cross-cultural dimensional emotion recognition, and emotional label generation from individual ratings. The Bipolar Disorder Sub-challenge was based on a novel dataset of structured interviews of patients suffering from bipolar disorder (BD corpus), the Cross-cultural Emotion Sub-challenge relied on an extension of the SEWA dataset, which includes human-human interactions recorded 'in-the-wild' for the German and the Hungarian cultures, and the Gold-standard Emotion Sub-challenge was based on the RECOLA dataset, which was previously used in the AVEC series for emotion recognition. In this summary, we mainly describe participation and conditions of the AVEC Challenge.

### CoVieW'18: The 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild

•      Kwanghoon Sohn
• Ming-Hsuan Yang
• Hyeran Byun
• Jongwoo Lim
• Jison Hsu
• Stephen Lin
• Euntai Kim
• Seungryong Kim

The 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild, dubbed CoVieW'18, is held in Seoul, Korea on October 22, 2018, in conjuction with ACM Multimedia 2018. The workshop aims to solve the joint and comprehensive understanding problem in untrimmed videos with a particular emphasis on joint action and scene recognition. The workshop encourages researchers to participate in joint action and scene recognition challenge in untrimmed videos and to report their results. The workshop program includes 1 keynote speech, 2 invited speakers, 6 regular and challenge papers. The developments made in the workshop will deliver a step change in a variety of video applications.

### HealthMedia 2018: Third International Workshop on Multimedia for Personal Health and Health Care

•      Jochen Meyer
• Susanne Boll
• Noel E. O'Connor
• Ramesh Jain
• Troy McDaniel

Research in multimedia and health is driven by the current technological advancements in sensors and personalized healthcare. There is an increasing amount of work that shows how core multimedia research is becoming an important enabler for solutions with applications and relevance for the societal questions of health. This workshop brings together researchers from diverse topics such as multimedia, pervasive health, lifelogging, accessibility, HCI, but also health, medicine, and psychology to address challenges and opportunities of multimedia in and for health.

### MAHCI 2018: The 1st Workshop on Multimedia for Accessible Human Computer Interface

•      Xueliang Liu
• Rui Min
• Benoit Huet
• Jia Jia

In the developing of advanced Human-Computer Interaction, multimedia technology plays a fundamental role to increase usability, and accessibility of computer interfaces. The first workshop on Multimedia for Accessible Human Computer Interface (MAHCI) provides a forum to both multimedia and HCI researchers to discuss the accessible human computer interface design, development, and evaluations with the state-of-the-art multimedia technology. It also enables multimedia community to expand its interaction with the HCI industry and broaden the scope of deploying multimedia technology in practical applications. The workshop features 5 papers which cover a number of novel applications and new methodologies in a half day program.

### ASMMC-MMAC 2018: The Joint Workshop of 4th the Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data Workshop

•      Dong-Yan Huang
• Sicheng Zhao
• Björn W. Schuller
• Hongxun Yao
• Jianhua Tao
• Min Xu
• Lei Xie
• Qingming Huang
• Jie Yang

Affective social multimedia computing is an emergent research topic for both affective computing and multimedia research communities. Social multimedia is fundamentally changing how we communicate, interact, and collaborate with other people in our daily lives. Social multimedia contains much affective information. Effective extraction of affective information from social multimedia can greatly help social multimedia computing (e.g., processing, index, retrieval, and understanding). Besides, with the rapid development of digital photography and social networks, people get used to sharing their lives and expressing their opinions online. As a result, user-generated social media data, including text, images, audios, and videos, grow rapidly, which urgently demands advanced techniques on the management, retrieval, and understanding of these data.

### AVSU: Workshop on Audio-Visual Scene Understanding for Immersive Multimedia

• Hong-Goo Kang
• Hansung Kim
• Kwanghoon Sohn

This workshop aims to provide a forum to exchange ideas in scene understanding techniques researched in audio and visual communities, and to ultimately unlock the creative potential of joint audio-visual signal processing to deliver a step change in various multimedia applications. Papers and talks presented in this workshop will contribute to the emerging technology for audio and visual information that can improve traditional approaches for multimedia content production and reproduction. The goals of this workshop are to (1) present and discuss the latest trends in audio and computer vision fields for the common research goals, (2) understand state-of-the-art techniques and bottlenecks in the other's discipline for the common topics, (3) investigate research opportunities of joint audio-visual scene understandings in multimedia content production. This workshop will be a good opportunity to bring together leading experts in audio processing and computer vision, and will bridge the gap between two research fields in multimedia content production and reproduction.

### 1st ACM International Workshop on Multimedia Content Analysis in Sports

•      Rainer Lienhart
• Thomas B. Moeslund
• Hideo Saito

The first ACM International Workshop on Multimedia Content Analysis in Sports (ACM MMSports'18) is held in Seoul, South Korea on October 26th, 2018 and is co-located with the ACM International Conference on Multimedia 2018 (ACM Multimedia 2018). The goal of this workshop is to bring together researchers and practitioners from academia and industry to address challenges and report progress in mining and content analysis of multimedia/multimodal data in sports. The combination of sports and modern technology offers a novel and intriguing field of research with promising approaches for visual broadcast augmentation, understanding, statistical analysis and evaluation, and sensor fusion. There is a lack of research communities focusing on the fusion of multiple modalities. We are helping to close this research gap with this first workshop of a serious workshops on multimedia content analysis in sports.

### EE-USAD: ACM MM 2018Workshop on UnderstandingSubjective Attributes of Data focus on Evoked Emotions

•      Xavier Alameda-Pineda
• Miriam Redi
• Nicu Sebe
• Shih-Fu Chang
• Jiebo Luo

The series of events devoted to the computational Understanding of Subjective Attributes (e.g. beauty, sentiment) of Data (USAD)provide a complementary perspective to the analysis of tangible properties (objects, scenes), which overwhelmingly covered the spectra of applications in multimedia. Partly fostered by the wide-spread usage of social media, the analysis of subjective attributes has attracted lots of attention in the recent years, and many research teams at the crossroads of multimedia, computer vision and social sciences, devoted time and effort to this topic. Among the subjective attributes there are those assessed by individuals (e.g. safety,interestingness, evoked emotions [2], memorability [3]) as well as aggregated emergent properties (such as popularity or virality [1]).This edition of the workshop (see below for the workshop's history)is devoted to the multimodal recognition of evoked emotions (EE).