ICMR '18- Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval


SESSION: Keynote 1

  •      Kiyoharu Aizawa

The Ongoing Evolution of Broadcast Technology

  •      Kohji Mitani

The media environment of program production, content delivery, and viewing has been changing because of progress in broadcasting and communication technologies and other technologies like IoT, cloud computing, and artificial intelligence (AI). In December 2018, 8K and 4K UHDTV satellite broadcasting will start in Japan, which means that viewers will soon be able to enjoy 8K and 4K programs featuring a wide color gamut and high dynamic range characteristics together with 22.2 multi-channel audio at home. Meanwhile, distribution services for sending content to PCs and smartphones through the Internet have rapidly been spreading and the introduction of the next generation of mobile networks (5G) will accelerate their spread. The coming of such advanced broadcast and broadband technologies and consequent changes in lifestyle will provide broadcasters with a great opportunity for a new stage of development. At NHK Science & Technology Research Laboratories (NHK STRL), we are pursuing a wide range of research with the aim of creating new broadcast services that can provide viewing experiences never before imagined and user experiences more attuned to daily life. To enhance the convenience of television and the value of TV programming, we are developing technology for connecting the TV experience with various activities in everyday life. Extensions to "Hybridcast Connect" will drive applications that link TVs, smartphones, and IoT. They will enable spontaneous consumption of content during everyday activities through various devices around the user. Establishing a new program production workflow with AI, which we call "Smart Production", is one of our most important research topics. We are developing speech and face recognition technologies for making closed captions and metadata efficiently, as well as technologies for automatically converting content into computer-generated sign language, audio descriptions, and simplified Japanese. This presentation introduces these research achievements targeting 2020 and beyond, as well as other broadcasting technology trends including 4K8K UHDTV broadcasting in Japan, 3D imaging, and VR/AR.

SESSION: Keynote 2

  •      Shin'ichi Satoh

Prototyping for Envisioning the Future

  •      Yamanaka Shunji

As an industrial designer I have worked in collaboration with various researchers and scientists since the beginning of this century. I have made many prototypes showing the possibility of their leading edge technologies, and exhibited them in these years. As the archives of academic documents and papers have became open, and the internet gave public access to the recordings of various experiments being conducted throughout the world, technology in laboratories are now constantly exposed to the public. In this context, prototypes are becoming more important as the medium that bridges between advanced technology and society. Now a prototype is not merely an experimental machine. It is a device created to present user experience in advance, to share the benefits of the technology with many others. The role of a prototype is not limited to just sharing of values within the development team, but goes beyond that: it is a medium used to voice the significance of research and development to society; an inspiration to stimulate future markets; and also a tool to secure development budgets. A prototype is the physical embodiment of speculative story that connects people to technology that has yet to be brought to society. I would like to introduce some of the prototypes we developed and share the future vision they invoke.

SESSION: Industrial Talks

  •      Go Irie Tao Mei

Orion: An Integrated Multimedia Content Moderation System for Web Services

  •      Yusuke Fujisaka

Social Networking Services (SNS) depend on user-generated content (UGC). A fraction of UGC is considered spam, such as adult, scam and abusive content. In order to maintain service reliability and avoid criminal activity, content moderation is employed to eliminate spam from SNS. Content moderation consists of manual content-monitoring operations and/or automatic spam-filtering. Detecting a small portion of spam among a large amount of UGC mostly relies on manual operation, thus it requires a large number of human operators and sometimes suffers from human error. In contrast, automatic spam-filtering can be processed with smaller cost, however it is difficult to follow spams' continuously changing trend, and it may declines service experience due to false positives. This presentation introduces an integrated content moderation platform called "Orion'', which aims to minimize manual process and maximize detection of spam in UGC data. Orion preserves post history by users and services, which enables calculating the risk level of each user and decide whether monitoring is required. Also, Orion has a scalable API that can perform number of machine-learning based filtering processes, such as DNN (Deep Neural Network) and SVM for text and images that are posted in many SNS systems. We show that Orion improves efficiency of content moderation compared to a fully manual operation.

Industrial Applications of Image Recognition and Retrieval Technologies for Public Safety and IT Services

  •      Tomokazu Murakami

Hitachi has a wide variety of technologies ranging from systems for infrastructure to IT platforms such as railway management systems, water supply operation systems, manufacturing management systems for factories, surveillance cameras and monitoring systems, rolling stocks, power plants, servers, storages, data centers, and various IT systems for governments and companies. The research and development group of Hitachi is developing video analytics and other media processing techniques and applying them to various products and solutions with business divisions for such as public safety, productivity improvement of factories and other IT applications. In this talk, I would like to introduce some of the products, solutions and research topics in Hitachi which video analytics and image retrieval techniques are applied. These include an image search system for retrieving public registered design graphics, a person detection and tracking function for video surveillance system and our activities and results in TRECVID 2017. In each cases, we integrated our original high speed image search database and deep learning based image recognition technique. Through these use cases, I would like to present how image recognition and retrieval technologies are practically utilized to industrial products and solutions and contributing to the improvement of social welfare.

NEC's Object Recognition Technologies and their Industrial Applications

  •      Kota Iwamoto

Recent advancements in image recognition technologies has enabled image recognition-based systems to be widely used in real world applications. In this talk, I will introduce NEC's image-based object recognition technologies targeted for recognizing various manufactured goods and retail products from a camera, and talk about their industrial applications which we have developed and commercialized. These image-based object recognition technologies enable highly efficient and cost-effective management of goods and products throughout their life-cycle (manufacturing, distribution, retail, and consumption), which otherwise cannot be achieved by human labor or by use of ID tags. Firstly, I will talk about a technology to recognize multiple objects from a single image using feature matching of compact local descriptors, combined with a more recent Deep Learning-based recognition. It enables large number of objects to be recognized at once, which greatly reduces human labor and time for various product inspection and checking works. Using this technology, we have developed and commercialized the product inspection system in warehouses, the planogram recognition system for retail shop shelves, and the self-service POS system for easy-to-use and fast checkout in retail stores. Secondly, I will talk about the "Fingerprint of Things'' technology. It enables individual identification of tiny manufactured parts (e.g. bolts and nuts) by identifying images of their unique surface patterns, just like human fingerprints. We have built a prototype of mass-produced parts traceability system, which enables users to easily track down the individual parts using a mobile device. In the talk, I will explain the key issues in realizing these industrial applications of image-based object recognition technologies.

Promoting Open Innovations in Real Estate Tech: Provision of the LIFULL HOME'S Data Set and Collaborative Studies

  •      Yoji Kiyota

The LIFULL HOME'S Data Set, which is provided for academic use since November 2015, is being used for research in a variety of fields such as economics, architecture, urban science and so on. In particular, since it contains 83 million object property images and 5.1 million floor plan images, utilization in the computer vision and multimedia field is thriving, and papers using data sets are also adopted at the top conference ICCV 2017 in the image processing field it is. This presentation summarizes the results that have been obtained through the provision of datasets, and shows plans to promote open innovation in the field of real estate technology.

SESSION: Tutorials

Objects, Relationships, and Context in Visual Data

  •      Hanwang Zhang
  • Qianru Sun

For decades, we are interested in detecting objects and classifying them into a fixed vocabulary of lexicon. With the maturity of these low-level vision solutions, we are hunger for a higher-level representation of the visual data, so as to extract visual knowledge rather than merely bags of visual entities, allowing machines to reason about human-level decision-making and even manipulate the visual data at the pixel-level. In this tutorial, we will introduce a various of machine learning techniques for modeling visual relationships (e.g., subject-predicate-object triplet detection) and contextual generative models (e.g., generating photo-realistic images using conditional generative adversarial networks). In particular, we plan to start from fundamental theories on object detection, relationship detection, generative adversarial networks, to more advanced topics on referring expression visual grounding, pose guided person image generation, and context based image inpainting.

Recommendation Technologies for Multimedia Content

  •      Xiangnan He
  • Hanwang Zhang
  • Tat-Seng Chua

Recommendation systems play a vital role in online information systems and have become a major monetization tool for user-oriented platforms. In recent years, there has been increasing research interest in recommendation technologies in the information retrieval and data mining community, and significant progress has been made owing to the fast development of deep learning. However, in the multimedia community, there has been relatively less attention paid to the development of multimedia recommendation technologies. In this tutorial, we summarize existing research efforts on multimedia recommendation. We first provide an overview on fundamental techniques and recent advances on personalized recommendation for general items. We then summarize existing developments on recommendation technologies for multimedia content. Lastly, we present insight into the challenges and future directions in this emerging and promising area.

Multimedia Content Understanding by Learning from Very Few Examples: Recent Progress on Unsupervised, Semi-Supervised and Supervised Deep Learning Approaches

  •      Guo-Jun Qi

In this tutorial, the speaker will present serval parallel efforts on building deep learning models with very few supervision information, with or without unsupervised data available. In particular, we will discuss in details. (1) Generative Adverbial Nets (GANs) and their applications to unsupervised feature extractions, semi-supervised learning with few labeled examples and a large amount of unlabeled data. We will discuss the state-of-the-art results that have been achieved by the semi-supervised GANs. (2) Low-Shot Learning algorithms to train and test models on disjoint sets of tasks. We will discuss the ideas of how to efficiently adapt models to tasks with very few examples. In particular, we will discuss several paradigms of learning-to-learn approaches. (3) We will also discuss how to transfer models across modalities by leveraging abundant labels from one modality to train a model for other modalities with few labels. We will discuss in details the cross-modal label transfer approach.

SESSION: Best Paper Session

  •      Benoit Huet

Ranking News-Quality Multimedia

  •      Gonçalo Marcelino
  • Ricardo Pinto
  • João Magalhães

News editors need to find the photos that best illustrate a news piece and fulfill news-media quality standards, while being pressed to also find the most recent photos of live events. Recently, it became common to use social-media content in the context of news media for its unique value in terms of immediacy and quality. Consequently, the amount of images to be considered and filtered through is now too much to be handled by a person. To aid the news editor in this process, we propose a framework designed to deliver high-quality, news-press type photos to the user. The framework, composed of two parts, is based on a ranking algorithm tuned to rank professional media highly and a visual SPAM detection module designed to filter-out low-quality media. The core ranking algorithm is leveraged by aesthetic, social and deep-learning semantic features. Evaluation showed that the proposed framework is effective at finding high-quality photos (true-positive rate) achieving a retrieval MAP of 64.5% and a classification precision of 70%.

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

  •      Niluthpol Chowdhury Mithun
  • Juncheng Li
  • Florian Metze
  • Amit K. Roy-Chowdhury

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval methods by learning joint representations, the video-text retrieval task, however, has not been explored to its fullest extent. In this paper, we study how to effectively utilize available multimodal cues from videos for the cross-modal video-text retrieval task. Based on our analysis, we propose a novel framework that simultaneously utilizes multi-modal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval. Furthermore, we explore several loss functions in training the embedding and propose a modified pairwise ranking loss for the task. Experiments on MSVD and MSR-VTT datasets demonstrate that our method achieves significant performance gain compared to the state-of-the-art approaches.

Class-aware Self-Attention for Audio Event Recognition

  •      Shizhe Chen
  • Jia Chen
  • Qin Jin
  • Alexander Hauptmann

Audio event recognition (AER) has been an important research problem with a wide range of applications. However, it is very challenging to develop large scale audio event recognition models. On the one hand, usually there are only "weak" labeled audio training data available, which only contains labels of audio events without temporal boundaries. On the other hand, the distribution of audio events is generally long-tailed, with only a few positive samples for large amounts of audio events. These two issues make it hard to learn discriminative acoustic features to recognize audio events especially for long-tailed events. In this paper, we propose a novel class-aware self-attention mechanism with attention factor sharing to generate discriminative clip-level features for audio event recognition. Since a target audio event only occurs in part of an entire audio clip and its corresponding temporal interval varies, the proposed class-aware self-attention approach learns to highlight relevant temporal intervals and to suppress irrelevant noises at the same time. In order to learn attention patterns effectively for those long-tailed events, we combine both the domain knowledge and data driven strategies to share attention factors in the proposed attention mechanism, which transfers the common knowledge learned from other similar events to the rare events. The proposed attention mechanism is a pluggable component and can be trained end-to-end in the overall AER model. We evaluate our model on a large-scale audio event corpus "Audio Set" with both short-term and long-term acoustic features. The experimental results demonstrate the effectiveness of our model, which improves the overall audio event recognition performance with different acoustic features especially for events with low resources. Moreover, the experiments also show that our proposed model is able to learn new audio events with a few training examples effectively and efficiently without disturbing the previously learned audio events.

Mining Exoticism from Visual Content with Fusion-based Deep Neural Networks

  •      Andrea Ceroni
  • Chenyang Ma
  • Ralph Ewerth

Exoticism is the charm of the unfamiliar, it often means unusual, mystery, and it can evoke the atmosphere of remote lands. Although it has received interest in different arts, like painting and music, no study has been conducted on understanding exoticism from a computational perspective. To the best of our knowledge, this work is the first to explore the problem of exoticism-aware image classification, aiming at automatically measuring the amount of exoticism in images and investigating the significant aspects of the task. The estimation of image exoticism could be applied in fields like advertising and travel suggestion, as well as to increase serendipity and diversity of recommendations and search results. We propose a Fusion-based Deep Neural Network (FDNN) for this task, which combines image representations learned by Deep Neural Networks with visual and semantic hand-crafted features. Comparisons with other Machine Learning models show that our proposed architecture is the best performing one, reaching accuracy over 83% and 91% on two different datasets. Moreover, experiments with classifiers exploiting both visual and semantic features allow to analyze what are the most important aspects for identifying exotic content. Ground truth has been gathered by retrieving exotic and not exotic images through a web search engine by posing queries with exotic and not exotic semantics, and then assessing the exoticism of the retrieved images via a crowdsourcing evaluation. The dataset is publicly released to promote advances in this novel field.

SESSION: Oral Session 1: Multimedia Retrieval

  •      Qi Tan

Modal-adversarial Semantic Learning Network for Extendable Cross-modal Retrieval

  •      Xing Xu
  • Jingkuan Song
  • Huimin Lu
  • Yang Yang
  • Fumin Shen
  • Zi Huang

Cross-modal retrieval, e.g., using an image query to search related text and vice-versa, has become a highlighted research topic, to provide flexible retrieval experience across multi-modal data. Existing approaches usually consider the so-called non-extendable cross-modal retrieval task. In this task, they learn a common latent subspace from a source set containing labeled instances of image-text pairs and then generate common representation for the instances in a target set to perform cross-modal matching. However, these method may not generalize well when the instances of the target set contains unseen classes since the instances of both the source and target set are assumed to share the same range of classes in the non-extensive cross-modal retrieval task. In this paper, we consider a more practical issue of extendable cross-modal retrieval task where instances in source and target set have disjoint classes. We propose a novel framework, termed Modal-adversarial Semantic Learning Network (MASLN), to tackle the limitation of existing methods on this practical task. Specifically, the proposed MASLN consists two subnetworks of cross-modal reconstruction and modal-adversarial semantic learning. The former minimizes the cross-modal distribution discrepancy by reconstructing each modality data mutually, with the guidelines of class embeddings as side information in the reconstruction procedure. The latter generates semantic representation to be indiscriminative for modalities, while to distinguish the modalities from the common representation via an adversarial learning mechanism. The two subnetworks are jointly trained to enhance the cross-modal semantic consistency in the learned common subspace and the knowledge transfer to instances in the target set. Comprehensive experiment on three widely-used multi-modal datasets show its effectiveness and robustness on both non-extendable and extendable cross-modal retrieval task.

Cross-Modal Retrieval Using Deep De-correlated Subspace Ranking Hashing

  •      Kevin Joslyn
  • Kai Li
  • Kien A. Hua

Cross-modal hashing has become a popular research topic in recent years due to the efficiency of storing and retrieving high-dimensional multimodal data represented by compact binary codes. While most cross-modal hash functions use binary space partitioning functions (e.g. the sign function), our method uses ranking-based hashing, which is based on numerically stable and scale-invariant rank correlation measures. In this paper, we propose a novel deep learning architecture called Deep De-correlated Subspace Ranking Hashing (DDSRH) that uses feature-ranking methods to determine the hash codes for the image and text modalities in a common hamming space. Specifically, DDSRH learns a set of de-correlated nonlinear subspaces on which to project the original features, so that the hash code can be determined by the relative ordering of projected feature values in a given optimized subspace. The network relies upon a pre-trained deep feature learning network for each modality, and a hashing network responsible for optimizing the hash codes based on the known similarity of the training image-text pairs. Our proposed method includes both architectural and mathematical techniques designed specifically for ranking-based hashing in order to achieve de-correlation between the bits, bit balancing, and quantization. Finally, through extensive experimental studies on two widely-used multimodal datasets, we show that the combination of these techniques can achieve state-of the-art performance on several benchmarks.

Learning Multilevel Semantic Similarity for Large-Scale Multi-Label Image Retrieval

  •      Ge Song
  • Xiaoyang Tan

We present a novel Deep Supervised Hashing with code operation (DSOH) method for large-scale multi-label image retrieval. This approach is in contrast with existing methods in that we respect both the intention gap and the intrinsic multilevel similarity of multi-labels. Particularly, our method allows a user to simultaneously present multiple query images rather than a single one to better express her intention, and correspondingly a separate sub-network in our architecture is specifically designed to fuse the query intention represented by each single query. Furthermore, as in the training stage, each image is annotated with multiple labels to enrich its semantic representation, we propose a new margin-adaptive triplet loss to learn the fine-grained similarity structure of multi-labels, which is known to be hard to capture. The whole system is trained in an end-to-end manner, and our experimental results demonstrate that the proposed method is not only able to learn useful multilevel semantic similarity-preserving binary codes but also achieves state-of-the-art retrieval performance on three popular datasets.

Multi-view Collective Tensor Decomposition for Cross-modal Hashing

  •      Limeng Cui
  • Zhensong Chen
  • Jiawei Zhang
  • Lifang He
  • Yong Shi
  • Philip S. Yu

Multimedia data available in various disciplines are usually heterogeneous, containing representations in multi-views, where the cross-modal search techniques become necessary and useful. It is a challenging problem due to the heterogeneity of data with multiple modalities, multi-views in each modality and the diverse data categories. In this paper, we propose a novel multi-view cross-modal hashing method named Multi-view Collective Tensor Decomposition (MCTD) to fuse these data effectively, which can exploit the complementary feature extracted from multi-modality multi-view while simultaneously discovering multiple separated subspaces by leveraging the data categories as supervision information. Our contributions are summarized as follows: 1) we exploit tensor modeling to get better representation of the complementary features and redefine a latent representation space; 2) a block-diagonal loss is proposed to explicitly pursue a more discriminative latent tensor space by exploring supervision information; 3) we propose a new feature projection method to characterize the data and to generate the latent representation for incoming new queries. An optimization algorithm is proposed to solve the objective function designed for MCTD, which works under an iterative updating procedure. Experimental results prove the state-of-the-art precision of MCTD compared with competing methods.

Binary Coding by Matrix Classifier for Efficient Subspace Retrieval

  •      Lei Zhou
  • Xiao Bai
  • Xianglong Liu
  • Jun Zhou

Fast retrieval in large-scale database with high-dimensional subspaces is an important task in many applications, such as image retrieval, video retrieval and visual recognition. This can be facilitated by approximate nearest subspace (ANS) retrieval which requires effective subspace representation. Most of the existing methods for this problem represent subspace by point in the Euclidean space or the Grassmannian space before applying the approximate nearest neighbor (ANN) search. However, the efficiency of these methods can not be guaranteed because the subspace representation step can be very time consuming when coping with high dimensional data. Moreover, the transforming process for subspace to point will cause subspace structural information loss which influence the retrieval accuracy. In this paper, we present a new approach for hashing-based ANS retrieval. The proposed method learns the binary codes for given subspace set following a similarity preserving criterion. It simultaneously leverages the learned binary codes to train matrix classifiers as hash functions. This method can directly binarize a subspace without transforming it into a vector. Therefore, it can efficiently solve the large-scale and high-dimensional multimedia data retrieval problem. Experiments on face recognition and video retrieval show that our method outperforms several state-of-the-art methods in both efficiency and accuracy.

Instance Image Retrieval by Aggregating Sample-based Discriminative Characteristics

  •      Zhongyan Zhang
  • Lei Wang
  • Yang Wang
  • Luping Zhou
  • Jianjia Zhang
  • Fang Chen

Identifying the discriminative characteristic of a query is important for image retrieval. For retrieval without human interaction, such characteristic is usually obtained by average query expansion (AQE) or its discriminative variant (DQE) learned from pseudo-examples online, among others. In this paper, we propose a new query expansion method to further improve the above ones. The key idea is to learn a "unique'' discriminative characteristic for each database image, in an offline manner. During retrieval, the characteristic of a query is obtained by aggregating the unique characteristics of the query-relevant images collected from an initial retrieval result. Compared with AQE which works in the original feature space, our method works in the space of the unique characteristics of database images, significantly enhancing the discriminative power of the characteristic identified for a query. Compared with DQE, our method needs neither pseudo-labeled negatives nor the online learning process, leading to more efficient retrieval and even better performance. The experimental study conducted on seven benchmark datasets verifies the considerable improvement achieved by the proposed method, and also demonstrates its application to the state-of-the-art diffusion-based image retrieval.

SESSION: Oral Session 2: Multimedia Content Analysis

  •      Wei-Ta Chu

Deep Extreme Multi-label Learning

  •      Wenjie Zhang
  • Junchi Yan
  • Xiangfeng Wang
  • Hongyuan Zha

Extreme multi-label learning (XML) or classification has been a practical and important problem since the boom of big data. The main challenge lies in the exponential label space which involves 2L possible label sets especially when the label dimension L is huge, e.g., in millions for Wikipedia labels. This paper is motivated to better explore the label space by originally establishing an explicit label graph. In the meanwhile, deep learning has been widely studied and used in various classification problems including multi-label classification, however it has not been properly introduced to XML, where the label space can be as large as in millions. In this paper, we propose a practical deep embedding method for extreme multi-label classification, which harvests the ideas of non-linear embedding and graph priors-based label space modeling simultaneously. Extensive experiments on public datasets for XML show that our method performs competitive against state-of-the-art result.

Multimodal Network Embedding via Attention based Multi-view Variational Autoencoder

  •      Feiran Huang
  • Xiaoming Zhang
  • Chaozhuo Li
  • Zhoujun Li
  • Yueying He
  • Zhonghua Zhao

Learning the embedding for social media data has attracted extensive research interests as well as boomed a lot of applications, such as classification and link prediction. In this paper, we examine the scenario of a multimodal network with nodes containing multimodal contents and connected by heterogeneous relationships, such as social images containing multimodal contents (e.g., visual content and text description), and linked with various forms (e.g., in the same album or with the same tag). However, given the multimodal network, simply learning the embedding from the network structure or a subset of content results in sub-optimal representation. In this paper, we propose a novel deep embedding method, i.e., Attention-based Multi-view Variational Auto-Encoder (AMVAE), to incorporate both the link information and the multimodal contents for more effective and efficient embedding. Specifically, we adopt LSTM with attention model to learn the correlation between different data modalities, such as the correlation between visual regions and the specific words, to obtain the semantic embedding of the multimodal contents. Then, the link information and the semantic embedding are considered as two correlated views. A multi-view correlation learning based Variational Auto-Encoder (VAE) is proposed to learn the representation of each node, in which the embedding of link information and multimodal contents are integrated and mutually reinforced. Experiments on three real-world datasets demonstrate the superiority of the proposed model in two applications, i.e., multi-label classification and link prediction.

Exploiting Relational Information in Social Networks using Geometric Deep Learning on Hypergraphs

  •      Devanshu Arya
  • Marcel Worring

Online social networks are constituted by a diverse set of entities including users, images and posts which makes the task of predicting interdependencies between entities challenging. We need a model that transfers information from a given type of relations between entities to predict other types of relations, irrespective of the type of entity. In order to devise a generic framework, one needs to capture the relational information between entities without any entity dependent information. However, there are two challenges: (a) a social network has an intrinsic community structure. In these communities, some relations are much more complicated than pairwise relations, thus cannot be simply modeled by a graph; (b) there are different types of entities and relations in a social network, taking into account all of them makes it difficult to formulate a model. In this paper, we claim that representing social networks using hypergraphs improves the task of predicting missing information about an entity by capturing higher-order relations. We study the behavior of our method by performing experiments on CLEF dataset consisting of images from Flickr, an online photo sharing social network.

Automatic Prediction of Building Age from Photographs

  •      Matthias Zeppelzauer
  • Miroslav Despotovic
  • Muntaha Sakeena
  • David Koch
  • Mario Döller

We present a first method for the automated age estimation of buildings from unconstrained photographs. To this end, we propose a two-stage approach that firstly learns characteristic visual patterns for different building epochs at patch-level and then globally aggregates patch-level age estimates over the building. We compile evaluation datasets from different sources and perform an detailed evaluation of our approach, its sensitivity to parameters, and the capabilities of the employed deep networks to learn characteristic visual age-related patterns. Results show that our approach is able to estimate building age at a surprisingly high level that even outperforms human evaluators and thereby sets a new performance baseline. This work represents a first step towards the automated assessment of building parameters for automated price prediction.

The PMEmo Dataset for Music Emotion Recognition

  •      Kejun Zhang
  • Hui Zhang
  • Simeng Li
  • Changyuan Yang
  • Lingyun Sun

Music Emotion Recognition (MER) has recently received considerable attention. To support the MER research which requires large music content libraries, we present the PMEmo dataset containing emotion annotations of 794 songs as well as the simultaneous electrodermal activity (EDA) signals. A Music Emotion Experiment was well-designed for collecting the affective-annotated music corpus of high quality, which recruited 457 subjects.

The dataset is publically available to the research community, which is foremost intended for benchmarking in music emotion retrieval and recognition. To straightforwardly evaluate the methodologies for music affective analysis, it also involves pre-computed audio feature sets. In addition to that, manually selected chorus excerpts (compressed in MP3) of songs are provided to facilitate the development of chorus-related research.

In this article, We describe in detail the resource acquisition, subject selection, experiment design and annotation collection procedures, as well as the dataset content and data reliability analysis. We also illustrate its usage in some simple music emotion recognition tasks which testified the PMEmo dataset's competence for the MER work. Compared to other homogeneous datasets, PMEmo is novel in the organization and management of the recruited annotators, and it is also characterized by its large amount of music with simultaneous physiological signals.

SESSION: Oral Session 3: Multimedia Applications

  •      Wolfgang Hürst

Interpretable Partitioned Embedding for Customized Multi-item Fashion Outfit Composition

  •      Zunlei Feng
  • Zhenyun Yu
  • Yezhou Yang
  • Yongcheng Jing
  • Junxiao Jiang
  • Mingli Song

Intelligent fashion outfit composition becomes more and more popular in these years. Some deep learning based approaches reveal competitive composition recently. However, the uninterpretable characteristic makes such deep learning based approach cannot meet the designers, businesses and consumers' urge to comprehend the importance of different attributes in an outfit composition. To realize interpretable and customized multi-item fashion outfit compositions, we propose a partitioned embedding network to learn interpretable embeddings from clothing items. The network consists of two vital components: attribute partition module and partition adversarial module. In the attribute partition module, multiple attribute labels are adopted to ensure that different parts of the overall embedding correspond to different attributes. In the partition adversarial module, adversarial operations are adopted to achieve the independence of different parts. With the interpretable and partitioned embedding, we then construct an outfit composition graph and an attribute matching map. Extensive experiments demonstrate that 1) the partitioned embedding have unmingled parts which corresponding to different attributes and 2) outfits recommended by our model are more desirable in comparison with the existing methods.

A Multi-Oriented Scene Text Detector with Position-Sensitive Segmentation

  •      Peirui Cheng
  • Weiqiang Wang

Scene text detection has been studied for a long time and lots of approaches have achieved promising performances. Most approaches regard text as a specific object and utilize the popular frameworks of object detection to detect scene text. However, scene text is different from general objects in terms of orientations, sizes and aspect ratios. In this paper, we present an end-to-end multi-oriented scene text detection approach, which combines the object detection framework with the position-sensitive segmentation. For a given image, features are extracted through a fully convolutional network. Then they are input into text detection branch and position-sensitive segmentation branch simultaneously, where text detection branch is used for generating candidates and position-sensitive segmentation branch is used for generating segmentation maps. Finally the candidates generated by text detection branch are projected onto the position-sensitive segmentation maps for filtering. The proposed approach utilizes the merits of position-sensitive segmentation to improve the expressiveness of the proposed network. Additionally, the approach uses position-sensitive segmentation maps to further filter the candidates so as to highly improve the precision rate. Experiments on datasets ICDAR2015 and COCO-Text demonstrate that the proposed method outperforms previous state-of-the-art methods. For ICDAR2015 dataset, the proposed method achieves an F-score of 0.83 and a precision rate of 0.87.

Scene Text Detection and Tracking in Video with Background Cues

  •      Lan Wang
  • Yang Wang
  • Susu Shan
  • Feng Su

To detect scene text in the video is valuable to many content-based video applications. In this paper, we present a novel scene text detection and tracking method for videos, which effectively exploits the cues of the background regions of the text. Specifically, we first extract text candidates and potential background regions of text from the video frame. Then, we exploit the spatial, shape and motional correlations between the text and its background region with a bipartite graph model and the random walk algorithm to refine the text candidates for improved accuracy. We also present an effective tracking framework for text in the video, making use of the temporal correlation of text cues across successive frames, which contributes to enhancing both the precision and the recall of the final text detection result. Experiments on public scene text video datasets demonstrate the state-of-the-art performance of the proposed method.

SESSION: Oral Session 4: Video Analysis

  •      Koichi Shinoda

Recognizing Actions in Wearable-Camera Videos by Training Classifiers on Fixed-Camera Videos

  •      Yang Mi
  • Kang Zheng
  • Song Wang

Recognizing human actions in wearable camera videos, such as videos taken by GoPro or Google Glass, can benefit many multimedia applications. By mixing the complex and non-stop motion of the camera, motion features extracted from videos of the same action may show very large variation and inconsistency. It is very difficult to collect sufficient videos to cover all such variations and use them to train action classifiers with good generalization ability. In this paper, we develop a new approach to train action classifiers on a relatively smaller set of fixed-camera videos with different views, and then apply them to recognize actions in wearable-camera videos. In this approach, we temporally divide the input video into many shorter video segments and transform the motion features to stable ones in each video segment, in terms of a fixed view defined by an anchor frame in the segment. Finally, we use sparse coding to estimate the action likelihood in each segment, followed by combining the likelihoods from all the video segments for action recognition. We conduct experiments by training on a set of fixed-camera videos and testing on a set of wearable-camera videos, with very promising results.

Annotating, Understanding, and Predicting Long-term Video Memorability

  •      Romain Cohendet
  • Karthik Yadati
  • Ngoc Q. K. Duong
  • Claire-Hélène Demarty

Memorability can be regarded as a useful metric of video importance to help make a choice between competing videos. Research on computational understanding of video memorability is however in its early stages. There is no available dataset for modelling purposes, and the few previous attempts provided protocols to collect video memorability data that would be difficult to generalize. Furthermore, the computational features needed to build a robust memorability predictor remain largely undiscovered. In this article, we propose a new protocol to collect long-term video memorability annotations. We measure the memory performances of 104 participants from weeks to years after memorization to build a dataset of 660 videos for video memorability prediction. This dataset is made available for the research community. We then analyze the collected data in order to better understand video memorability, in particular the effects of response time, duration of memory retention and repetition of visualization on video memorability. We finally investigate the use of various types of audio and visual features and build a computational model for video memorability prediction. We conclude that high level visual semantics help better predict the memorability of videos.

Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection

  •      Daniel Rotman
  • Dror Porat
  • Gal Ashour
  • Udi Barzelay

Video scene detection is the task of temporally dividing a video into its semantic sections. This is an important preliminary step for effective analysis of heterogeneous video content. We present a unique formulation of this task as a generic optimization problem with a novel normalized cost function, aimed at optimal grouping of consecutive shots into scenes. The mathematical properties of the proposed normalized cost function enable robust scene detection, also in challenging real-world scenarios. We present a novel dynamic programming formulation for efficiently optimizing the proposed cost function despite an inherent dependency between subproblems. We use deep neural network models for visual and audio analysis to encode the semantic elements in the video scene, enabling effective and more accurate video scene detection. The proposed method has two key advantages compared to other approaches: it inherently provides a temporally consistent division of the video into scenes, and is also parameter-free, eliminating the need for fine-tuning for different types of content. While our method can adaptively estimate the number of scenes from the video content, we also present a new non-greedy procedure for creating a hierarchical consensus-based division tree spanning multiple levels of granularity. We provide comprehensive experimental results showing the benefits of the normalized cost function, and demonstrating that the proposed method outperforms the current state of the art in video scene detection.

SESSION: Poster Paper Session

  •      Keiji Yanai

Transductive Zero-Shot Hashing via Coarse-to-Fine Similarity Mining

  •      Hanjiang Lai

Zero-shot Hashing (ZSH) is to learn hashing models for novel/target classes without training data, which is an important and challenging problem. Most existing ZSH approaches exploit transfer learning via an intermediate shared semantic representations between the seen/source classes and novel/target classes. However, the hash functions learned from the source dataset may show poor performance when directly applied to the target classes due to the dataset bias. In this paper, we study the transductive ZSH, i.e., we have unlabeled data for novel classes. We put forward a simple yet efficient joint learning approach via coarse-to-fine similarity mining which transfers knowledges from source data to target data. It mainly consists of two building blocks in the proposed deep architecture: 1) a shared two-streams network to learn the effective common image representations. The first stream operates on the source data and the second stream operates on the unlabeled data. And 2) a coarse-to-fine module to transfer the similarities of the source data to the target data in a greedy fashion. It begins with a coarse search over the unlabeled data to find the images that most dissimilar to the source data, and then detects the similarities among the found images via the fine module. Extensive evaluation results on several benchmark datasets demonstrate that the proposed hashing method achieves significant improvement over the state-of-the-art methods.

Asymmetric Discrete Cross-Modal Hashing

  •      Xin Luo
  • Peng-Fei Zhang
  • Ye Wu
  • Zhen-Duo Chen
  • Hua-Junjie Huang
  • Xin-Shun Xu

Recently, cross-modal hashing (CMH) methods have attracted much attention. Many methods have been explored; however, there are still some issues that need to be further considered. 1) How to efficiently construct the correlations among heterogeneous modalities. 2) How to solve the NP-hard optimization problem and avoid the large quantization errors generated by relaxation. 3) How to handle the complex and difficult problem in most CMH methods that simultaneously learning the hash codes and hash functions. To address these challenges, we present a novel cross-modal hashing algorithm, named Asymmetric Discrete Cross-Modal Hashing (ADCH). Specifically, it leverages the collective matrix factorization technique to learn the common latent representations while preserving not only the cross-correlation from different modalities but also the semantic similarity. Instead of relaxing the binary constraints, it generates the hash codes directly using an iterative optimization algorithm proposed in this work. Based the learnt hash codes, ADCH further learns a series of binary classifiers as hash functions, which is flexible and effective. Extensive experiments are conducted on three real-world datasets. The results demonstrate that ADCH outperforms several state-of-the-art cross-modal hashing baselines.

Collaborative Subspace Graph Hashing for Cross-modal Retrieval

  •      Xiang Zhang
  • Guohua Dong
  • Yimo Du
  • Chengkun Wu
  • Zhigang Luo
  • Canqun Yang

Current hashing methods for cross-modal retrieval generally attempt to learn the separate modality-specific transformation matrices to embed multi-modality data into a latent common subspace, and usually ignore the fact that respecting the diversity of multi-modality features in the latent subspace could be beneficial for retrieval improvements. To this, we propose a collaborative subspace graph hashing method (CSGH) to perform a two-stage collaborative learning framework for cross-modal retrieval. Particularly, CSGH first embeds multi-modality data into separate latent subspaces through individual modality-specific transformation matrices, and then connects these latent subspaces to a common Hamming space through a shared transformation matrix. In this framework, CSGH considers the modality-specific neighborhood structure and the cross-modal correlation within multi-modality data through the Laplacian regularization and the graph based correlation constraint, respectively. To solve CSGH, we develop an alternative procedure to optimize it, and fortunately, each sub-problem of CSGH has the elegant analytical solution. Experiments of cross-modal retrieval on Wiki, NUS-WIDE, Flickr25K and Flickr1M datasets show the effectiveness of CSGH compared with the state-of-the-art cross-modal hashing methods.

Dictionary Learning based Supervised Discrete Hashing for Cross-Media Retrieval

  •      Ye Wu
  • Xin Luo
  • Xin-Shun Xu
  • Shanqing Guo
  • Yuliang Shi

Hashing technique has attracted considerable attention for large-scale multimedia retrieval due to its low storage cost and fast query speed. Moreover, many hashing models have been proposed for cross-modal retrieval task. However, there are still some problems that need to be further considered. For example, a majority of them directly use linear projection matrix to project heterogeneous data into a common space, which may lead to large error as there are some heterogeneous data with semantic similarity hard to be close in latent space when linear projection is used. Besides, most existing cross-modal hashing methods use a simple pairwise similarity matrix for preserving the label information when learning. This kind of pairwise similarity cannot fully utilize the discriminative property of label information. Furthermore, most existing supervised ones try to solve a relaxed continuous optimization problem by dropping the discrete constraints, which may lead to large quantization error. To overcome these limitations, in this paper, we propose a novel cross-modal hashing method, called Dictionary Learning based Supervised Discrete Hashing (DLSDH). Specifically, it learns dictionaries and generates sparse representation for every instance, which is more suitable to be projected to a latent space. To make full use of label information, it uses cosine similarity to construct a new pairwise similarity matrix which can contain more information. Moreover, it directly learns the discrete hash codes instead of relaxing the discrete constraints. Extensive experiments are conducted on three benchmark datasets and the results demonstrate that it outperforms several state-of-the-art methods for cross-modal retrieval task.

Feature Reconstruction by Laplacian Eigenmaps for Efficient Instance Search

  •      Bingqing Ke
  • Jie Shao
  • Zi Huang
  • Heng Tao Shen

Instance search aims at retrieving images containing a particular query instance. Recently, image features derived from pre-trained convolutional neural networks (CNNs) have been shown to provide promising performance for image retrieval. However, the robustness of these features is still limited by hard positives and hard negatives. To address this issue, this work focuses on reconstructing a new representation based on conventional CNN features to capture the intrinsic image manifold in the original feature space. After the feature reconstruction, the Euclidean distance can be applied in the new space to measure the pairwise distance among feature points. The proposed method is highly efficient, which benefits from the linear search complexity and a further optimization for speedup. Experiments demonstrate that our method achieves promising efficiency with highly competitive accuracy. This work succeeds in capturing implicit embedding information in images as well as reducing the computational complexity significantly.

Image Annotation Retrieval with Text-Domain Label Denoising

  •      Zachary Seymour
  • Zhongfei (Mark) Zhang

This work explores the problem of making user-generated text data, in the form of noisy tags, usable for tasks such as automatic image annotation and image retrieval by denoising the data. Earlier work in this area has focused on filtering out noisy, sparse, or incorrect tags by representing an image by the accumulation of the tags of its nearest neighbors in the visual space. However, this imposes an expensive preprocessing step that must be performed for each new set of images and tags and relies on assumptions about the way the images have been labelled that we find do not always hold. We instead propose a technique for calculating a set of probabilities for the relevance of each tag for a given image relying soley on information in the text domain, namely through widely-available pretrained continous word embeddings. By first clustering the word embeddings for the tags, we calculate a set of weights representing the probability that each tag is meaningful to the image content. Given the set of tags denoised in this way, we use kernel canonical correlation analysis (KCCA) to learn a semantic space which we can project into to retrieve relevant tags for unseen images or to retrieve images for unseen tags. This work also explores the deficiencies of the use of continuous word embeddings for automatic image annotation in the existing KCCA literature and introduces a new method for constructing textual kernel matrices using these word vectors that improves tag retrieval results for both user-generated tags as well as expert labels.

Multi-label Triplet Embeddings for Image Annotation from User-Generated Tags

  •      Zachary Seymour
  • Zhongfei (Mark) Zhang

This work studies the representational embedding of images and their corresponding annotations--in the form of tag metadata--such that, given a piece of the raw data in one modality, the corresponding semantic description can be retrieved in terms of the raw data in another. While convolutional neural networks (CNNs) have been widely and successfully applied in this domain with regards to detecting semantically simple scenes or categories (even though many such objects may be simultaneously present in an image), this work approaches the task of dealing with image annotations in the context of noisy, user-generated, and semantically complex multi-labels, widely available from social media sites. In this case, the labels for an image are diverse, noisy, and often not specifically related to an object, but rather descriptive or user-specific. Furthermore, the existing deep image annotation literature using this type of data typically utilizes the so-called CNN-RNN framework, combining convolutional and recurrent neural networks. We offer a discussion of why RNNs may not be the best choice in this case, though they have been shown to perform well on the similar captioning tasks. Our model exploits the latent image-text space through the use of a triplet loss framework to learn a joint embedding space for the images and their tags, in the presence of multiple, potentially positive exemplar classes. We present state-of-the-art results of the representational properties of these embeddings on several image annotation datasets to show the promise of this approach.

Linguistic Patterns and Cross Modality-based Image Retrieval for Complex Queries

  •      Chandramani Chaudhary
  • Poonam Goyal
  • Joel Ruben Antony Moniz
  • Navneet Goyal
  • Yi-Ping Phoebe Chen

With the rising prevalence of social media, coupled with the ease of sharing images, people with specific needs and applications such as known item search, multimedia question answering, etc., have started searching for visual content, which is expressed in terms of complex queries. A complex query consists of multiple concepts and their attributes are arranged to convey semantics. It is less effective to answer such queries by simply appending the search results gathered from individual or subsets of concepts present in the query. In this paper, we propose to exploit the query constituents and relationships among them. The proposed approach finds image-query relevance by integrating three models - the linguistic pattern-based textual model, the visual model, and the cross modality model. We extract linguistic patterns from complex queries, gather their related crawled images, and assign relevance scores to images in the corpus. The relevance scores are then used to rank the images. We experiment on more than 140k images and compare the [email protected] scores with the state-of-the-art image ranking methods for complex queries. Also, ranking of images obtained by our approach outperforms than that of obtained by a popular search engine.

A Context-Aware Late-Fusion Approach for Disaster Image Retrieval from Social Media

  •      Minh-Son Dao
  • Pham Quang Nhat Minh
  • Asem Kasem
  • Mohamed Saleem Haja Nazmudeen

Natural disasters, especially those related to flooding, are global issues that attract a lot of attention in many parts of the world. A series of research ideas focusing on combining heterogeneous data sources to monitor natural disasters have been proposed, including multi-modal image retrieval. Among these data sources, social media streams are considered of high importance due to the fast and localized updates on disaster situations. Unfortunately, the social media itself contains several factors that limit the accuracy of this process such as noisy data, unsynchronized content between image and collateral text, and untrusted information, to name a few. In this research work, we introduce a context-aware late-fusion approach for disaster image retrieval from social media. Several known techniques based on context-aware criteria are integrated, namely late fusion, tuning, ensemble learning, object detection and scene classification using deep learning. We have developed a method for image-text content synchronization and spatial-temporal-context event confirmation, and evaluated the role of using different types of features extracted from internal and external data sources. We evaluated our approach using the dataset and evaluation tool offered by MediaEval2017: Emergency Response for Flooding Events Task. We have also compared our approach with other methods introduced by MediaEval2017's participants. The experimental results show that our approach is the best one when taking the image-text content synchronization and spatial-temporal-context event confirmation into account.

Face Retrieval Framework Relying on User's Visual Memory

  •      Yugo Sato
  • Tsukasa Fukusato
  • Shigeo Morishima

This paper presents an interactive face retrieval framework for clarifying an image representation envisioned by a user. Our system is designed for a situation in which the user wishes to find a person but has only visual memory of the person. We address a critical challenge of image retrieval across the user's inputs. Instead of target-specific information, the user can select several images (or a single image) that are similar to an impression of the target person the user wishes to search for. Based on the user's selection, our proposed system automatically updates a deep convolutional neural network. By interactively repeating these process (human-in-the-loop optimization), the system can reduce the gap between human-based similarities and computer-based similarities and estimate the target image representation. We ran user studies with 10 subjects on a public database and confirmed that the proposed framework is effective for clarifying the image representation envisioned by the user easily and quickly.

Facial Expression Synthesis by U-Net Conditional Generative Adversarial Networks

  • Xueping Wang
  • Weixin Li
  • Guodong Mu
  • Di Huang
  • Yunhong Wang

High-level manipulation of facial expressions in images such as expression synthesis is challenging because facial expression changes are highly non-linear, and vary depending on the facial appearance. Identity of the person should also be well preserved in the synthesized face. In this paper, we propose a novel U-Net Conditioned Generative Adversarial Network (UC-GAN) for facial expression generation. U-Net helps retain the property of the input face, including the identity information and facial details. We also propose an identity preserving loss, which further improves the performance of our model. Both qualitative and quantitative experiments are conducted on the Oulu-CASIA and KDEF datasets, and the results show that our method can generate faces with natural and realistic expressions while preserve the identity information. Comparison with the state-of-the-art approaches also demonstrates the competency of our method.

PatternNet: Visual Pattern Mining with Deep Neural Network

  •      Hongzhi Li
  • Joseph G. Ellis
  • Lei Zhang
  • Shih-Fu Chang

Visual patterns represent the discernible regularity in the visual world. They capture the essential nature of visual objects or scenes. Understanding and modeling visual patterns is a fundamental problem in visual recognition that has wide ranging applications. In this paper, we study the problem of visual pattern mining and propose a novel deep neural network architecture called PatternNet for discovering these patterns that are both discriminative and representative. The proposed PatternNet leverages the filters in the last convolution layer of a convolutional neural network to find locally consistent visual patches, and by combining these filters we can effectively discover unique visual patterns. In addition, PatternNet can discover visual patterns efficiently without performing expensive image patch sampling, and this advantage provides an order of magnitude speedup compared to most other approaches. We evaluate the proposed PatternNet subjectively by showing randomly selected visual patterns which are discovered by our method and quantitatively by performing image classification with the identified visual patterns and comparing our performance with the current state-of-the-art. We also directly evaluate the quality of the discovered visual patterns by leveraging the identified patterns as proposed objects in an image and compare with other relevant methods. Our proposed network and procedure, PatterNet, is able to outperform competing methods for the tasks described.

Steganographer Detection based on Multiclass Dilated Residual Networks

  •      Mingjie Zheng
  • Sheng-hua Zhong
  • Songtao Wu
  • Jianmin Jiang

Steganographer detection task is to identify criminal users, who attempt to conceal confidential information by steganography methods, among a large number of innocent users. The significant challenge of the task is how to collect the evidences to identify the guilty user with suspicious images, which are embedded with secret messages generating by unknown steganography and payload. Unfortunately, existing methods for steganalysis were served for the binary classification. It makes them harder to classify the images with different kinds of payloads, especially when the payloads of images in test dataset have not been provided in advance. In this paper, we propose a novel steganographer detection method based on multiclass deep neural networks. In the training stage, the networks are trained to classify the images with six types of payloads. The networks can preserve even strengthen the weak stego signals from secret messages in much larger receptive filed by virtue of residual and dilated residual learning. In the inference stage, the learnt model is used to extract the discriminative features, which can capture the difference between guilty users and innocent users. A series of empirical experimental results demonstrate that the proposed method achieves good performance in spatial and frequency domains even though the embedding payload is low. The proposed method achieves a higher level of robustness of inter-steganographic algorithms and can provide a possible solution to address the payload mismatch problem

An Entropy Model for Loiterer Retrieval across Multiple Surveillance Cameras

  •      Maguell L. T. L. Sandifort
  • Jianquan Liu
  • Shoji Nishimura
  • Wolfgang Hürst

Loitering is a suspicious behavior that often leads to criminal actions, such as pickpocketing and illegal entry. Tracking methods can determine suspicious behavior based on trajectory, but require continuous appearance and are difficult to scale up to multi-camera systems. Using the duration of appearance of features works on multiple cameras, but does not consider major aspects of loitering behavior, such as repeated appearance and trajectory of candidates. We introduce an entropy model that maps the location of a person's features on a heatmap. It can be used as an abstraction of trajectory tracking across multiple surveillance cameras. We evaluate our method over several datasets and compare it to other loitering detection methods. The results show that our approach has similar results to state of the art, but can provide additional interesting candidates.

Visual Question Answering With a Hybrid Convolution Recurrent Model

  •      Philipp Harzig
  • Christian Eggert
  • Rainer Lienhart

Visual Question Answering (VQA) is a relatively new task, which tries to infer answer sentences for an input image coupled with a corresponding question. Instead of dynamically generating answers, they are usually inferred by finding the most probable answer from a fixed set of possible answers. Previous work did not address the problem of finding all possible answers, but only modeled the answering part of VQA as a classification task. To tackle this problem, we infer answer sentences by using a Long Short-Term Memory (LSTM) network that allows us to dynamically generate answers for (image, question) pairs. In a series of experiments, we discover an end-to-end Deep Neural Network structure, which allows us to dynamically answer questions referring to a given input image by using an LSTM decoder network. With this approach, we are able to generate both less common answers, which are not considered by classification models, and more complex answers with the appearance of datasets containing answers that consist of more than three words.

Searching and Matching Texture-free 3D Shapes in Images

  •      Shuai Liao
  • Efstratios Gavves
  • Cees G. M. Snoek

The goal of this paper is to search and match the best rendered view of a texture-free 3D shape to an object of interest in a 2D query image. Matching rendered views of 3D shapes to RGB images is challenging because, 1) 3D shapes are not always a perfect match for the image queries, 2) there is great domain difference between rendered and RGB images, and 3) estimating the object scale versus distance is inherently ambiguous in images from uncalibrated cameras. In this work we propose a deeply learned matching function that attacks these challenges and can be used for a search engine that finds the appropriate 3D shape and matches it to objects in 2D query images. We evaluate the proposed matching function and search engine with a series of controlled experiments on the 24 most populated vehicle categories in PASCAL3D+. We test the capability of the learned matching function in transferring to unseen 3D shapes and study overall search engine sensitivity w.r.t available 3D shapes and object localization accuracy, showing promising results in retrieving 3D shapes given 2D image queries.

Challenges and Opportunities within Personal Life Archives

  •      Duc-Tien Dang-Nguyen
  • Michael Riegler
  • Liting Zhou
  • Cathal Gurrin

Nowadays, almost everyone holds some form or other of a personal life archive. Automatically maintaining such an archive is an activity that is becoming increasingly common, however without automatic support the users will quickly be overwhelmed by the volume of data and will miss out on the potential benefits that lifelogs provide. In this paper we give an overview of the current status of lifelog research and propose a concept for exploring these archives. We motivate the need for new methodologies for indexing data, organizing content and supporting information access. Finally we will describe challenges to be addressed and give an overview of initial steps that have to be taken, to address the challenges of organising and searching personal life archives.

Object Trajectory Proposal via Hierarchical Volume Grouping

  •      Xu Sun
  • Yuantian Wang
  • Tongwei Ren
  • Zhi Liu
  • Zheng-Jun Zha
  • Gangshan Wu

Object trajectory proposal aims to locate category-independent object candidates in videos with a limited number of trajectories,i.e.,bounding box sequences. Most existing methods, which derive from combining object proposal with tracking, cannot handle object trajectory proposal effectively due to the lack of comprehensive objectness measurement through analyzing spatio-temporal characteristics over a whole video. In this paper, we propose a novel object trajectory proposal method using hierarchical volume grouping. Specifically, we first represent a given video with hierarchical volumes by mapping hierarchical regions with optical flow. Then, we filter the short volumes and background volumes, and combinatorially group the retained volumes into object candidates. Finally, we rank the object candidates using a multi-modal fusion scoring mechanism, which incorporates both appearance objectness and motion objectness, and generate the bounding boxes of the object candidates with the highest scores as the trajectory proposals. We validated the proposed method on a dataset consisting of 200 videos from ILSVRC2016-VID. The experimental results show that our method is superior to the state-of-the-art object trajectory proposal methods.

CBVMR: Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint

  •      Sungeun Hong
  • Woobin Im
  • Hyun S. Yang

Up to now, only limited research has been conducted on crossmodal retrieval of suitable music for a specified video or vice versa. Moreover, much of the existing research relies on metadata such as keywords, tags, or description that must be individually produced and attached posterior. This paper introduces a new content-based, cross-modal retrieval method for video and music that is implemented through deep neural networks. We train the network via inter-modal ranking loss such that videos and music with similar semantics end up close together in the embedding space. However, if only the inter-modal ranking constraint is used for embedding, modality-specific characteristics can be lost. To address this problem, we propose a novel soft intra-modal structure loss that leverages the relative distance relationship between intra-modal samples before embedding. We also introduce reasonable quantitative and qualitative experimental protocols to solve the lack of standard protocols for less-mature video-music related tasks. All the datasets and source code can be found in our online repository (https://github.com/csehong/VM-NET).

Multi-Scale Spatiotemporal Conv-LSTM Network for Video Saliency Detection

  •      Yi Tang
  • Wenbin Zou
  • Zhi Jin
  • Xia Li

Recently, deep neural networks have been crucial techniques for image salient detection. However, two difficulties prevent the development of deep learning in video saliency detection. The first one is that the traditional static network cannot conduct a robust motion estimation in videos. The other is that the data-driven deep learning is in lack of sufficient manually annotated pixel-wise ground truths for video saliency network training. In this paper, we propose a multi-scale spatiotemporal convolutional LSTM network (MSST-ConvLSTM) to incorporate spatial and temporal cues for video salient objects detection. Furthermore, as manually pixel-wised labeling is very time-consuming, we sign lots of coarse labels, which are mixed with fine labels to train a robust saliency prediction model. Experiments on the widely used challenging benchmark datasets (e.g., FBMS and DAVIS) demonstrate that the proposed approach has competitive performance of video saliency detection compared with the state-of-the-art saliency models.

Supervised Nonparametric Multimodal Topic Modeling Methods for Multi-class Video Classification

  •      Jianfei Xue
  • Koji Eguchi

Nonparametric topic models such as hierarchical Dirichlet processes (HDP) have been attracting more and more attentions for multimedia data analysis. However, the existing models for multimedia data are unsupervised ones that purely cluster semantically or characteristically related features into a specific latent topic without considering side information such as class information. In this paper, we present a novel supervised sequential symmetric correspondence HDP (Sup-SSC-HDP) model for multi-class video classification, where the empirical topic frequencies learned from multimodal video data are modeled as a predictor of video class. Qualitative and quantitative assessments demonstrate the effectiveness of Sup-SSC-HDP.

Dense Dilated Network for Few Shot Action Recognition

  •      Baohan Xu
  • Hao Ye
  • Yingbin Zheng
  • Heng Wang
  • Tianyu Luwang
  • Yu-Gang Jiang

Recently, video action recognition has been widely studied. Training deep neural networks requires a large amount of well-labeled videos. On the other hand, videos in the same class share high-level semantic similarity. In this paper, we introduce a novel neural network architecture to simultaneously capture local and long-term spatial temporal information. The dilated dense network is proposed with the blocks being composed of densely-connected dilated convolutions layers. The proposed framework is capable of fusing each layer's outputs to learn high-level representations, and the representations are robust even with only few training snippets. The aggregations of dilated dense blocks are also explored. We conduct extensive experiments on UCF101 and demonstrate the effectiveness of our proposed method, especially with few training examples.

Precise Temporal Action Localization by Evolving Temporal Proposals

  •      Haonan Qiu
  • Yingbin Zheng
  • Hao Ye
  • Yao Lu
  • Feng Wang
  • Liang He

Locating actions in long untrimmed videos has been a challenging problem in video content analysis. The performances of existing action localization approaches remain unsatisfactory in precisely determining the beginning and the end of an action. Imitating the human perception procedure with observations and refinements, we propose a novel three-phase action localization framework. Our framework is embedded with an Actionness Network to generate initial proposals through frame-wise similarity grouping, and then a Refinement Network to conduct boundary adjustment on these proposals. Finally, the refined proposals are sent to a Localization Network for further fine-grained location regression. The whole process can be deemed as multi-stage refinement using a novel non-local pyramid feature under various temporal granularities. We evaluate our framework on THUMOS14 benchmark and obtain a significant improvement over the state-of-the-arts approaches. Specifically, the performance gain is remarkable under precise localization with high IoU thresholds. Our proposed framework achieves [email protected]=0.5 of 34.2%.

SESSION: Special Session 1: Predicting User Perceptions of Multimedia Content

  •      Claire-Hélène Demarty

Image Selection in Photo Albums

  •      Dmitry Kuzovkin
  • Tania Pouli
  • Rémi Cozot
  • Olivier Le Meur
  • Jonathan Kervec
  • Kadi Bouatouch

The selection of the best photos in personal albums is a task that is often faced by photographers. This task can become laborious when the photo collection is large and it contains multiple similar photos. Recent advances on image aesthetics and photo importance evaluation has led to the creation of different metrics for automatically assessing a given image. However, these metrics are intended for the independent assessment of an image, without considering the possible context implicitly present within photo albums. In this work, we perform a user study for assessing how users select photos when provided with a complete photo album---a task that better reflects how users may review their personal photos and collections. Using the data provided by our study, we evaluate how existing state-of-the-art photo assessment methods perform relative to user selection, focusing in particular on deep learning based approaches. Finally, we explore a recent framework for adapting independent image scores to collections and evaluate in which scenarios such an adaptation can prove beneficial.

Feature Selection and Multimodal Fusion for Estimating Emotions Evoked by Movie Clips

  •      Yasemin Timar
  • Nihan Karslioglu
  • Heysem Kaya
  • Albert Ali Salah

Perceptual understanding of media content has many applications, including content-based retrieval, marketing, content optimization, psychological assessment, and affect-based learning. In this paper, we model audio visual features extracted from videos via machine learning approaches to estimate the affective responses of the viewers. We use the LIRIS-ACCEDE dataset and the MediaEval 2017 Challenge setting to evaluate the proposed methods. This dataset is composed of movies of professional or amateur origin, annotated with viewers' arousal, valence, and fear scores. We extract a number of audio features, such as Mel-frequency Cepstral Coefficients, and visual features, such as dense SIFT, hue-saturation histogram, and features from a deep neural network trained for object recognition. We contrast two different approaches in the paper, and report experiments with different fusion and smoothing strategies. We demonstrate the benefit of feature selection and multimodal fusion on estimating affective responses to movie segments.

Multimodal Continuous Prediction of Emotions in Movies using Long Short-Term Memory Networks

  •      Sarath Sivaprasad
  • Tanmayee Joshi
  • Rishabh Agrawal
  • Niranjan Pedanekar

Predicting emotions that movies are designed to evoke, can be useful in entertainment applications such as content personalization, video summarization and ad placement. Multimodal input, primarily audio and video, helps in building the emotional content of a movie. Since the emotion is built over time by audio and video, the temporal context of these modalities is an important aspect in modeling it. In this paper, we use Long Short-Term Memory networks (LSTMs) to model the temporal context in audio-video features of movies. We present continuous emotion prediction results using a multimodal fusion scheme on an annotated dataset of Academy Award winning movies. We report a significant improvement over the state-of-the-art results, wherein the correlation between predicted and annotated values is improved from 0.62 vs 0.84 for arousal, and from 0.29 to 0.50 for valence.

Learning Perceptual Embeddings with Two Related Tasks for Joint Predictions of Media Interestingness and Emotions

  •      Yang Liu
  • Zhonglei Gu
  • Tobey H. Ko
  • Kien A. Hua

Integrating media elements of various medium, multimedia is capable of expressing complex information in a neat and compact way. Early studies have linked different sensory presentation in multimedia with the perception of human-like concepts. Yet, the richness of information in multimedia makes understanding and predicting user perceptions in multimedia content a challenging task both to the machine and the human mind. This paper presents a novel multi-task feature extraction method for accurate prediction of user perceptions in multimedia content. Differentiating from the conventional feature extraction algorithms which focus on perfecting a single task, the proposed model recognizes the commonality between different perceptions (e.g., interestingness and emotional impact), and attempts to jointly optimize the performance of all the tasks through uncovered commonality features. Using both a media interestingness dataset and a media emotion dataset for user perception prediction tasks, the proposed model attempts to simultaneously characterize the individualities of each task and capture the commonalities shared by both tasks, and achieves better accuracy in predictions than other competing algorithms on real-world datasets of two related tasks: MediaEval 2017 Predicting Media Interestingness Task and MediaEval 2017 Emotional Impact of Movies Task.

Deep Pairwise Classification and Ranking for Predicting Media Interestingness

  •      Jayneel Parekh
  • Harshvardhan Tibrewal
  • Sanjeel Parekh

With the explosive increase in the consumption of multimedia content in recent years, the field of media interestingness analysis has gained a lot of attention. This paper tackles the problem of image interestingness in videos and proposes a novel algorithm based on pairwise-comparisons of frames to rank all frames in a video. Experiments performed on the Predicting Media Interestingness dataset, affirm its effectiveness over existing solutions. In terms of the official metric i.e. Mean Average Precision at 10, it outperforms the previous state-of-the-art (to the best of our knowledge) on this dataset. Additional results on video interestingness substantiate the flexibility and performance reliability of our approach.

Perceptually-guided Understanding of Egocentric Video Content: Recognition of Objects to Grasp

  •      Iván González-Díaz
  • Jenny Benois-Pineau
  • Jean-Philippe Domenger
  • Aymar de Rugy

Incorporating user perception into visual content search and understanding tasks has become one of the major trends in multimedia retrieval. We tackle the problem of object recognition guided by user perception, as indicated by his gaze during visual exploration, in the application domain of assistance to upper-limb amputees. Although selecting the object to be grasped represents a task-driven visual search, human gaze recordings are noisy due to several physiological factors. Hence, since gaze does not always point to the object of interest, we use video-level weak annotations indicating the object to be grasped, and propose a video-level weak loss in classification with Deep CNNs. Our results show that the method achieves notably better performance than other approaches over a complex real-life dataset specifically recorded, with optimal performance for fixation times around 400-800ms, producing a minimal impact on subjects' behavior.

Towards Better Understanding of Player's Game Experience

  •      Wenlu Yang
  • Maria Rifqi
  • Christophe Marsala
  • Andrea Pinna 

Improving player's game experience has always been the common goal of video game practitioner. In order to get a better understanding of player's perception of game experience, we carry out experimental study for data collection and present game experience prediction model based on machine learning method. The model is trained on the proposed multi-modal database which contains: physiological modality, behavioral modality and meta-information to predict the player game experience in terms of difficulty, immersion and amusement. By investigating the model trained on separate and fusion feature sets, we show that physiological modality is effective. Moreover, better understanding is achieved with further analysis on the most relevant features in the behavioral and meta-information features set. We argue that combining the physiological modalities with behavioral and meta information can provide a better performance on the game experience prediction.

SESSION: Special Session 2: Social-Media Visual Summarization / Large-Scale 3D Multimedia Analysis and Applications

  •      Joao Magalhaes Rongrong Ji
 

Multimodal Filtering of Social Media for Temporal Monitoring and Event Analysis

  •      Po-Yao Huang
  • Junwei Liang
  • Jean-Baptiste Lamare
  • Alexander G. Hauptmann

Developing an efficient and effective social media monitoring system has become one of the important steps towards improved public safety. With the explosive availability of user-generated content documenting most conflicts and human rights abuses around the world, analysts and first-responders increasingly find themselves overwhelmed with massive amounts of noisy data from social media. In this paper, we construct a large-scale public safety event dataset with retrospective automatic labeling for 4.2 million multimodal tweets from 7 public safety events occurred in 2013~2017. We propose a new multimodal social media filtering system composed of encoding, classification, and correlation networks to jointly learn shared and complementary visual and textual information to filter out the most relevant and useful items among the noisy social media influx. The proposed model is verified and achieves significant improvement over competitive baselines under the retrospective and real-time experimental protocols.

A LiDAR Point Cloud Generator: from a Virtual World to Autonomous Driving

  •      Xiangyu Yue
  • Bichen Wu
  • Sanjit A. Seshia
  • Kurt Keutzer
  • Alberto L. Sangiovanni-Vincentelli

3D LiDAR scanners are playing an increasingly important role in autonomous driving as they can generate depth information of the environment. However, creating large 3D LiDAR point cloud datasets with point-level labels requires a significant amount of manual annotation. This jeopardizes the efficient development of supervised deep learning algorithms which are often data-hungry. We present a framework to rapidly create point clouds with accurate point-level labels from a computer game. To our best knowledge, this is the first publication on LiDAR point cloud simulation framework for autonomous driving. The framework supports data collection from both auto-driving scenes and user-configured scenes. Point clouds from auto-driving scenes can be used as training data for deep learning algorithms, while point clouds from user-configured scenes can be used to systematically test the vulnerability of a neural network, and use the falsifying examples to make the neural network more robust through retraining. In addition, the scene images can be captured simultaneously in order for sensor fusion tasks, with a method proposed to do automatic registration between the point clouds and captured scene images. We show a significant improvement in accuracy (+9%) in point cloud segmentation by augmenting the training dataset with the generated synthesized data. Our experiments also show by testing and retraining the network using point clouds from user-configured scenes, the weakness/blind spots of the neural network can be fixed.

3D Image-based Indoor Localization Joint With WiFi Positioning

  •      Guoyu Lu
  • Jingkuan Song

We realize a system that utilizes WiFi to facilitate the image-based localization system, which avoids the confusion caused by the similar decoration inside the buildings. While WiFi-based localization thread obtains the rough location information, the image-based localization thread retrieves the best matching images and clusters the camera poses associated with the images into different location candidates. The image cluster closest to the WiFi localization outcome is selected for the exact camera pose estimation. The usage of WiFi significantly reduces the search scope, avoiding the extensive search of millions of descriptors in a 3D model. In the image-based localization stage, we also propose a novel 2D-to-2D-to-3D localization framework which follows a coarse-to-fine strategy to quickly locate the query image in several location candidates and performs the local feature matching and camera pose estimation after choosing the correct image location by WiFi positioning. The entire system demonstrates significant benefits in combining both images and WiFi signals in localization tasks and great potential to be deployed in real applications.

Compare Stereo Patches Using Atrous Convolutional Neural Networks

  •      Zhiwei Li
  • Lei Yu

In this work, we address the task of dense stereo matching with Convolutional Neural Networks (CNNs). Particularly, we focus on improving matching cost computation by better aggregating contextual information. Towards this goal, we advocate to use atrous convolution, a powerful tool for dense prediction task that allows us to control the resolution at which feature responses are computed within CNNs and to enlarge the receptive field of the network without losing image resolution and requiring learning extra parameters. Aiming to improve the performance of atrous convolution, we propose different frameworks for further boosting performance. We evaluate our models on KITTI 2015 benchmark, the result shows that we achieve on-par performance with fewer post-processing methods applied.

SESSION: Doctoral Symposium Session

  •      Martha Larson Takahiro Ogaawa

Personal Basketball Coach: Tactic Training through Wireless Virtual Reality

  •      Wan-Lun Tsai

In this paper, we present a basketball tactic training framework with the aid of virtual reality (VR) technology to improve the effectiveness and experience of tactic learning. Our proposal is composed of 1) a wireless VR interaction system with motion capture devices which is applicable in the fast movement basketball running scenario; 2) a computing server that generates three-dimensional virtual players, defenders, and advantageous tactics guide. By the assistance of our VR training system, the user can vividly experience how the tactics are executed by viewing from the a specific player's viewing direction. Moreover, the basketball tactic movement guidance and virtual defenders are rendered in our VR system to make the users feel like playing in a real basketball game, which improves the efficiency and effectiveness of tactics training.

Extracting and Using Medical Expert Knowledge to Advance in Video Processing for Gynecologic Endoscopy

  •      Andreas Leibetseder
  • Klaus Schoeffmann

Modern day endoscopic technology enables medical staff to conveniently document surgeries via recording raw treatment footage, which can be utilized for planning further proceedings, future case revisitations or even educational purposes. However, the prospect of manually perusing recorded media files constitutes a tedious additional workload on physicians' already packed timetables and therefore ultimately represents a burden rather than a benefit. The aim of this PhD project is to improve upon this situation by closely collaborating with medical experts in order to devise datasets and systems to facilitate semi-automatic post-surgical media processing.

Temporal Aggregation of Visual Features for Large-Scale Image-to-Video Retrieval

  •      Noa Garcia

In this research we study the specific task of image-to-video retrieval, in which static pictures are used to find a specific timestamp or frame within a collection of videos. The inner temporal structure of video data consists of a sequence of highly correlated images or frames, commonly reproduced at rates of 24 to 30 frames per second. To perform large-scale retrieval, it is necessary to reduce the amount of data to be processed by exploiting the redundancy between these highly correlated images. In this work, we explore several techniques to aggregate visual temporal information from video data based on both standard local features and deep learning representations with the focus on the image-to-video retrieval task.

Tourism Category Classification on Image Sharing Services Through Estimation of Existence of Reliable Results

  •      Naoki Saito
  • Takahiro Ogawa
  • Satoshi Asamizu
  • Miki Haseyama

A new tourism category classification method through estimation of existence of reliable classification results is presented in this paper. The proposed method obtains two kinds of classification results by applying a convolutional neural network to tourism images and applying a Fuzzy K-nearest neighbor algorithm to geotags attached to the tourism images. Then the proposed method estimates existence of reliable classification results in the above two results. If the reliable result is included, the result is selected as the final classification result. If any reliable result is not included, the final result is obtained by another approach based on a multiple annotator logistic regression model. Consequently, the proposed method enables accurate classification based on the new estimation scheme.

Considering Documents in Lifelog Information Retrieval

  •      Rashmi Gupta

Lifelogging is a research topic that is receiving increasing attention and although lifelog research has progressed in recent years, the concept of what represents a document in lifelog retrieval has not yet been sufficiently explored. Hence, the generation of multimodal lifelog documents is a fundamental concept that must be addressed. In this paper, I introduce my general perspective on generating documents in lifelogging and reflect on learnings from collecting multimodal lifelog data from a number of participants in a study on lifelog data organization. In addition, the main motivation behind document generation is proposed and the challenges faced while collecting data and generating documents are discussed in detail. Finally, a process for organizing the documents in lifelog data retrieval is proposed, which I intend to follow in my PhD research.

SESSION: Demonstration Session

  •      Koichi Shinoda Zhipeng Wu

VP-ReID: Vehicle and Person Re-Identification System

  •      Longhui Wei
  • Xiaobin Liu
  • Jianing Li
  • Shiliang Zhang

With the capability of locating and tracking specific suspects or vehicles in a large camera network, person Re-Identification (ReID) and vehicle ReID show potential to be a key technology in smart surveillance system. They have been drawing lots of attentions from both academia and industry. To demonstrate our recent research progresses on those two tasks, we develop a robust and efficient person and video ReID system named as VP-ReID. This system is build based on our recent works including Deep Convolutional Neural Network design for discriminative feature extraction, efficient off-line indexing, as well as distance metric optimization for deep feature learning. Constructed upon those algorithms, VP-ReID identifies query vehicle and person efficiently and accurately from a large gallery set.

VisLoiter+: An Entropy Model-Based Loiterer Retrieval System with User-Friendly Interfaces

  •      Maguell L.T.L. Sandifort
  • Jianquan Liu
  • Shoji Nishimura
  • Wolfgang Hürst

It is very difficult to fully automate the detection of loitering behavior in video surveillance, therefore humans are often required for monitoring. Alternatively, we could provide a list of potential loiterer candidates for a final yes/no judgment of a human operator. Our system, VisLoiter+, realizes this idea with a unique, user-friendly interface and by employing an entropy model for improved loitering analysis. Rather than using only frequency of appearance, we expand the loiter analysis with new methods measuring the amount of person movements across multiple camera views. The interface gives an overview of loiterer candidates to show their behavior at a glance, complemented by a lightweight video playback for further details about why a candidate was selected. We demonstrate that our system outperforms state-of-the-art solutions using real-life data sets.

Automated Scanning and Individual Identification System for Parts without Marking or Tagging

  •      Kengo Makino
  • Wenjie Duan
  • Rui Ishiyama
  • Toru Takahashi
  • Yuta Kudo
  • Pieter Jonker

This paper presents a fully automated system for detecting, classifying, microscopic imaging, and individually identifying multiple parts without ID-marking or tagging. The system is beneficial for product assemblers, who handle multiple types of parts simultaneously. They can ensure traceability quite easily by only placing the parts freely on the system platform. The system captures microscopic images of parts as their "fingerprints," which are matched with pre-registered images in a database to identify an individual part's information such as its serial number. We demonstrate a working prototype and interaction scenario.

Dynamic Construction and Manipulation of Hierarchical Quartic Image Graphs

  •      Nico Hezel
  • Kai Uwe Barthel

Over the last years, we have published papers about intuitive image graph navigation and showed how to build static hierarchical image graphs efficiently. In this paper, we showcase new results and present techniques to dynamically construct and manipulate these kinds of graphs. They connect similar images and perform well in retrieving tasks regardless of the number of nodes. By applying an improved fast self-sorting map algorithm, entire image collections (structured in a graph) can be explored with a user interface resembling common navigation services.

WTPlant (What's That Plant?): A Deep Learning System for Identifying Plants in Natural Images

  •      Jonas Krause
  • Gavin Sugita
  • Kyungim Baek
  • Lipyeow Lim

Despite the availability of dozens of plant identification mobile applications, identifying plants from a natural image remains a challenging problem - most of the existing applications do not address the complexity of natural images, the large number of plant species, and the multi-scale nature of natural images. In this technical demonstration, we present the WTPlant system for identifying plants in natural images. WTPlant is based on deep learning approaches. Specifically, it uses stacked Convolutional Neural Networks for image segmentation, a novel preprocessing stage for multi-scale analyses, and deep convolutional networks to extract the most discriminative features. WTPlant employs different classification architectures for plants and flowers, thus enabling plant identification throughout all the seasons. The user interface also shows, in an interactive way, the most representative areas in the image that are used to predict each plant species. The first version of WTPlant is trained to classify 100 different plant species present in the campus of the University of Hawai'i at Manoa. First experiments support the hypothesis that an initial segmentation process helps guide the extraction of representative samples and, consequently, enables Convolutional Neural Networks to better recognize objects of different scales in natural images. Future versions aim to extend the recognizable species to cover the land-based flora of the Hawaiian Islands.

MOOCex: Exploring Educational Video via Recommendation

  •      Matthew Cooper
  • Jian Zhao
  • Chidansh Bhatt
  • David A. Shamma

Massive Open Online Course (MOOC) platforms have scaled online education to unprecedented enrollments, but remain limited by their predetermined curricula. Increasingly, professionals consume this content to augment or update specific skills rather than complete degree or certification programs. To better address the needs of this emergent user population, we describe a visual recommender system called MOOCex. The system recommends lecture videos across multiple courses and content platforms to provide a choice of perspectives on topics of interest. The recommendation engine considers both video content and sequential inter-topic relationships mined from course syllabi. Furthermore, it allows for interactive visual exploration of the semantic space of recommendations within a learner's current context.

Who to Ask: An Intelligent Fashion Consultant

  •      Yangbangyan Jiang
  • Qianqian Xu
  • Xiaochun Cao
  • Qingming Huang

Humankind has always been in pursuit of fashion. Nevertheless, people are often troubled by collocating clothes, e.g., tops, bottoms, shoes, and accessories, from numerous fashion items in their closets. Moreover, it may be expensive and inconvenient to employ a fashion stylist. In this paper, we present Stile, an end-to-end intelligent fashion consultant system, to generate stylish outfits for given items. Unlike previous systems, our framework considers the global compatibility of fashion items in the outfit and models the dependencies among items in a fixed order via a bidirectional LSTM. Therefore, it can guarantee that items in the same outfit should share a similar style and neither redundant nor missing items exist in the resulting outfit for essential categories. The demonstration shows that our proposed system provides people with a practical and convenient solution to find natural and proper fashion outfits.

A Simple Score Following System for Music Ensembles Using Chroma and Dynamic Time Warping

  •      Po-Wen Chou
  • Fu-Neng Lin
  • Keh-Ning Chang
  • Herng-Yow Chen

It is disruptive for instrumentalists to turn the page of music sheet when they are playing instruments. The purpose of this study is to investigate how real-time music score alignment can serve as a tool for a computer-assisted page turner. We proposed a simple system which can be set up easily and quickly for use to solve the problem. The framework of the system has two parts: off-line preprocessing stage and online alignment stage. In the first stage, the system extracts chroma feature vectors from the reference recording. In the second stage, the system receives audio signals of live performance and extracts chroma feature vectors from them. Finally, the system uses Dynamic Time Warping (DTW) to find an alignment between those two sets of chroma feature vectors to mark the current measure of the score. The prototype system was evaluated by musicians in different music ensembles like string quartet and orchestra. Most musicians agreed that the system is helpful and can indicate the current measure of a live performance correctly. Some musicians, however, disagreed that the system turned the page at right time. The user survey showed that the best timing for page turning is user-dependent because it is highly to do with musicians' sight reading skills and speed.