ICMR '20: Proceedings of the 2020 International Conference on Multimedia Retrieval

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Keynote Talks

What Should I Do?

Ramesh Jain

I find myself asking, "What should I do?" in many situations such as when I want to
go out to eat; I want to decide about my vacation; decide on spending some free time
on the weekend; and numerous other decisions. All these decisions are really personalized
contextual decisions that may be addressed by a contextual recommendation engine that
knows me. For knowing me well, the engine should prepare my model based on all events
in my life. By retrieving and mining events of various types computed using different
multimodal data streams, such a personal model may be prepared and then used to help
in making decisions ranging from trivial to critical. We discuss important challenges
in organizing life events that may be used for building personal models and for accessing
characteristics of such events as may be needed in various applications. We will demonstrate
our ideas using some applications related to lifestyle and health.

Medical Image Retrieval: Applications and Resources

Henning Müller

Motivation: Medical imaging is one of the largest data producers in the world and
over the last 30 years this production increased exponentially via a larger number
of images and a higher resolution, plus totally new types of images. Most images are
used only in the context of a single patient and a single time point, besides a few
images that are used for publications or in teaching. Data are usually scattered across
many institutions and cannot be combined even for the treatment of a single patient.
Much knowledge is stored in these medical archives of images and other clinical information
and content-based medical image retrieval has from the start aimed at making such
knowledge accessible using visual information in combination with text or structured
data. With the digitization of radiology that started in the mid 1990s the foundation
for broader use was laid out. Problem statement: This keynote presentation aims at
giving a historical perspective of how medical image retrieval has evolved from a
few prototypes using first only text, then global visual features to the current multimodal
systems that can index many types of images in large quantities and use deep learning
as a basis for the tools [1,2,3,4]. It also aims at looking at what the place of image
retrieval is in medicine, where it is currently still only sparsely used in clinical
practice. It seems that it is mainly a tool for teaching and research. Certified medical
tools for decision support rather make use of specific approaches for detection and
classification. Approach: The presentation follows a systematic review of the domain
that includes many examples of systems and approaches that changed over time when
better performing tools became available. Medical mage retrieval has evolved strongly,
and many tools linked to mage retrieval are now employed as clinical decision support
but mainly for detection and classification. Retrieval remains useful but is often
integrated with tools and thus has become almost invisible. A second aspect of the
presentation includes a presentations of existing data sets and other resources that
were difficult to obtain even ten years ago, but that have been shared via repositories
such as TCGA (The Cancer Genome Atlas, https://www.cancer.gov/about-nci/organization/ccg/
research/structural-genomics/tcga), TCIA (The Cancer Imaging Archive, https://www.cancerimagingarchive.net),
or via scientific challenges such ImageCLEF [5] or listed in the Grand Challenges
web page (https://grand-challenge.org). Medical data are now easily accessible in
many fields and often even in large quantities. Discussion: Medical retrieval has
gone from single text or image retrieval to multimodal approaches [6], really aiming
to use all data available for a case, similar to what a physician would do by looking
at a patient holistically. The limiting factor in terms of data access is now rather
linked to limited manual annotations, as the time of clinicians for annotations is
expensive. Global labels for images usually exist with the associated text reports
that describe images and outcomes. Still, these weak labels need to be made usable
with deep learning approaches that possibly require large amounts of data to generalize
well. Conclusions: Medical image retrieval has evolved strongly over the past 30 years
and can be integrated with several tools. For real clinical decision support, it is
still rarely used, also because the certification process is tedious and commercial
benefit is not as easy to show, as with detection or classification in a clear and
limited scenario. In terms of research many resources are available that allow advances
also in the future. Still, certification and ethical aspects also need to be taken
into account to limit risks for individuals.

Beyond Relevance Feedback for Searching and Exploring large Multimedia Collections

Marcel Worring

Relevance feedback was introduced over twenty years ago as a powerful tool for interactive
retrieval and still is the dominant mode of interaction in multimedia retrieval systems.
Over the years methods have improved and recently relevance feedback has become feasible
on even the largest collections available in the multimedia community. Yet, relevance
feedback typically targets the optimization of linear lists of search results and
thus focuses on only one of the many tasks on the search - explore axis. Truly interactive
retrieval systems have to consider the whole axis and interactive categorization is
an overarching framework for many of those tasks. The multimedia analytics system
MediaTable exploits this to support users in getting insight in large image collections.
Categorization as a representation of the collection and user tasks does not capture
the relations between items in the collection like graphs do. Hypergraphs are combining
categories and relations in one model and as they are founded in set theory in fact
are closely related to categorization. They, therefore, provide an elegant framework
to move forward. In this talk we highlight the progress that has been made in the
field of interactive retrieval and in the direction of multimedia analytics. We will
further consider the promises that new results in deep learning, especially in the
context of graph convolutional networks, and hypergraphs might bring to go beyond
relevance feedback.

SESSION: Tutorials

Automation of Deep Learning - Theory and Practice

Martin Wistuba
Ambrish Rawat
Tejaswini Pedapati

The growing interest in both the automation of machine learning and deep learning
has inevitably led to the development of a wide variety of methods to automate deep
learning. The choice of network architecture has proven critical, and many improvements
in deep learning are due to new structuring of it. However, deep learning techniques
are computationally intensive and their use requires a high level of domain knowledge.
Even a partial automation of this process therefore helps to make deep learning more
accessible for everyone. In this tutorial we present a uniform formalism that enables
different methods to be categorized and compare the different approaches in terms
of their performance. We achieve this through a comprehensive discussion of the commonly
used architecture search spaces and architecture optimization algorithms based on
reinforcement learning and evolutionary algorithms as well as approaches that include
surrogate and one-shot models. In addition, we discuss approaches to accelerate the
search for neural architectures based on early termination and transfer learning and
address the new research directions, which include constrained and multi-objective
architecture search as well as the automated search for data augmentation, optimizers,
and activation functions.

One Perceptron to Rule Them All: Language, Vision, Audio and Speech

Xavier Giro-i-Nieto

Deep neural networks have boosted the convergence of multimedia data analytics in
a unified framework shared by practitioners in natural language, vision and speech.
Image captioning, lip reading or video sonorization are some of the first applications
of a new and exciting field of research exploiting the generalization properties of
deep neural representation. This tutorial will firstly review the basic neural architectures
to encode and decode vision, text and audio, to later review the those models that
have successfully translated information across modalities.

SESSION: Best Paper Session

Visual Relations Augmented Cross-modal Retrieval

Yutian Guo
Jingjing Chen
Hao Zhang
Yu-Gang Jiang

Retrieving relevant samples across multiple-modalities is a primary topic that receives
consistently research interests in multimedia communities, and has benefited various
real-world multimedia applications (e.g., text-based image searching). Current models
mainly focus on learning a unified visual semantic embedding space to bridge visual
contents & text query, targeting at aligning relevant samples from different modalities
as neighbors in the embedding space. However, these models did not consider relations
between visual components in learning visual representations, resulting in their incapability
of distinguishing images with the same visual components but different relations (i.e.,
Figure 1). To precisely modeling visual contents, we introduce a novel framework that
enhanced visual representation with relations between components. Specifically, visual
relations are represented by the scene graph extracted from an image, then encoded
by the graph convolutional neural networks for learning visual relational features.
We combine the relational and compositional representation together for image-text
retrieval. Empirical results conducted on the challenging MS-COCO and Flicker 30K
datasets demonstrate the effectiveness of our proposed model for cross-modal retrieval
task.

Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

Eric Müller-Budack
Jonas Theiner
Sebastian Diering
Maximilian Idahl
Ralph Ewerth

The World Wide Web has become a popular source for gathering information and news.
Multimodal information, e.g., enriching text with photos, is typically used to convey
the news more effectively or to attract attention. The photos can be decorative, depict
additional details, or even contain misleading information. Quantifying the cross-modal
consistency of entity representations can assist human assessors in evaluating the
overall multimodal message. In some cases such measures might give hints to detect
fake news, which is an increasingly important topic in today's society. In this paper,
we present a multimodal approach to quantify the entity coherence between image and
text in real-world news. Named entity linking is applied to extract persons, locations,
and events from news texts. Several measures are suggested to calculate the cross-modal
similarity of these entities with the news photo, using state-of-the-art computer
vision approaches. In contrast to previous work, our system automatically gathers
example data from the Web and is applicable to real-world news. The feasibility is
demonstrated on two novel datasets that cover different languages, topics, and domains.

Human Object Interaction Detection via Multi-level Conditioned Network

Xu Sun
Xinwen Hu
Tongwei Ren
Gangshan Wu

As one of the essential problems in scene understanding, human object interaction
detection (HOID) aims to recognize fine-grained object-specific human actions, which
demands the capabilities of both visual perception and reasoning. Existing methods
based on convolutional neural network (CNN) utilize diverse visual features for HOID,
which are insufficient for complex human object interaction understanding. To enhance
the reasoning capablity of CNN, we propose a novel multi-level conditioned network
that fuses extra spatial-semantic knowledge with visual features. Specifically, we
construct a multi-branch CNN as backbone for multi-level visual representation. We
then encode extra knowledge including human body structure and object context as condition
to dynamically influence the feature extraction of CNN by affine transformation and
attention mechanism. Finally, we fuse the modulated multimodal features to distinguish
the interactions. The proposed method is evaluated on two most frequently-used benchmarks,
HICO-DET and V-COCO. The experiment results show that our method is superior to the
state-of-the-arts.

Explaining with Counter Visual Attributes and Examples

Sadaf Gulshad
Arnold Smeulders

In this paper, we aim to explain the decisions of neural networks by utilizing multimodal
information. That is counter-intuitive attributes and counter visual examples which
appear when perturbed samples are introduced. Different from previous work on interpreting
decisions using saliency maps, text, or visual patches we propose to use attributes
and counter-attributes, and examples and counter-examples as part of the visual explanations.
When humans explain visual decisions they tend to do so by providing attributes and
examples. Hence, inspired by the way of human explanations in this paper we provide
attribute-based and example-based explanations. Moreover, humans also tend to explain
their visual decisions by adding counter-attributes and counter-examples to explain
what isnot seen. We introduce directed perturbations in the examples to observe which
attribute values change when classifying the examples into the counter classes. This
delivers intuitive counter-attributes and counter-examples. Our experiments with both
coarse and fine-grained datasets show that attributes provide discriminating and human-understandable
intuitive and counter-intuitive explanations.

SESSION: Oral Session 1: Cross-Modal Analysis

Deep Semantic-Alignment Hashing for Unsupervised Cross-Modal Retrieval

Dejie Yang
Dayan Wu
Wanqian Zhang
Haisu Zhang
Bo Li
Weiping Wang

Deep hashing methods have achieved tremendous success in cross-modal retrieval, due
to its low storage consumption and fast retrieval speed. In real cross-modal retrieval
applications, it's hard to obtain label information. Recently, increasing attention
has been paid to unsupervised cross-modal hashing. However, existing methods fail
to exploit the intrinsic connections between images and their corresponding descriptions
or tags (text modality). In this paper, we propose a novel Deep Semantic-Alignment
Hashing (DSAH) for unsupervised cross-modal retrieval, which sufficiently utilizes
the co-occurred image-text pairs. DSAH explores the similarity information of different
modalities and we elaborately design a semantic-alignment loss function, which elegantly
aligns the similarities between features with those between hash codes. Moreover,
to further bridge the modality gap, we innovatively propose to reconstruct features
of one modality with hash codes of the other one. Extensive experiments on three cross-modal
retrieval datasets demonstrate that DSAH achieves the state-of-the-art performance.

Forward and Backward Multimodal NMT for Improved Monolingual and Multilingual Cross-Modal
Retrieval

Po-Yao Huang
Xiaojun Chang
Alexander Hauptmann
Eduard Hovy

We explore methods to enrich the diversity of captions associated with pictures for
learning improved visual-semantic embeddings (VSE) in cross-modal retrieval. In the
spirit of "A picture is worth a thousand words", it would take dozens of sentences
to parallel each picture's content adequately. But in fact, real-world multimodal
datasets tend to provide only a few (typically, five) descriptions per image. For
cross-modal retrieval, the resulting lack of diversity and coverage prevents systems
from capturing the fine-grained inter-modal dependencies and intra-modal diversities
in the shared VSE space. Using the fact that the encoder-decoder architectures in
neural machine translation (NMT) have the capacity to enrich both monolingual and
multilingual textual diversity, we propose a novel framework leveraging multimodal
neural machine translation (MMT) to perform forward and backward translations based
on salient visual objects to generate additional text-image pairs which enables training
improved monolingual cross-modal retrieval (English-Image) and multilingual cross-modal
retrieval (English-Image and German-Image) models. Experimental results show that
the proposed framework can substantially and consistently improve the performance
of state-of-the-art models on multiple datasets. The results also suggest that the
models with multilingual VSE outperform the models with monolingual VSE.

Heterogeneous Non-Local Fusion for Multimodal Activity Recognition

Petr Byvshev
Pascal Mettes
Yu Xiao

In this work, we investigate activity recognition using multimodal inputs from heterogeneous
sensors. Activity recognition is commonly tackled from a single-modal perspective
using videos. In case multiple signals are used, they come from the same homogeneous
modality, e.g. in the case of color and optical flow. Here, we propose an activity
network that fuses multimodal inputs coming from completely different and heterogeneous
sensors. We frame such a heterogeneous fusion as a non-local operation. The observation
is that in a non-local operation, only the channel dimensions need to match. In the
network, heterogeneous inputs are fused, while maintaining the shapes and dimensionalities
that fit each input. We outline both asymmetric fusion, where one modality serves
to enforce the other, and symmetric fusion variants. To further promote research into
multimodal activity recognition, we introduce GloVid, a first-person activity dataset
captured with video recordings and smart glove sensor readings. Experiments on GloVid
show the potential of heterogeneous non-local fusion for activity recognition, outperforming
individual modalities and standard fusion techniques.

Trajectory Prediction Network for Future Anticipation of Ships

Pim Dijt
Pascal Mettes

This work investigates the anticipation of future ship locations based on multimodal
sensors. Predicting future trajectories of ships is an important component for the
development of safe autonomous sailing ships on water. A core challenge towards future
trajectory prediction is making sense of multiple modalities from vastly different
sensors, including GPS coordinates, radar images, and charts specifying water and
land regions. To that end, we propose a Trajectory Prediction Network, an end-to-end
approach for trajectory anticipation based on multimodal sensors. Our approach is
framed as a multi-task sequence-to-sequence network, with network components for coordinate
sequences and radar images. In the network, water/land segmentations from charts are
integrated as an auxiliary training objective. Since future anticipation of ships
has not previously been studied from such a multimodal perspective, we introduce the
Inland Shipping Dataset (ISD), a novel dataset for future anticipation of ships. Experimental
evaluation on ISD shows the potential of our approach, outperforming single-modal
variants and baselines from related anticipation tasks.

SESSION: Oral Session 2: Applications

Knowledge Enhanced Neural Fashion Trend Forecasting

Yunshan Ma
Yujuan Ding
Xun Yang
Lizi Liao
Wai Keung Wong
Tat-Seng Chua

Fashion trend forecasting is a crucial task for both academia andindustry. Although
some efforts have been devoted to tackling this challenging task, they only studied
limited fashion elements with highly seasonal or simple patterns, which could hardly
reveal thereal fashion trends. Towards insightful fashion trend forecasting,this work
focuses on investigating fine-grained fashion element trends for specific user groups.
We first contribute a large-scale fashion trend dataset (FIT) collected from Instagram
with extracted time series fashion element records and user information. Furthermore,
to effectively model the time series data of fashion elements with rather complex
patterns, we propose a Knowledge Enhanced Recurrent Network model (KERN) which takes
advantage of the capability of deep recurrent neural networks in modeling time series
data. Moreover, it leverages internal and external knowledgein fashion domain that
affects the time-series patterns of fashion element trends. Such incorporation of
domain knowledge further enhances the deep learning model in capturing the patterns
of specific fashion elements and predicting the future trends. Extensive experiments
demonstrate that the proposed KERN model can effectively capture the complicated patterns
of objective fashion elements, therefore making preferable fashion trend forecast.

Learning to Select Elements for Graphic Design

Guolong Wang
Zheng Qin
Junchi Yan
Liu Jiang

Selecting elements for graphic design is essential for ensuring a correct understanding
of clients' requirements as well as improving the efficiency of designers before a
fine-designed process. Some semi-automatic design tools proposed layout templates
where designers always select elements according to the rectangular boxes that specify
how elements are placed. In practice, layout and element selection are complementary.
Compared to the layout which can be readily obtained from pre-designed templates,
it is generally time-consuming to mindfully pick out suitable elements, which calls
for an automation of elements selection. To address this, we formulate element selection
as a sequential decision-making process and develop a deep element selection network
(DESN). Given a layout file with annotated elements, new graphical elements are selected
to form graphic designs based on aesthetics and consistency criteria. To train our
DESN, we propose an end-to-end, reinforcement learning based framework, where we design
a novel reward function that jointly accounts for visual aesthetics and consistency.
Based on this, visually readable and aesthetic drafts can be efficiently generated.
We further contribute a layout-poster dataset with exhaustively labeled attributes
of poster key elements. Qualitative and quantitative results indicate the efficacy
of our approach.

Actor-Critic Sequence Generation for Relative Difference Captioning

Zhengcong Fei

This paper investigates a new task named relative difference caption which aims to
generate a sentence to tell the difference between the given image pair. Difference
description is a crucial task for developing intelligent machines that can understand
and handle changeable visual scenes and applications. Towards that end, we propose
a reinforcement learning-based model, which utilizes a policy network and a value
network in a decision procedure to collaboratively produce a difference caption. Specifically,
the policy network works as an actor to estimate the probability of next word based
on the current state and the value network serves as a critic to predict all possible
extension values according to current action and state. To encourage generating correct
and meaningful descriptions, we leverage a visual-linguistic similarity-based reward
function as feedback. Empirical results on the recently released dataset demonstrate
the effectiveness of our method in comparison with various baselines and model variants.

Interactivity Proposals for Surveillance Videos

Shuo Chen
Pascal Mettes
Tao Hu
Cees G.M. Snoek

This paper introduces spatio-temporal interactivity proposals for video surveillance.
Rather than focusing solely on actions performed by subjects, we explicitly include
the objects that the subjects interact with. To enable interactivity proposals, we
introduce the notion of interactivityness, a score that reflects the likelihood that
a subject and object have an interplay. For its estimation, we propose a network containing
an interactivity block and geometric encoding between subjects and objects. The network
computes local interactivity likelihoods from subject and object trajectories, which
we use to link intervals of high scores into spatio-temporal proposals. Experiments
on an interactivity dataset with new evaluation metrics show the general benefit of
interactivity proposals as well as its favorable performance compared to traditional
temporal and spatio-temporal action proposals.

SESSION: Oral Session 3: Retrieval

Sentence-based and Noise-robust Cross-modal Retrieval on Cooking Recipes and Food
Images

Zichen Zan
Lin Li
Jianquan Liu
Dong Zhou

In recent years, people are facing with billions of food images, videos and recipes
on social medias. An appropriate technology is highly desired to retrieve accurate
contents across food images and cooking recipes, like cross-modal retrieval framework.
Based on our observations, the order of sequential sentences in recipes and the noises
in food images will affect retrieval results. We take into account the sentence-level
sequential orders of instructions and ingredients in recipes, and noise portion in
food images to propose a new framework for cross-retrieval. In our framework, we propose
three new strategies to improve the retrieval accuracy. (1) We encode recipe titles,
ingredients, instructions in sentence level, and adopt three attention networks on
multi-layer hidden state features separately to capture more semantic information.
(2) We apply attention mechanism to select effective features from food images incorporating
with recipe embeddings, and adopt an adversarial learning strategy to enhance modality
alignment. (3) We design a new triplet loss scheme with an effective sampling strategy
to reduce the noise impact on retrieval results. The experimental results show that
our framework clearly outperforms the state-of-art methods in terms of median rank
and recall rate at top k on the Recipe 1M dataset.

QIK: A System for Large-Scale Image Retrieval on Everyday Scenes With Common Objects

Arun Zachariah
Mohamed Gharibi
Praveen Rao

In this paper, we propose a system for large-scale image retrieval on everyday scenes
with common objects by leveraging advances in deep learning and natural language processing
(NLP). Unlike recent state-of-the-art approaches that extract image features from
a convolutional neural network (CNN), our system exploits the predictions made by
deep neural networks for image understanding tasks. Our system aims to capture the
relationships between objects in an everyday scene rather than just the individual
objects in the scene. It works as follows: For each image in the database, it generates
most probable captions and detects objects in the image using state-of-the-art deep
learning models. The captions are parsed and represented by tree structures using
NLP techniques. These are stored and indexed in a database system. When a user poses
a query image, its caption is generated using deep learning and parsed into its corresponding
tree structures. Then an optimized tree-pattern query is constructed and executed
on the database to retrieve a set of candidate images. Finally, these candidate images
are ranked using the tree-edit distance metric computed on the tree structures. A
query based on only objects detected in the query image can also be formulated and
executed. In this case, the ranking scheme uses the probabilities of the detected
objects. We evaluated the performance of our system on the Microsoft COCO dataset
containing everyday scenes (with common objects) and observed that our system can
outperform state-of-the-art techniques in terms of mean average precision for large-scale
image retrieval.

Deep Discrete Attention Guided Hashing for Face Image Retrieval

Zhi Xiong
Dayan Wu
Wen Gu
Haisu Zhang
Bo Li
Weiping Wang

Recently, face image hashing has been proposed in large-scale face image retrieval
due to its storage and computational efficiency. However, owing to the large intra-identity
variation (same identity with different poses, illuminations, and facial expressions)
and the small inter-identity separability (different identities look similar) of face
images, existing face image hashing methods have limited power to generate discriminative
hash codes. In this work, we propose a deep hashing method specially designed for
face image retrieval named deep Discrete Attention Guided Hashing (DAGH). In DAGH,
the discriminative power of hash codes is enhanced by a well-designed discrete identity
loss, where not only the separability of the learned hash codes for different identities
is encouraged, but also the intra-identity variation of the hash codes for the same
identities is compacted. Besides, to obtain the fine-grained face features, DAGH employs
a multi-attention cascade network structure to highlight discriminative face features.
Moreover, we introduce a discrete hash layer into the network, along with the proposed
modified backpropagation algorithm, our model can be optimized under discrete constraint.
Experiments on two widely used face image retrieval datasets demonstrate the inspiring
performance of DAGH over the state-of-the-art face image hashing methods.

Image Synthesis from Locally Related Texts

Tianrui Niu
Fangxiang Feng
Lingxuan Li
Xiaojie Wang

Text-to-image synthesis refers to generating photo-realistic images from text descriptions.
Recent works focus on generating images with complex scenes and multiple objects.
However, the text inputs to these models are the only captions that always describe
the most apparent object or feature of the image and detailed information (e.g. visual
attributes) for regions and objects are often missing. Quantitative evaluation of
generation performances is still an unsolved problem, where traditional image classification-
or retrieval-based metrics fail at evaluating complex images. To address these problems,
we propose to generate images conditioned on locally-related texts, i.e., descriptions
of local image regions or objects instead of the whole image. Specifically, questions
and answers (QAs) are chosen as locally-related texts, which makes it possible to
use VQA accuracy as a new evaluation metric. The intuition is simple: higher image
quality and image-text consistency (both globally and locally) can help a VQA model
answer questions more correctly. We purposed VQA-GAN model with three key modules:
hierarchical QA encoder, QA-conditional GAN and external VQA loss. These modules help
leverage the new inputs effectively. Thorough experiments on two public VQA datasets
demonstrate the effectiveness of the model and the newly proposed metric.

SESSION: Oral Session 4: Semantic Enrichment

Automatic Color Scheme Extraction from Movies

Suzi Kim
Sunghee Choi

A color scheme is an association of colors, i.e., a subset of all possible colors,
that represents a visual identity. We propose an automated method to extract a color
scheme from a movie. Since a movie is a carefully edited video with different objects
and heterogeneous content embodying the director's messages and values, it is a challenging
task to extract a color scheme from a movie as opposed to a general video filmed at
once without distinction of shots or scenes. Despite such challenges, color scheme
extraction plays a very important role in film production and application. The color
scheme is an interpretation of the scenario by the cinematographer and it can convey
a mood or feeling that stays with the viewer after the movie has ended. It also acts
as a contributing factor to describe a film, like the metadata fields of a film such
as a genre, director, and casting. Moreover, it can be automatically tagged unlike
metadata, so it can be directly applied to the existing movie database without much
effort. Our method produces a color scheme from a movie in a bottom-up manner from
segmented shots. We formulate the color extraction as a selection problem where perceptually
important colors are selected using saliency. We introduce a semi-master-shot, an
alternative unit defined as a combination of contiguous shots taken in the same place
with similar colors. Using real movie videos, we demonstrate and validate the plausibility
of the proposed technique.

Compact Network Training for Person ReID

Hussam Lawen
Avi Ben-Cohen
Matan Protter
Itamar Friedman
Lihi Zelnik-Manor

The task of person re-identification (ReID) has attracted growing attention in recent
years leading to improved performance, albeit with little focus on real-world applications.
Most SotA methods are based on heavy pre-trained models, e.g. ResNet50 (~25M parameters),
which makes them less practical and more tedious to explore architecture modifications.
In this study, we focus on a small-sized randomly initialized model that enables us
to easily introduce architecture and training modifications suitable for person ReID.
The outcomes of our study are a compact network and a fitting training regime. We
show the robustness of the network by outperforming the SotA on both Market1501 and
DukeMTMC. Furthermore, we show the representation power of our ReID network via SotA
results on a different task of multi-object tracking.

Google Helps YouTube: Learning Few-Shot Video Classification from Historic Tasks and
Cross-Domain Sample Transfer

Xinzhe Zhou
Yadong Mu

The fact that video annotation is labor-intensive inspires recent research to endeavor
on few-shot video classification. The core motivation of our work is to mitigate the
supervision scarcity issue in this few-shot setting via cross-domain meta-learning.
Particularly, we aim to harness large-scale richly-annotated image data (i.e., source
domain) for few-shot video classification (i.e., target domain). The source data is
heterogeneous (image v.s. video) and has noisy labels, not directly usable in the
target domain. This work proposes meta-learning input-transformer (MLIT), a novel
deep network that tames the noisy source data such that they are more amenable for
being used in the target domain. It has two key traits. First, to bridge the data
distribution gap between source / target domains, MLIT includes learnable neural layers
to reweigh and transform the source data, effectively suppressing corrupted or noisy
source data. Secondly, MLIT is designed to learn from historic video classification
tasks in the target domain, which significantly elevates the accuracy of the unseen
video category. Comprehensive empirical evaluations on two large-scale video datasets,
ActivityNet and Kinetics-400, have strongly shown the superiority of our proposed
method.

iSparse: Output Informed Sparsification of Neural Network

Yash Garg
K. Selçuk Candan

Deep neural networks have demonstrated unprecedented success in various multimedia
applications. However, the networks created are often very complex, with large numbers
of trainable edges that require extensive computational resources. We note that many
successful networks nevertheless often contain large numbers of redundant edges. Moreover,
many of these edges may have negligible contributions towards the overall network
performance. In this paper, we propose a novel iSparse framework and experimentally
show, that we can sparsify the network without impacting the network performance.
iSparse leverages a novel edge significance score, E, to determine the importance
of an edge with respect to the final network output. Furthermore, iSparse can be applied
both while training a model or on top of a pre-trained model, making it a retraining-free
approach - leading to a minimal computational overhead. Comparisons of iSparse against
Dropout, L1, DropConnect, Retraining-Free, and Lottery-Ticket Hypothesis on benchmark
datasets show that iSparse leads to effective network sparsifications.

SESSION: Session: Posters (Full Length)

Super-Resolution Coding Defense Against Adversarial Examples

Yanjie Chen
Likun Cai
Wei Cheng
Hao Wang

Deep neural networks have achieved state-of-the-art performance in many fields including
image classification. However, recent studies show these models are vulnerable to
adversarial examples formed by adding small but intentional perturbations to clean
examples. In this paper, we introduce a significant defense method against adversarial
examples. The key idea is to leverage a super-resolution coding (SR-coding) network
to eliminate noise from adversarial examples. Furthermore, to boost the effect of
defending noise, we propose a novel hybrid approach that incorporates SR-coding and
adversarial training to train robust neural networks. Experiments on benchmark datasets
demonstrate the effectiveness of our method against both the state-of-the-art white-box
attacks and black-box attacks. The proposed approach significantly improves defense
performance and achieves up to 41.26% improvement based on the accuracy by ResNet18
on PGD white-box attack.

Continuous ODE-defined Image Features for Adaptive Retrieval

Fabio Carrara
Giuseppe Amato
Fabrizio Falchi
Claudio Gennaro

In the last years, content-based image retrieval largely benefited from representation
extracted from deeper and more complex convolutional neural networks, which became
more effective but also more computationally demanding. Despite existing hardware
acceleration, query processing times may be easily saturated by deep feature extraction
in high-throughput or real-time embedded scenarios, and usually, a trade-off between
efficiency and effectiveness has to be accepted. In this work, we experiment with
the recently proposed continuous neural networks defined by parametric ordinary differential
equations, dubbed ODE-Nets, for adaptive extraction of image representations. Given
the continuous evolution of the network hidden state, we propose to approximate the
exact feature extraction by taking a previous "near-in-time" hidden state as features
with a reduced computational cost. To understand the potential and the limits of this
approach, we also evaluate an ODE-only architecture in which we minimize the number
of classical layers in order to delegate most of the representation learning process
--- and thus the feature extraction process --- to the continuous part of the model.
Preliminary experiments on standard benchmarks show that we are able to dynamically
control the trade-off between efficiency and effectiveness of feature extraction at
inference-time by controlling the evolution of the continuous hidden state. Although
ODE-only networks provide the best fine-grained control on the effectiveness-efficiency
trade-off, we observed that mixed architectures perform better or comparably to standard
residual nets in both the image classification and retrieval setups while using fewer
parameters and retaining the controllability of the trade-off.

Search Result Clustering in Collaborative Sound Collections

Xavier Favory
Frederic Font
Xavier Serra

The large size of nowadays' online multimedia databases makes retrieving their content
a difficult and time-consuming task. Users of online sound collections typically submit
search queries that express a broad intent, often making the system return large and
unmanageable result sets. Search Result Clustering is a technique that organises search-result
content into coherent groups, which allows users to identify useful subsets in their
results. Obtaining coherent and distinctive clusters that can be explored with a suitable
interface is crucial for making this technique a useful complement of traditional
search engines. In our work, we propose a graph-based approach using audio features
for clustering diverse sound collections obtained when querying large online databases.
We propose an approach to assess the performance of different features at scale, by
taking advantage of the metadata associated with each sound. This analysis is complemented
with an evaluation using ground-truth labels from manually annotated datasets. We
show that using a confidence measure for discarding inconsistent clusters improves
the quality of the partitions. After identifying the most appropriate features for
clustering, we conduct an experiment with users performing a sound design task, in
order to evaluate our approach and its user interface. A qualitative analysis is carried
out including usability questionnaires and semi-structured interviews. This provides
us with valuable new insights regarding the features that promote efficient interaction
with the clusters.

EfficientFAN: Deep Knowledge Transfer for Face Alignment

Pengcheng Gao
Ke Lu
Jian Xue

Face alignment plays an important role in many applications that process facial images.
At present, deep learning-based methods have achieved excellent results in face alignment.
However, these models usually have a large number of parameters, resulting in high
computational complexity and execution time. In this paper, a lightweight, efficient,
and effective model is proposed and named Efficient Face Alignment Network (EfficientFAN).
EfficientFAN adopts the encoder-decoder structure, using a simple backbone Efficient-Net-B0
as the encoder and three deconvolutional layers as the decoder. Compared with state-of-the-art
models, it achieves equivalent performance with fewer model parameters, lower computation
cost, and higher speed. Moreover, the accuracy of EfficientFAN is further improved
by transferring deep knowledge of a complex teacher network through feature-aligned
distillation and patch similarity distillation. Extensive experimental results on
public data sets demonstrate the superiority of EfficientFAN over state-of-the-art
methods.

DAGC: Employing Dual Attention and Graph Convolution for Point Cloud based Place Recognition

Qi Sun
Hongyan Liu
Jun He
Zhaoxin Fan
Xiaoyong Du

Point cloud based retrieval for place recognition remains to be a problem demanding
prompt solution due to its difficulty in efficiently encoding local features into
adequate global descriptor in scenes. Existing studies solve this problem by generating
a global descriptor for each point cloud, which is used to retrieve matched point
cloud in database. But existing studies do not make effective use of the relationship
between points and neglect different feature's discrimination power. In this paper,
we propose to employ Dual Attention and Graph Convolution for point cloud based place
recognition (DAGC) to solve these issues. Specifically, we employ two modules to help
extract discriminative and generalizable features to describe a point cloud. We introduce
a Dual Attention module to help distinguish task-relevant features and to utilize
other points' different contributions to a point to generate representation. Meanwhile,
we introduce a Residual Graph Convolution Network (ResGCN) module to aggregate local
features of each point's multi-level neighbor points to further improve the representation.
In this way, we improve the descriptor generation by considering the importance of
both point and feature and leveraging point relationship. Experiments conducted on
different datasets show that our work outperforms current approaches on all evaluation
metrics.

PredNet and Predictive Coding: A Critical Review

Roshan Prakash Rane
Edit Szügyi
Vageesh Saxena
André Ofner
Sebastian Stober

PredNet, a deep predictive coding network developed by Lotter et al., combines a biologically
inspired architecture based on the propagation of prediction error with self-supervised
representation learning in video. While the architecture has drawn a lot of attention
and various extensions of the model exist, there is a lack of a critical analysis.
We fill in the gap by evaluating PredNet both as an implementation of the predictive
coding theory and as a self-supervised video prediction model using a challenging
video action classification dataset. We design an extended model to test if conditioning
future frame predictions on the action class of the video improves the model performance.
We show that PredNet does not yet completely follow the principles of predictive coding.
The proposed top-down conditioning leads to a performance gain on synthetic data,
but does not scale up to the more complex real-world action classification dataset.
Our analysis is aimed at guiding future research on similar architectures based on
the predictive coding theory.

Query-controllable Video Summarization

Jia-Hong Huang
Marcel Worring

When video collections become huge, how to explore both within and across videos efficiently
is challenging. Video summarization is one of the ways to tackle this issue. Traditional
summarization approaches limit the effectiveness of video exploration because they
only generate one fixed video summary for a given input video independent of the information
need of the user. In this work, we introduce a method which takes a text-based query
as input and generates a video summary corresponding to it. We do so by modeling video
summarization as a supervised learning problem and propose an end-to-end deep learning
based method for query-controllable video summarization to generate a query-dependent
video summary. Our proposed method consists of a video summary controller, video summary
generator, and video summary output module. To foster the research of query-controllable
video summarization and conduct our experiments, we introduce a dataset that contains
frame-based relevance score labels. Based on our experimental result, it shows that
the text-based query helps control the video summary. It also shows the text-based
query improves our model performance. Our code and dataset: https://github.com/Jhhuangkay/Query-controllable-Video-Summarization.

SESSION: Session: Posters (Short)

Semantic Gated Network for Efficient News Representation

Xuxiao Bu
Bingfeng Li
Yaxiong Wang
Jihua Zhu
Xueming Qian
Marco Zhao

Learning an efficient news representation is a fundamental yet important problem for
many tasks. Most existing news-relevant methods only take the textual information
while abandoning the visual clues from the illustrations. We argue that the textual
title and tags together with the visual illustrations form the main force of a piece
of news and are more efficient to express the news content. In this paper, we develop
a novel framework, namely Semantic Gated Network (SGN), to integrate the news title,
tags and visual illustrations to obtain an efficient joint textual-visual feature
for the news, by which we can directly measure the relevance between two pieces of
news. Particularly, we first harvest the tag embeddings by the proposed self-supervised
classification model. Besides, news title is fed into a sentence encoder pretrained
by two semantically relevant news to learn efficient contextualized word vectors.
Then the feature of the news title is extracted based on the learned vectors and we
combine it with features of tags to obtain textual feature. Finally, we design a novel
mechanism named semantic gate to adaptively fuse the textual feature and the image
feature. Extensive experiments on benchmark dataset demonstrate the effectiveness
of our approach.

System Fusion with Deep Ensembles

Liviu-Daniel Ştefan
Mihai Gabriel Constantin
Bogdan Ionescu

Deep neural networks (DNNs) are universal estimators that have achieved state-of-the-art
performance in a broad spectrum of classification tasks, opening new perspectives
for many applications. One of them is addressing ensemble learning. In this paper,
we introduce a set of deep learning techniques for ensemble learning with dense, attention,
and convolutional neural network layers. Our approach automatically discovers patterns
and correlations between the decisions of individual classifiers, therefore, alleviating
the difficulty of building such architectures. To assess its robustness, we evaluate
our approach on two complex data sets that target different perspectives of predicting
the user perception of multimedia data, i.e., interestingness and violence. The proposed
approach outperforms the existing state-of-the-art algorithms by a large margin.

Reducing Response Time for Multimedia Event Processing using Domain Adaptation

Asra Aslam
Edward Curry

The Internet of Multimedia Things (IoMT) is an emerging concept due to the large amount
of multimedia data produced by sensing devices. Existing event-based systems mainly
focus on scalar data, and multimedia event-based solutions are domain-specific. Multiple
applications may require handling of numerous known/unknown concepts which may belong
to the same/different domains with an unbounded vocabulary. Although deep neural network-based
techniques are effective for image recognition, the limitation of having to train
classifiers for unseen concepts will lead to an increase in the overall response-time
for users. Since it is not practical to have all trained classifiers available, it
is necessary to address the problem of training of classifiers on demand for unbounded
vocabulary. By exploiting transfer learning based techniques, evaluations showed that
the proposed framework can answer within ~0.01 min to ~30 min of response-time with
accuracy ranges from 95.14% to 98.53%, even when all subscriptions are new/unknown.

Are You Watching Closely? Content-based Retrieval of Hand Gestures

Mahnaz Amiri Parian
Luca Rossetto
Heiko Schuldt
Stéphane Dupont

Gestures play an important role in our daily communications. However, recognizing
and retrieving gestures in-the-wild is a challenging task which is not explored thoroughly
in literature. In this paper, we explore the problem of identifying and retrieving
gestures in a large-scale video dataset provided by the computer vision community
and based on queries recorded in-the-wild. Our proposed pipeline, I3DEF, is based
on the extraction of spatio-temporal features from intermediate layers of an I3D network,
a state-of-the-art network for action recognition, and the fusion of the output of
feature maps from RGB and optical flow input. The obtained embeddings are used to
train a triplet network to capture the similarity between gestures. We further explore
the effect of a person and body part masking step for improving both retrieval performance
and recognition rate. Our experiments show the ability of I3DEF to recognize and retrieve
gestures which are similar to the queries independently of the depth modality. This
performance holds both for queries taken from the test data, and for queries using
recordings from different people performing relevant gestures in a different setting.

Efficient Base Class Selection Algorithms for Few-Shot Classification

Takumi Ohkuma
Hideki Nakayama

Few-shot classification is a task to learn a classifier for novel classes with a limited
number of examples on top of the known base classes which have a sufficient number
of examples. In recent years, significant progress has been achieved on this task.
However, despite the importance of selecting the base classes themselves for better
knowledge transfer, few works have paid attention to this point. In this paper, we
propose two types of base class selection algorithms that are suitable for few-shot
classification tasks. One is based on the thesaurus-tree structure of class names,
and the other is based on word embeddings. In our experiments using representative
few-shot learning methods on the ILSVRC dataset, we show that these two algorithms
can significantly improve the performance compared to a naive class selection method.
Moreover, they do not require high computational and memory costs, which is an important
advantage to scale to a very large number of base classes.

A Crowd Analysis Framework for Detecting Violence Scenes

Konstantinos Gkountakos
Konstantinos Ioannidis
Theodora Tsikrika
Stefanos Vrochidis
Ioannis Kompatsiaris

This work examines violence detection in video scenes of crowds and proposes a crowd
violence detection framework based on a 3D convolutional deep learning architecture,
the 3D-ResNet model with 50 layers. The proposed framework is evaluated on the Violent
Flows dataset against several state-of-the-art approaches and achieves higher accuracy
values in almost all cases, while also performing the violence detection activities
in (near) real-time.

Towards Evaluating and Simulating Keyword Queries for Development of Interactive Known-item
Search Systems

Ladislav Peška
František Mejzlík
Tomáš Souček
Jakub Lokoš

Searching for memorized images in large datasets (known-item search) is a challenging
task due to a limited effectiveness of retrieval models as well as limited ability
of users to formulate suitable queries and choose an appropriate search strategy.
A popular option to approach the task is to automatically detect semantic concepts
and rely on interactive specification of keywords during the search session. Nonetheless,
employed instances of such search models are often set arbitrarily in existing KIS
systems as comprehensive evaluations with reals users are time demanding. This paper
envisions and investigates an option to simulate keyword queries in a selected "toy''
(yet competitive) keyword search model relying on a deep image classification network.
Specifically, two properties of such keyword-based model are experimentally investigated
with our known-item search benchmark dataset: which output transformation and ranking
models are effective for the utilized classification model and whether there are some
options for simulations of keyword queries. In addition to the main objective, the
paper inspects also the effect of interactive query reformulations for the considered
keyword search model.

Itinerary Planning via Deep Reinforcement Learning

Shengxin Chen
Bo-Hao Chen
Zhaojiong Chen
YunBing Wu

Itinerary planning that provides tailor-made tours for each traveler is a fundamental
yet inefficient task in route recommendation. In this paper, we propose an automatic
route recommendation approach with deep reinforcement learning to solve the itinerary
planning problem. We formulate automatic generation of route recommendation as Markov
Decision Process (MDP) and then solve it by our variational agent optimized through
deep Q-learning algorithm. We train our agent using open data over various cities
and show that the agent accomplishes notable improvement in comparison with other
state-of-the-art methods.

Confidence-based Weighted Loss for Multi-label Classification with Missing Labels

Karim M. Ibrahim
Elena V. Epure
Geoffroy Peeters
Gaël Richard

The problem of multi-label classification with missing labels (MLML) is a common challenge
that is prevalent in several domains, e.g. image annotation and auto-tagging. In multi-label
classification, each instance may belong to multiple class labels simultaneously.
Due to the nature of the dataset collection and labelling procedure, it is common
to have incomplete annotations in the dataset, i.e. not all samples are labelled with
all the corresponding labels. However, the incomplete data labelling hinders the training
of classification models. MLML has received much attention from the research community.
However, in cases where a pre-trained model is fine-tuned on an MLML dataset, there
has been no straightforward approach to tackle the missing labels, specifically when
there is no information about which are the missing ones. In this paper, we propose
a weighted loss function to account for the confidence in each label/sample pair that
can easily be incorporated to fine-tune a pre-trained model on an incomplete dataset.
Our experiment results show that using the proposed loss function improves the performance
of the model as the ratio of missing labels increases.

Learning Fine-Grained Similarity Matching Networks for Visual Tracking

Dawei Zhang
Zhonglong Zheng
Xiaowei He
Liu Su
Liyuan Chen

Recently, siamese trackers have been increasingly popular in visual tracking community.
Despite great success, it is still difficult to perform robust tracking in various
challenging scenarios. In this paper, we propose a novel similarity matching network,
that effectively extracts fine-grained semantic features by adding a Classification
branch and a Category-Aware module into the classical Siamese framework (CCASiam).
More specifically, the supervision module can fully utilize the class information
to obtain a loss for classification and the whole network performs tracking loss,
so that the network can extract more discriminative features for each specific target.
During online tracking, the classification branch is removed and the category-aware
module is designed to guide the selection of target-active features using a ridge
regression network, which avoids unnecessary calculations and over-fitting. Furthermore,
we introduce different types of attention mechanisms to selectively emphasize important
semantic information. Due to the fine-grained and category-aware features, CCASiam
can perform high performance tracking efficiently. Extensive experimental results
on several tracking benchmarks, show that the proposed tracker obtains the state-of-the-art
performance with a real-time speed.

At the Speed of Sound: Efficient Audio Scene Classification

Bo Dong
Cristian Lumezanu
Yuncong Chen
Dongjin Song
Takehiko Mizoguchi
Haifeng Chen
Latifur Khan

Efficient audio scene classification is essential for smart sensing platforms such
as robots, medical monitoring, surveillance, or autonomous vehicles. We propose a
retrieval-based scene classification architecture that combines recurrent neural networks
and attention to compute embeddings for short audio segments. We train our framework
using a custom audio loss function that captures both the relevance of audio segments
within a scene and that of sound events within a segment. Using experiments on real
audio scenes, we show that we can discriminate audio scenes with high accuracy after
listening in for less than a second. This preserves 93% of the detection accuracy
obtained after hearing the entire scene.

Imageability Estimation using Visual and Language Features

Chihaya Matsuhira
Marc A. Kastner
Ichiro Ide
Yasutomo Kawanishi
Takatsugu Hirayama
Keisuke Doman
Daisuke Deguchi
Hiroshi Murase

Imageability is a concept from Psycholinguistics quantizing the human perception of
words. However, existing datasets are created through subjective experiments and are
thus very small. Therefore, methods to automatically estimate the imageability can
be helpful. For an accurate automatic imageability estimation, we extend the idea
of a psychological hypothesis called Dual-Coding Theory, that discusses the connection
of our perception towards visual information and language information, and also focus
on the relationship between the pronunciation of a word and its imageability. In this
research, we propose a method to estimate imageability of words using both visual
and language features extracted from corresponding data. For the estimation, we use
visual features extracted from low- and high-level image features, and language features
extracted from textual features and phonetic features of words. Evaluations show that
our proposed method can estimate imageability more accurately than comparative methods,
implying the contribution of each feature to the imageability.

Image Retrieval using Multi-scale CNN Features Pooling

Federico Vaccaro
Marco Bertini
Tiberio Uricchio
Alberto Del Bimbo

In this paper, we address the problem of image retrieval by learning images representation
based on the activations of a Convolutional Neural Network. We present an end-to-end
trainable network architecture that exploits a novel multi-scale local pooling based
on NetVLAD and a triplet mining procedure based on samples difficulty to obtain an
effective image representation. Extensive experiments show that our approach is able
to reach state-of-the-art results on three standard datasets.

Analysis of the Effect of Dataset Construction Methodology on Transferability of Music
Emotion Recognition Models

Sabina Hult
Line Bay Kreiberg
Sami Sebastian Brandt
Björn Þór Jónsson

Indexing and retrieving music based on emotion is a powerful retrieval paradigm with
many applications. Traditionally, studies in the field of music emotion recognition
have focused on training and testing supervised machine learning models using a single
music dataset. To be useful for today's vast music libraries, however, such machine
learning models must be widely applicable beyond the dataset for which they were created.
In this work, we analyze to what extent models trained on one music dataset can predict
emotion in another dataset constructed using a different methodology, by conducting
cross-dataset experiments with three publicly available datasets. Our results suggest
that training a prediction model on a homogeneous dataset with carefully collected
emotion annotations yields a better foundation than prediction models learned on a
larger, more varied dataset, with less reliable annotations.

One Shot Logo Recognition Based on Siamese Neural Networks

Camilo Vargas
Qianni Zhang
Ebroul Izquierdo

This work presents an approach for one-shot logo recognition that relies on a Siamese
neural network (SNN) embedded with a pre-trained model that is fine-tuned on a challenging
logo dataset. Although the model is fine-tuned using logo images, the training and
testing datasets do not have overlapped categories; meaning that, all the classes
used for testing the one-shot recognition framework remain unseen during the fine-tuning
process. The recognition process follows the standard SNN approach in which a pair
of input images are encoded by each sister network. The encoded outputs for each image
are afterwards compared using a trained metric and thresholded to define matches and
mismatches. The proposed approach achieves an accuracy of 77.07% under the one-shot
constraints in the QMUL-OpenLogo dataset. Code is available at https://github.com/cjvargasc/oneshot_siamese/.

Visual Story Ordering with a Bidirectional Writer

Wei-Rou Lin
Hen-Hsen Huang
Hsin-Hsi Chen

This paper introduces visual story ordering, a challenging task in which images and
text are ordered in a visual story jointly. We propose a neural network model based
on the reader-processor-writer architecture with a self-attention mechanism. A novel
bidirectional decoder is further proposed with bidirectional beam search. Experimental
results show the effectiveness of the approach. The information gained from multimodal
learning is presented and discussed. We also find that the proposed embedding narrows
the distance between images and their corresponding story sentences, even though we
do not align the two modalities explicitly. As it addresses a general issue in generative
models, the proposed bidirectional inference mechanism applies to a variety of applications.

Salienteye: Maximizing Engagement While Maintaining Artistic Style on Instagram Using
Deep Neural Networks

Lili Wang
Ruibo Liu
Soroush Vosoughi

Instagram has become a great venue for amateur and professional photographers alike
to showcase their work. It has, in other words, democratized photography. Generally,
photographers take thousands of photos in a session, from which they pick a few to
showcase their work on Instagram. Photographers trying to build a reputation on Instagram
have to strike a balance between maximizing their followers' engagement with their
photos, while also maintaining their artistic style. We used transfer learning to
adapt Xception, which is a model for object recognition trained on the ImageNet dataset,
to the task of engagement prediction and utilized Gram matrices generated from VGG19,
another object recognition model trained on ImageNet, for the task of style similarity
measurement on photos posted on Instagram. Our models can be trained on individual
Instagram accounts to create personalized engagement prediction and style similarity
models. Once trained on their accounts, users can have new photos sorted based on
predicted engagement and style similarity to their previous work, thus enabling them
to upload photos that not only have the potential to maximize engagement from their
followers but also maintain their style of photography. We trained and validated our
models on several Instagram accounts, showing it to be adept at both tasks, also outperforming
several baseline models and human annotators.

Attention Mechanisms, Signal Encodings and Fusion Strategies for Improved Ad-hoc Video
Search with Dual Encoding Networks

Damianos Galanopoulos
Vasileios Mezaris

In this paper, the problem of unlabeled video retrieval using textual queries is addressed.
We present an extended dual encoding network which makes use of more than one encodings
of the visual and textual content, as well as two different attention mechanisms.
The latter serve the purpose of highlighting temporal locations in every modality
that can contribute more to effective retrieval. The different encodings of the visual
and textual inputs, along with early/late fusion strategies, are examined for further
improving performance. Experimental evaluations and comparisons with state-of-the-art
methods document the merit of the proposed network.

Emotion Recognition from Galvanic Skin Response Signal Based on Deep Hybrid Neural
Networks

Imam Yogie Susanto
Tse-Yu Pan
Chien-Wen Chen
Min-Chun Hu
Wen-Huang Cheng

Emotion reacts human beings' physiological and psychological status. Galvanic Skin
Response (GSR) can reveal the electrical characteristics of human skin and is widely
used to recognize the presence of emotion. In this work, we propose an emotion recognition
frame-work based on deep hybrid neural networks, in which 1D CNN and Residual Bidirectional
GRU are employed for time series data analysis. The experimental results show that
the proposed method can outperform other state-of-the-art methods. In addition, we
port the proposed emotion recognition model on Raspberry Pi and design a real-time
emotion interaction robot to verify the efficiency of this work.

SESSION: Session: Brave New Ideas

Automatic Evaluation of Iconic Image Retrieval based on Colour, Shape, and Texture

Riku Togashi
Sumio Fujita
Tetsuya Sakai

Product image search is required to deal with large target image datasets which are
frequently updated, and therefore it is not always practical to maintain exhaustive
and up-to-date relevance assessments for tuning and evaluating the search engine.
Moreover, in similar product image search where the query is also an image, it is
difficult to identify the possible search intents behind it and thereby verbalise
the relevance criteria for the assessors, especially if graded relevance assessments
are required. In this study, we focus on similar product image search within a given
product category (e.g., shoes), wherein each image is iconic (i.e., the image clearly
shows what the product looks like and basically nothing else), and propose an initial
approach to evaluating the task without relying on manual relevance assessments. More
specifically, we build a simple probabilistic model that assumes that an image is
generated from latent intents representing shape, texture, and colour, which enables
us to estimate the relevance score of each image and thereby compute graded relevance
measures for any image search engine result page. Through large-scale crowdsourcing
experiments, we demonstrate that our proposed measures, InDCG (which is based on per-intent
binary relevance) and D-InDCG (which is based on per-intent graded relevance), align
reasonably well with human SERP preferences and with human image preferences. Hence,
our automatic measures may be useful at least for rough tuning and evaluation of similar
product image search.

HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do

Keith Curtis
George Awad
Shahzad Rajput
Ian Soboroff

In this paper we propose a new evaluation challenge and direction in the area of High-level
Video Understanding. The challenge we are proposing is designed to test automatic
video analysis and understanding, and how accurately systems can comprehend a movie
in terms of actors, entities, events and their relationship to each other. A pilot
High-Level Video Understanding (HLVU) dataset of open source movies were collected
for human assessors to build a knowledge graph representing each of them. A set of
queries will be derived from the knowledge graph to test systems on retrieving relationships
among actors, as well as reasoning and retrieving non-visual concepts. The objective
is to benchmark if a computer system can "understand" non-explicit but obvious relationships
the same way humans do when they watch the same movies. This is long-standing problem
that is being addressed in the text domain and this project moves similar research
to the video domain. Work of this nature is foundational to future video analytics
and video understanding technologies. This work can be of interest to streaming services
and broadcasters hoping to provide more intuitive ways for their customers to interact
with and consume video content.

On Visualizations in the Role of Universal Data Representation

Tomáš Skopal

The deep learning revolution changed the world of machine learning and boosted the
AI industry as such. In particular, the most effective models for image retrieval
are based on deep convolutional neural networks (DCNN), outperforming the traditional
"hand-engineered" models by far. However, this tremendous success was redeemed by
a high cost in the form of an exhaustive gathering of labeled data, followed by designing
and training the DCNN models. In this paper, we outline a vision of a framework for
instant transfer learning, where a generic pre-trained DCNN model is used as a universal
feature extraction method for visualized unstructured data in many (non-visual) domains.
The deep feature descriptors are then usable in similarity search tasks (database
queries, joins) and in other parts of the data processing pipeline. The envisioned
framework should enable practitioners to instantly use DCNN-based data representations
in their new domains without the need for the costly training step. Moreover, by use
of the framework the information visualization community could acquire a versatile
metric for measuring the quality of data visualizations, which is generally a difficult
task.

SESSION: Session: Doctoral Symposium

An Interactive Learning System for Large-Scale Multimedia Analytics

Omar Shahbaz Khan

Analyzing multimedia collections in order to gain insight is a common desire amongst
industry and society. Recent research has shown that while machines are getting better
at analyzing multimedia data, they still lack the understanding and flexibility of
humans. A central conjecture in Multimedia Analytics is that interactive learning
is a key method to bridge the gap between human and machine. We investigate the requirements
and design of the Exquisitor system, a very large-scale interactive learning system
that aims to verify the validity of this conjecture. We describe the architecture
and initial scalability results for Exquisitor, and propose research directions related
to both performance and result quality.

Object Detection for Unseen Domains while Reducing Response Time using Knowledge Transfer
in Multimedia Event Processing

Asra Aslam

Event recognition is among one of the popular areas of smart cities that has attracted
great attention for researchers. Since Internet of Things (IoT) is mainly focused
on scalar data events, research is shifting towards the Internet of Multimedia Things
(IoMT) and is still in infancy. Presently multimedia event-based solutions provide
low response-time, but they are domain-specific and can handle only familiar classes
(bounded vocabulary). However multiple applications within smart cities may require
processing of numerous familiar as well as unseen concepts (unbounded vocabulary)
in the form of subscriptions. Deep neural network-based techniques are popular for
image recognition, but have the limitation of training of classifiers for unseen concepts
as well as the requirement of annotated bounding boxes with images. In this work,
we explore the problem of training of classifiers for unseen/unknown classes while
reducing response-time of multimedia event processing (specifically object detection).
We proposed two domain adaptation based models while leveraging Transfer Learning
(TL) and Large Scale Detection through Adaptation (LSDA). The preliminary results
show that proposed framework can achieve 0.5 mAP (mean Average Precision) within 30
min of response-time for unseen concepts. We expect to improve it further using modified
LSDA while applying fastest classification (MobileNet) and detection (YOLOv3) network,
along with elimination of requirement of annotated bounding boxes.

Enabling Relevance-Based Exploration of Cataract Videos

Negin Ghamsarian

Training new surgeons as one of the major duties of experienced expert surgeons demands
a considerable supervisory investment of them. To expedite the training process and
subsequently reduce the extra workload on their tight schedule, surgeons are seeking
a surgical video retrieval system. Automatic workflow analysis approaches can optimize
the training procedure by indexing the surgical video segments to be used for online
video exploration. The aim of the doctoral project described in this paper is to provide
the basis for a cataract video exploration system, that is able to (i) automatically
analyze and extract the relevant segments of videos from cataract surgery, and (ii)
provide interactive exploration means for browsing archives of cataract surgery videos.
In particular, we apply deep-learning-based classification and segmentation approaches
to cataract surgery videos to enable automatic phase and action recognition and similarity
detection.

SESSION: Session: Demonstrations

Automatic Reminiscence Therapy for Dementia

Mariona Carós
Maite Garolera
Petia Radeva
Xavier Giro-i-Nieto

With people living longer than ever, the number of cases with dementia such as Alzheimer's
disease increases steadily. It affects more than 46 million people worldwide, and
it is estimated that in 2050 more than 100 million will be affected. While there are
no effective treatments for these terminal diseases, therapies such as reminiscence,
that stimulate memories from the past are recommended. Currently, reminiscence therapy
takes place in care homes and is guided by a therapist or a carer. In this work, we
present an AI-based solution to automate the reminiscence therapy. This consists of
a dialogue system that uses photos of the users as input to generate questions about
their life. Overall, this paper presents how reminiscence therapy can be automated
by using deep learning, and deployed to smartphones and laptops, making the therapy
more accessible to every person affected by dementia.

Music Tower Blocks: Multi-Faceted Exploration Interface for Web-Scale Music Access

Markus Schedl
Michael Mayr
Peter Knees

We present Music Tower Blocks, a novel browsing interface for interactive music visualization,
capable of dealing with web-scale music collections nowadays offered by major music
streaming services. Based on a clustering created from fused metadata and acoustic
features, a block-based skyline landscape is constructed. It can be navigated by the
user in several ways (zooming, panning, changing angle of slope). User-adjustable
color coding is used for highlighting various facets, e.g., visualizing the distributions
of genres and acoustic features. Furthermore, several search and filtering capabilities
are provided (e.g., search for artists and tracks; filtering with respect to track
popularity to focus on top hits or discovering unknown gems). In addition, Music Tower
Blocks offers the user to connect to their personal music streaming profiles and highlight
on the landscape their favorite or recently-listened-to music, to support exploring
parts of the landscape near to (or far-away from) their own taste.

A Framework for Paper Submission Recommendation System

Dinh V. Cuong
Dac H. Nguyen
Son Huynh
Phong Huynh
Cathal Gurrin
Minh-Son Dao
Duc-Tien Dang-Nguyen
Binh T. Nguyen

Nowadays, recommendation systems play an indispensable role in many fields, including
e-commerce, finance, economy, and gaming. There is emerging research on publication
venue recommendation systems to support researchers when submitting their scientific
work. Several publishers such as IEEE, Springer, and Elsevier have implemented their
submission recommendation systems only to help researchers choose appropriate conferences
or journals for submission. In this work, we present a demo framework to construct
an effective recommendation system for paper submission. With the input data (the
title, the abstract, and the list of possible keywords) of a given manuscript, the
system recommends the list of top relevant journals or conferences to authors. By
using state-of-the-art techniques in natural language understanding, we combine the
features extracted with other useful handcrafted features. We utilize deep learning
models to build an efficient recommendation engine for the proposed system. Finally,
we present the User Interface (UI) and the architecture of our paper submission recommendation
system for later usage by researchers.

surgXplore: Interactive Video Exploration for Endoscopy

Andreas Leibetseder
Klaus Schoeffmann

Accumulating recordings of daily conducted surgical interventions such as endoscopic
procedures for the long term generates very large video archives that are both difficult
to search and explore. Since physicians utilize this kind of media routinely for documentation,
treatment planning or education and training, it can be considered a crucial task
to make said archives manageable in regards to discovering or retrieving relevant
content. We present an interactive tool including a multitude of modalities for browsing,
searching and filtering medical content, demonstrating its usefulness on over 140
hours of pre-processed laparoscopic surgery videos.

Detection of Semantic Risk Situations in Lifelog Data for Improving Life of Frail
People

Thinhinane Yebda
Jenny Benois-Pineau
Marion Pech
Hélène Amièva
Cathal Gurrin

The automatic recognition of risk situations for frail people is an urgent research
topic for the interdisciplinary artificial intelligence and multimedia community.
Risky situations can be recognized from lifelog data recorded with wearable devices.
In this paper, we present a new approach for the detection of semantic risk situations
for frail people in lifelog data. Concept matching between general lifelog and risk
taxonomies was realized and tuned AlexNet was deployed for detection of two semantic
risks situations such as risk of domestic accident and risk of fraud with promising
results.

SenseMood: Depression Detection on Social Media

Chenhao Lin
Pengwei Hu
Hui Su
Shaochun Li
Jing Mei
Jie Zhou
Henry Leung

More than 300 million people have been affected by depression all over the world.
Due to the medical equipment and knowledge limitations, most of them are not diagnosed
at the early stages. Recent work attempts to use social media to detect depression
since the patterns of opinions and thoughts expression of the posted text and images,
can reflect users' mental state to some extent. In this work, we design a system dubbed
SenseMood to demonstrate that the users with depression can be efficiently detected
and analyzed by using proposed system. A deep visual-textual multimodal learning approach
has been proposed to reveal the psychological state of the users on social networks.
The posted images and tweets data from users with/without depression on Twitter have
been collected and used for depression detection. CNN-based classifier and Bert are
applied to extract the deep features from the pictures and text posted by users respectively.
Then visual and textual features are combined to reflect the emotional expression
of users. Finally our system classifies the users with depression and normal users
through a neural network and the analysis report is generated automatically.

An Active Learning Framework for Duplicate Detection in SaaS Platforms

Quy H. Nguyen
Dac Nguyen
Minh-Son Dao
Duc-Tien Dang-Nguyen
Cathal Gurrin
Binh T. Nguyen

With the rapid growth of users' data in SaaS (Software-as-a-service) platforms using
micro-services, it becomes essential to detect duplicated entities for ensuring the
integrity and consistency of data in many companies and businesses (primarily multinational
corporations). Due to the large volume of databases today, the expected duplicate
detection algorithms need to be not only accurate but also practical, which means
that it can release the detection results as fast as possible for a given request.
Among existing algorithms for the deduplicate detection problem, using Siamese neural
networks with the triplet loss has become one of the robust ways to measure the similarity
of two entities (texts, paragraphs, or documents) for identifying all possible duplicated
items. In this paper, we first propose a practical framework for building a duplicate
detection system in a SaaS platform. Second, we present a new active learning schema
for training and updating duplicate detection algorithms. In this schema, we not only
allow the crowd to provide more annotated data for enhancing the chosen learning model
but also use the Siamese neural networks as well as the triplet loss to construct
an efficient model for the problem. Finally, we design a user interface of our proposed
deduplicate detection system, which can easily apply for empirical applications in
different companies.

An Interactive Multimodal Retrieval System for Memory Assistant and Life Organized
Support

Van-Luon Tran
Anh-Vu Mai-Nguyen
Trong-Dat Phan
Anh-Khoa Vo
Minh-Son Dao
Koji Zettsu

Lifelogging is known as the new trend of writing diary digitally where both the surrounding
environment and personal physiological data and cognition are collected at the same
time under the first perspective. Exploring and exploiting these lifelog (i.e., data
created by lifelogging) can provide useful insights for human beings, including healthcare,
work, entertainment, and family, to name a few. Unfortunately, having a valuable tool
working on lifelog to discover these insights is still a tough challenge. To meet
this requirement, we introduce an interactive multimodal retrieval system that aims
to provide people with two functions, memory assistant and life organized support,
with a friendly and easy-to-use web UI. The output of the former function is a video
with footages expressing all instances of events people want to recall. The latter
function generates a statistical report of each event so that people can have more
information to balance their lifestyle. The system relies on two major algorithms
that try to match keywords/phrases to images and to run a cluster-based query using
a watershed-based approach.

SESSION: Special Session 1: Human-Centric Cross-Modal Retrieval

Visible-infrared Person Re-identification via Colorization-based Siamese Generative
Adversarial Network

Xian Zhong
Tianyou Lu
Wenxin Huang
Jingling Yuan
Wenxuan Liu
Chia-Wen Lin

With explosive surveillance data during day and night, visible-infrared person re-identification
(VI-ReID) is an emerging challenge due to the apparent cross-modality discrepancy
between visible and infrared images. Existing VI-ReID work mainly focuses on learning
a robust feature to represent a person in both modalities despite the modality gap
cannot be effectively eliminated. Recent research works have proposed various generative
adversarial network (GAN) models to transfer the visible modality to another unified
modality, aiming to bridge the cross-modality gap. However, they neglect the information
loss caused by transferring the domain of visible images which is significant for
identification. To effectively address the problems, we observe that key information
such as textures and semantics in an infrared image can help to color the image itself
and the colored infrared image maintains rich information from infrared image while
reducing the discrepancy with the visible image. We therefore propose a colorization-based
Siamese generative adversarial network (CoSiGAN) for VI-ReID to bridge the cross-modality
gap, by retaining the identity of the colored infrared image. Furthermore, we also
propose a feature-level fusion model to supplement the transfer loss of colorization.
The experiments conducted on two cross-modality person re-identification datasets
demonstrate the superiority of the proposed method compared with the state-of-the-arts.

iCap: Interactive Image Captioning with Predictive Text

Zhengxiong Jia
Xirong Li

In this paper we study a brand new topic of interactive image captioning with human
in the loop. Different from automated image captioning where a given test image is
the sole input in the inference stage, we have access to both the test image and a
sequence of (incomplete) user-input sentences in the interactive scenario. We formulate
the problem as Visually Conditioned Sentence Completion (VCSC). For VCSC, we propose
ABD-Cap, asynchronous bidirectional decoding for image caption completion. With ABD-Cap
as the core module, we build iCap, a web-based interactive image captioning system
capable of predicting new text with respect to live input from a user. A number of
experiments covering both automated evaluations and real user studies show the viability
of our proposals.

Multi-Attention Multimodal Sentiment Analysis

Taeyong Kim
Bowon Lee

Sentiment analysis plays an important role in natural-language processing. It has
been performed on multimodal data including text, audio, and video. Previously conducted
research does not make full utilization of such heterogeneous data. In this study,
we propose a model of Multi-Attention Recurrent Neural Network (MA-RNN) for performing
sentiment analysis on multimodal data. The proposed network consists of two attention
layers and a Bidirectional Gated Recurrent Neural Network (BiGRU). The first attention
layer is used for data fusion and dimensionality reduction, and the second attention
layer is used for the augmentation of BiGRU to capture key parts of the contextual
information among utterances. Experiments on multimodal sentiment analysis indicate
that our proposed model achieves the state-of-the-art performance of 84.31% accuracy
on the Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI)
dataset. Furthermore, an ablation study is conducted to evaluate the contributions
of different components of the network. We believe that our findings of this study
may also offer helpful insights into the design of models using multimodal data.

MAENet: Boosting Feature Representation for Cross-Modal Person Re-Identification with
Pairwise Supervision

Yongbiao Chen
Sheng Zhang
Zhengwei Qi

Person re-identification aims at successfully retrieving the images of a specific
person in the gallery dataset given a probe image. Among all the existing research
areas related to person re-identification, visible to thermal person re-identification
(VT-REID) has gained proliferating momentum. VT-REID is deemed to be a rather challenging
task owing to the large cross-modality gap [25], cross-modality variation and intra-modality
variation. Existing techniques generally tackle this problem by embedding cross-modality
data with convolutional neural networks into shared feature space to bridge the cross-modality
discrepancy, and subsequently, devise hinge losses on similarity learning to alleviate
the variation. However, feature extraction methods based simply on convolutional neural
networks may fail to capture the distinctive and modality-invariant features, resulting
in noises for further re-identification techniques. In this work, we present a novel
modality and appearance invariant embedding learning framework equipped with maximum
likelihood learning to perform cross-modal person re-identification. Extensive and
comprehensive experiments are conducted to test the effectiveness of our framework.
Results demonstrated that the proposed framework yields state-of-the-art Re-ID accuracy
on RegDB and SYSU-MM01 datasets.

SESSION: Special Session 2: Activities of Daily Living

Incorporating Semantic Knowledge for Visual Lifelog Activity Recognition

Min-Huan Fu
An-Zi Yen
Hen-Hsen Huang
Hsin-Hsi Chen

The advance in wearable technology has made lifelogging more feasible and more popular.
Visual lifelogs collected by wearable cameras capture every single detail of individual's
life experience, offering a promising data source for deeper lifestyle analysis and
better memory recall assistance. However, building a system for organizing and accessing
visual lifelogs is a challenging task due to the semantic gap between visual data
and semantic descriptions of life events. In this paper, we introduce semantic knowledge
to reduce such a semantic gap for daily activity recognition and lifestyle understanding.
We incorporate the semantic knowledge derived from external resources to enrich the
training data for the proposed supervised learning model. Experimental results show
that incorporating external semantic knowledge is beneficial for improving the performance
of recognizing life events.

Anomaly Detection in Traffic Surveillance Videos with GAN-based Future Frame Prediction

Khac-Tuan Nguyen
Dat-Thanh Dinh
Minh N. Do
Minh-Triet Tran

It is essential to develop efficient methods to detect abnormal events, such as car-crashes
or stalled vehicles, from surveillance cameras to provide in-time help. This motivates
us to propose a novel method to detect traffic accidents in traffic videos. To tackle
the problem where anomalies only occupy a small amount of data, we propose a semi-supervised
method using Generative Adversarial Network trained on regular sequences to predict
future frames. Our key idea is to model the ordinary world with a generative model,
then compare a predicted frame with the real next frame to determine if an abnormal
event occurs. We also propose a new idea of encoding motion descriptors and scaled
intensity loss function to optimize GAN for fast-moving objects. Experiments on the
Traffic Anomaly Detection dataset of AI City Challenge 2019 show that our method achieves
the top 3 results with F1 score 0.9412 and RMSE 4.8088, and S3 score 0.9261. Our method
can be applied to different related applications of anomaly and outlier detection
in videos.

Multi-level Recognition on Falls from Activities of Daily Living

Jiawei Li
Shu-Tao Xia
Qianggang Ding

The falling accident is one of the largest threats to human health, which leads to
broken bones, head injury, or even death. Therefore, automatic human fall recognition
is vital for the Activities of Daily Living (ADL). In this paper, we try to define
multi-level computer vision tasks for the visually observed fall recognition problem
and study the methods and pipeline. We make frame-level labels for the fall action
on several ADL datasets to test the methods and support the analysis. While current
deep-learning fall recognition methods usually work on the sequence-level input, we
propose a novel Dynamic Pose Motion (DPM) representation to go a step further, which
can be captured by a flexible motion extraction module. Besides, a sequence-level
fall recognition pipeline is proposed, which has an explicit two-branch structure
for the appearance and motion feature, and has canonical LSTM to make temporal modeling
and fall prediction. Finally, while current research only makes a binary classification
on the fall and ADL, we further study how to detect the start time and the end time
of a fall action in a video-level task. We conduct analysis experiments and ablation
studies on both the simulated and real-life fall datasets. The relabelled datasets
and extensive experiments form a new baseline on the recognition of falls and ADL.

Intelligent Task Recognition: Towards Enabling Productivity Assistance in Daily Life

Jonathan Liono
Mohammad Saiedur Rahaman
Flora D. Salim
Yongli Ren
Damiano Spina
Falk Scholer
Johanne R. Trippas
Mark Sanderson
Paul N. Bennett
Ryen W. White

We introduce the novel research problem of task recognition in daily life. We recognize
tasks such as project management, planning, meal-breaks, communication, documentation,
and family care. We capture Cyber, Physical, and Social (CPS) activities of 17 participants
over four weeks using device-based sensing, app activity logging, and an experience
sampling methodology. Our cohort includes students, casual workers, and professionals,
forming the first real-world context-rich task behaviour dataset. We model CPS activities
across different task categories, results highlight the importance of considering
the CPS feature sets in modelling, especially work-related tasks.

Flood Level Prediction via Human Pose Estimation from Social Media Images

Khanh-An C. Quan
Vinh-Tiep Nguyen
Tan-Cong Nguyen
Tam V. Nguyen
Minh-Triet Tran

Floods are the most common natural and among the most dangerous disasters in the world.
It is important to get up-to-date information about flooding and the flood level for
flood preparation and prevention. In this paper, we propose an efficient method to
determine the flood level from daily activity photos on social media. Our method is
based on the idea of matching the water level with human pose to determine the level
of severity of flooding. Extensive experiments conducted on the dataset of Multimodal
Flood Level Estimation show the superiority of our proposed method. We achieve the
first rank in MediaEval 2019 and this demonstrates the potential applications of our
method to analyze flood information.

Continuous Health Interface Event Retrieval

Vaibhav Pandey
Nitish Nag
Ramesh Jain

Knowing the state of our health at every moment in time is critical for advances in
health science. Using data obtained outside an episodic clinical setting is the first
step towards building a continuous health estimation system. In this paper, we explore
a system that allows users to combine events and data streams from different sources
and retrieve complex biological events, such as cardiovascular volume overload, using
measured lifestyle events. These complex events, which have been explored in biomedical
literature and which we call interface events, have a direct causal impact on the
relevant biological systems; they are the interface through which the lifestyle events
influence our health. We retrieve the interface events from existing events and data
streams by encoding domain knowledge using the event operator language. The interface
events can then be utilized to provide a continuous estimate of the biological variables
relevant to the user's health state. The event-based framework also makes it easier
to estimate which event is causally responsible for a particular change in the individual's
health state.

SESSION: Special Session 3: Multimedia Information Retrieval for Urban Data

Detecting, Classifying, and Mapping Retail Storefronts Using Street-level Imagery

Shahin Sharifi Noorian
Sihang Qiu
Achilleas Psyllidis
Alessandro Bozzon
Geert-Jan Houben

Up-to-date listings of retail stores and related building functions are challenging
and costly to maintain. We introduce a novel method for automatically detecting, geo-locating,
and classifying retail stores and related commercial functions, on the basis of storefronts
extracted from street-level imagery. Specifically, we present a deep learning approach
that takes storefronts from street-level imagery as input, and directly provides the
geo-location and type of commercial function as output. Our method showed a recall
of 89.05% and a precision of 88.22% on a real-world dataset of street-level images,
which experimentally demonstrated that our approach achieves human-level accuracy
while having a remarkable run-time efficiency compared to methods such as Faster Region-Convolutional
Neural Networks (Faster R-CNN) and Single Shot Detector (SSD).

Urban Movie Map for Walkers: Route View Synthesis using 360° Videos

Naoki Sugimoto
Toru Okubo
Kiyoharu Aizawa

We propose a movie map for walkers based on synthesized street walking views along
routes in a particular area. From the perspectives of walkers, we captured a number
of omnidirectional videos along streets in the target area (1km2 around Kyoto Station).
We captured a separate video for each street. We then performed simultaneous localization
and mapping to obtain camera poses from key video frames in all of the videos and
adjusted the coordinates based on a map of the area using reference points. To join
one video to another smoothly at intersections, we identified frames of video intersection
based on camera locations and visual feature matching. Finally, we generated moving
route views by connecting the omnidirectional videos based on the alignment of the
cameras. To improve smoothness at intersections, we generated rotational views by
mixing video intersection frames from two videos. The results demonstrate that our
method can precisely identify intersection frames and generate smooth connections
between videos at intersections.

Urban Object Detection Kit: A System for Collection and Analysis of Street-Level Imagery

Maarten Sukel
Stevan Rudinac
Marcel Worring

In this paper, we propose Urban Object Detection Kit, a system for the real-time collection
and analysis of street-level imagery. The system is affordable and portable and allows
local government agencies to receive actionable intelligence about the objects on
the streets. This system can be attached to service vehicles, such as garbage trucks,
parking scanners and maintenance cars, thus allowing for large-scale deployment. This
will, in turn, result in street-level imagery captured at a high collection frequency,
while covering a large geographical region. Unlike more traditional panoramic street-level
imagery, the data collected by this system has a higher frequency, making it suitable
for the highly dynamic nature of city streets. For example, the proposed system allows
for real-time detection of urban objects and potential issues that require the attention
of city services. It paves the way for easy deployment and testing of multimedia information
retrieval algorithms in a dynamic real-world setting. We showcase the usefulness of
object detection for identifying issues in public spaces that occur within a limited
time span. Finally, we make the kit, as well as the data collected using it, openly
available for the research community.

SESSION: Special Session 4: Knowledge-Driven Analysis and Retrieval on Multimedia

YOLO-mini-tiger: Amur Tiger Detection

Runchen Wei
Ning He
Ke Lu

In this paper, we present our solution for tiger detection in the 2019 Computer Vision
for Wildlife Conservation Challenge (CVWC2019). We introduce an efficient deep tiger
detector, which consists of the convnet channel adaptation method and an improved
tiger detection method based on You Only Look Once version 3 (YOLOv3). Considering
the limited memory and computing power of tiny embedded devices, we have used EfficientNet-B0
and Darknet-53 as backbone networks for detection and adapted them to balance their
depth and width inspired by the channel pruning method and knowledge distillation
method. Our results show that after an architecture adjustment of Darknet-53, the
floating-point computation decreases by 93%, its model size decreases by 97%, and
its accuracy only decreases by 1%; after an architecture adjustment of EfficientNet-B0,
the floating-point computation decreases by 66%, its model size decreases by 70% with
its accuracy only decreased by 1%. We also compare GIoU loss and MSE loss in the training
stage. The GIoU loss has the advantage that it increases the average AP for IoU from
0.5 to 0.95 without affecting training speed and the interface speed, so it is experimentally
reasonable for tiger detection in the wild. This proposed method outperforms previous
Amur tiger detection methods presented at CVWC2019.

Deep Adversarial Discrete Hashing for Cross-Modal Retrieval

Cong Bai
Chao Zeng
Qing Ma
Jinglin Zhang
Shengyong Chen

Cross-modal hashing has received widespread attentions on cross-modal retrieval task
due to its superior retrieval efficiency and low storage cost. However, most existing
cross-modal hashing methods learn binary codes directly from multimedia data, which
cannot fully utilize the semantic knowledge of the data. Furthermore, they cannot
learn the ranking based similarity relevance of data points with multi-label. And
they usually use a relax constraint of hash code which causes non-negligible quantization
loss in the optimization. In this paper, a hashing method called Deep Adversarial
Discrete Hashing (DADH) is proposed to address these issues for cross-modal retrieval.
The proposed method uses adversarial training to learn features across modalities
and ensure the distribution consistency of feature representations across modalities.
We also introduce a weighted cosine triplet constraint which can make full use of
semantic knowledge from the multi-label to ensure the precise ranking relevance of
item pairs. In addition, we use a discrete hashing strategy to learn the discrete
binary codes without relaxation, by which the semantic knowledge from label in the
hash codes can be preserved while the quantization loss can be minimized. Ablation
experiments and comparison experiments on two cross-modal databases show that the
proposed DADH improves the performance and outperforms several state-of-the-art hashing
methods for cross-modal retrieval.

A Lightweight Gated Global Module for Global Context Modeling in Neural Networks

Li Hao
Liping Hou
Yuantao Song
Ke Lu
Jian Xue

Global context modeling has been used to achieve better performance in various computer-vision-related
tasks, such as classification, detection, segmentation and multimedia retrieval applications.
However, most of the existing global mechanisms display problems regarding convergence
during training. In this paper, we propose a novel gated global module (GGM) that
is lightweight and yet effective in terms of achieving better integration of global
information in relation to feature representation. Regarding the original structure
of the network as a local block, our module infers global information in parallel
with local information, and then a gate function is applied to generate global guidance
which is applied to the output of the local module to capture representative information.
The proposed GGM can be easily integrated with common CNN architectures and is training
friendly. We used a classification task as an example to verify the effectiveness
of the proposed GGM, and extensive experiments on ImageNet and CIFAR demonstrated
that our method can be widely applied and is conducive to integrating global information
into common networks.

Fake News Detection via Knowledge-driven Multimodal Graph Convolutional Networks

Youze Wang
Shengsheng Qian
Jun Hu
Quan Fang
Changsheng Xu

Nowadays, with the rapid development of social media, there is a great deal of news
produced every day. How to detect fake news automatically from a large of multimedia
posts has become very important for people, the government and news recommendation
sites. However, most of the existing approaches either extract features from the text
of the post which is a single modality or simply concatenate the visual features and
textual features of a post to get a multimodal feature and detect fake news. Most
of them ignore the background knowledge hidden in the text content of the post which
facilitates fake news detection. To address these issues, we propose a novel Knowledge-driven
Multimodal Graph Convolutional Network (KMGCN) to model the semantic representations
by jointly modeling the textual information, knowledge concepts and visual information
into a unified framework for fake news detection. Instead of viewing text content
as word sequences normally, we convert them into a graph, which can model non-consecutive
phrases for better obtaining the composition of semantics. Besides, we not only convert
visual information as nodes of graphs but also retrieve external knowledge from real-world
knowledge graph as nodes of graphs to provide complementary semantics information
to improve fake news detection. We utilize a well-designed graph convolutional network
to extract the semantic representation of these graphs. Extensive experiments on two
public real-world datasets illustrate the validation of our approach.

Optimizing Queries over Video via Lightweight Keypoint-based Object Detection

Jiansheng Dong
Jingling Yuan
Lin Li
Xian Zhong
Weiru Liu

Recent advancements in convolutional neural networks based object detection have enabled
analyzing the mounting video data with high accuracy. However, inference speed is
a major drawback of these video analysis system because of the heavy object detectors.
To address the computational and practicability challenges of video analysis, we propose
FastQ, a system for efficient querying over video at scale. Given a target video,
FastQ can automatically label the category and number of objects for each frame. We
introduce a novel lightweight object detector named FDet to improve the efficiency
of query system. First, a difference detector filters the frames whose difference
is less than the threshold. Second, FDet is employed to efficiently label the remaining
frames. To reduce inference time, FDet detects a center keypoint and a pair of corners
from the feature map generated by a lightweight backbone to predict the bounding boxes.
FDet completely avoid the complicated computation related to anchor boxes. Compared
with state-of-the-art real-time detectors, FDet achieves superior performance with
29.1% AP on COCO benchmark at 25.3ms. Experiments show that FastQ achieves 150 times
to 300 times speed-ups while maintaining more than 90% accuracy in video queries.

Multi-Graph Group Collaborative Filtering

Bo Jiang

The task of recommending an item or an event to a user group attracts wide attention.
Most existing works obtain group preference by aggregating personalized preferences
in the same group. However, the groups, users, and items are connected in a more complex
structure, e.g.the users in the same group may have different preferences. Thus, it
is important to introduce correlations among groups, users, and items into embedding
learning. To address this problem, we propose Multi-Graph Group Collaborative Filtering
(MGGCF), which refines the group, user and item representations according to three
bipartite graphs. Moreover, since MGGCF refines the group, user and item embeddings
simultaneously, it would benefit both the group recommendation tasks and the individual
recommendation tasks. Extensive experiments are conducted on one real-world dataset
and two synthetic datasets. Empirical results demonstrate that MGGCF significantly
improves not only the group recommendation but also the item recommendation. Further
analysis verifies the importance of embedding propagation for learning better user,
group, item representations, which reveals the rationality and effectiveness of MGGCF.

Rank-embedded Hashing for Large-scale Image Retrieval

Haiyan Fu
Ying Li
Hengheng Zhang
Jinfeng Liu
Tao Yao

With the growth of images on the Internet, plenty of hashing methods are developed
to handle the large-scale image retrieval task. Hashing methods map data from high
dimension to compact codes, so that they can effectively cope with complicated image
features. However, the quantization process of hashing results in unescapable information
loss. As a consequence, it is a challenge to measure the similarity between images
with generated binary codes. The latest works usually focus on learning deep features
and hashing functions simultaneously to preserve the similarity between images, while
the similarity metric is fixed. In this paper, we propose a Rank-embedded Hashing
(ReHash) algorithm where the ranking list is automatically optimized together with
the feedback of the supervised hashing. Specifically, ReHash jointly trains the metric
learning and the hashing codes in an end-to-end model. In this way, the similarity
between images are enhanced by the ranking process. Meanwhile, the ranking results
are an additional supervision for the hashing function learning as well. Extensive
experiments show that our ReHash outperforms the state-of-the-art hashing methods
for large-scale image retrieval.

A Coordinated Representation Learning Enhanced Multimodal Machine Translation Approach
with Multi-Attention

Yifeng Han
Lin Li
Jianwei Zhang

In recent years, the application of machine translation has become more and more widely.
Currently, the neural multimodal translation models have made attractive progress,
which combines images into deep learning networks, such as Transformer and RNN. When
considering images in translation models, they directly apply gate structure or image
attention to introduce image feature to enhance the translation effect. We argue that
it may mismatch the text and image features since they are in different semantic space.
In this paper, we propose a coordinated representation learning enhanced multimodal
machine translation approach with multimodal attention. Our approach accepts the text
data and its relevant image data as the input. The image features are fed into the
decoder side of the basic Transformer model. Moreover, the Coordinated Representation
Learning is utilized to map the different text and image modal features into their
semantic representations. The mapped representations are linearly related in a shared
semantic space. Finally, the sum of the image and text representations, called Coordinated
Visual-Semantic Representation (CVSR), will be sent to a Multimodal Attention Layer
(MAL) in our Transformer based translation approach. Experimental results show that
our approach achieves the state-of-art performance on the public Multi30k dataset.

SESSION: Workshop Summaries

CEA'20: The 12th Workshop on Multimedia for Cooking and Eating Activities

Ichiro Ide
Yoko Yamakata
Atsushi Hashimoto

The 12th Workshop on Multimedia for Cooking and Eating Activities presents This overview
introduces the aim of the CEA'20 workshop and the list of papers presented in the
workshop.

ICDAR'20: Intelligent Cross-Data Analysis and Retrieval

Minh-Son Dao
Morten Fjeld
Filip Biljecki
Uraz Yavanoglu
Mianxiong Dong

The First International Workshop on "Intelligence Cross-Data Analytics and Retrieval"
(ICDAR'20) welcomes any theoretical and practical works on intelligence cross-data
analytics and retrieval to bring the smart-sustainable society to human beings. We
have witnessed the era of big data where almost any event that happens is recorded
and stored either distributedly or centrally. The utmost requirement here is that
data came from different sources, and various domains must be harmonically analyzed
to get their insights immediately towards giving the ability to be retrieved thoroughly.
These emerging requirements lead to the need for interdisciplinary and multidisciplinary
contributions that address different aspects of the problem, such as data collection,
storage, protection, processing, and transmission, as well as knowledge discovery,
retrieval, and security and privacy. Hence, the goal of the workshop is to attract
researchers and experts in the areas of multimedia information retrieval, machine
learning, AI, data science, event-based processing and analysis, multimodal multimedia
content analysis, lifelog data analysis, urban computing, environmental science, atmospheric
science, and security and privacy to tackle the issues as mentioned earlier.

MMArt-ACM'20: International Joint Workshop on Multimedia Artworks Analysis and Attractiveness
Computing in Multimedia 2020

Wei-Ta Chu
Ichiro Ide
Naoko Nitta
Norimichi Tsumura
Toshihiko Yamasaki

The International Joint Workshop on Multimedia Artworks Analysis and Attractiveness
Computing in Multimedia (MMArt-ACM) solicits contributions on methodology advancement
and novel applications of multimedia artworks and attractiveness computing that emerge
in the era of big data and deep learning. Despite the strike of the Covid-19 pandemic,
this workshop attracts submissions of diverse topics in these two fields, and the
workshop program finally consists of five presented papers. The topics cover image
retrieval, image transformation and generation, recommendation system, and image/video
summarization. The actual MMArt-ACM'20 Proceedings are available in the ACM DL at:
https://dl.acm.org/citation.cfm?id=3379173

Introduction to the Third Annual Lifelog Search Challenge (LSC'20)

Cathal Gurrin
Tu-Khiem Le
Van-Tu Ninh
Duc-Tien Dang-Nguyen
Björn Þór Jónsson
Jakub Lokoš
Wolfgang Hürst
Minh-Triet Tran
Klaus Schöffmann

The Lifelog Search Challenge (LSC) is an annual comparative benchmarking activity
for comparing approaches to interactive retrieval from multi-modal lifelogs. LSC'20,
the third such challenge, attracts fourteen participants with their interactive lifelog
retrieval systems. These systems are comparatively evaluated in front of a live-audience
at the LSC workshop at ACM ICMR'20 in Dublin, Ireland. This overview motivates the
challenge, presents the dataset and system configuration used in the challenge, and
briefly presents the participating teams.

ICMR '20: Proceedings of the 2020 International Conference on Multimedia Retrieval

ICMR '20: Proceedings of the 2020 International Conference on Multimedia Retrieval

SESSION: Keynote Talks

SESSION: Tutorials

SESSION: Best Paper Session

SESSION: Oral Session 1: Cross-Modal Analysis

SESSION: Oral Session 2: Applications

SESSION: Oral Session 3: Retrieval

SESSION: Oral Session 4: Semantic Enrichment

SESSION: Session: Posters (Full Length)

SESSION: Session: Posters (Short)

SESSION: Session: Brave New Ideas

SESSION: Session: Doctoral Symposium

SESSION: Session: Demonstrations

SESSION: Special Session 1: Human-Centric Cross-Modal Retrieval

SESSION: Special Session 2: Activities of Daily Living

SESSION: Special Session 3: Multimedia Information Retrieval for Urban Data

SESSION: Special Session 4: Knowledge-Driven Analysis and Retrieval on Multimedia

SESSION: Workshop Summaries

Sections

User login