MULEA '19- 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications

Full Citation in the ACM Digital Library

SESSION: Keynote Session 1

Connecting Language and Vision: From Captioning towards Embodied Learning

Subhashini Venugopalan

For most humans, understanding multimedia content is easy, and in many cases images and videos are a preferred means of augmenting and enhancing human interaction and communication. Given a video, humans can discern a great deal from this rich information source and can interpret and describe the content to varying degrees of detail. For computers however, interpreting content from image and video pixels and associating them with language is very challenging. Research in the recent past has made tremendous progress in this problem of visual language grounding, i.e. interpreting visual content, from images and videos, and associating them with language. This progress has been made possible not only by advances in object recognition, activity recognition, and language generation, but also by developing versatile and elegant ways of combining them. However to realize the long-term goal of enabling fluent interaction between humans and computers/robots, it is also essential to ground language in action in addition to vision. In this respect embodied, task-oriented aspect of language grounding has emerged as a research direction that is garnering much attention. Current research focuses on developing new datasets and techniques for linking language to action in the real world, such as agents that follow instructions for navigation tasks or manipulation tasks. Following the exciting progress in this space, we expect research in connecting language and vision to continue to accelerate in the coming years towards the development of embodied agents that learn to navigate the real world through human interaction.

SESSION: Oral Session 1: Vision and Language

Geometry-aware Relational Exemplar Attention for Dense Captioning

Tzu-Jui Julius Wang
Hamed R. Tavakoli
Mats Sjöberg
Jorma Laaksonen

Dense captioning (DC), which provides a comprehensive context understanding of images by describing all salient visual groundings in an image, facilitates multimodal understanding and learning. As an extension of image captioning, DC is developed to discover richer sets of visual contents and to generate captions of wider diversity and increased details. The state-of-the-art models of DC consist of three stages: (1) region proposals, (2) region classification, and (3) caption generation for each proposal. They are typically built upon the following ideas: (a) guiding the caption generation with image-level features as the context cues along with regional features and (b) refining locations of region proposals with caption information. In this work, we propose (a) a joint visual-textual criterion exploited by the region classifier that further improves both region detection and caption accuracy, and (b) a Geometry aware Relational Exemplar attention (GREatt) mechanism to relate region proposals. The former helps the model learn a region classifier by effectively exploiting both visual groundings and caption descriptions. Rather than treating each region proposal in isolation, the latter relates regions in complementary relations, i.e. contextually dependent, visually supported and geometry relations, to enrich context information in regional representations. We conduct an extensive set of experiments and demonstrate that our proposed model improves the state-of-the-art by at least +5.3% in terms of the mean average precision on the Visual Genome dataset.

Deep Reinforcement Learning Visual-Text Attention for Multimodal Video Classification

Mengyi Liu
Zhu Liu

Nowadays multimedia contents including text, images, and videos have been produced and shared ubiquitously in our daily life, which has encouraged researchers to develop algorithms for multimedia search and analysis in various applications. The trend of web data becoming increasingly multimodal makes the task of multimodal classification ever more popular and pertinent. In this paper, we mainly focus on the scenario of videos for their intrinsic multimodal property, and resort to attention learning among different modalities for classification. Specifically, we formulate the multimodal attention learning as a sequential decision-making process, and propose an end-to-end, deep reinforcement learning based framework to determine the selection of modality at each time step for the final feature aggregation model. To train our policy networks, we design a supervised reward which considers the multi-label classification loss, and two unsupervised rewards which simultaneously consider inter-modality correlation for consistency and intra-modality reconstruction for representativeness. Extensive experiments have been conducted on two large-scale multimodal video datasets to evaluate the whole framework and several key components, including the parameters of policy network, the effects of different rewards, and the rationality of the learned visual-text attention. Promising results demonstrate that our approach outperforms other state-of-the-art methods of attention mechanism and multimodal fusion for video classification task.

SESSION: Keynote Session 2

On the Multisensory Nature of Objects and Language: A Robotics Perspective

Jivko Sinapov

Infants use exploratory behaviors to learn about the objects around them. Psychologists have theorized that behaviors such as grasping touching, pressing, and lifting, coupled with the visual, tactile, haptic and auditory sensory modalities, enable infants to form grounded object representations. For example, scratching an object can provide information about its roughness, while lifting it can provide information about its weight. In a sense, the exploratory behavior acts as a "question'' to the object, which is subsequently "answered" by the sensory stimuli produced during the execution of the behavior. In contrast, most object representations used by robots today rely solely on computer vision or laser scan data, gathered through passive observation. Such disembodied approaches to robotic perception may be useful for recognizing an object using a 3D model database, but nevertheless, will fail to infer object properties that cannot be detected using vision alone. To bridge this gap, our research has pursued a developmental framework for object perception and exploration in which the robot's representation of objects is grounded in its own sensorimotor experience with them \citesinapov2014grounding. In this framework, an object is represented by sensorimotor contingencies that span a diverse set of exploratory behaviors and sensory modalities. In this talk, I will highlight results from several large-scale experimental studies which show that the behavior-grounded object representation enables a robot to solve a wide variety of perceptual and cognitive tasks relevant to object learning \citesinapov2014learning,sinapov2011interactive. I will discuss recent work on how robots can ground language in multisensory experience with objects \citethomason2016learning and will conclude with a discussion on open problems in multisensory symbol grounding, which, if solved, could result in the large-scale deployment of robotic systems in real-world domains.

SESSION: Keynote Session 3

Learning to Navigate

Piotr Mirowski

Navigation is an important cognitive task that enables humans and animals to traverse, with or without maps, over long distances in the complex world. Such long-range navigation can simultaneously support self-localisation ("I am here") and a representation of the goal ("I am going there"). For this reason, studying navigation is fundamental to the study and development of artificial intelligence, and trying to replicate navigation in artificial agents can also help neuroscientists understand its biological underpinnings. This talk will cover our own journey to understand navigation by building deep reinforcement learning agents, starting from learning to control a simple agent that can explore and memorise large 3D mazes to designing agents with a read-write memory that can generalise to unseen mazes from one traversal. I will show how these artificial agents relate to navigation in the real world, both through the study of the emergence of grid cell representations in neural networks and by demonstrating that these agents can navigate in Street View-based real world photographic environments. I will finally present two approaches in our ongoing work on leveraging multimodal information for generalising navigation policies to unseen environments in Street View, one consisting in following language instructions and the second one in transferring navigation policies by training on aerial views.

SESSION: Oral Session 2: Language and Robotics

Visually Grounded Language Learning for Robot Navigation

Emre Ünal
Ozan Arkan Can
Yücel Yemez

We present an end-to-end deep learning model for robot navigation from raw visual pixel input and natural text instructions. The proposed model is an LSTM-based sequence-to-sequence neural network architecture with attention, which is trained on instruction-perception data samples collected in a synthetic environment. We conduct experiments on the SAIL dataset which we reconstruct in 3D so as to generate the 2D images associated with the data. Our experiments show that the performance of our model is on a par with state-of-the-art, despite the fact that it learns navigational language with end-to-end training from raw visual data.

Clustering Optimization for Abnormality Detection in Semi-Autonomous Systems

Hafsa Iqbal
Damian Campo
Mohamad Baydoun
Lucio Marcenaro
David Martin Gomez
Carlo Regazzoni

The use of machine learning techniques is fundamental for developing autonomous systems that can assist humans in everyday tasks. This paper focus on selecting an appropriate network size for detecting abnormalities in multisensory data coming from a semi-autonomous vehicle. We use an extension of Growing Neural Gas with the utility measurement (GNG-U) for segmenting multisensory data into an optimal set of clusters that facilitate a semantic interpretation of data and define local linear models used for prediction purposes. A functional that favors precise linear dynamical models in large state space regions is considered for optimization purposes. The proposed method is tested with synchronized multi-sensor dynamic data related to different maneuvering tasks performed by a semi-autonomous vehicle that interacts with pedestrians in a closed environment. Comparisons with a previous work of abnormality detection are provided.

SESSION: Poster Session

An Improvement on Audio-to-MIDI Alignment Using Triplet Pair

Yifan Wang
Shuchang Liu
Li Guo

In this paper, we employ a neural network based cross-modality model on audio-to-MIDI alignment task. A novel loss function based on Hinge Loss is proposed to optimize the model learning an Euclidean embedding space, where the distance of embedding vectors can be directly used as a measure of similarity in alignment. In the previous alignment system also based on cross-modality model, there are positive and negative pairs in the loss function, which represent aligned and misaligned pairs. In this paper, we introduce an extra pair named overlapping to capture musical onset information. We evaluate our system on the MAPS dataset and compare it to other previous methods. The results reveal that the align accuracy of the proposed system beats the transcription based method by a significant margin, e.g., 81.61% to 86.41%, when the align error threshold is set to 10 ms. And the proposed loss also has an improvement on the statistics of absolute onset errors in comparison to the loss function implemented in other audio-to-MIDI alignment system. We also conduct experiments on the dimension of embedding vectors and results show the proposed system can still maintain the alignment performance with lower dimension.

Video Object Linguistic Grounding

Alba Herrera-Palacio
Carles Ventura
Xavier Giro-i-Nieto

The goal of this work is segmenting on a video sequence the objects which are mentioned in a linguistic description of the scene. We have adapted an existing deep neural network that achieves state of the art performance in semi-supervised video object segmentation, to add a linguistic branch that would generate an attention map over the video frames, making the segmentation of the objects temporally consistent along the sequence.

MultiLock: Mobile Active Authentication based on Multiple Biometric and Behavioral Patterns

Alejandro Acien
Aythami Morales
Ruben Vera-Rodriguez
Julian Fierrez
Ruben Tolosana

In this paper we evaluate how discriminative are behavior-based signals obtained from the smartphone sensors. The main aim is to evaluate these signals for person recognition. The recognition based on these signals increases the security of devices, but also implies privacy concerns. We consider seven different data channels and their combinations. Touch dynamics (touch gestures and keystroking), accelerometer, gyroscope, WiFi, GPS location and app usage are all collected during human-mobile interaction to authenticate the users. We evaluate two approaches: one-time authentication and active authentication. In one-time authentication, we employ the information of all channels available during one session. For active authentication we take advantage of mobile user behavior across multiple sessions by updating a confidence value of the authentication score. Our experiments are conducted on the semi-uncontrolled UMDAA-02 database. This database comprises of smartphone sensor signals acquired during natural human-mobile interaction. Our results show that different traits can be complementary and multimodal systems clearly increase the performance with accuracies ranging from 82.2% to 97.1% depending on the authentication scenario. These results confirm the discriminative power of these signals.

MULEA '19- 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications

MULEA '19- 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications

SESSION: Keynote Session 1

Connecting Language and Vision: From Captioning towards Embodied Learning

SESSION: Oral Session 1: Vision and Language

Geometry-aware Relational Exemplar Attention for Dense Captioning

Deep Reinforcement Learning Visual-Text Attention for Multimodal Video Classification

SESSION: Keynote Session 2

On the Multisensory Nature of Objects and Language: A Robotics Perspective

SESSION: Keynote Session 3

Learning to Navigate

SESSION: Oral Session 2: Language and Robotics

Visually Grounded Language Learning for Robot Navigation

Clustering Optimization for Abnormality Detection in Semi-Autonomous Systems

SESSION: Poster Session

An Improvement on Audio-to-MIDI Alignment Using Triplet Pair

Video Object Linguistic Grounding

MultiLock: Mobile Active Authentication based on Multiple Biometric and Behavioral Patterns

Sections

User login