AVSU'18- Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia

Full Citation in the ACM Digital Library

SESSION: Keynote Talk 1

Session details: Keynote Talk 1

Adrian Hilton

Multimodal Fusion Strategies: Human vs. Machine

Hanseok Ko

Two-hour movie or a short movie clip as its subset is intended to capture and present a meaningful (or significant) story in video to be recognized and understood by human audience. What if we substitute the task of human audience with that of an intelligent machine or robot capable of capturing and processing the semantic information in terms of audio and video cues contained in the video? By using both auditory and visual means, human brain processes the audio (sound, speech) and video (background image scene, moving video objects, written characters) modalities to extract the spatial and temporal semantic information, that are contextually complementary and robust. Smart machines equipped with audiovisual multisensors (e.g. CCTV equipped with cameras and microphones) should be capable of achieving the same task. An appropriate fusion strategy combining the audio and visual information would be a key component in developing such artificial general intelligent (AGI) systems. This talk reviews the challenges of current video analytics schemes and explores various sensor fusion techniques [1, 2, 3, 4] to combine the audio-visual information cues for video content analytics task.

SESSION: Oral Session 1

Session details: Oral Session 1

Bumsub Ham

An Audio-Visual Method for Room Boundary Estimation and Material Recognition

Luca Remaggi
Hansung Kim
Philip J. B. Jackson
Adrian Hilton

In applications such as virtual and augmented reality, a plausible and coherent audio-visual reproduction can be achieved by deeply understanding the reference scene acoustics. This requires knowledge of the scene geometry and related materials. In this paper, we present an audio-visual approach for acoustic scene understanding. We propose a novel material recognition algorithm, that exploits information carried by acoustic signals. The acoustic absorption coefficients are selected as features. The training dataset was constructed by combining information available in the literature, and additional labeled data that we recorded in a small room having short reverberation time (RT60). Classic machine learning methods are used to validate the model, by employing data recorded in five rooms, having different sizes and RT60s. The estimated materials are utilized to label room boundaries, reconstructed by a vision-based method. Results show 89% and 80% agreement between the estimated and reference room volumes and materials, respectively.

A Deep Learning-based Stress Detection Algorithm with Speech Signal

Hyewon Han
Kyunggeun Byun
Hong-Goo Kang

In this paper, we propose a deep learning-based psychological stress detection algorithm using speech signals. With increasing demands for communication between human and intelligent systems, automatic stress detection is becoming an interesting research topic. Stress can be reliably detected by measuring the level of specific hormones (e.g., cortisol), but this is not a convenient method for the detection of stress in human-machine interactions. The proposed algorithm first extracts mel-filterbank coefficients using pre-processed speech data and then predicts the status of stress output using a binary decision criterion (i.e., stressed or unstressed) using long short-term memory (LSTM) and feed-forward networks. To evaluate the performance of the proposed algorithm, speech, video, and bio-signal data were collected in a well-controlled environment. We utilized only speech signals in the decision process from subjects whose salivary cortisol level varies over 10%. Using the proposed algorithm, we achieved 66.4% accuracy in detecting the stress state from 25 subjects, thereby demonstrating the possibility of utilizing speech signals for automatic stress detection.

SESSION: Keynote Talk 2

Session details: Keynote Talk 2

Hansung Kim

Spatial Audio on the Web - Create, Compress, and Render

Jan Skoglund

The recent surge of VR and AR has spawned an interest in spatial audio beyond its traditional delivery over loudspeakers in, e.g., home theater environments, to headphone delivery over, e.g., mobile devices. In this talk we'll discuss a web-based approach to spatial audio. It will cover creating real-time spatial audio directly in the browser, data compression of the audio for immersive media, and efficient binaural rendering.

SESSION: Oral Session 2

Session details: Oral Session 2

Hong-Goo Kang

Generation Method for Immersive Bullet-Time Video Using an Omnidirectional Camera in VR Platform

Oto Takeuchi
Hidehiko Shishido
Yoshinari Kameda
Hansung Kim
Itaru Kitahara

This paper proposes a generation method of immersive bullet-time video that continuously switches the images captured by multi-viewpoint omnidirectional cameras arranged around the subject. In ordinary bullet-time processing, it is possible to observe a point of interest (POI) at the same screen position by applying projective transformation to captured multi-viewpoint images. However, the observable area is limited by the field of view of the capturing cameras. Thus, a blank region is added to the displayed image, depending on the spatial relationship between the POI and the capturing camera. This seriously harms image quality (i.e., immersiveness). We solve this problem by applying omnidirectional cameras to bullet-time video production. Furthermore, by using the virtual reality platform for calibration of multi-viewpoint omnidirectional cameras and display of bullet-time video, fast and simple processing can be realised.

Audio-Visual Attention Networks for Emotion Recognition

Jiyoung Lee
Sunok Kim
Seungryong Kim
Kwanghoon Sohn

We present a spatiotemporal attention based multimodal deep neu- ral networks for dimensional emotion recognition in multimodal audio-visual video sequence. To learn the temporal attention that discriminatively focuses on emotional sailient parts within speech audios, we formulate the temporal attention network using deep neural networks (DNNs). In addition, to learn the spatiotemporal attention that selectively focuses on emotional sailient parts within facial videos, the spatiotemporal encoder-decoder network is formu- lated using Convolutional LSTM (ConvLSTM) modules, and learned implicitly without any pixel-level annotations. By leveraging the spatiotemporal attention, the 3D convolutional neural networks (3D-CNNs) is also formulated to robustly recognize the dimensional emotion in facial videos. Furthermore, to exploit multimodal infor- mation, we fuse the audio and video features to emotion regression model. The experimental results show that our method can achieve the state-of-the-art results in dimensional emotion recognition with the highest concordance correlation coefficient (CCC) on AV+EC 2017 dataset.

Towards Realistic Immersive Audiovisual Simulations for Hearing Research: Capture, Virtual Scenes and Reproduction

Gerard Llorach
Giso Grimm
Maartje M.E. Hendrikse
Volker Hohmann

Most current hearing research laboratories and hearing aid evaluation setups are not sufficient to simulate real-life situations and to evaluate future generations of hearing aids that might include gaze information and brain signals. Thus, new methodologies and technologies might need to be implemented in hearing laboratories and clinics in order to generate audiovisual realistic testing environments. The aim of this work is to provide a comprehensive review of the current available approaches and future directions to create audiovisual realistic immersive simulations for hearing research. Additionally, we present the technologies and use cases of our laboratory, as well as the pros and cons of such technologies: From creating 3D virtual simulations with computer graphics and virtual acoustic simulations, to 360º videos and Ambisonic recordings.