MMSys '22: Proceedings of the 13th ACM Multimedia Systems Conference

MMSys '22: Proceedings of the 13th ACM Multimedia Systems Conference

MMSys '22: Proceedings of the 13th ACM Multimedia Systems Conference

Full Citation in the ACM Digital Library

A questionnaire-based and physiology-inspired quality of experience evaluation of an immersive multisensory wheelchair simulator

  • Débora Pereira Salgado
  • Ronan Flynn
  • Eduardo Lázaro Martins Naves
  • Niall Murray

Immersive multimedia technologies such as virtual reality (VR) are now finding potential applications in domains outside of entertainment and gaming in areas such as health, education, and tourism, to name a few. This article presents a Quality of Experience (QoE) evaluation of an immersive haptic-based VR wheelchair simulator. The paper presents the results of an explicit and implicit (physiology-based) QoE evaluation of the Immersive Simulator in three different configurations: (a) desktop group (non-immersive); (b) headset 1 group (immersive with a high rate of motion acceleration); and (c) headset 2 group (immersive with a lower rate of motion acceleration). As part of the user evaluations, participants in each of the groups completed several questionnaires, including: an emotion questionnaire (SAM), cognitive task load (NASA-TLX), user expectations usability (SUS), and presence (IPQ). In addition, during the experience, physiological responses such as electrodermal activity (EDA) and heart rate variability (HRV) were recorded. The self-reported findings suggest that both headset groups had higher usability and presence levels than the desktop group. The two headset groups also had greater pleasant and exciting emotions than the desktop group. The NASA-TLX findings indicate that the headset 1 group presented the highest task cognitive load. The performance evaluation shows that both headset groups had a better results than the desktop group in terms of task completion.

Deep variational learning for multiple trajectory prediction of 360° head movements

  • Quentin Guimard
  • Lucile Sassatelli
  • Francesco Marchetti
  • Federico Becattini
  • Lorenzo Seidenari
  • Alberto Del Bimbo

Prediction of head movements in immersive media is key to design efficient streaming systems able to focus the bandwidth budget on visible areas of the content. Numerous proposals have therefore been made in the recent years to predict 360° images and videos. However, the performance of these models is limited by a main characteristic of the head motion data: its intrinsic uncertainty. In this article, we present an approach to generate multiple plausible futures of head motion in 360° videos, given a common past trajectory. Our method provides likelihood estimates of every predicted trajectory, enabling direct integration in streaming optimization. To the best of our knowledge, this is the first work that considers the problem of multiple head motion prediction for 360° video streaming. We first quantify this uncertainty from the data. We then introduce our discrete variational multiple sequence (DVMS) learning framework, which builds on deep latent variable models. We design a training procedure to obtain a flexible and lightweight stochastic prediction model compatible with sequence-to-sequence recurrent neural architectures. Experimental results on 3 different datasets show that our method DVMS outperforms competitors adapted from the self-driving domain by up to 37% on prediction horizons up to 5 sec., at lower computational and memory costs. Finally, we design a method to estimate the respective likelihoods of the multiple predicted trajectories, by exploiting the stationarity of the distribution of the prediction error over the latent space. Experimental results on 3 datasets show the quality of these estimates, and how they depend on the video category.

Context-aware image compression optimization for visual analytics offloading

  • Bo Chen
  • Zhisheng Yan
  • Klara Nahrstedt

Convolutional Neural Networks (CNN) have given rise to numerous visual analytics applications at the edge of the Internet. The image is typically captured by cameras and then live-streamed to edge servers for analytics due to the prohibitive cost of running CNN on computation-constrained end devices. A critical component to ensure low-latency and accurate visual analytics offloading over low bandwidth networks is image compression that minimizes the amount of visual data to offload and maximizes the decoding quality of salient pixels for analytics. Despite the wide adoption, JPEG standard and traditional image compression do not address the accuracy of analytics tasks, leading to ineffective compression for visual analytics offloading. Although recent machine-centric image compression techniques leverage sophisticated neural network models or hardware architecture to support the accuracy-bandwidth trade-off, they introduce excessive latency in the visual analytics offloading pipeline. This paper presents CICO, a Context-aware Image Compression Optimization framework to achieve low-bandwidth and low-latency visual analytics offloading. CICO contextualizes image compression for offloading by employing easily-computable low-level image features to understand the importance of different image regions for a visual analytics task. Accordingly, CICO can optimize the trade-off between compression size and analytics accuracy. Extensive real-world experiments demonstrate that CICO reduces the bandwidth consumption of existing compression methods by up to 40% under a comparable analytics accuracy. In terms of the low-latency support, CICO achieves up to a 2x speedup over state-of-the-art compression techniques.

Spatial audio in 360° videos: does it influence visual attention?

  • Amit Hirway
  • Yuansong Qiao
  • Niall Murray

Immersive technologies are rapidly gaining traction across a variety of application domains. 360° video is one such technology, which can be captured with an omnidirectional multi-camera arrangement. With a Virtual Reality (VR) Head Mounted Display (HMD), users have the freedom to look in any direction they wish within the scene. While there is a plethora of work focused on modeling visual attention (VA) in VR, little research has considered the influence of the audio modality on VA in VR. It is well known that audio has an important role in VR experiences. Listeners can experience sound in all directions with high quality spatial audio. One such technique, Ambisonics or 3D audio, provides a full 360° sound soundscape.

In this paper, the results of an empirical study that examined how (if at all) spatial audio influences visual attention in 360° videos are presented. The same videos were accompanied with either non-spatial (stereo) or spatial (third order Ambisonics) sound. Pose and gaze fixations, pupil diameter, audio energy maps (AEMs) and associated analysis for 20 users watching ten 360° videos across various categories (in Indoor and Outdoor conditions) are presented. The findings reveal that users have different viewing patterns and physiological responses for the different sound conditions. With the increasing use of spatial audio in VR and 360° videos, this knowledge can help develop effective techniques for optimizations in terms of processing, encoding, distributing, and rendering content.

3DeformR: freehand 3D model editing in virtual environments considering head movements on mobile headsets

  • Kit Yung Lam
  • Lik-Hang Lee
  • Pan Hui

3D objects are the primary media in virtual reality environments in immersive cyberspace, also known as the Metaverse. Users, through editing such objects, can communicate with other individuals on mobile headsets. Knowing that the tangible controllers cause the burden to carry such addendum devices, the body-centric interaction techniques, such as hand gestures, get rid of such burdens. However, object editing with hand gestures is usually overlooked. Accordingly, we propose and implement a palm-based virtual embodiment for hand gestural model editing, namely 3DeformR. We employ three optimized hand gestures on bi-harmonic deformation algorithms that enable selecting and editing 3D models in fine granularity. Our evaluation with nine participants considers three interaction techniques (two-handed tangible controller (OMC), a naive implementation of hand gestures (SH), and 3DeformR. Two experimental tasks of planar and spherical objects imply that 3DeformR outperforms SH, in terms of task completion time (~51%) and required actions (~17%). Also, our participants with 3DeformR make significantly better performance than the commercial standard (OMC) - saved task time (~43%) and actions (~3%). Remarkably, the edited objects by 3DeformR show no discernible difference from those with tangible controllers characterised by accurate and responsive detection.

CrispSearch: low-latency on-device language-based image retrieval

  • Zhiming Hu
  • Lan Xiao
  • Mete Kemertas
  • Caleb Phillips
  • Iqbal Mohomed
  • Afsaneh Fazly

Advances in deep learning have enabled accurate language-based search and retrieval, e.g., over user photos, in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency, but requires a lot more computational resources, and an order of magnitude more training data (i.e. large web-scraped datasets consisting of millions of image-caption pairs) making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage, on two standard benchmark datasets (namely, MSCOCO and Flickr30k) show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.

Automatic thumbnail selection for soccer videos using machine learning

  • Andreas Husa
  • Cise Midoglu
  • Malek Hammou
  • Steven A. Hicks
  • Dag Johansen
  • Tomas Kupka
  • Michael A. Riegler
  • Pål Halvorsen

Thumbnail selection is a very important aspect of online sport video presentation, as thumbnails capture the essence of important events, engage viewers, and make video clips attractive to watch. Traditional solutions in the soccer domain for presenting highlight clips of important events such as goals, substitutions, and cards rely on the manual or static selection of thumbnails. However, such approaches can result in the selection of sub-optimal video frames as snapshots, which degrades the overall quality of the video clip as perceived by viewers, and consequently decreases viewership, not to mention that manual processes are expensive and time consuming. In this paper, we present an automatic thumbnail selection system for soccer videos which uses machine learning to deliver representative thumbnails with high relevance to video content and high visual quality in near real-time. Our proposed system combines a software framework which integrates logo detection, close-up shot detection, face detection, and image quality analysis into a modular and customizable pipeline, and a subjective evaluation framework for the evaluation of results. We evaluate our proposed pipeline quantitatively using various soccer datasets, in terms of complexity, runtime, and adherence to a pre-defined rule-set, as well as qualitatively through a user study, in terms of the perception of output thumbnails by end-users. Our results show that an automatic end-to-end system for the selection of thumbnails based on contextual relevance and visual quality can yield attractive highlight clips, and can be used in conjunction with existing soccer broadcast pipelines which require real-time operation.

Less annoying: quality of experience of commonly used mobile applications

  • Alexandre De Masi
  • Katarzyna Wac

In recent years, research on the Quality of Experience (QoE) of smartphone applications has received attention from both industry and academia due to the complexity of quantifying and managing it. This paper proposes a smartphone-embedded system able to quantify and notify smartphone users of the expected QoE level (high or low) during their interaction with their devices. We conducted two in the wild studies for four weeks each with Android smartphones users. The first study enabled the collection of the QoE levels of popular smartphone applications' usage rated by 38 users. We aimed to derive an understanding of users' QoE level. From this dataset, we also built our own model that predicts the QoE level for application category. Existing QoE models lack contextual features, such as duration of the user interaction with an application and the user's current physical activity. Subsequently, we implemented our model in an Android application (called expectQoE) for a second study involving 30 users to maximize high QoE level, and we replicated a previous study (2012) on the factors influencing the QoE of commonly used applications. The expectQoE, through emoji-based notifications, presents the expected application category QoE level. This information enable the user's to make a conscious choice about the application to launch. We then investigated whether if expectQoE improved the user's perceived QoE level and affected their application usage. The results showed no conclusive user-reported improvement of their perceived QoE due to expectQoE. Although the participants always had high QoE application usage expectations, the variation in their expectations was minimal and not significant. However, based on a time series analysis of the quantitative data, we observed that expectQoE decreased the application usage duration. Finally, the factors influencing the QoE on smartphone applications were similar to the 2012 findings. However, we observed the emergence of digital wellbeing features as facets of the users' lifestyle choices.

RL-AFEC: adaptive forward error correction for real-time video communication based on reinforcement learning

  • Ke Chen
  • Han Wang
  • Shuwen Fang
  • Xiaotian Li
  • Minghao Ye
  • H. Jonathan Chao

Real-time video communication is profoundly changing people's lives, especially in today's pandemic situation. However, packet loss during video transmission degrades reconstructed video quality, thus impairing users' Quality of Experience (QoE). Forward Error Correction (FEC) techniques are commonly employed in today's audio and video conferencing applications, such as Skype and Zoom, to mitigate the impact of packet loss. FEC helps recover the lost packets during transmissions at the receiver side, but the additional bandwidth consumption is also a concern. Since network conditions are highly dynamic, it is not trivial for FEC to maintain video quality with a fixed bandwidth overhead. In this paper, we propose RL-AFEC, an adaptive FEC scheme based on Reinforcement Learning (RL) to improve reconstructed video quality with an aim to mitigate bandwidth consumption for different network conditions. RL-AFEC learns to select a proper redundancy rate for each video frame, and then adds redundant packets based on the frame-level Reed-Solomon (RS) code. We also implement a novel packet-level Video Quality Assessment (VQA) method based on Video Multimethod Assessment Fusion (VMAF), which leverages Supervised Learning (SL) to generate video quality scores in real time by only extracting information from the packet stream without the need of visual contents. Extensive evaluations demonstrate the superiority of our scheme over other baseline FEC methods.

C2: consumption context cognizant ABR streaming for improved QoE and resource usage tradeoffs

  • Cheonjin Park
  • Chinmaey Shende
  • Subhabrata Sen
  • Bing Wang

Smartphones have emerged as ubiquitous platforms for people to consume content in a wide range of consumption contexts (C2), e.g., over cellular or WiFi, playing back audio and video directly on phone or through peripheral devices such as external screens or speakers, etc. In this paper, we argue that a user's specific C2 is an important factor to consider in Adaptive Bitrate (ABR) streaming. We examine the current practice of using C2 in four popular ABR players, and identify various limitations in existing treatments that have a detrimental impact on network resource usage and user experience. We then develop practical best-practice guidelines for C2-cognizant ABR streaming. Instantiating these guidelines, we develop a proof-of-concept implementation in the widely used state-of-the-art ExoPlayer platform and demonstrate that it leads to significantly better tradeoffs in terms of user experience and resource usage.

Swipe along: a measurement study of short video services

  • Shangyue Zhu
  • Theo Karagioules
  • Emir Halepovic
  • Alamin Mohammed
  • Aaron D. Striegel

Short videos have recently emerged as a popular form of short-duration User Generated Content (UGC) within modern social media. Short video content is generally less than a minute long and predominantly produced in vertical orientation on smartphones. While still fundamentally being streaming, short video delivery is distinctly characterized by the deployment of a mechanism that pre-loads ahead of user request. Background pre-loading aims to eliminate start-up time, which is now prioritized higher in Quality of Experience (QoE) objectives, given that the application design facilitates instant 'swiping' to the next video in a recommended sequence. In this work, we provide a comprehensive comparison of four popular short video services. In particular, we explore content characteristics and evaluate the video quality across resolutions for each service. We next characterize the pre-loading policy adopted by each service. Last, we conduct an experimental study to investigate data consumption and evaluate achieved QoE under different network scenarios and application configurations.

Unsupervised method for video action segmentation through spatio-temporal and positional-encoded embeddings

  • Guilherme de A. P. Marques
  • Antonio José G. Busson
  • Alan Lívio V. Guedes
  • Julio Cesar Duarte
  • Sérgio Colcher

Action segmentation consists of temporally segmenting a video and labeling each segmented interval with a specific action label. In this work, we propose a novel action segmentation method that requires no prior video analysis and no annotated data. Our method involves extracting spatio-temporal features from videos using a pre-trained deep network. Data is then transformed using a positional encoder, and finally a clustering algorithm is applied, where each produced cluster presumably corresponds to a different single and distinguishable action. In experiments, we show that our method produces competitive results on the Breakfast and Inria Instructional Videos dataset benchmarks.

GreenABR: energy-aware adaptive bitrate streaming with deep reinforcement learning

  • Bekir Oguzhan Turkkan
  • Ting Dai
  • Adithya Raman
  • Tevfik Kosar
  • Changyou Chen
  • Muhammed Fatih Bulut
  • Jaroslaw Zola
  • Daby Sow

Adaptive bitrate (ABR) algorithms aim to make optimal bitrate decisions in dynamically changing network conditions to ensure a high quality of experience (QoE) for the users during video streaming. However, most of the existing ABRs share the limitations of predefined rules and incorrect assumptions about streaming parameters. They also come short to consider the perceived quality in their QoE model, target higher bitrates regardless, and ignore the corresponding energy consumption. This joint approach results in additional energy consumption and becomes a burden, especially for mobile device users. This paper proposes GreenABR, a new deep reinforcement learning-based ABR scheme that optimizes the energy consumption during video streaming without sacrificing the user QoE. GreenABR employs a standard perceived quality metric, VMAF, and real power measurements collected through a streaming application. GreenABR's deep reinforcement learning model makes no assumptions about the streaming environment and learns how to adapt to the dynamically changing conditions in a wide range of real network scenarios. GreenABR outperforms the existing state-of-the-art ABR algorithms by saving up to 57% in streaming energy consumption and 60% in data consumption while achieving up to 22% more perceptual QoE due to up to 84% less rebuffering time and near-zero capacity violations.

Visual privacy protection in mobile image recognition using protective perturbation

  • Mengmei Ye
  • Zhongze Tang
  • Huy Phan
  • Yi Xie
  • Bo Yuan
  • Sheng Wei

Deep neural networks (DNNs) have been widely adopted in mobile image recognition applications. Considering intellectual property and computation resources, the image recognition model is often deployed at the service provider end, which takes input images from the user's mobile device and accomplishes the recognition task. However, from the user's perspective, the input images could contain sensitive information that is subject to visual privacy concerns, and the user must protect the privacy while offloading them to the service provider. To address the visual privacy issue, we develop a protective perturbation generator at the user end, which adds perturbations to the input images to prevent privacy leakage. Meanwhile, the image recognition model still runs at the service provider end to recognize the protected images without the need of being re-trained. Our evaluations using the CIFAR-10 dataset and 8 image recognition models demonstrate effective visual privacy protection while maintaining high recognition accuracy. Also, the protective perturbation generator achieves premium timing performance suitable for real-time image recognition applications.

Encrypted video search: scalable, modular, and content-similar

  • Yu Zheng
  • Heng Tian
  • Minxin Du
  • Chong Fu

Video-based services have become popular. Clients often outsource their videos to the cloud to relieve local maintenance. However, privacy has emerged as a major concern since many videos contain sensitive information. While retrieving (unencrypted) videos has been widely studied, encrypted multimedia retrieval receives rare attention, at best in a limited form of similarity searches on images.

We initiate the study of scalable encrypted video search in which a client can search videos similar to an image query. Our modular framework abstracts intrinsic attributes of videos in semantics and visuals to capture their contents. We advocate two-step searches by incorporating lightweight searchable encryption techniques for pre-screening and an interactive approach for fine-grained search.

We provide two instantiations - The basic one searches over semantic keywords, feature extraction, and locality-sensitive hash-based visual representations. The advanced one employs forward and backward private searchable encryption [CCS 2017] over deep hashing [CVPR 2020]. Our experimental results illustrate their practical performance over multiple real-world datasets.

AQP: an open modular Python platform for objective speech and audio quality metrics

  • Jack Geraghty
  • Jiazheng Li
  • Alessandro Ragano
  • Andrew Hines

Audio quality assessment has been widely researched in the signal processing area. Full-reference objective metrics (e.g., POLQA, ViSQOL) have been developed to estimate the audio quality relying only on human rating experiments. To evaluate the audio quality of novel audio processing techniques, researchers constantly need to compare objective quality metrics. Testing different implementations of the same metric and evaluating new datasets are fundamental and ongoing iterative activities. In this paper, we present AQP - an open-source, node-based, light-weight Python pipeline for audio quality assessment. AQP allows researchers to test and compare objective quality metrics helping to improve robustness, reproducibility and development speed. We introduce the platform, explain the motivations, and illustrate with examples how, using AQP, objective quality metrics can be (i) compared and benchmarked; (ii) prototyped and adapted in a modular fashion; (iii) visualised and checked for errors. The code has been shared on GitHub to encourage adoption and contributions from the community.

Njord: a fishing trawler dataset

  • Tor-Arne Schmidt Nordmo
  • Aril Bernhard Ovesen
  • Bjørn Aslak Juliussen
  • Steven Alexander Hicks
  • Vajira Thambawita
  • Håvard Dagenborg Johansen
  • Pål Halvorsen
  • Michael Alexander Riegler
  • Dag Johansen

Fish is one of the main sources of food worldwide. The commercial fishing industry has a lot of different aspects to consider, ranging from sustainability to reporting. The complexity of the domain also attracts a lot of research from different fields like marine biology, fishery sciences, cybernetics, and computer science. In computer science, detection of fishing vessels via for example remote sensing and classification of fish from images or videos using machine learning or other analysis methods attracts growing attention. Surprisingly, little work has been done that considers what is happening on board the fishing vessels. On the deck of the boats, a lot of data and important information are generated with potential applications, such as automatic detection of accidents or automatic reporting of fish caught. This paper presents Njord, a fishing trawler dataset consisting of surveillance videos from a modern off-shore fishing trawler at sea. The main goal of this dataset is to show the potential and possibilities that analysis of such data can provide. In addition to the data, we provide a baseline analysis and discuss several possible research questions this dataset could help answer.

Huldra: a framework for collecting crowdsourced feedback on multimedia assets

  • Malek Hammou
  • Cise Midoglu
  • Steven A. Hicks
  • Andrea Storås
  • Saeed Shafiee Sabet
  • Inga Strümke
  • Michael A. Riegler
  • Pål Halvorsen

Collecting crowdsourced feedback to evaluate, rank, or score multimedia content can be cumbersome and time-consuming. Most of the existing survey tools are complicated, hard to customize, or tailored for a specific asset type. In this paper, we present an open source framework called Huldra, designed explicitly to address the challenges associated with user studies involving crowdsourced feedback collection. The web-based framework is built in a modular and configurable fashion to allow for the easy adjustment of the user interface (UI) and the multimedia content, while providing integrations with reliable and stable backend solutions to facilitate the collection and analysis of responses. Our proposed framework can be used as an online survey tool by researchers working on different topics such as Machine Learning (ML), audio, image, and video quality assessment, Quality of Experience (QoE), and require user studies for the benchmarking of various types of multimedia content.

Nagare media ingest: a server for live CMAF ingest workflows

  • Matthias Neugebauer

New media ingest protocols have been presented recently. SRT and RIST compete with old protocols such as RTMP while the DASH-IF specified an HTTP-based ingest protocol for CMAF formatted media that lends itself towards delivery protocols such as DASH and HLS. Additionally, use cases of media ingest workflows can vary widely. This makes implementing generic and flexible tools for ingest workflows a hard challenge. A monolithic approach limits adoption if the implementation does not fit the use case completely. We propose a new design for ingest servers that splits responsibilities into multiple components running concurrently. This design enables flexible ingest deployments as is discussed for various use cases. We have implemented this design in the open source software Nagare Media Ingest for the new DASH-IF ingest protocol.

Multi-codec ultra high definition 8K MPEG-DASH dataset

  • Babak Taraghi
  • Hadi Amirpour
  • Christian Timmerer

Many applications and online services produce and deliver multimedia traffic over the Internet. Video streaming services with a rapidly growing desire for more resources to provide better quality, such as Ultra High Definition (UHD) 8K content, are on the list. The HTTP Adaptive Streaming (HAS) technique defines standard baselines for audio-visual content streaming to balance the delivered media quality and minimize defects in streaming sessions. On the other hand, video codecs development and standardization help the progress toward improving such services by introducing efficient algorithms and technologies. Versatile Video Coding (VVC) is one of the latest advancements in video encoding technology that is still not fully optimized and not supported on all available platforms. Mentioned optimization of the video codecs and supporting more platforms require years of research and development. This paper provides multiple test assets in the form of a dataset that facilitates the research and development of the stated technologies. Our open-source dataset comprises Dynamic Adaptive Streaming over HTTP (MPEG-DASH) packaged multimedia assets, encoded with Advanced Video Coding (AVC), High Efficiency Video Coding (HEVC), AOMedia Video 1 (AV1), and VVC. We provide our dataset with resolutions of up to 7680x4320 or 8K. Our dataset has a maximum media duration of 322 seconds, and we offer our MPEG-DASH packaged content with two segments lengths, 4 and 8 seconds.

SILVR: a synthetic immersive large-volume plenoptic dataset

  • Martijn Courteaux
  • Julie Artois
  • Stijn De Pauw
  • Peter Lambert
  • Glenn Van Wallendael

In six-degrees-of-freedom light-field (LF) experiences, the viewer's freedom is limited by the extent to which the plenoptic function was sampled. Existing LF datasets represent only small portions of the plenoptic function, such that they either cover a small volume, or they have limited field of view. Therefore, we propose a new LF image dataset "SILVR" that allows for six-degrees-of-freedom navigation in much larger volumes while maintaining full panoramic field of view. We rendered three different virtual scenes in various configurations, where the number of views ranges from 642 to 2226. One of these scenes (called Zen Garden) is a novel scene, and is made publicly available. We chose to position the virtual cameras closely together in large cuboid and spherical organisations (2.2m3 to 48m3), equipped with 180° fish-eye lenses. Every view is rendered to a color image and depth map of 2048px × 2048px. Additionally, we present the software used to automate the multiview rendering process, as well as a lens-reprojection tool that converts between images with panoramic or fish-eye projection to a standard rectilinear (i.e., perspective) projection. Finally, we demonstrate how the proposed dataset and software can be used to evaluate LF coding/rendering techniques (in this case for training NeRFs with instant-ngp). As such, we provide the first publicly-available LF dataset for large volumes of light with full panoramic field of view.

NewsImages: addressing the depiction gap with an online news dataset for text-image rematching

  • Andreas Lommatzsch
  • Benjamin Kille
  • Özlem Özgöbek
  • Yuxiao Zhou
  • Jelena Tešić
  • Cláudio Bartolomeu
  • David Semedo
  • Lidia Pivovarova
  • Mingliang Liang
  • Martha Larson

We present NewsImages, a dataset of online news items, and the related NewsImages rematching task. The goal of NewsImages is to provide researchers with a means of studying the depiction gap, which we define to be the difference between what an image literally depicts and the way in which it is connected to the text that it accompanies. Online news is a domain in which the image-text connection is known to be indirect: The news article does not describe what is literally depicted in the image. We validate NewsImages with experiments that show the dataset's and the task's use for studying occurring connections between image and text, as well as addressing the depiction gap, which include sparse data, diversity of content, and importance of background knowledge.

VCD: video complexity dataset

  • Hadi Amirpour
  • Vignesh V Menon
  • Samira Afzal
  • Mohammad Ghanbari
  • Christian Timmerer

This paper provides an overview of the open Video Complexity Dataset (VCD) which comprises 500 Ultra High Definition (UHD) resolution test video sequences. These sequences are provided at 24 frames per second (fps) and stored online in losslessly encoded 8-bit 4:2:0 format. In this paper, all sequences are characterized by spatial and temporal complexities, rate-distortion complexity, and encoding complexity with the x264 AVC/H.264 and x265 HEVC/H.265 video encoders. The dataset is tailor-made for cutting-edge multimedia applications such as video streaming, two-pass encoding, per-title encoding, scene-cut detection, etc. Evaluations show that the dataset includes diversity in video complexities. Hence, using this dataset is recommended for training and testing video coding applications. All data have been made publicly available as part of the dataset, which can be used for various applications.

Online documentation:

Dataset URL:

Enabling scalable emulation of differentiated services in mininet

  • Darijo Raca
  • Meghana Salian
  • Ahmed H. Zahran

Evolving Internet applications, such as immersive multimedia and Industry 4, exhibit stringent delay, loss, and rate requirements. Realizing these requirements would be difficult without advanced dynamic traffic management solutions that leverage state-of-the-art technologies, such as Software-Defined Networking (SDN). Mininet represents a common choice for evaluating SDN solutions in a single machine. However, Mininet lacks the ability to emulate links that have multiple queues to enable differentiated service for different traffic streams. Additionally, performing a scalable emulation in Mininet would not be possible without light-weight application emulators. In this paper, we introduce two tools, namely: QLink and SPEED. QLink extends Mininet API to enable emulating links with multiple queues to differentiate between different traffic streams. SPEED represents a light-weight web traffic emulation tool that enables scalable HTTP traffic simulation in Mininet. Our performance evaluation shows that SPEED enables scalable emulation of HTTP traffic in Mininet. Additionally, we demo the benefits of using QLink to isolate three different applications (voice, web, and video) in a network bottleneck for numerous users.

Realistic video sequences for subjective QoE analysis

  • Kerim Hodzic
  • Mirsad Cosovic
  • Sasa Mrdovic
  • Jason J. Quinlan
  • Darijo Raca

Multimedia streaming over the Internet (live and on demand) is the cornerstone of modern Internet carrying more than 60% of all traffic. With such high demand, delivering outstanding user experience is a crucial and challenging task. To evaluate user Quality of Experience (QoE) many researchers deploy subjective quality assessments where participants watch and rate videos artificially infused with various temporal and spatial impairments. To aid current efforts in bridging the gap between the mapping of objective video QoE metrics to user experience, we developed DashReStreamer, an open-source framework for re-creating adaptively streamed video in real networks. DashReStreamer utilises a log created by a HTTP adaptive streaming (HAS) algorithm run in an uncontrolled environment (i.e., wired or wireless networks), encoding visual changes and stall events in one video file. These videos are applicable for subjective QoE evaluation mimicking realistic network conditions.

To supplement DashReStreamer, we re-create 234 realistic video clips, based on video logs collected from real mobile and wireless networks. In addition our dataset contains both video logs with all decisions made by the HAS algorithm and network bandwidth profile illustrating throughput distribution. We believe this dataset and framework will permit other researchers in their pursuit for the final frontier in understanding the impact of video QoE dynamics.

PEM360: a dataset of 360° videos with continuous physiological measurements, subjective emotional ratings and motion traces

  • Quentin Guimard
  • Florent Robert
  • Camille Bauce
  • Aldric Ducreux
  • Lucile Sassatelli
  • Hui-Yin Wu
  • Marco Winckler
  • Auriane Gros

From a user perspective, immersive content can elicit more intense emotions than flat-screen presentations. From a system perspective, efficient storage and distribution remain challenging, and must consider user attention. Understanding the connection between user attention, user emotions and immersive content is therefore key. In this article, we present a new dataset, PEM360 of user head movements and gaze recordings in 360° videos, along with self-reported emotional ratings of valence and arousal, and continuous physiological measurement of electrodermal activity and heart rate. The stimuli are selected to enable the spatiotemporal analysis of the connection between content, user motion and emotion. We describe and provide a set of software tools to process the various data modalities, and introduce a joint instantaneous visualization of user attention and emotion we name Emotional maps. We exemplify new types of analyses the PEM360 dataset can enable. The entire data and code are made available in a reproducible framework.

VCA: video complexity analyzer

  • Vignesh V Menon
  • Christian Feldmann
  • Hadi Amirpour
  • Mohammad Ghanbari
  • Christian Timmerer

For online analysis of the video content complexity in live streaming applications, selecting low-complexity features is critical to ensure low-latency video streaming without disruptions. To this light, for each video (segment), two features, i.e., the average texture energy and the average gradient of the texture energy, are determined. A DCT-based energy function is introduced to determine the block-wise texture of each frame. The spatial and temporal features of the video (segment) are derived from this DCT-based energy function. The Video Complexity Analyzer (VCA) project aims to provide an efficient spatial and temporal complexity analysis of each video (segment) which can be used in various applications to find the optimal encoding decisions. VCA leverages some of the x86 Single Instruction Multiple Data (SIMD) optimizations for Intel CPUs and multi-threading optimizations to achieve increased performance. VCA is an open-source library published under the GNU GPLv3 license.


Online documentation:


A new free viewpoint video dataset and DIBR benchmark

  • Shuai Guo
  • Kai Zhou
  • Jingchuan Hu
  • Jionghao Wang
  • Jun Xu
  • Li Song

Free viewpoint video (FVV) has drawn great attention in recent years, which provides viewers with strong interactive and immersive experience. Despite the developments made, further progress of FVV research is limited by existing datasets that mostly have too few number of camera views, or static scenes. To overcome the limitations, in this paper, we present a new dynamic RGB-D video dataset with up to 12 views. Our dataset consists of 13 groups of dynamic video sequences that are taken at the same scene, and a group of video sequences of the empty scene. Each group has 12 HD video sequences taken by synchronized cameras and 12 correspondingly estimated depth video sequences. Moreover, we also introduce a FVV synthesis benchmark on the basis of depth image based rendering (DIBR) to help researchers validate their data-driven methods. We hope our work will inspire more FVV synthesis methods with enhanced robustness, improved performance and deeper understanding.

CGD: a cloud gaming dataset with gameplay video and network recordings

  • Ivan Slivar
  • Kresimir Bacic
  • Irena Orsolic
  • Lea Skorin-Kapov
  • Mirko Suznjevic

With advances in network capabilities, the gaming industry is increasingly turning towards offering "gaming on demand" solutions, with cloud gaming services such as Sony PlayStation Now, Google Stadia, and NVIDIA GeForce NOW expanding their market offerings. Similar to adaptive video streaming services, cloud gaming services typically adapt the quality of game streams (e.g., bitrate, resolution, frame rate) in accordance with current network conditions. To select the most appropriate video encoding parameters given certain conditions, it is important to understand their impact on Quality of Experience (QoE). On the other hand, network operators are interested in understanding the relationships between parameters measurable in the network and cloud gaming QoE, to be able to invoke QoE-aware network management mechanisms. To encourage developments in these areas, comprehensive datasets are crucial, including both network and application layer data. This paper presents CGD, a dataset consisting of 600 game streaming sessions corresponding to 10 games of different genres being played and streamed using the following encoding parameters: bitrate (5, 10, 20 Mbps), resolution (720p, 1080p), and frame rate (30, 60 fps). For every combination repeated five times for each game, the dataset includes: 1) gameplay video recordings, 2) network traffic traces, 3) user input logs (mouse and keyboard), and 4) streaming performance logs.

Enhancing situational awareness with adaptive firefighting drones: leveraging diverse media types and classifiers

  • Tzu-Yi Fan
  • Fangqi Liu
  • Jia-Wei Fang
  • Nalini Venkatasubramanian
  • Cheng-Hsin Hsu

High-rise fires are among the largest threats to safety in modern cities, and autonomous drones with multi-modal sensors can be employed to enhance situational awareness in such unfortunate disasters. In this paper, we study the fine-grained measurement selection problem for drones being dispatched to perform situation monitoring tasks in high-rise fires. Our problem considers multiple sensor/media types, classifier designs, and measurement locations, which were overlooked in prior waypoint scheduling studies. For concrete discussion, we adopt window openness as the target situation, while other situations can be readily supported by our solution as well. More specifically, we: (i) develop diverse window openness classifiers, (ii) mathematically formulate the fine-grained measurement selection problem and solve it using two algorithms, and (iii) create a photo-realistic simulator and an event-driven simulator to evaluate our algorithms. The evaluation results demonstrate that our proposed algorithms achieve higher classification accuracy (up to 50% improvement), deliver more feasible solutions (up to 100% improvement), and reduce energy consumption (up to 6.78 times reduction), compared to the current practices.

A QoE evaluation of procedural and example instruction formats for procedure training in augmented reality

  • Eoghan Hynes
  • Ronan Flynn
  • Brian Lee
  • Niall Murray

Augmented reality (AR) has significant potential as a training platform. The pedagogical purpose of training is learning or transfer. Learning is the acquisition of an ability to perform a procedure as taught while transfer involves generalising that knowledge to similar procedures in the same domain. Quality of experience (QoE) concerns the fulfilment of the application, system or service user's pragmatic and hedonic needs and expectations. Learning or transfer fulfil the AR trainee's pragmatic needs. Training instructions can be presented in procedural, and example formats. Procedural instructions tell the trainee what to do while examples show the trainee how to do it. These two different instruction formats can influence learning, transfer, and hardware resource availability differently. The AR trainee's hedonic needs and expectations may be influenced by the impact of instruction format resource consumption on system performance. Efficient training efficacy is a design concern for mobile AR training applications. This work aims to inform AR training application design by evaluating the influence of procedural and example instruction formats on AR trainee QoE.

In this demo, an AR GoCube™ solver training application will be exhibited on the state-of-the-art Hololens 2 (HL2) mixed reality (MR) headset. This AR training app is part of a test framework that will be used in a between groups study to evaluate the influence of text-based and animated 3D model instruction formats on AR trainee QoE. This framework will record the trainee's physiological ratings, eye gaze features and facial expressions. Learning will be evaluated in a post-training recall phase while transfer will be evaluated using a pre and post training comparison of mental rotation skills. Application profiling code will monitor AR headset resource consumption.

Context-aware video encoding as a network-based media processing (NBMP) workflow

  • Christoph Mueller
  • Louay Bassbouss
  • Stefan Pham
  • Stephan Steglich
  • Sven Wischnowsky
  • Peter Pogrzeba
  • Thomas Buchholz

Leveraging processing capabilities and resources in a network is a trending approach in accomplishing complex media processing tasks. At the same time, efficiently utilizing available resources while ensuring the potential for scalability and distribution is key. However, deploying, operating and maintaining such complex media service workflows on different cloud services, at the edge or on-premise can be a very complex and time-consuming task. In this paper, we will present an approach that addresses these challenges by utilizing state-of-the-art technologies and standards for advanced multimedia services such as the MPEG Network-based Media Processing (NBMP) standard. We will apply the presented approach for implementing bandwidth reduction and optimization strategies by using context aware video encoding. Implemented as an automated NBMP workflow, the context aware encoding method with the support of machine learning models avoids computationally heavy test encodes. The models are trained on complex datasets composed of 40+ video attributes and generate an optimal encoding ladder as an output (bitrate/resolution pairs). In comparison to the conventional per-title encoding method, we observed significant savings in terms of storage and delivery costs, while maintaining the same visual quality.

Teleoperation of the industrial robot: augmented reality application

  • Chung Xue Er (Shamaine)
  • Yuansong Qiao
  • Vladimir Kuts
  • John Henry
  • Ken McNevin
  • Niall Murray

Industry 4.0 is aimed at the full manufacturing domain automatization and digitalization. Humans and robots working together are being discussed widely both in the academic and industrial sectors. As being discussed, there is a need for a more novel type of interaction method between humans and robots. This demonstrational paper is presenting the technical advancement and prototype of remote control and re-programming of the Industrial robot. This development is safe and efficient and allows the operator to focus rather on the final task than the programming process.

The design and evaluation of a wearable-based system for targeted tremor assessment in Parkinson's disease

  • Samantha O'Sullivan
  • Niall Murray
  • Thiago Braga Rodrigues

Wearable sensors are worn by subjects to allow for continuous physiological monitoring. The use of wearable sensors for the quantification of movement within research communities has increased in recent years, with the purpose of objectively assessing and diagnosing the progression of Parkinson's Disease (PD). Most studies taking this approach for PD have stated that there is a need for a long-term solution, due to individuals having varying symptoms at different stages of the disease. Furthermore, a preference for home-based care has increased in recent times due to COVID-19, with clinical care being highly effected due to cancellations, delayed appointments, or a reduction of time spent with patients. The necessity for a system for patients with Parkinson's is extremely significant. There is no clinically available long-term assessment for tremors, and how these systems can be used to assess and aid in a clinical environment is still underdeveloped. The proposed system which includes wireless sensors, and results based off the clinical scale used currently for tremor assessment, may allow for constant, real-time, and accurate monitoring of a subject with tremors. This will provide more detailed medical data to enable long-term assessment, diagnosis, as well as person-centered physical therapy.

Virtual visits: life-size immersive communication

  • Sylvie Dijkstra-Soudarissanane
  • Simon N. B. Gunkel
  • Véronne Reinders

Elderly people in care homes face a great lack of contact with their families and loved ones. Social isolation and loneliness are detrimental factors for older people in their health condition, cognitive impairment and quality of life. We previously presented a communication system that facilitated high-quality mediated social contact through an Augmented Reality tool with 3D user capture and rendering on an iPad. We have further enhanced this system and propose a new demonstrator that provides a working communication system between the elderly and family members on a life-size display. A complete end-to-end chain architecture is presented, where capture, transmission, and rendering are enhanced to provide a tool for natural interaction and high social presence. Our system is technically ready and well-adapted to meet the needs of elderly users.

Towards low latency live streaming: challenges in a real-world deployment

  • Reza Shokri Kalan
  • Reza Farahani
  • Emre Karsli
  • Christian Timmerer
  • Hermann Hellwagner

Over-the-Top (OTT) service providers need faster, cheaper, and Digital Rights Management (DRM)-capable video streaming solutions. Recently, HTTP Adaptive Streaming (HAS) has become the dominant video delivery technology over the Internet. In HAS, videos are split into short intervals called segments, and each segment is encoded at various qualities/bitrates (i.e., representations) to adapt to the available bandwidth. Utilizing different HAS-based technologies with various segment formats imposes extra cost, complexity, and latency to the video delivery system. Enabling a unified format for transmitting and storing segments at Content Delivery Network (CDN) servers can alleviate the aforementioned issues. To this end, MPEG Common Media Application Format (CMAF) is presented as a standard format for cost-effective and low latency streaming. However, CMAF has not been adopted by video streaming providers yet and it is incompatible with most legacy end-user players. This paper reveals some useful steps for achieving low latency live video streaming that can be implemented for non-DRM sensitive contents before jumping to CMAF technology. We first design and instantiate our testbed in a real OTT provider environment, and then investigate the impact of changing format, segment duration, and Digital Video Recording (DVR) window length on a real live event. The results illustrate that replacing the transport stream (.ts) format with fragmented MP4 (.fMP4) and shortening segments' duration reduces live latency significantly.

Continuous-time feedback device to enhance situation awareness during take-over requests in automated driving conditions

  • Guilherme Daniel Gomes
  • Ronan Flynn
  • Niall Murray

Conditional automation may require drivers of Autonomous Vehicles (AVs) to turn their attention away from the roads by taking part, for instance, in Non-Driving Related Tasks (NDRTs). That might cause them a lack of Situation Awareness (SA) when resuming to manual control. Consequently, Take-Over Requests (TORs) are system events intended to inform drivers when the vehicle is unable to handle an upcoming situation that is outside its Operational Design Domain (ODD) and the automated system might get disengaged, requiring driver's attention. The short time frame of time-critical TOR events impacts on the performance of the machine-to-human transition, especially when the driver is engaged with NDRTs. This can lead to dangerous driving conditions. In that context, this work proposes the demonstration of a device called Adaptive Tactile Device (ATD), capable of continuously adjust itself according to the driving conditions and smooth control transition by constantly informing the driver about road and system state based on force haptic feedback. The Continuous-Time (CT) nature of the proposed device is intended to provide adaptive feedback during automated vehicle conditions, promoting a feeling of control and possibly improving driver's Situation Awareness (SA) during TOR events. Future work will consider implementing this device on the steering wheel or driver's seat and collect the user's Quality of Experience (QoE) when using it in Virtual Reality (VR) simulations, to be compared with the user's objective and subjective metrics when receiving Discrete-Time (DT) feedback warnings previously applied in TOR research.

MobileCodec: neural inter-frame video compression on mobile devices

  • Hoang Le
  • Liang Zhang
  • Amir Said
  • Guillaume Sautiere
  • Yang Yang
  • Pranav Shrestha
  • Fei Yin
  • Reza Pourreza
  • Auke Wiggers

Realizing the potential of neural codecs on real-world mobile devices is a big technological challenge due to the inherent conflict between the computational complexity of deep networks and the power-constrained mobile hardware performance. We demonstrate practical feasibility by leveraging Qualcomm's innovation and technology, bridging the gap from neural network-based model simulations to operation on a mobile device powered by Snapdragon® technology. We show the first-ever inter-frame neural video decoder running on a commercial mobile phone, decompressing high-definition videos in real-time while maintaining a low bitrate and high visual quality, comparable to conventional codecs.

A virtual reality volumetric music video: featuring new pagans

  • Gareth W. Young
  • Néill O'Dwyer
  • Aljosa Smolic

Music videos are short films that integrate songs and imagery produced for artistic and promotional purposes. Modern music videos apply various media capture techniques and creative postproduction technologies to provide a myriad of stimulating and artistic approaches to audience entertainment and engagement for viewing across multiple devices. Within this domain, volumetric video capture technologies (Figure 2) have become an emerging means of recording and reproducing musical performances for new audiences to access via traditional 2D screens and emergent extended reality platforms, such as augmented and virtual reality. These 3D digital reproductions of musical performances are captured live and are enhanced to deliver cutting-edge audiovisual entertainment (Figure 1). However, the precise impact of volumetric video in music video entertainment is still in a state of flux.

HOST-ATS: automatic thumbnail selection with dashboard-controlled ML pipeline and dynamic user survey

  • Andreas Husa
  • Cise Midoglu
  • Malek Hammou
  • Pål Halvorsen
  • Michael A. Riegler

We present HOST-ATS, a holistic system for the automatic selection and evaluation of soccer video thumbnails, which is composed of a dashboard-controlled machine learning (ML) pipeline, and a dynamic user survey. The ML pipeline uses logo detection, close-up shot detection, face detection, image quality prediction, and blur detection to automatically select thumbnails from soccer videos in near real-time, and can be configured via a graphical user interface. The web-based dynamic user survey can be employed to qualitatively evaluate the thumbnails selected by the pipeline. The survey is fully configurable and easy to update via continuous integration, allowing for the dynamic aggregation of participant responses to different sets of multimedia assets. We demonstrate the configuration and execution of the ML pipeline via the custom dashboard, and the agile (re-)deployment of the user survey via Firebase and Heroku cloud service integrations, where the audience can interact with configuration parameter updates in real-time. Our experience with HOST-ATS shows that an automatic thumbnail selection system can yield highly attractive highlight clips, and can be used in conjunction with existing soccer broadcast practices in real-time.

XREcho: a unity plug-in to record and visualize user behavior during XR sessions

  • Sophie Villenave
  • Jonathan Cabezas
  • Patrick Baert
  • Florent Dupont
  • Guillaume Lavoué

With the ever-growing use of extended reality (XR) technologies, researchers seek to understand the factors leading to a higher quality of experience. Measurements of the quality of experience (e.g., presence, immersion, flow) are usually assessed through questionnaires completed after the experience. To cope with shortcomings and limitations of this kind of assessment, some recent studies tend to use physiological measures and interaction traces generated in the virtual world in replacement or in addition to questionnaires. Those physiological and behavioral measurements are still complex to implement, and existing studies difficult to replicate, because of a lack of easy and efficient tools to collect and visualize such data produced during XR experiments. In this paper, we present XRE-cho, a Unity package that allows for the recording, replaying and visualization of user behavior and interactions during XR sessions; recorded data allows to replay the whole experience, as it includes movements of the XR device, controllers and interacted objects, as well as eye tracking data (e.g., gaze position, pupils diameter). The capabilities of this tool are illustrated in a user study, where 12 participants' data have been collected and visualized with XREcho. Source code for XREcho is publicly available on GitHub.1

Explainability methods for machine learning systems for multimodal medical datasets: research proposal

  • Andrea M. Storås
  • Inga Strümke
  • Michael A. Riegler
  • Pål Halvorsen

This paper contains the research proposal of Andrea M. Storås that was presented at the MMSys 2022 doctoral symposium. Machine learning models have the ability to solve medical tasks with a high level of performance, e.g., classifying medical videos and detecting anomalies using different sources of data. However, many of these models are highly complex and difficult to understand. Lack of interpretability can limit the use of machine learning systems in the medical domain. Explainable artificial intelligence provides explanations regarding the models and their predictions. In this PhD project, we develop machine learning models for automatic analysis of medical data and explain the results using established techniques from the field of explainable artificial intelligence. Current research indicate that there are still open issues to be solved in order for end users to understand multimedia systems powered by machine learning. Consequently, new explanation techniques will also be developed. Different types of medical data are applied in order to investigate the generalizability of the methods.

Real-time anxiety prediction in virtual reality therapy: research proposal

  • Deniz Mevlevioğlu
  • Sabin Tabirca
  • David Murphy

This paper contains the research proposal of Deniz Mevlevioglu that was presented at the MMSys 2022 doctoral symposium. The benefits of real-time anxiety prediction in Virtual Reality are vast, including uses from therapy to entertainment. These anxiety predictions can be made using biosensors by tracking physiological measurements such as heart rate, electrical brain activity and skin conductivity. However, there are multiple challenges when trying to achieve accurate predictions. First of all, defining anxiety in a useful context and getting objective measurements to predict it can be difficult due to different interpretations of the word. Secondly, personal differences can make it difficult to fit everyone into a generalisable model. Lastly, Virtual Reality strives for immersion, and many systems that use objective measures such as on-body sensors to detect anxiety can make it hard for the user to immerse themselves into the virtual world. Our research aims to come up with a system that will address these problems and manage to get accurate and objective predictions of anxiety in real-time while still allowing the users to be immersed in the experience. To this end, we aim to use fast-performing classification models with multi-modal on-body sensor data to maximise comfort and minimise noise and inaccuracies.

Light field image quality assessment method based on deep graph convolutional neural network: research proposal

  • Sana Alamgeer
  • Muhammad Irshad
  • Mylène C. Q. Farias

This paper contains the research proposal of Sana Alamgeer that was presented at the MMSys 2022 doctoral symposium. Unlike regular images that represent only light intensities, Light Field (LF) contents carry information about the intensity of light in a scene, including the direction light rays are traveling in space. This allows for a richer representation of our world, but requires large amounts of data that need to be processed and compressed before being transmitted to the viewer. Since these techniques may introduce distortions, the design of Light Field Image Quality Assessment (LF-IQA) methods is important. The majority of LF-IQA methods based on traditional Convolutional Neural Network (CNN) have limitations, i.e. they are unable to increase the receptive field of a neuron-pixel to model non-local image features. In this work, we propose a novel no-reference LF-IQA method that is based on Deep Graph Convolutional Neural Network (GCNN). Our method not only takes into account both LF angular and spatial information, but also learns the order of pixel information. Specifically, the method is composed of one input layer that takes a pair of graphs and their corresponding subjective quality scores as labels, 4 GCNN layers, fully connected layers, and a regression block for quality prediction. Our aim is to develop the quality prediction method with maximum accuracy for distorted LF content.

Analyzing and understanding embodied interactions in virtual reality systems: research proposal

  • Florent Robert
  • Marco Winckler
  • Hui-Yin Wu
  • Lucile Sassatelli

Virtual reality (VR) offers opportunities in human-computer interaction research, to embody users in immersive environments and observe how they interact with 3D scenarios under well-controlled environments. VR content has stronger influences on users physical and emotional states as compared to traditional 2D media, however, a fuller understanding of this kind of embodied interaction is currently limited by the extent to which attention and behavior can be observed in a VR environment, and the accuracy at which these observations can be interpreted as, and mapped to, real-world interactions and intentions. This thesis aims at the creation of a system to help designers in the analysis of the entire user experience in VR environment: how they feel, what is their intentions when interacting with a certain object, provide them guidance based on their needs and attention. A controlled environment in which the user is guided will help to establish a better intersubjectivity between designer intention who created the experience and users who lived it and will lead to a more efficient analysis of the user behavior in VR systems for the design of better experiences.

Design, development and evaluation of adaptative and interactive solutions for high-quality viewport-aware VR360 video processing and delivery: research proposal

  • Miguel Fernández-Dasí
  • Mario Montagud
  • Josep Paradells Aspas

This paper contains the research proposal of Miguel Fernández Dasí that was presented at the MMSys 2022 doctoral symposium.1 The production and consumption of multimedia content is continuously increasing, and this particularly affects to immersive formats, like VR360 video. Even though significant advances have been witnessed with regard to the processing, delivery and consumption of interactive VR360 video, key challenges and research questions still need to be addressed to efficiently provide multi-camera and multi-user VR360 video services over distributed and heterogeneous environments. This research work aims at providing novel and efficient contributions to overcome existing limitations in this topic. First, it will develop an end-to-end modular VR360 video platform, including the measurement of Quality of Service (QoS) and activity metrics, to be used as a research testbed. Second, it will provide web-compliant viewport-aware video processing and delivery strategies to dynamically concentrate the video resolution on the users' viewport, with a single stream and decoding process through the web browser. Third, it will propose innovative encoding, signaling and synchronization solutions to enable an effective support for multi-camera VR360 services, with personalized, fast and in-sync switching features. Fourth, it will explore how to effectively provide social viewing scenarios between remote users while watching the same or related VR360 videos. The work also plans to contribute with low-latency delivery pipelines as well as with tools and algorithms to assess and model the Quality of Experience (QoE).

AI-assisted affective computing and spatial audio for interactive multimodal virtual environments: research proposal

  • Juan Antonio De Rus
  • Mario Montagud
  • Maximo Cobos

This paper contains the research proposal of Juan Antonio De Rus that was presented at the MMSys 2022 doctoral symposium. The use of virtual reality (VR) is growing every year. With the normalization of remote work it is to expect that the use of immersive virtual environments to support tasks as online meetings, education, etc, will grow even more. VR environments typically include multimodal content formats (synthetic content, video, audio, text) and even multi-sensory stimuli to provide an enriched user experience. In this context, Affective Computing (AC) techniques assisted by Artificial Intelligence (AI) become a powerful means to determine the user's perceived Quality of Experience (QoE). In the field of AC, we investigate a variety of tools to obtain accurate emotional analysis by using AI techniques applied on physiological data. In this doctoral study we have formulated a set of open research questions and objectives on which we plan to generate valuable contributions and knowledge in the field of AC, spatial audio, and multimodal interactive virtual environments, one of which is the creation of tools to automatically evaluate the QoE, even in real-time, which can provide valuable benefits both to service providers and consumers. For data acquisition we use sensors of different quality to study the scalability, reliability and replicability of our solutions, as clinical-grade sensors are not always within the reach of the average user.

Perception of video quality at a local spatio-temporal horizon: research proposal

  • Andréas Pastor
  • Patrick Le Callet

This paper contains the research proposal of Andréas Pastor that was presented at the MMSys 2022 doctoral symposium. Encoding video for streaming on Internet has become a major topic to reduce the consumption of bandwidth and latency. At the same time, the human perception of distortions has been explored in multiple research projects, especially for distortions generated by Coder-DECoder (CODEC) algorithms. These algorithms operate in a rate-distortion optimization paradigm to efficiently compress video content. This optimization can be driven by metrics that are most of the time not based on the human perception, and more importantly, not tuned to reflect the local perception of distortions by human eyes.

In this doctoral study, we proposed to work on the perception of localized distortion at a small temporal and spatial horizon. We present here the fundamental research questions and challenges in the domain with a focus on methods to collect perceptual judgments in subjective studies and metrics that can help us to derive an estimate of the perception of distortions by humans.

A telehealth and sensor-based system for user-centered physical therapy in Parkinson's disease: research proposal

  • Samantha O'Sullivan
  • Niall Murray
  • Thiago Braga Rodrigues

This paper contains the research proposal of Samantha O'Sullivan that was presented at the MMSys 2022 doctoral symposium. The use of wearable sensors for the understanding and quantification of movement within research communities working on Parkinson's Disease (PD) has increased significantly in recent years with a motivation to objectively diagnose, assess and then understand the progression of the disease. Most studies taking this approach for PD have stated that there is a need for a long-term solution, due to varying symptoms at different stages of the disease. COVID-19 has brought further limitations in the delivery of clinical care, reducing time with therapists and doctors whilst increasing the preference for at-home care. The necessity for a system for patients with PD is extremely significant. There is no clinically available long-term assessment for tremors, which is an issue highlighted in the literature. By using wireless sensors to track tremor severity continuously, and telehealth to create communication between patient and clinician, this proposed system will allow for better targeted therapy, accurate statistics, and constant accessible data. In this context, this work will design, build, and evaluate a novel system that would allow for constant monitoring of a patient with tremors. By using wireless sensors and telehealth, it will provide more detailed data that may enable directed and informed physical therapy. It will also improve communication creating a data flow constantly between clinician and patient to improve person-centered feedback, and aid towards the diagnosis and assessment of disease progression. The incorporation of a mobile/cloud-based application to assist this is due to the current heightened preference for home-based healthcare, long-term evaluation of tremors and personalized physical therapy. The primary focus of the PhD will be on capturing tremor activity and progression through a telehealth-based system. This proposed system will obtain real-time readings of tremors using wireless sensors and an application that will communicate consistently with healthcare professionals. The aim will be to provide better home-based care, person-centered physical therapy and improve quality of life.

Adaptability between ABR algorithms in DASH video streaming and HTTP/3 over QUIC: research proposal

  • Sindhu Chellappa
  • Radim Bartos

This paper contains the research proposal of Sindhu Chellappa that was presented at the MMSys 2022 doctoral symposium. With the ever growing demand for media consumption, HTTP Adaptive Streaming (HAS) is the de-facto standard for adaptive bit rate streaming. Several efforts have been made to increase the Quality of Experience (QoE) for the users by proposing Adaptive Bit Rate (ABR) streaming algorithms to choose the best bitrate for the current network condition. With HAS, the segments are downloaded through Hyper Text Transfer Protocol (HTTP) which operates over Transmission Control Protocol (TCP). HTTP/3 is the recent standardized application protocol which operates over Quick UDP Internet Connection (QUIC). When the protocols changes, it is necessary to revisit the QoE of the ABR algorithms to select the suitable ABR algorithm best suited for the protocol. We present four research questions to address the adaptability between the ABR algorithms and the underlying protocols.

Just noticeable difference (JND) and satisfied user ratio (SUR) prediction for compressed video: research proposal

  • Jingwen Zhu
  • Patrick Le Callet

This paper contains the research proposal of Jingwen ZHU that was presented at the MMSys 2022 doctoral symposium. Just noticeable difference (JND) is the minimum amount of distortion from which human eyes can perceive difference between the original stimuli and distorted stimuli. With the rapid raise of multimedia demand, it is crucial to apply JND into the visual communication systems to use the least resources (e.g., bandwidth and storage) but without damaging the Quality of Experience (QoE) of end-users. In this thesis, we focus on the JND prediction for compressed video to guide the choice of optimal encoding parameters for video streaming service. In this paper, we analyse the limitations of the current JND prediction models and present five main research questions to address these challenges.

Machine learning-based strategies for streaming and experiencing 3DoF virtual reality: research proposal

  • Quentin Guimard
  • Lucile Sassatelli

This paper contains the research proposal of Quentin Guimard that was presented at the MMSys 2022 doctoral symposium.

The development of 360° videos experienced in virtual reality (VR) is hindered by network, cybersickness, and content perception challenges. Many levers have already been proposed to address these challenges, but separately. This PhD thesis intends to jointly address these issues by dynamically controlling levers and making quality decisions, with a view to improving the VR streaming experience.

This paper describes the steps necessary to the building of such approach, by separating work that has already been achieved over the course of this PhD from tasks that are still left to do. First results are also presented.

AI-derived quality of experience prediction based on physiological signals for immersive multimedia experiences: research proposal

  • Sowmya Vijayakumar
  • Peter Corcoran
  • Ronan Flynn
  • Niall Murray

This paper contains the research proposal of Sowmya Vijayakumar that was presented at the MMSys 2022 doctorial symposium. Multimedia applications can now be found across many application domains including but not limited to entertainment, communication, health, business, and education. It is becoming more and more important to understand the factors that influence user perceptual quality, and hence monitoring user quality of experience (QoE) for improving multimedia interaction and services is essential. In this PhD work, we propose advanced machine learning techniques to predict QoE from physiological signals for immersive multimedia experiences. The aim of this doctoral study is to investigate the utility of physiological responses for QoE assessment for different multimedia technologies. Here, the research questions and solutions proposed to address this challenge are presented. A multimodal QoE prediction model is being developed that integrates several physiological measurements to improve QoE prediction performance.

Federated learning to understand human emotions via smart clothing: research proposal

  • Mary Pidgeon
  • Nadia Kanwal
  • Niall Murray

This paper contains the research proposal of Mary Pidgeon that was presented at the MMSys 2022 doctoral symposium. Emotion recognition from physiological signals has seen a huge growth in recent decades. Wearables such as smart watches now have sensors to accurately measure physiological signals such as electrocardiography (ECG), blood volume pressure (BVP), galvanic skin response (GSR), and skin temperature (ST). These sensors have also been embedded in textiles. Collaborative body sensor networks (CBSN) have been used to analyse emotion reactions in a social setting from heart rate sensors. Federated learning, a recently proposed machine learning paradigm, protects user's private information while using information from several users to train a global machine learning model. Federated learning has several categorisations based on data partitioning, the privacy mechanisms, machine learning models and methods for solving heterogeneity. In this doctoral thesis, we propose using a smart clothing body sensor network to collect peripheral physiological data while protecting the user's privacy using federated machine learning. We present three primary research questions to address the challenges in emotion prediction, data collection from e-textile sensors and federated (FL) learning.