MMSys '23: Proceedings of the 14th Conference on ACM Multimedia Systems

MMSys '23: Proceedings of the 14th Conference on ACM Multimedia Systems

MMSys '23: Proceedings of the 14th Conference on ACM Multimedia Systems

Full Citation in the ACM Digital Library

FleXR: A System Enabling Flexibly Distributed Extended Reality

  • Jin Heo
  • Ketan Bhardwaj
  • Ada Gavrilovska

Extended reality (XR) applications require computationally demanding functionalities with low end-to-end latency and high throughput. To enable XR on commodity devices, a number of distributed systems solutions enable offloading of XR workloads on remote servers. However, they make a priori decisions regarding the offloaded functionalities based on assumptions about operating factors, and their benefits are restricted to specific deployment contexts. To realize the benefits of offloading in various distributed environments, we present a distributed stream processing system, FleXR, which is specialized for real-time and interactive workloads and enables flexible distributions of XR functionalities. In building FleXR, we identified and resolved several issues of presenting XR functionalities as distributed pipelines. FleXR provides a framework for flexible distribution of XR pipelines while streamlining development and deployment phases. We evaluate FleXR with three XR use cases in four different distribution scenarios. In the results, the best-case distribution scenario shows up to 50% less end-to-end latency and 3.9x pipeline throughput compared to alternatives.

The World is Too Big to Download: 3D Model Retrieval for World-Scale Augmented Reality

  • Yi-Zhen Tsai
  • James Luo
  • Yunshu Wang
  • Jiasi Chen

World-scale augmented reality (AR) is a form of AR where users move around the real world, viewing and interacting with 3D models at specific locations. However, given the geographical scale of world-scale AR, pre-fetching and storing numerous high-quality 3D models locally on the device is infeasible. For example, it would be impossible to download and store 3D ads from all the storefronts in a city onto a single device. A key challenge is thus deciding which remotely-stored 3D models should be fetched onto the AR device from an edge server, in order to render them in a timely fashion - yet with high visual quality - on the display. In this work, we propose a 3D model retrieval framework that makes intelligent decisions of which quality of 3D models to fetch, and when. The optimization decision is based on quality-compression tradeoffs, network bandwidth, and predictions of which 3D models the AR user is likely to view next. To support our framework, we collect real-world traces of AR users playing a world-scale AR game, and use this to drive our simulation and prediction modules. Our results show that the proposed framework can achieve higher visual quality of the 3D models while missing fewer display deadlines (by 20%) and wasting fewer bytes (by 10x), compared to a baseline approach of pre-fetching models within a fixed distance of the user.

Learning to Predict Head Pose in Remotely-Rendered Virtual Reality

  • Gazi Karam Illahi
  • Ashutosh Vaishnav
  • Teemu Kämäräinen
  • Matti Siekkinen
  • Mario Di Francesco

Accurate characterization of Head Mounted Display (HMD) pose in a virtual scene is essential for rendering immersive graphics in Extended Reality (XR). Remote rendering employs servers in the cloud or at the edge of the network to overcome the computational limitations of either standalone or tethered HMDs. Unfortunately, it increases the latency experienced by the user; for this reason, predicting HMD pose in advance is highly beneficial, as long as it achieves high accuracy. This work provides a thorough characterization of solutions that forecast HMD pose in remotely-rendered virtual reality (VR) by considering six degrees of freedom. Specifically, it provides an extensive evaluation of pose representations, forecasting methods, machine learning models, and the use of multiple modalities along with joint and separate training. In particular, a novel three-point representation of pose is introduced together with a data fusion scheme for long-term short-term memory (LSTM) neural networks. Our findings show that machine learning models benefit from using multiple modalities, even though simple statistical models perform surprisingly well. Moreover, joint training is comparable to separate training with carefully chosen pose representation and data fusion strategies.

Extending 3-DoF Metrics to Model User Behaviour Similarity in 6-DoF Immersive Applications

  • Silvia Rossi
  • Irene Viola
  • Laura Toni
  • Pablo Cesar

Immersive reality technologies, such as Virtual and Augmented Reality, have ushered a new era of user-centric systems, in which every aspect of the coding-delivery-rendering chain is tailored to the interaction of the users. Understanding the actual interactivity and behaviour of the users is still an open challenge and a key step to enabling such a user-centric system. Our main goal is to extend the applicability of existing behavioural methodologies for studying user navigation in the case of 6 Degree-of-Freedom (DoF). Specifically, we first compare the navigation in 6-DoF with its 3-DoF counterpart highlighting the main differences and novelties. Then, we define new metrics aimed at better modelling behavioural similarities between users in a 6-DoF system. We validate and test our solutions on real navigation paths of users interacting with dynamic volumetric media in 6-DoF Virtual Reality conditions. Our results show that metrics that consider both user position and viewing direction better perform in detecting user similarity while navigating in a 6-DoF system. Having easy-to-use but robust metrics that underpin multiple tools and answer the question "how do we detect if two users look at the same content?" open the gate to new solutions for a user-centric system.

The Impact of Latency on Target Selection in First-Person Shooter Games

  • Shengmei Liu
  • Mark Claypool

While target selection in a 2D space is fairly well-studied, target selection in a 3D space, such as shooting in first-person shooter (FPS) games, is not, nor are the benefits to players for many latency compensation techniques. This paper presents results from a user study that evaluates the impact of latency and latency compensation techniques on 3D target selection via a bespoke FPS shooter. Analysis of the results shows latency degrades player performance (time to select/shoot a target), with subjective opinions on Quality of Experience (QOE) following suit. Individual latency compensation techniques cannot fully overcome the effects of latency but combined techniques can, letting players perform and feel as if there is no network latency. We derive a basic analytic model for the distribution of the player selection times which can be part of a simulation of a full-range of FPS games.

Performance and User Experience Studies of HILLES: Home-based Immersive Lower Limb Exergame System

  • Yu-Yen Chung
  • Thiru M. Annaswamy
  • Balakrishnan Prabhakaran

Head-Mounted Devices (HMDs) have become popular for home-based immersive gaming. However, using lower limb motion in the immersive virtual environment is still restricted. This work introduces an RGB-D camera-based motion capture system alongside a standalone HMD for Home-based Immersive Lower Limbs Exergame Systems (HILLES) in a seated pose. With the advance of neural network models, camera-based 3D body tracking accuracy is increasing. Nevertheless, the high demand for computing resources on model inference may compromise the game engine's performance. Accordingly, HILLES applies a distributed architecture to leverage the resources effectively. The system performances, such as frames per second and latency, are compared with a centralized system. For an immersive exergame, a pet walking around could raise safety issues. Hence, we also showcase that the camera system can provide an additional safety feature by combining an object detection model. Besides, another challenge in games focusing on lower limb interactions is the safe reachability of different virtual objects from a seated pose. Accordingly, in the user study, a stomping game with two reachability enhancements, including leg extension and seated navigation, is implemented based on the HILLES to evaluate and explore the gaming experience. The result shows that the system motivates the leg exercise, and the added enhancements may adjust the game difficulty. However, the enhancements may also distract users from focusing on leg exertion. The derived insight could benefit the lower limb exergame design in the future.

An Asynchronous Intensity Representation for Framed and Event Video Sources

  • Andrew C. Freeman
  • Montek Singh
  • Ketan Mayer-Patel

Neuromorphic "event" cameras, designed to mimic the human vision system with asynchronous sensing, unlock a new realm of high-speed and high-dynamic-range applications. However, researchers often either revert to a framed representation of event data for applications, or build bespoke applications for a particular camera's event data type. To usher in the next era of video systems, accommodate new event camera designs, and explore the benefits of asynchronous video in classical applications, we argue that there is a need for an asynchronous, source-agnostic video representation. In this paper, we introduce a novel, asynchronous intensity representation for both framed and non-framed data sources. We show that our representation can increase intensity precision and greatly reduce the number of samples per pixel compared to grid-based representations. With framed sources, we demonstrate that by permitting a small amount of loss through the temporal averaging of stable pixel values, we can reduce our representational sample rate by more than half, while incurring a drop in VMAF quality score of only 4.5. We also demonstrate lower latency than the state-of-the-art method for fusing and transcoding framed and event camera data to an intensity representation, while maintaining 2000X the temporal resolution. We argue that our method provides the computational efficiency and temporal granularity necessary to build real-time intensity-based applications for event video.

Deep Feature Compression with Rate-Distortion Optimization for Networked Camera Systems

  • Ademola Ikusan
  • Rui Dai

Deep-learning-based video analysis solutions have become indispensable components in today's intelligent sensing applications. In a networked camera system, an efficient way to analyze the captured videos is to extract the features for deep learning at local cameras or edge devices and then transmit the features to powerful processing hubs for further analysis. As there exists substantial redundancy among different feature maps from the same video frame, the feature maps could be compressed before transmission to save bandwidth. This paper introduces a new rate-distortion optimized framework for compressing the intermediate deep features from the key frames of a video. First, to reduce the redundancy among different features, a feature selection strategy is designed based on hierarchical clustering. The selected features are then quantized, repacked as videos, and further compressed using a standardized video encoder. Furthermore, the proposed framework incorporates rate-distortion models that are built for three representative computer vision tasks: image classification, image segmentation, and image retrieval. A corresponding rate-distortion optimization module is designed to enhance the performance of common computer vision tasks under rate constraints. Experimental results show that the proposed deep feature compression framework can boost the compression performance over the standard HEVC video encoder.

RABBIT: Live Transcoding of V-PCC Point Cloud Streams

  • Michael Rudolph
  • Stefan Schneegass
  • Amr Rizk

Point clouds are a mature representation format for volumetric objects in 6 degrees-of-freedom multimedia streaming. To handle the massive size of point cloud data for visually satisfying immersive media, MPEG standardized Video-based Point Cloud Compression (V-PCC), leveraging existing video codecs to achieve high compression ratios. A major challenge of V-PCC is the high encoding latency, which results in fallback solutions that exchange the compression ratio for faster point cloud codecs. This encoding effort rises significantly in adaptive streaming systems, where heterogeneous user requirements translate into a set of quality representations of the media.

In this paper, we show that given one high quality media representation we can achieve live transcoding of video-based compressed point clouds to serve heterogeneous user quality requirements in real time. This stands in contrast to the slow, baseline transcoding that reconstructs and re-encodes the raw point cloud at a new quality setting. To address the high latency when employing the decoder-encoder stack of V-PCC during transcoding, we propose RABBIT, a novel technique that only re-encodes the underlying video sub-streams. This eliminates the overhead of the baseline decoding-encoding approach and decreases the latency further by applying optimized video codecs. We perform extensive evaluation of RABBIT in combination with different video codecs, showing on-par quality with the baseline V-PCC transcoding. Using a hardware-accelerated video codec we demonstrate live transcoding performance of RABBIT and finally present a trade-off between rate, distortion and transcoding latency.

Enabling Low Bit-Rate MPEG V-PCC-encoded Volumetric Video Streaming with 3D Sub-sampling

  • Yuang Shi
  • Pranav Venkatram
  • Yifan Ding
  • Wei Tsang Ooi

MPEG's Video-based Point Cloud Compression (V-PCC) is a recent new standard for volumetric video compression. By mapping a 3D dynamic point cloud to a 2D image sequence, V-PCC can rely on state-of-the-art video codecs to achieve high compression rate while maintaining the visual fidelity of the point cloud sequence. The quality of a compressed point cloud degrades steeply, however, below the operational bit-rate range of the video codec. In this work, we show that redundant information inherent in a 3D point cloud can be exploited to further extend the bit-rate range of the V-PCC codec, enabling it to operate in a low bit-rate scenario that is important in the context of volumetric video streaming. By simplifying the 3D point clouds through down-sampling and down-scaling during the encoding phase, and reversing the process during the decoding phase, we show that V-PCC could achieve up to 2.1 dB improvement in peak signal-to-noise ratio (PSNR), 7.1% improvement in structural similarity index (SSIM) and 14.8 improvement in video multimethod assessment fusion (VMAF) of the rendered point clouds at the same bit-rate and correspondingly up to 48.5% lower bit-rate at the same image quality.

patchVVC: A Real-time Compression Framework for Streaming Volumetric Videos

  • Ruopeng Chen
  • Mengbai Xiao
  • Dongxiao Yu
  • Guanghui Zhang
  • Yao Liu

Nowadays, volumetric video has emerged as an attractive multimedia application, which provides highly immersive watching experiences. However, streaming the volumetric video demands prohibitively high bandwidth. Thus, effectively compressing its underlying point cloud frames is essential to deploying the volumetric videos. The existing compression techniques are either 3D-based or 2D-based, but they still have drawbacks when being deployed in practice. The 2D-based methods compress the videos in an effective but slow manner, while the 3D-based methods feature high coding speeds but low compression ratios. In this paper, we propose patchVVC, a 3D-based compression framework that reaches both a high compression ratio and a real-time decoding speed. More importantly, patchVVC is designed based on point cloud patches, which makes it friendly to an field of view adaptive streaming system that further reduces the bandwidth demands. The evaluation shows patchVCC achieves the real-time decoding speed and the comparable compression ratios as the representative 2D-based scheme, V-PCC, in an FoV-adaptive streaming scenario.

FBDT: Forward and Backward Data Transmission Across RATs for High Quality Mobile 360-Degree Video VR Streaming

  • Suresh Srinivasan
  • Sam Shippey
  • Ehsan Aryafar
  • Jacob Chakareski

The metaverse encompasses many virtual universes and relies on streaming high-quality 360° videos to VR/AR headsets. This type of video transmission requires very high data rates to meet the desired Quality of Experience (QoE) for all clients. Simultaneous data transmission across multiple Radio Access Technologies (RATs) such as WiFi and WiGig is a key solution to meet this required capacity demand. However, existing transport layer multi-RAT traffic aggregation schemes suffer from Head-of-Line (HoL) blocking and sub-optimal traffic splitting across the RATs, particularly when there is a high fluctuation in their channel conditions. As a result, state-of-the-art multi-path TCP (MPTCP) solutions can achieve aggregate transmission data rates that are lower than that of using only a single WiFi RAT in many practical settings, e.g., when the client is mobile. We make two key contributions to enable high quality mobile 360° video VR streaming using multiple RATs. First, we propose the design of FBDT, a novel multi-path transport layer solution that can achieve the sum of individual transmission rates across the RATs despite their system dynamics. We implemented FBDT in the Linux kernel and showed substantial improvement in transmission throughput relative to state-of-the-art schemes, e.g, 2.5x gain in a dual-RAT scenario (WiFi and WiGig) when the VR client is mobile. Second, we formulate an optimization problem to maximize a mobile VR client's viewport quality by taking into account statistical models of how clients explore the 360° look-around panorama and the transmission data rate of each RAT. We explore an iterative method to solve this problem and evaluate its performance through measurement-driven simulations leveraging our testbed. We show up to 12 dB increase in viewport quality when our optimization framework is employed.

EVASR: Edge-Based Video Delivery with Salience-Aware Super-Resolution

  • Na Li
  • Yao Liu

With the rapid growth of video content consumption, it is important to deliver high-quality streaming videos to users even under limited available network bandwidth. In this paper, we propose EVASR, a system that performs edge-based video delivery to clients with salience-aware super-resolution. We select patches with higher saliency score to perform super-resolution while applying the simple yet efficient bicubic interpolation for the remaining patches in the same video frame. To efficiently use the computation resources available at the edge server, we introduce a new metric called "saliency visual quality" and formulate patch selection as an optimization problem to achieve the best performance when an edge server is serving multiple users. We implement EVASR based on the FFmpeg framework and conduct extensive experiments for evaluation. Results show that EVASR outperforms baseline approaches in both resource efficiency and visual quality metrics including PSNR, saliency visual quality (SVQ), and VMAF.

Latency Target based Analysis of the DASH.js Player

  • Piers O'Hanlon
  • Adil Aslam

We analyse the low latency performance of the three Adaptive Bitrate (ABR) algorithms in the dash.js Dynamic Adaptive Streaming over HTTP (DASH) player with respect to a range of latency targets and configuration options. We perform experiments on our DASH Testbed which allows for testing with a range of real world derived network profiles. Our experiments enable a better understanding of how latency targets affect quality of experience (QoE), and how well the different algorithms adhere to their targets. We find that with dash.js v4.5.0 the default Dynamic algorithm achieves the best overall QoE. We show that whilst the other algorithms can achieve higher video quality at lower latencies, they do so only at the expense of increased stalling. We analyse the poor performance of L2A-LL in our tests and develop modifications which demonstrate significant improvements. We also highlight how some low latency configuration settings can be detrimental to performance.

A First Look at Adaptive Video Streaming over Multipath QUIC with Shared Bottleneck Detection

  • Thomas William do Prado Paiva
  • Simone Ferlin
  • Anna Brunstrom
  • Ozgu Alay
  • Bruno Yuji Lino Kimura

The promises of multipath transport is to aggregate bandwidth and improve resource utilisation and reliability. We demonstrate in this paper that the way multipath coupled congestion control is defined today RFC6359 leads to a sub-optimal resource utilisation when network paths are mainly disjoint, i.e., they do not share a bottleneck. With growing interest to standardise Multipath QUIC (MPQUIC), we implement the practical shared bottleneck detection (SBD) algorithm from RFC8382 in MPQUIC, namely MPQUIC-SBD. We evaluate MPQUIC-SBD through extensive emulation experiments in the context of video streaming. We show that MPQUIC-SBD is able to correctly detect shared bottlenecks over 90% of the time as the video segments' size increase depending on the Adaptive Bitrate (ABR) algorithm. In non-shared bottleneck scenarios, MPQUIC-SBD results in video throughput gains of more than 13% compared to MPQUIC, which directly translates into better video quality metrics.

QUTY: Towards Better Understanding and Optimization of Short Video Quality

  • Haodan Zhang
  • Yixuan Ban
  • Zongming Guo
  • Zhimin Xu
  • Qian Ma
  • Yue Wang
  • Xinggong Zhang

Short video applications such as TikTok and Instagram have attracted tremendous attention recently. However, it is very limited for industry and academia to understand the user's Quality of Experience (QoE) on short video, let alone how to improve the QoE in short video streaming.

In this paper, we dug into the factors that affect the user's QoE and then propose a system which models and optimizes user's QoE. We unveil the QoE formulation of short video by diving into the understanding of users' viewing behavior, and analyzing large dataset (more than 10 million records) from Douyin (a short video application). We find that: (a) the increase of rebuffering duration, rebuffering times, and starting delay will decrease the user retention ratio, whereas the video bitrate has little effect, (b) the users exhibit different viewing behavior patterns such as scrolling video fastly or slowly, which can be utilize to improve QoE. Over these findings, we propose QUTY, a QoE-driven short video streaming system, which utilizes a data-driven approach to quantify QoE of short video and optimizes it with a Hierarchical Reinforcement Learning (HRL) method. Our evaluations show that QUTY can reduce the rebuffering ratio by up to 49.9%, reduce the rebuffering times by up to 55.8%, reduce the startup delay by up to 81.9%, and improve the QoE by up to 8.5% compared with the existing short video streaming approaches.

Cross-layer Network Bandwidth Estimation for Low-latency Live ABR Streaming

  • Chinmaey Shende
  • Cheonjin Park
  • Subhabrata Sen
  • Bing Wang

Low-latency live (LLL) adaptive bitrate (ABR) streaming relies critically on accurate bandwidth estimation to react to dynamic network conditions. While existing studies have proposed bandwidth estimation techniques for LLL streaming, these approaches are at the application level, and their accuracy is limited by the distorted timing information observed at the application level. In this paper, we propose a novel cross-layer approach that uses coarse-grained application-level semantics and fine-grained kernel-level packet capture to obtain accurate bandwidth estimation. We incorporate this technique in three popular open-source ABR players and show that it provides significantly more accurate bandwidth estimation than the state-of-the-art application-level approaches. In addition, the more accurate bandwidth estimation leads to better bandwidth prediction, which we show can lead to significantly better quality of experience (QoE) for end users.

TASQ: Temporal Adaptive Streaming over QUIC

  • Akram Ansari
  • Yang Liu
  • Mea Wang
  • Emir Halepovic

Traditional Adaptive BitRate (ABR) streaming faces a challenge of providing smooth experience under highly variable network conditions, especially when low latency is required. Effective adaptation techniques exist for deep-buffer scenarios, such as streaming long-form Video-on-Demand content, but remain elusive for short-form or low-latency cases, when even a short segment may be delivered too late and cause a stall. Recently proposed temporal adaptation aims to mitigate this problem by being robust to losing a part of the video segment, essentially dropping the tail of the segment intentionally to avoid the stall. In this paper, we analyze this approach in the context of a recently adopted codec AV1 and find that it does not always provide the promised benefits. We investigate the root causes and find that a combination of codec efficiency and TCP behavior can defeat the benefits of temporal adaptation. We develop a solution based on QUIC, and present the results showing that the benefits of temporal adaptation that still apply to AV1, including reduced stall time up to 65% compared to the original TCP-based approach. In addition, we present a novel way to use the stream management features of QUIC to benefit Quality-of-Experience (QoE) and reduce wasted data in video streaming.

Color-aware Deep Temporal Backdrop Duplex Matting System

  • Hendrik Hachmann
  • Bodo Rosenhahn

Deep learning-based alpha matting showed tremendous improvements in recent years, yet, feature film production studios still rely on classical chroma keying including costly post-production steps. This perceived discrepancy can be explained by some missing links necessary for production which are currently not adequately addressed in the alpha matting community, in particular foreground color estimation or color spill compensation. We propose a neural network-based temporal multi-backdrop production system that combines beneficial features from chroma keying and alpha matting. Given two consecutive frames with different background colors, our one-encoder-dual-decoder network predicts foreground colors and alpha values using a patch-based overlap-blend approach. The system is able to handle imprecise backdrops, dynamic cameras, and dynamic foregrounds and has no restrictions on foreground colors. We compare our method to state-of-the-art algorithms using benchmark datasets and a video sequence captured by a demonstrator setup. We verify that a dual backdrop input is superior to the usually applied trimap-based approach. In addition, the proposed studio set is actor friendly, and produces high-quality, temporal consistent alpha and color estimations that include a superior color spill compensation.

SketchBuddy: Context-Aware Sketch Enrichment and Enhancement

  • Aishwarya Agarwal
  • Anuj Srivastava
  • Inderjeet Nair
  • Swasti Shreya Mishra
  • Vineeth Dorna
  • Sharmila Reddy Nangi
  • Balaji Vasan Srinivasan

Sketching is a visual thinking tool available to humans for several decades. With the advent of modern sketching technologies, artists use sketches to express and iterate their ideas. To accelerate sketch-based ideation and illustration workflows, we propose a novel framework, SketchBuddy, which retrieves diverse fine-grained object suggestions to enrich a sketch and coherently inserts it into the scene. Sketchbuddy detects objects in the input sketch to estimate the scene context which is then utilized for the recommendation and insertion. We propose a novel multi-modal transformer based framework for obtaining context-aware fine-grained object recommendations. We train a CNN-based bounding box classifier to extract information from the input scene and the recommended objects to infer plausible locations for insertion. While prior works focus on sketches at object-level only, SketchBuddy is the first work in the direction of scene-level sketching assistance. Our extensive evaluations comparing SketchBuddy against competing baselines across several metrics and agreements with human preferences demonstrate its value on several aspects.

Style-based film grain analysis and synthesis

  • Zoubida Ameur
  • Claire-Hélène Demarty
  • Olivier Le Meur
  • Daniel Ménard
  • Edouard François

Film grain which used to be a by-product of the chemical processing in the analog film stock, is a desirable feature in the era of digital cameras. Besides participating to the artistic intent during content creation, film grain has also interesting properties in the video compression chain such as its ability to mask compression artifacts. In this paper, we use a deep learning-based framework for film grain analysis, generation and synthesis. Our framework consists of three modules: a style encoder performing film grain style analysis, a mapping network responsible for film grain style generation, and a synthesis network that generates and blends a specific grain style to a given content in a content-adaptive manner. All modules are trained jointly, thanks to dedicated loss functions, on a new large and diverse dataset of pairs of grain-free and grainy images that we made publicly available to the community1. Quantitative and qualitative evaluations show that fidelity to the reference grain, diversity of grain styles as well as a perceptually pleasant grain synthesis are achieved, demonstrating that each module outperforms the state-of-the-art in the task it was designed for.

Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems

  • Luis Carvalho
  • Tobias Washüttl
  • Gerhard Widmer

Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pretrained models are able to retrieve snippets with better precision in all scenarios and pre-training configurations. Encouraged by these results, we employ the snippet embeddings in the higher-level task of cross-modal piece identification and conduct more experiments on several retrieval configurations. In this task, we observe that the retrieval quality improves from 30% up to 100% when real music data is present. We then conclude by arguing for the potential of self-supervised contrastive learning for alleviating the annotated data scarcity in multi-modal music retrieval models. Code and trained models are accessible at

Semi-automatic mulsemedia authoring analysis from the user's perspective

  • Raphael Abreu
  • Douglas Mattos
  • Joel Santos
  • George Guinea
  • Débora C. Muchaluat-Saade

Mulsemedia (Multiple Sensorial Media) authoring is a complex task that requires the author to scan the media content to identify the moments to activate sensory effects. A novel proposal is to integrate content recognition algorithms into authoring tools to alleviate the authoring effort. Such algorithms could potentially replace the work of the human author when analyzing audiovisual content, by performing automatic extraction of sensory effects. Besides that, the semi-automatic method proposes to maintain the author subjectivity, allowing the author to define which sensory effects should be automatically extracted. This paper presents an evaluation of the proposed semi-automatic authoring considering the point of view of users. Experiments were done with the STEVE 2.0 mulsemedia authoring tool. Our work uses the GQM (Goal Question Metric) methodology, a questionnaire for collecting users' feedback, and analyzes the results. We conclude that users believe that the semi-automatic authoring is a positive addition to the authoring method.

Multimodal Cascaded Framework with Metric Learning Robust to Missing Modalities for Person Classification

  • Vijay John
  • Yasutomo Kawanishi

This paper addresses the missing modality problem in multimodal person classification, where an incomplete multimodal input with one modality missing is classified into predefined person classes. A multimodal cascaded framework with three deep learning models is proposed, where model parameters, outputs, and latent space learnt at a given step are transferred to the model in a subsequent step. The cascaded framework addresses the missing modality problem by, firstly, generating the complete multimodal data from the incomplete multimodal data in the feature space via a latent space. Subsequently, the generated and original multimodal features are effectively merged and embedded into a final latent space to estimate the person label. During the learning phase, the cascaded framework uses two novel latent loss functions, the missing modality joint loss, and latent prior loss to learn the different latent spaces. The missing modality joint loss ensures that the similar class latent data are close to each other, even if a modality is missing. In the cascaded framework, the latent prior loss learns the final latent space using a previously learnt latent space as a prior. The proposed framework is validated on the audio-visible RAVDESS and the visible-thermal Speaking Faces datasets. A detailed comparative analysis and an ablation analysis are performed, which demonstrate that the proposed framework enhances the robustness of person classification even under conditions of missing modalities, reporting an average of 21.75% increase and 25.73% increase over the baseline algorithms on the RAVDESS and Speaking Faces datasets.

Security-Preserving Live 3D Video Surveillance

  • Zhongze Tang
  • Huy Phan
  • Xianglong Feng
  • Bo Yuan
  • Yao Liu
  • Sheng Wei

3D video surveillance has become the new trend in security monitoring with the popularity of 3D depth cameras in the consumer market. While enabling more fruitful surveillance features, the finer-grained 3D videos being captured would raise new security concerns that have not been addressed by existing research. This paper explores the security implications of live 3D surveillance videos in triggering biometrics-related attacks, such as face ID spoofing. We demonstrate that the state-of-the-art face authentication systems can be effectively compromised by the 3D face models presented in the surveillance video. Then, to defend against such face spoofing attacks, we propose to proactively and benignly inject adversarial perturbations to the surveillance video in real time, prior to the exposure to potential adversaries. Such dynamically generated perturbations can prevent the face models from being exploited to bypass deep learning-based face authentications while maintaining the required quality and functionality of the 3D video surveillance. We evaluate the proposed perturbation generation approach on both an RGB-D dataset and a 3D video dataset, which justifies its effective security protection, low quality degradation, and real-time performance.

MOSAIC: Spatially-Multiplexed Edge AI Optimization over Multiple Concurrent Video Sensing Streams

  • Ila Gokarn
  • Hemanth Sabbella
  • Yigong Hu
  • Tarek Abdelzaher
  • Archan Misra

Sustaining high fidelity and high throughput of perception tasks over vision sensor streams on edge devices remains a formidable challenge, especially given the continuing increase in image sizes (e.g., generated by 4K cameras) and complexity of DNN models. One promising approach involves criticality-aware processing, where the computation is directed selectively to "critical" portions of individual image frames. We introduce MOSAIC, a novel system for such criticality-aware concurrent processing of multiple vision sensing streams that provides a multiplicative increase in the achievable throughput with negligible loss in perception fidelity. MOSAIC determines critical regions from images received from multiple vision sensors and spatially bin-packs these regions using a novel multi-scale Mosaic Across Scales (MoS) tiling strategy into a single `canvas frame', sized such that the edge device can retain sufficiently high processing throughput. Experimental studies using benchmark datasets for two tasks, Automatic License Plate Recognition and Drone-based Pedestrian Detection, shows that MOSAIC, executing on a Jetson TX2 edge device, can provide dramatic gains in the throughput vs. fidelity tradeoff. For instance, for drone-based pedestrian detection, for a batch size of 4, MOSAIC can pack input frames from 6 cameras to achieve (a) 4.75X (475%) higher throughput (23 FPS per camera, cumulatively 138FPS) with ≤ 1% accuracy loss, compared to a First Come First Serve (FCFS) processing paradigm.

Video-based Contrastive Learning on Decision Trees: from Action Recognition to Autism Diagnosis

  • Mindi Ruan
  • Xiangxu Yu
  • Na Zhang
  • Chuanbo Hu
  • Shuo Wang
  • Xin Li

How can we teach a computer to recognize 10,000 different actions? Deep learning has evolved from supervised and unsupervised to self-supervised approaches. In this paper, we present a new contrastive learning-based framework for decision tree-based classification of actions, including human-human interactions (HHI) and human-object interactions (HOI). The key idea is to translate the original multi-class action recognition into a series of binary classification tasks on a pre-constructed decision tree. Under the new framework of contrastive learning, we present the design of an interaction adjacent matrix (IAM) with skeleton graphs as the backbone for modeling various action-related attributes such as periodicity and symmetry. Through the construction of various pretext tasks, we obtain a series of binary classification nodes on the decision tree that can be combined to support higher-level recognition tasks. Experimental justification for the potential of our approach in real-world applications ranges from interaction recognition to symmetry detection. In particular, we have demonstrated the promising performance of video-based autism spectrum disorder (ASD) diagnosis on the CalTech interview video database.

Media over QUIC: Initial Testing, Findings and Results

  • Zafer Gurel
  • Tugce Erkilic Civelek
  • Atakan Bodur
  • Senem Bilgin
  • Deniz Yeniceri
  • Ali C. Begen

With its advantages over TCP, QUIC created a new field for developing media-aware low-latency delivery solutions. The problem space is being examined by the new Media over QUIC (moq) working group in the IETF. In this paper, we study one of the initial proposals in detail, do a gap analysis and create an open-source testbed by introducing new essential features.

"You AR' right in front of me": RGBD-based capture and rendering for remote training

  • Simon N.B. Gunkel
  • Sylvie Dijkstra-Soudarissanane
  • Omar Niamut

Immersive technologies such as virtual reality have enabled novel forms of education and training, where students can learn new skills in simulated environments. But some specialized training procedures, e.g. ESA-certified soldering, still involve real-world physical processes with physical lab equipment. Such training sessions require students to travel to teaching labs and may interrupt everyday commitments for a longer period of time. There is a desire to make such training procedures more accessible remotely while keeping any student-to-teacher interaction natural, personal, and engaging. This paper presents a prototype for a remote teaching use case by rendering 3D photorealistic representations into the Augmented Reality (AR) glasses of a student. The teacher is captured with a modular RGBD capture application integrated into a web-based immersive communication platform. The integration offers multiple real-time capture calibration and rendering configurations. Our modular platform allows for an easy evaluation of different technical constraints as well as easy testing of the use case itself. Such evaluation may include a direct comparison of different 3D point-cloud and mesh rendering techniques. Additionally, the overall system allows immersive interaction between the student and the teacher, including augmented text messages for non-intrusive notifications. Our platform offers an ideal testbed for both technical and user-centered immersive communication studies.

Open-Source Toolkit for Live End-to-End 4K VVC Intra Coding

  • Marko Viitanen
  • Joose Sainio
  • Alexandre Mercat
  • Guillaume Gautier
  • Jarno Vanne
  • Ibrahim Farhat
  • Pierre-Loup Cabarat
  • Wassim Hamidouche
  • Daniel Menard

Versatile Video Coding (VVC/H.266) takes video coding to the next level by doubling the coding efficiency over its predecessors for the same subjective quality, but at the cost of immense coding complexity. Therefore, VVC calls for aggressively optimized codecs to make it feasible for live streaming media applications. This paper introduces the first public end-to-end (E2E) pipeline for live 4K30p VVC intra coding and streaming. The pipeline is made up of three open-source components: 1) uvg266 for VVC encoding; 2) uvgRTP for VVC streaming; and 3) OpenVVC for VVC decoding. The proposed setup is demonstrated with a proof-of-concept prototype that implements the encoder end on AMD ThreadRipper 2990WX and the decoder end on Nvidia Jetson AGX Orin. Our prototype is almost 34 000 times as fast as the corresponding E2E pipeline built around the VTM codec. Respectively, it achieves 3.3 times speedup without any significant coding overhead over the pipeline that utilizes the fastest possible configuration of the well-known VVenC/VVdeC codec. These results indicate that our prototype is currently the only viable open-source solution for live 4K VVC intra coding and streaming.

Remote Expert Assistance System for Mixed-HMD Clients over 5G Infrastructure

  • Frank ter Haar
  • Sylvie Dijkstra-Soudarissanane
  • Piotr Zuraniewski
  • Rick Hindriks
  • Karim El Assal
  • Simon Gunkel
  • Galit Rahim
  • Omar Niamut

When operating under adverse conditions or at distant locations, it is not always feasible to obtain the assistance of an expert on site. In such cases, remote expert assistance may provide a solution. Current remote assistance systems employ immersive data visualization or video-based communication. By introducing XR-based multi-user communication system to remote expert assistance solutions, we expect to improve the effectiveness of both the operator and the supporting experts. Our multi-user XR collaboration demo integrates XR and cloud/network technologies. It enables three users to collaborate in a way that makes them feel that they are solving a challenging task together.

Video Decoding Performance and Requirements for XR Applications

  • Emmanouil Potetsianakis
  • Emmanuel Thomas

Designing XR applications creates challenges regarding the performance and the scaling of media decoding operations, composition and synchronization of the various assets. Going beyond the single decoder paradigm of conventional video applications, XR applications tend to compose more and more visual streams such as 2D video assets but also textures and 2D/3D graphics encoded in video streams. All this demands a robust and predictable decoder management and a dynamic buffer organization. However, the behaviour of multiple decoder instances running in parallel is yet to be well understood on mobile platforms. To this end, we present in this paper VidBench - a parallel video decoding performance measurement tool for mobile Android devices. With VidBench, we quantify the challenges for applications using parallel video decoding pipelines with objective measurements and subjectively, we illustrate the current state of decoding multiple media streams and the possible visual artefacts resulting from unmanaged parallel video pipelines. Test results provide hints on the feasibility and the potential performance gain of using technologies like the MPEG-I Part 13 - Video Decoding Interface for immersive media (VDI) to alleviate those problems. We briefly present the main goals of VDI, standardised by the SC29 WG3 Moving Picture Experts Group (MPEG) Systems, which introduces functions and related constraints for optimizing such decoding instances as well as relevant video decoding APIs on which VDI is building upon such as the Khronos Vulkan Video extension.

Machine-learning based VMAF prediction for HDR video content

  • Christoph Müller
  • Stephan Steglich
  • Sandra Groß
  • Paul Kremer

This paper presents a methodology for predicting VMAF video quality scores for high dynamic range (HDR) video content using machine learning. To train the ML model, we are collecting a dataset of HDR and converted SDR video clips, as well as their corresponding objective video quality scores, specifically the Video Multimethod Assessment Fusion (VMAF) values. A 3D convolutional neural network (3D-CNN) model is being trained on the collected dataset. Finally, a hands-on demonstrator is developed to showcase the newly predicted HDR-VMAF metric in comparison to VMAF and other metric values for SDR content, and to conduct further validation with user testing.

A Holistic Approach to Understand HTTP Adaptive Streaming

  • Mike Vandersanden

HTTP adaptive streaming is a demanding application requiring high throughput and low latency, with consumers expecting an ever-increasing quality of experience. This doctoral study proposes a novel methodology to analyze and guarantee these requirements through the establishment of a holistic cross-layer application view. By significantly increasing the amount of data points and sources available as well as combining them into a unified representation, the proposed holistic point of view facilitates root cause analysis. At the same time, it can support or even automate analysis of various (sub)processes in the end-to-end chain. Potential cross-layer optimizations in the workflow are also more easily identified and analyzed.

Factors Influencing Video Quality of Experience in Ecologically Valid Experiments: Measurements and a Theoretical Mode

  • Kamil Koniuch

Users' perception of multimedia quality and satisfaction with multimedia services are the subject of various studies in the field of Quality of Experience (QoE). In this respect, subjective studies of quality represent an important part of the multimedia optimization process. However, researchers who measure QoE have to face its multidimensional character and address the fact that quality perception is influenced by numerous factors. To address this issue, experiments measuring QoE often limit the scope of factors influencing subjective judgments by administering laboratory protocols. However, the generalizability of the results gathered with such protocols is limited. The proposed PhD dissertation aims to address this challenge. In order to increase the generalizability of QoE studies we started with an identification of factors influencing user multimedia experience in a natural context. We proposed a new theoretical model of video QoE based on both original research and a literature review. This new theoretical framework allowed us to propose new experimental designs introducing influencing factors one by one in an additive manner. Thanks to the model, we can also propose comparable experiments which could differ in ecological validity. The proposed theoretical framework can be adjusted to other multimedia in the future.

The ADΔER Framework: Tools for Event Video Representations

  • Andrew C. Freeman

The concept of "video" is synonymous with frame-sequence image representations. However, neuromorphic "event" cameras, which are rapidly gaining adoption for computer vision tasks, record frameless video. We believe that these different paradigms of video capture can each benefit from the lessons of the other. To usher in the next era of video systems and accommodate new event camera designs, we argue that we will need an asynchronous, source-agnostic processing pipeline. In this paper, we propose an end-to-end framework for frameless video, and we describe its modularity and amenability to compression and both existing and future applications.

QoE- and Energy-aware Content Consumption For HTTP Adaptive Streaming

  • Daniele Lorenzi

Video streaming services account for the majority of today's traffic on the Internet, and according to recent studies, this share is expected to continue growing. Given this broad utilization, research in video streaming is recently moving towards energy-aware approaches, which aim at reducing the energy consumption of the devices involved in the streaming process. On the other side, the perception of quality delivered to the user plays an important role, and the advent of HTTP Adaptive Streaming (HAS) changed the way quality is perceived. The focus is not any more exclusively on the Quality of Service (QoS) but rather oriented towards the Quality of Experience (QoE) of the user taking part in the streaming session. Therefore video streaming services need to develop Adaptive BitRate (ABR) techniques to deal with different network conditions on the client side or appropriate end-to-end strategies to provide high QoE to the users. The scope of this doctoral study is within the end-to-end environment with a focus on the end-users domain, referred to as the player environment, including video content consumption and interactivity. This thesis aims to investigate and develop different techniques to increase the delivered QoE to the users and minimize the energy consumption of the end devices in HAS context. We present four main research questions to target the related challenges in the domain of content consumption for HAS systems.

Everybody Compose: Deep Beats To Music

  • Conghao Shen
  • Violet Z. Yao
  • Yixin Liu

This project presents a deep learning approach to generate monophonic melodies based on input beats, allowing even amateurs to create their own music compositions. Three effective methods - LSTM with Full Attention, LSTM with Local Attention, and Transformer with Relative Position Representation - are proposed for this novel task, providing great variation, harmony, and structure in the generated music. This project allows anyone to compose their own music by tapping their keyboards or "recoloring" beat sequences from existing works.

VOLVQAD: An MPEG V-PCC Volumetric Video Quality Assessment Dataset

  • Samuel Rhys Cox
  • May Lim
  • Wei Tsang Ooi

We present VOLVQAD, a volumetric video quality assessment dataset consisting 7,680 ratings on 376 video sequences from 120 participants. The volumetric video sequences are first encoded with MPEG V-PCC using 4 different avatar models and 16 quality variations, and then rendered into test videos for quality assessment using 2 different background colors and 16 different quality switching patterns. The dataset is useful for researchers who wish to understand the impact of volumetric video compression on subjective quality. Analysis of the collected data are also presented in this paper.

Modeling Illumination Data with Flying Light Specks

  • Hamed Alimohammadzadeh
  • Daryon Mehraban
  • Shahram Ghandeharizadeh

A Flying Light Speck, FLS, is a miniature sized drone configured with light sources. Swarms of FLSs will illuminate an object in a 3D volume, an FLS display. These illuminations and their data models are the novel contributions of this paper. We introduce a conceptual model of drone flight paths to render static, slide, and motion illuminations. We describe a physical implementation of the conceptual model using bag files. We evaluate this implementation using different lossless compression techniques. A key finding is that our bag file implementation is very compact when compared with the original point clouds. While compression reduces the size of a bag file, a combination that includes the use of both internal bag file compression (lz4 with chunks) and Gzip is not necessarily the most compact representation. We open source our software and its point cloud sequence data for use by the scientific community, see

TotalDefMeme: A Multi-Attribute Meme dataset on Total Defence in Singapore

  • Nirmalendu Prakash
  • Ming Shan Hee
  • Roy Ka-Wei Lee

Total Defence is a defence policy combining and extending the concept of military defence and civil defence. While several countries have adopted total defence as their defence policy, very few studies have investigated its effectiveness. With the rapid proliferation of social media and digitalisation, many social studies have been focused on investigating policy effectiveness through specially curated surveys and questionnaires either through digital media or traditional forms. However, such references may not truly reflect the underlying sentiments about the target policies or initiatives of interest. People are more likely to express their sentiment using communication mediums such as starting topic thread on forums or sharing memes on social media. Using Singapore as a case reference, this study aims to address this research gap by proposing TotalDefMeme, a large-scale multi-modal and multi-attribute meme dataset that captures public sentiments toward Singapore's Total Defence policy. Besides supporting social informatics and public policy analysis of the Total Defence policy, TotalDefMeme can also support many downstream multi-modal machine learning tasks, such as aspect-based stance classification and multi-modal meme clustering. We perform baseline machine learning experiments on TotalDefMeme and evaluate its technical validity, and present possible future interdisciplinary research directions and application scenarios using the dataset as a baseline.

A Dynamic 3D Point Cloud Dataset for Immersive Applications

  • Yuan-Chun Sun
  • I-Chun Huang
  • Yuang Shi
  • Wei Tsang Ooi
  • Chun-Ying Huang
  • Cheng-Hsin Hsu

Motion estimation in a 3D point cloud sequence is a fundamental operation with many applications, including compression, error concealment, and temporal upscaling. While there have been multiple research contributions toward estimating the motion vector of points between frames, there is a lack of a dynamic 3D point cloud dataset with motion ground truth to benchmark against. In this paper, we present an open dynamic 3D point cloud dataset to fill this gap. Our dataset consists of synthetically generated objects with pre-determined motion patterns, allowing us to generate the motion vectors for the points. Our dataset contains nine objects in three categories (shape, avatar, and textile) with different animation patterns. We also provide semantic segmentation of each avatar object in the dataset. Our dataset can be used by researchers who need temporal information across frames. As an example, we present an evaluation of two motion estimation methods using our dataset.

SMART360: Simulating Motion prediction and Adaptive bitRate sTrategies for 360° video streaming

  • Quentin Guimard
  • Lucile Sassatelli

Adaptive bitrate (ABR) algorithms are used in streaming media to adjust video or audio quality based on the viewer's network conditions to provide a smooth playback experience. With the rise of virtual reality (VR) headsets, 360° video streaming is growing rapidly and requires efficient ABR strategies to also adapt the video quality to the user's head position. However, research in this field is often difficult to compare due to a lack of reproducible simulations. To address this problem, we provide SMART360, a 360° streaming simulation environment to compare motion prediction and adaptive bitrates strategies. We provide sample inputs and baseline algorithms along with the simulator, as well as examples of results and visualizations that can be obtained with SMART360. The code and data are made publicly available.

360 Video DASH Dataset

  • Darijo Raca
  • Yogita Jadhav
  • Jason J. Quinlan
  • Ahmed H. Zahran

Different industries are observing the positive impact of 360 video on the user experience. However, the performance of VR systems continues to fall short of customer expectations. Therefore, more research into various design elements for VR streaming systems is required. This study introduces a SW tool that offers straight-forward encoding platforms to simplify the encoding of DASH VR videos. In addition, we developed a dataset composed of 9 VR videos encoded with seven tiling configurations, four segment durations, and up to four different bitrates. A corresponding tile size dataset is also provided, which can be utilised to power network simulations or trace-driven emulations. We analysed the traffic load of various films and encoding setups using the dataset that was presented. Our research indicates that, while smaller tile sizes reduce traffic load, video decoding may require more computational power.

Web3DP: A Crowdsourcing Platform for 3D Models Based on Web3 Infrastructure

  • Lehao Lin
  • Haihan Duan
  • Wei Cai

Recently, the concept of metaverse has been rapidly emerging, which highly expands the human living space. Specifically, 3D models are at the heart of building a vast metaverse space, so a massive number of 3D models are needed. Existing 3D model libraries and platforms have achieved great results. However, most of them are unscalable, insufficiently open, inefficient to collect, and at risk of service disruption and data corruption. Therefore, we propose and implement Web3DP, a crowdsourcing platform for 3D models based on Web3 (a.k.a. Web 3.0) infrastructure. By using the decentralized blockchain technology, Web3DP has the advantages of transparency, auditability, traceability, data tamper-proof, high file transfer efficiency, and service stability. Experiments are conducted to validate the performance of the proposed platform. It illustrates that Web3DP shows better file transmission capabilities with an acceptable transaction fee to facilitate 3D model collecting and managing for metaverse, games, cultural heritage, etc.

Vegvisir: A testing framework for HTTP/3 media streaming

  • Joris Herbots
  • Mike Vandersanden
  • Peter Quax
  • Wim Lamotte

Assessing media streaming performance traditionally requires the presence of reproducible network conditions and a heterogeneous dataset of media materials. Setting up such experiments represents a complex challenge in itself. This challenge becomes even more complex when we consider the new QUIC transport protocol, which has many tunable features, yet is difficult to analyze due to its inherent encrypted nature. In this paper, we introduce Vegvisir, which aims to solve these aforementioned challenges by providing an open-source automated testing framework for orchestrating media streaming experiments over HTTP/3. We describe how users can steer the behavior of Vegvisir through its configuration system. We provide a high-level overview of its inner workings and its broad applicability by describing two use cases: one covering sizeable experiments spanning multiple days and another covering HAS evaluation scenarios.

FSVVD: A Dataset of Full Scene Volumetric Video

  • Kaiyuan Hu
  • Yili Jin
  • Haowen Yang
  • Junhua Liu
  • Fangxin Wang

Recent years have witnessed a rapid development of immersive multimedia which bridges the gap between the real world and virtual space. Volumetric videos, as an emerging representative 3D video paradigm that empowers extended reality, stand out to provide unprecedented immersive and interactive video watching experience. Despite the tremendous potential, the research towards 3D volumetric video is still in its infancy, relying on sufficient and complete datasets for further exploration. However, existing related volumetric video datasets mostly only include a single object, lacking details about the scene and the interaction between them. In this paper, we focus on the current most widely used data format, point cloud, and for the first time release a full-scene volumetric video dataset that includes multiple people and their daily activities interacting with the external environments. Comprehensive dataset description and analysis are conducted, with potential usage of this dataset. The dataset and additional tools can be accessed via the following website:

A Dataset of Food Intake Activities Using Sensors with Heterogeneous Privacy Sensitivity Levels

  • Yi-Hung Wu
  • Hsin-Che Chiang
  • Shervin Shirmohammadi
  • Cheng-Hsin Hsu

Human activity recognition, which involves recognizing human activities from sensor data, has drawn a lot of interest from researchers and practitioners as a result of the advent of smart homes, smart cities, and smart systems. Existing studies on activity recognition mostly concentrate on coarse-grained activities like walking and jumping, while fine-grained activities like eating and drinking are understudied because it is more difficult to recognize fine-grained activities than coarse-grained ones. As such, food intake activity recognition in particular is under investigation in the literature despite its importance for human health and well-being, including telehealth and diet management. In order to determine sensors' practical recognition accuracy, preferably with the least amount of privacy intrusion, a dataset of food intake activities utilizing sensors with varying degrees of privacy sensitivity is required. In this study, we collected such a dataset by collecting fine-grained food intake activities using sensors of heterogeneous privacy sensitivity levels, namely a mmWave radar, an RGB camera, and a depth camera. Solutions to recognize food intake activities can be developed using this dataset, which may provide a more comprehensive picture of the accuracy and privacy trade-offs involved with heterogeneous sensors.

VAST: A Decentralized Open-Source Publish/Subscribe Architecture

  • Victory Opeolu
  • Herman Engelbrecht
  • Shun-Yun Hu
  • Charl Marais

Publish/Subscribe (pub/sub) systems have been widely adopted in highly scalable environments. We see this especially with IoT/IIoT applications, an environment where low bandwidth and high latency is ideal. The projected growth of Iot/IIoT network nodes are in the billions in the next few years and as such, there is a need for network communication standards that can adapt to the evergrowing nature of this industry. While current pub/sub standards have produced positive results so far, they all adopt a "topic" based pub/sub approach. They do not leverage off modern devices having spatial information. Current open-source standards also focus heavily on centralized brokering of information. This makes the broker in this system a potential bottleneck as it means if that broker goes down, the entire network goes down. We have developed a new, unique and innovative open-source pub/sub standard called VAST that leverages spatial information of modern network devices to perform message communication. It uses a unique concept called Spatial Publish/Subscribe (SPS). It is built on a peer-to-peer network to enable high scalability. In addition to this, it provides a Voronoi Overlay to efficiently distribute the messages, ensuring that network brokers are not overloaded with requests and ensures the network self-organizes itself if one or more brokers break down. It also has a forwarding algorithm to eliminate redundancies in the network. We will demonstrate this concept with a simulator we developed. We will show how the simulator works and how to use it. We believe that with this simulator, we will help encourage researchers adopt this technology for their spatial applications. An example of such is Massively Multi-user Virtual Environments (MMVEs), where there is a need for a high number of spatial network nodes in virtual environments.

Adaptive streaming of 3D content for web-based virtual reality: an open-source prototype including several metrics and strategies

  • Jean-Philippe Farrugia
  • Luc Billaud
  • Guillaume Lavoue

Virtual reality is a new technology that has been developing a lot during the last decade. With autonomous head-mounted displays appearing on the market, new uses and needs have been created. The 3D content displayed by those devices can now be stored on distant servers rather than directly in the device's memory. In such networked immersive experiences, the 3D environment has to be streamed in real-time to the headset. In that context, several recent papers proposed utility metrics and selection strategies to schedule the streaming of the different objects composing the 3D environment, in order to minimize the latency and to optimize the quality of what is being visualized by the user at each moment. However, these proposed frameworks are hardly comparable since they operate on different systems and data. Therefore, we hereby propose an open-source DASH-based web framework for adaptive streaming of 3D content in a 6 Degrees of Freedom (DoFs) scenario. Our framework integrates several strategies and utility metrics from the state of the art, as well as several relevant features: 3D graphics compression, levels of details and the use of a visual quality index. We used our software to demonstrate the relevance of those tools and provide useful hints for the community for the further improvements of 3D streaming systems.

A Dataset for User Visual Behaviour with Multi-View Video Content

  • Tiago Soares da Costa
  • Maria Teresa Andrade
  • Paula Viana
  • Nuno Castro Silva

Immersive video applications impose unpractical bandwidth requirements for best-effort networks. With Multi-View (MV) streaming, these can be minimized by resorting to view prediction techniques. SmoothMV is a multi-view system that uses a non-intrusive head tracking mechanism to detect the viewer's interest and select appropriate views. By coupling Neural Networks (NNs) to anticipate the viewer's interest, a reduction of view-switching latency is likely to be obtained. The objective of this paper is twofold: 1) Present a solution for acquisition of gaze data from users when viewing MV content; 2) Describe a dataset, collected with a large-scale testbed, capable of being used to train NNs to predict the user's viewing interest. Tracking data from head movements was obtained from 45 participants using an Intel Realsense F200 camera, with 7 video playlists, each being viewed a minimum of 17 times. This dataset is publicly available to the research community and constitutes an important contribution to reducing the current scarcity of such data. Tools to obtain saliency/heat maps and generate complementary plots are also provided as an open-source software package.

A 6DoF VR Dataset of 3D virtualWorld for Privacy-Preserving Approach and Utility-Privacy Tradeoff

  • Yu-Szu Wei
  • Xing Wei
  • Shin-Yi Zheng
  • Cheng-Hsin Hsu
  • Chenyang Yang

Virtual Reality (VR) applications offer an immersive user experience at the expense of privacy leakage caused by inevitably streaming various new types of user data. While some privacy-preserving approaches have been proposed for protecting one type of data, how to design and evaluate approaches for multiple types of user data are still open. On the other hand, preserving privacy will degrade the quality of experience of VR applications or say the utility of user data. How to achieve efficient utility-privacy tradeoff with multiple types of data is also open. Both call for a dataset that contains multiple types of user data and personal attributes of users as ground-truth values. In this paper, we collect a 6 degree-of-freedom VR dataset of 3D virtual worlds for the investigation of privacy-preserving approaches and utility-privacy tradeoff.

IDCIA: Immunocytochemistry Dataset for Cellular Image Analysis

  • Abdurahman Ali Mohammed
  • Catherine Fonder
  • Donald S. Sakaguchi
  • Wallapak Tavanapong
  • Surya K. Mallapragada
  • Azeez Idris

We present a new annotated microscopic cellular image dataset to improve the effectiveness of machine learning methods for cellular image analysis. Cell counting is an important step in cell analysis. Typically, domain experts manually count cells in a microscopic image. Automated cell counting can potentially eliminate this tedious, time-consuming process. However, a good, labeled dataset is required for training an accurate machine learning model. Our dataset includes microscopic images of cells, and for each image, the cell count and the location of individual cells. The data were collected as part of an ongoing study investigating the potential of electrical stimulation to modulate stem cell differentiation and possible applications for neural repair. Compared to existing publicly available datasets, our dataset has more images of cells stained with more variety of antibodies (protein components of immune responses against invaders) typically used for cell analysis. The experimental results on this dataset indicate that none of the five existing models under this study are able to achieve sufficiently accurate count to replace the manual methods. The dataset is available at

Perceptual annotation of local distortions in videos: tools and datasets

  • Andréas Pastor
  • Patrick Le Callet

To assess the quality of multimedia content, create datasets, and train objective quality metrics, one needs to collect subjective opinions from annotators. Different subjective methodologies exist, from direct rating with single or double stimuli to indirect rating with pairwise comparisons. Triplet and quadruplet-based comparisons are a type of indirect rating. From these comparisons and preferences on stimuli, we can place the assessed stimuli on a perceptual scale (e.g., from low to high quality). Maximum Likelihood Difference Scaling (MLDS) solver is one of these algorithms working with triplets and quadruplets. A participant is asked to compare intervals inside pairs of stimuli: (a,b) and (c,d), where a,b,c,d are stimuli forming a quadruplet. However, one limitation is that the perceptual scales retrieved from stimuli of different contents are usually not comparable. We previously offered a solution to measure the inter-content scale of multiple contents. This paper presents an open-source python implementation of the method and demonstrates its use on three datasets collected in an in-lab environment. We compared the accuracy and effectiveness of the method using pairwise, triplet, and quadruplet for intra-content annotations. The code is available here:

SEPE Dataset: 8K Video Sequences and Images for Analysis and Development

  • Tariq Al Shoura
  • Ali Mollaahmadi Dehaghi
  • Reza Razavi
  • Behrouz Far
  • Mohammad Moshirpour

This paper provides an overview of our open (Software Engineering Practice and Education) SEPE 8K dataset which is made of 40 different 8K (8192 x 4320) video sequences and 40 variant 8K (8192 x 5464) images. The video sequences were captured at a framerate of 29.97 frames per second (FPS) and had been encoded into videos using AVC/H.264, HEVC/H.265, and AV1 codecs at resolutions from 8K to 480p. The images, video sequences, encoded videos, and various other statistics related to the media that make the dataset are stored online, published, and maintained on the repo on GitHub for non-commercial use. In this paper, the dataset components are described and analyzed using various methods. The proposed dataset is - as far as we know - the first to publish true 8K natural sequences; thus, it is important for the next level of applications dealing with multimedia such as video quality assessment, super-resolution, video coding, video compression, and many more.


Semi-coupled Congestion Control for Multi-site Parallel Downloading

  • Chenfei Tian
  • Shaorui Ren
  • Yixuan Zhang
  • Mingwei Xu

Multi-site Parallel Downloading is a technique that uses multiple low-cost edge nodes in the Internet to transfer short video content. Traditional multi-path congestion control fails to achieve fast convergence and high bandwidth utilization in MPD scenarios due to the over-coupling of subflows. In this paper, we propose a semi-coupled congestion control design for the MPD scenario by reallocating traffic between independent subflows. In simulation experiments, our design outperforms baseline models of traditional MPTCP.

Modified CUBIC Congestion Avoidance for Multi-side Parallel Downloading over Lossy Networks

  • Yu-Yen Chung

With the rapid growth of online video viewing, the quality of experience for users become a critical factor in the video streaming service to attract users and increase their adherence. Multi-side parallel downloading, which requests video segments from various low-cost data nodes simultaneously, could be a strategy to reduce the latency and improve the experience. However, the communication between such data nodes might not be as reliable as a conventional dedicated server. In such a network, the random loss events may bias the loss-based congestion control. Accordingly, this work incorporates a reevaluation mechanism into the CUBIC congestion avoidance state to correct the underestimation of the congestion window. In the experiment, we analyzed the round-trip time pattern and the transmission speed of CUBIC against our modification via network simulation with various loss rates. Our result shows that reevaluation helps to recover the congestion window and to improve the transmission speed in the extremely high-loss network. The derived insight may benefit the future improvement of loss-based congestion control. This paper presents one of the winning team's strategies in the MMSys23 Grand Challenge. The submitted code can be found on GitHub.1.