MMSys '21: Proceedings of the 12th ACM Multimedia Systems Conference

MMSys '21: Proceedings of the 12th ACM Multimedia Systems Conference

MMSys '21: Proceedings of the 12th ACM Multimedia Systems Conference

Full Citation in the ACM Digital Library

Motion segmentation and tracking for integrating event cameras

  • Andrew C. Freeman
  • Chris Burgess
  • Ketan Mayer-Patel

Integrating event cameras are asynchronous sensors wherein incident light values may
be measured directly through continuous integration, with individual pixels' light
sensitivity being adjustable in real time, allowing for extremely high frame rate
and high dynamic range video capture. This paper builds on lessons learned with previous
attempts to compress event data and presents a new scheme for event compression that
has many analogues to traditional framed video compression techniques. We show how
traditional video can be transcoded to an event-based representation, and describe
the direct encoding of motion data in our event-based representation. Finally, we
present experimental results proving how our simple scheme already approaches the
state-of-the-art compression performance for slow-motion object tracking. This system
introduces an application "in the loop" framework, where the application dynamically
informs the camera how sensitive each pixel should be, based on the efficacy of the
most recent data received.

Open3DGen: open-source software for reconstructing textured 3D models from RGB-D images

  • Teo T. Niemirepo
  • Marko Viitanen
  • Jarno Vanne

This paper presents the first entirely open-source and cross-platform software called
Open3DGen for reconstructing photorealistic textured 3D models from RGB-D images.
The proposed software pipeline consists of nine main stages: 1) RGB-D acquisition;
2) 2D feature extraction; 3) camera pose estimation; 4) point cloud generation; 5)
coarse mesh reconstruction; 6) optional loop closure; 7) fine mesh reconstruction;
8) UV unwrapping; and 9) texture projection. This end-to-end scheme combines multiple
state-of-the-art techniques and provides an easy-to-use software package for real-time
3D model reconstruction and offline texture mapping. The main innovation lies in various
Structure-from-Motion (SfM) techniques that are used with additional depth data to
yield high-quality 3D models in real-time and at low cost. The functionality of Open3DGen
has been validated on AMD Ryzen 3900X CPU and Nvidia GTX1080 GPU. This proof-of-concept
setup attains an average processing speed of 15 fps for 720p (1280x720) RGBD input
without the offline backend. Our solution is shown to provide competitive 3D mesh
quality and execution performance with the state-of-the-art commercial and academic

Enabling hyperspectral imaging in diverse illumination conditions for indoor applications

  • Puria Azadi Moghadam
  • Neha Sharma
  • Mohamed Hefeeda

Hyperspectral imaging provides rich information across many wavelengths of the captured
scene, which is useful for many potential applications such as food quality inspection,
medical diagnosis, material identification, artwork authentication, and crime scene
analysis. However, hyperspectral imaging has not been widely deployed for such indoor
applications. In this paper, we address one of the main challenges stifling this wide
adoption, which is the strict illumination requirements for hyperspectral cameras.
Hyperspectral cameras require a light source that radiates power across a wide range
of the electromagnetic spectrum. Such light sources are expensive to setup and operate,
and in some cases, they are not possible to use because they could damage important
objects in the scene. We propose a data-driven method that enables indoor hyper-spectral
imaging using cost-effective and widely available lighting sources such as LED and
fluorescent. These common sources, however, introduce significant noise in the hyperspectral
bands in the invisible range, which are the most important for the applications. Our
proposed method restores the damaged bands using a carefully-designed supervised deep-learning
model. We conduct an extensive experimental study to analyze the performance of the
proposed method and compare it against the state-of-the-art using real hyperspectral
datasets that we have collected. Our results show that the proposed method outperforms
the state-of-the-art across all considered objective and subjective metrics, and it
produces hyperspectral bands that are close to the ground truth bands captured under
ideal illumination conditions.

Livelyzer: analyzing the first-mile ingest performance of live video streaming

  • Xiao Zhu
  • Subhabrata Sen
  • Z. Morley Mao

Over-the-top (OTT) live video traffic has grown significantly, fueled by fundamental
shifts in how users consume video content (e.g., increased cord-cutting) and by improvements
in camera technologies, computing power, and wireless resources. A key determining
factor for the end-to-end live streaming QoE is the design of the first-mile upstream
ingest path that captures and transmits the live content in real-time, from the broadcaster
to the remote video server. This path often involves either a Wi-Fi or cellular component,
and is likely to be bandwidth-constrained with time-varying capacity, making the task
of high-quality video delivery challenging. Today, there is little understanding of
the state of the art in the design of this critical path, with existing research focused
mainly on the downstream distribution path, from the video server to end viewers.

To shed more light on the first-mile ingest aspect of live streaming, we propose Livelyzer,
a generalized active measurement and black-box testing framework for analyzing the
performance of this component in popular live streaming software and services under
controlled settings. We use Livelyzer to characterize the ingest behavior and performance
of several live streaming platforms, identify design deficiencies that lead to poor
performance, and propose best practice design recommendations to improve the same.

Playing chunk-transferred DASH segments at low latency with QLive

  • Praveen Kumar Yadav
  • Abdelhak Bentaleb
  • May Lim
  • Junyi Huang
  • Wei Tsang Ooi
  • Roger Zimmermann

More users have a growing interest in low latency over-the-top (OTT) applications
such as online video gaming, video chat, online casino, sports betting, and live auctions.
OTT applications face challenges in delivering low latency live streams using Dynamic
Adaptive Streaming over HTTP (DASH) due to large playback buffer and video segment
duration. A potential solution to this issue is the use of HTTP chunked transfer encoding
(CTE) with the common media application format (CMAF). This combination allows the
delivery of each segment in several chunks to the client, starting before the segment
is fully available in real-time. However, CTE and CMAF alone are not sufficient as
they do not address other limitations and challenges at the client-side, including
inaccurate bandwidth measurement, latency control, and bitrate selection.

In this paper, we leverage a simple and intuitive method to resolve the fundamental
problem of bandwidth estimation for low latency live streaming through the use of
a hybrid of an existing chunk parser and proposed filtering of downloaded chunk data.
Next, we model the playback buffer as a M/D/1/K queue to limit the playback delay. The combination of these techniques is collectively
called QLive. QLive uses the relationship between the estimated bandwidth, total buffer
capacity, instantaneous playback speed, and buffer occupancy to decide the playback
speed and the bitrate of the representation to download. We evaluated QLive under
a diverse set of scenarios and found that it controls the latency to meet the given
latency requirement, with an average latency up to 21 times lower than the compared
methods. The average playback speed of QLive ranges between 1.01 - 1.26X and it plays
back at 1X speed up to 97% longer than the compared algorithms, without sacrificing
the quality of the video. Moreover, the proposed bandwidth estimator has a 94% accuracy
and is unaffected by a spike in instantaneous playback latency, unlike the compared
state-of-the-art counterparts.

VRComm: an end-to-end web system for real-time photorealistic social VR communication

  • Simon N. B. Gunkel
  • Rick Hindriks
  • Karim M. El Assal
  • Hans M. Stokking
  • Sylvie Dijkstra-Soudarissanane
  • Frank ter Haar
  • Omar Niamut

Tools and platforms that enable remote communication and collaboration provide a strong
contribution to societal challenges. Virtual meetings and conferencing, in particular,
can help to reduce commutes and lower our ecological footprint, and can alleviate
physical distancing measures in case of global pandemics. In this paper, we outline
how to bridge the gap between common video conferencing systems and emerging social
VR platforms to allow immersive communication in Virtual Reality (VR). We present
a novel VR communication framework that enables remote communication in virtual environments
with real-time photorealistic user representation based on colour-and-depth (RGBD)
cameras and web browser clients, deployed on common off-the-shelf hardware devices.
The paper's main contribution is threefold: (a) a new VR communication framework,
(b) a novel approach for real-time depth data transmitting as a 2D grayscale for 3D
user representation, including a central MCU-based approach for this new format and
(c) a technical evaluation of the system with respect to processing delay, CPU and
GPU usage.

Towards cloud-edge collaborative online video analytics with fine-grained serverless

  • Miao Zhang
  • Fangxin Wang
  • Yifei Zhu
  • Jiangchuan Liu
  • Zhi Wang

The ever-growing deployment scale of surveillance cameras and the users' increasing
appetite for real-time queries have urged online video analytics. Synergizing the
virtually unlimited cloud resources with agile edge processing would deliver an ideal
online video analytics system; yet, given the complex interaction and dependency within
and across video query pipelines, it is easier said than done. This paper starts with
a measurement study to acquire a deep understanding of video query pipelines on real-world
camera streams. We identify the potentials and practical challenges towards cloud-edge
collaborative video analytics. We then argue that the newly emerged serverless computing paradigm is the key to achieve fine-grained resource partitioning with minimum dependency.
We accordingly propose CEVAS, a Cloud-Edge collaborative Video Analytics system empowered by fine-grained Serverless
It builds flexible serverless-based infrastructures to facilitate fine-grained and
adaptive partitioning of cloud-edge workloads for multiple concurrent query pipelines.
With the optimized design of individual modules and their integration, CEVAS achieves
real-time responses to highly dynamic input workloads. We have developed a prototype
of CEVAS over Amazon Web Services (AWS) and conducted extensive experiments with real-world
video streams and queries. The results show that by judiciously coordinating the fine-grained
serverless resources in the cloud and at the edge, CEVAS reduces 86.9% cloud expenditure
and 74.4% data transfer overhead of a pure cloud scheme and improves the analysis
throughput of a pure edge scheme by up to 20.6%. Thanks to the fine-grained video
content-aware forecasting, CEVAS is also more adaptive than the state-of-the-art cloud-edge
collaborative scheme.

DataPlanner: data-budget driven approach to resource-efficient ABR streaming

  • Yanyuan Qin
  • Chinmaey Shende
  • Cheonjin Park
  • Subhabrata Sen
  • Bing Wang

Over-the-top video (OTT) streaming accounts for the majority of traffic on cellular
networks, and also places a heavy demand on users' limited monthly cellular data budgets.
In contrast to much of traditional research that focuses on improving the quality,
we explore a different direction---using data budget information to better manage
the data usage of mobile video streaming, while minimizing the impact on users' quality
of experience (QoE). Specifically, we propose a novel framework for quality-aware
Adaptive Bitrate (ABR) streaming involving a per-session data budget constraint. Under
the framework, we develop two planning based strategies, one for the case where fine-grained
perceptual quality information is known to the planning scheme, and another for the
case where such information is not available. Evaluations for a wide range of network
conditions, using different videos covering a variety of content types and encodings,
demonstrate that both these strategies use much less data compared to state-of-the-art
ABR schemes, while still providing comparable QoE. Our proposed approach is designed
to work in conjunction with existing ABR streaming workflows, enabling ease of adoption.

AMP: authentication of media via provenance

  • Paul England
  • Henrique S. Malvar
  • Eric Horvitz
  • Jack W. Stokes
  • Cédric Fournet
  • Rebecca Burke-Aguero
  • Amaury Chamayou
  • Sylvan Clebsch
  • Manuel Costa
  • John Deutscher
  • Shabnam Erfani
  • Matt Gaylor
  • Andrew Jenks
  • Kevin Kane
  • Elissa M. Redmiles
  • Alex Shamis
  • Isha Sharma
  • John C. Simmons
  • Sam Wenker
  • Anika Zaman

Advances in graphics and machine learning have led to the general availability of
easy-to-use tools for modifying and synthesizing media. The proliferation of these
tools threatens to cast doubt on the veracity of all media. One approach to thwarting
the flow of fake media is to detect modified or synthesized media through machine
learning methods. While detection may help in the short term, we believe that it is
destined to fail as the quality of fake media generation continues to improve. Soon,
neither humans nor algorithms will be able to reliably distinguish fake versus real
content. Thus, pipelines for assuring the source and integrity of media will be required---and
increasingly relied upon. We present AMP, a system that ensures the authentication
of media via certifying provenance. AMP creates one or more publisher-signed manifests
for a media instance uploaded by a content provider. These manifests are stored in
a database allowing fast lookup from applications such as browsers. For reference,
the manifests are also registered and signed by a permissioned ledger, implemented
using the Confidential Consortium Framework (CCF). CCF employs both software and hardware
techniques to ensure the integrity and transparency of all registered manifests. AMP,
through its use of CCF, enables a consortium of media providers to govern the service
while making all its operations auditable. The authenticity of the media can be communicated
to the user via visual elements in the browser, indicating that an AMP manifest has
been successfully located and verified.

User-assisted video reflection removal

  • Amgad Ahmed
  • Suhong Kim
  • Mohamed Elgharib
  • Mohamed Hefeeda

Reflections in videos are obstructions that often occur when videos are taken behind
reflective surfaces like glass. These reflections reduce the quality of such videos,
lead to information loss and degrade the accuracy of many computer vision algorithms.
A video containing reflections is a combination of background and reflection layers.
Thus, reflection removal is equivalent to decomposing the video into two layers. This,
however, is a challenging and ill-posed problem as there is an infinite number of
valid decompositions. To address this problem, we propose a user-assisted method for
video reflection removal. We rely on both spatial and temporal information and utilize
sparse user hints to help improve separation. The proposed method removes complex
reflections in videos by including the user in the loop. The method is flexible and
can accept various levels of user annotations, within each frame and in the number
of frames being annotated. The user provides some strokes in some of the frames in
the video, and our method propagates these strokes within the frame using a random
walk computation as well as across frames using a point-based motion tracking method.
We implement and evaluate the proposed method through quantitative and qualitative
results on real and synthetic videos. Our experiments show that the proposed method
successfully removes reflection from video sequences, does not introduce visual distortions,
and significantly outperforms the state-of-the-art reflection removal methods in the

LiveROI: region of interest analysis for viewport prediction in live mobile virtual reality

  • Xianglong Feng
  • Weitian Li
  • Sheng Wei

Virtual reality (VR) streaming can provide immersive video viewing experience to the
end users but with huge bandwidth consumption. Recent research has adopted selective
streaming to address the bandwidth challenge, which predicts and streams the user's
viewport of interest with high quality and the other portions of the video with low
quality. However, the existing viewport prediction mechanisms mainly target the video-on-demand
(VOD) scenario relying on historical video and user trace data to build the prediction
model. The community still lacks an effective viewport prediction approach to support
live VR streaming, the most engaging and popular VR streaming experience. We develop a
region of interest (ROI)-based viewport prediction approach, namely LiveROI, for live VR streaming. LiveROI employs an action recognition algorithm to analyze the video content and uses the
analysis results as the basis of viewport prediction. To eliminate the need of historical
video/user data, LiveROI employs adaptive user preference modeling and word embedding to dynamically select
the video viewport at runtime based on the user head orientation. We evaluate LiveROI with 12 VR videos viewed by 48 users obtained from a public VR head movement dataset.
The results show that LiveROI achieves high prediction accuracy and significant bandwidth savings with real-time
processing to support live VR streaming.

EScALation: a framework for efficient and scalable spatio-temporal action localization

  • Bo Chen
  • Klara Nahrstedt

Spatio-temporal action localization aims to detect the spatial location and the start/end
time of the action in a video. The state-of-the-art approach uses convolutional neural
networks to extract possible bounding boxes for the action in each frame and then
link bounding boxes into action tubes based on the location and the class-specific
score of each bounding box. Though this approach has been successful at achieving
a good localization accuracy, it is computation-intensive. High-end GPUs are usually
demanded for it to achieve real-time performance. In addition, this approach does
not scale well on a large number of action classes. In this work, we present a framework,
EScALation, for making spatio-temporal action localization efficient and scalable.
Our framework involves two main strategies. One is the frame sampling technique that
utilizes the temporal correlation between frames and selects key frame(s) from a temporally
correlated set of frames to perform bounding box detection. The other is the class
filtering technique that exploits bounding box information to predict the action class
prior to linking bounding boxes. We compare EScALation with the state-of-the-art approach
on UCF101-24 and J-HMDB-21 datasets. One of our experiments shows EScALation is able
to save 72.2% of the time with only 6.1% loss of mAP. In addition, we show that EScALation
scales better to a large number of action classes than the state-of-the-art approach.

A distributed, decoupled system for losslessly streaming dynamic light probes to thin

  • Michael Stengel
  • Zander Majercik
  • Benjamin Boudaoud
  • Morgan McGuire

We present a networked, high-performance graphics system that combines dynamic, high-quality,
ray traced global illumination computed on a server with direct illumination and primary
visibility computed on a client. This approach provides many of the image quality
benefits of real-time ray tracing on low-power and legacy hardware, while maintaining
a low latency response and mobile form factor.

As opposed to streaming full frames from rendering servers to end clients, our system
distributes the graphics pipeline over a network by computing diffuse global illumination
on a remote machine. Diffuse global illumination is computed using a recent irradiance
volume representation combined with a new lossless, HEVC-based, hardware-accelerated
encoding, and a perceptually-motivated update scheme.

Our experimental implementation streams thousands of irradiance probes per second
and requires less than 50 Mbps of throughput, reducing the consumed bandwidth by 99.4%
when streaming at 60 Hz compared to traditional lossless texture compression.

The bandwidth reduction achieved with our approach allows higher quality and lower
latency graphics than state-of-the-art remote rendering via video streaming. In addition,
our split-rendering solution decouples remote computation from local rendering and
so does not limit local display update rate or display resolution.

MPEG NBMP testbed for evaluation of real-time distributed media processing workflows
at scale

  • Roberto Ramos-Chavez
  • Rufael Mekuria
  • Theo Karagkioules
  • Dirk Griffioen
  • Arjen Wagenaar
  • Mark Ogle

Real-time Distributed Media Processing Workflows (DMPW) are popular for online media
delivery. Combining distributed media sources and processing can reduce storage costs
and increase flexibility. However, high request rates may result in unacceptable latency
or even failures in incorrect configurations. Thus, testing DMPW deployments at scale
is key, particularly for real-time cases. We propose the new MPEG Network Based Media
Processing (NBMP) standard for this and present a testbed implementation that includes
all the reference components. In addition, the testbed includes a set of configurable
functions for load generation, monitoring, data-collection and visualization. The
testbed is used to test Dynamic Adaptive HTTP streaming functions under different
workloads in a standardized and reproducible manner. A total of 327 tests with different
loads and Real-Time DMPW configurations were completed. The results provide insights
in the performance, reliability and time-consistency of each configuration. Based
on these tests, we selected the preferred cloud instance type, considering hypervisor
options and different function implementation configurations. Further, we analyzed
different processing tasks and options for distributed deployments on edge and centralized
clouds. Last, a classifier was developed to detect if failures happen under a certain
workload. Results also show that, normalized inter-experiment standard deviation of
the metric means can be an indicator for unstable or incorrect configurations.

CrossRoI: cross-camera region of interest optimization for efficient real time video analytics
at scale

  • Hongpeng Guo
  • Shuochao Yao
  • Zhe Yang
  • Qian Zhou
  • Klara Nahrstedt

Video cameras are pervasively deployed in city scale for public good or community
safety (i.e. traffic monitoring or suspected person tracking). However, analyzing
large scale video feeds in real time is data intensive and poses severe challenges
to today's network and computation systems. We present CrossRoI, a resource-efficient
system that enables real time video analytics at scale via harnessing the videos content
associations and redundancy across a fleet of cameras. CrossRoI exploits the intrinsic
physical correlations of cross-camera viewing fields to drastically reduce the communication
and computation costs. CrossRoI removes the repentant appearances of same objects
in multiple cameras without harming comprehensive coverage of the scene. CrossRoI
operates in two phases - an offline phase to establish cross-camera correlations,
and an efficient online phase for real time video inference. Experiments on real-world
video feeds show that CrossRoI achieves 42% ~ 65% reduction for network overhead and
25% ~ 34% reduction for response delay in real time video analytics applications with
more than 99% query accuracy, when compared to baseline methods. If integrated with
SotA frame filtering systems, the performance gains of CrossRoI reaches 50% ~ 80%
(network overhead) and 33% ~ 61% (end-to-end delay).

Tightrope walking in low-latency live streaming: optimal joint adaptation of video rate and playback speed

  • Liyang Sun
  • Tongyu Zong
  • Siquan Wang
  • Yong Liu
  • Yao Wang

It is highly challenging to simultaneously achieve high-rate and low-latency in live
video streaming. Chunk-based streaming and playback speed adaptation are two promising
new trends to achieve high user Quality-of-Experience (QoE). To thoroughly understand
their potentials, we develop a detailed chunk-level dynamic model that characterizes
how video rate and playback speed jointly control the evolution of a live streaming
session. Leveraging on the model, we first study the optimal joint video rate-playback
speed adaptation as a non-linear optimal control problem. We further develop model-free
joint adaptation strategies using deep reinforcement learning. Through extensive experiments,
we demonstrate that our proposed joint adaptation algorithms significantly outperform
rate-only adaptation algorithms and the recently proposed low-latency video streaming
algorithms that separately adapt video rate and playback speed without joint optimization.
In a wide-range of network conditions, the model-based and model-free algorithms can
achieve close-to-optimal trade-offs tailored for users with different QoE preferences.

Foveated streaming of real-time graphics

  • Gazi Karam Illahi
  • Matti Siekkinen
  • Teemu Kämäräinen
  • Antti Ylä-Jääski

Remote rendering systems comprise powerful servers that render graphics on behalf
of low-end client devices and stream the graphics as compressed video, enabling high
end gaming and Virtual Reality on those devices. One key challenge with them is the
amount of bandwidth required for streaming high quality video. Humans have spatially
non-uniform visual acuity: We have sharp central vision but our ability to discern
details rapidly decreases with angular distance from the point of gaze. This phenomenon
called foveation can be taken advantage of to reduce the need for bandwidth. In this paper, we study
three different methods to produce a foveated video stream of real-time rendered graphics
in a remote rendered system: 1) foveated shading as part of the rendering pipeline,
2) foveation as post processing step after rendering and before video encoding, 3)
foveated video encoding. We report results from a number of experiments with these
methods. They suggest that foveated rendering alone does not help save bandwidth.
Instead, the two other methods decrease the resulting video bitrate significantly
but they also have different quality per bit and latency profiles, which makes them
desirable solutions in slightly different situations.

Foresight: planning for spatial and temporal variations in bandwidth for streaming services on
mobile devices

  • Manasvini Sethuraman
  • Anirudh Sarma
  • Ashutosh Dhekne
  • Umakishore Ramachandran

Spatiotemporal variation in cellular bandwidth availability is well-known and could
affect a mobile user's quality of experience (QoE), especially while using bandwidth
intensive streaming applications such as movies, podcasts, and music videos during
commute. If such variations are made available to a streaming service in advance it
could perhaps plan better to avoid sub-optimal performance while the user travels
through regions of low bandwidth availability. The intuition is that such future knowledge
could be used to buffer additional content in regions of higher bandwidth availability
to tide over the deficits in regions of low bandwidth availability. Foresight is a service designed to provide this future knowledge for client apps running on
a mobile device. It comprises three components: (a) a crowd-sourced bandwidth estimate
reporting facility, (b) an on-cloud bandwidth service that records the spatiotemporal
variations in bandwidth and serves queries for bandwidth availability from mobile
users, and (c) an on-device bandwidth manager that caters to the bandwidth requirements
from client apps by providing them with bandwidth allocation schedules. Foresight
is implemented in the Android framework. As a proof of concept for using this service,
we have modified an open-source video player---Exoplayer---to use the results of Foresight
in its video buffer management. Our performance evaluation shows Foresight's scalability.
We also showcase the opportunity that Foresight offers to ExoPlayer to enhance video
quality of experience (QoE) despite spatiotemporal bandwidth variations for metrics
such as overall higher bitrate of playback, reduction in number of bitrate switches,
and reduction in the number of stalls during video playback.