MMSys '20: Proceedings of the 11th ACM Multimedia Systems Conference

Digital Library logo
Full Citation in the ACM Digital Library

The bits of silence: redundant traffic in VoIP

Mohammad A. Hoque
Petteri Nurmi
Matti Siekkinen
Pan Hui
Sasu Tarkoma

Human conversation is characterized by brief pauses and so-called turn-taking behavior between the speakers.
In the context of VoIP, this means that there are frequent periods where the microphone
captures only background noise - or even silence whenever the microphone is muted.
The bits transmitted from such silence periods introduce overhead in terms of data
usage, energy consumption, and network infrastructure costs. In this paper, we contribute
by shedding light on these costs for VoIP applications. We systematically measure
the performance of six popular mobile VoIP applications with controlled human conversation
and acoustic setup. Our analysis demonstrates that significant savings can indeed
be achieved - with the best performing silence suppression technique being effective
on 75% of silent pauses in the conversation in a quiet place. This results in 2-5
times data savings, and 50-90% lower energy consumption compared to the next best
alternative. Even then, the effectiveness of silence suppression can be sensitive
to the amount of background noise, underlying speech codec, and the device being used.
The codec characteristics and performance do not depend on the network type. However,
silence suppression makes VoIP traffic network friendly as much as VoLTE traffic.
Our results provide new insights into VoIP performance and offer a motivation for
further enhancements to a wide variety of voice assisted applications, such as home
assistants and other IoT devices.

A latency compensation technique based on game characteristics to mitigate the influence
of delay on cloud gaming quality of experience

Saeed Shafiee Sabet
Steven Schmidt
Saman Zadtootaghaj
Babak Naderi
Carsten Griwodz
Sebastian Möller

Cloud Gaming (CG) is an immersive multimedia service that promises many benefits.
In CG, the games are rendered in a cloud server, and the resulted scenes are streamed
as a video sequence to the client. Using CG users are not forced to update their gaming
hardware frequently, and available games can be played on any operating system or
suitable device. However, cloud gaming requires a reliable and low-latency network,
which makes it a very challenging service. Transmission latency strongly affects the
playability of a cloud game and consequently reduces the users' Quality of Experience
(QoE). In this paper, we propose a latency compensation technique using game adaptation
that mitigates the influence of delay on QoE. This technique uses five game characteristics
for the adaptation. These characteristics, in addition to an Aim-assistance technique,
were implemented in four games for evaluation. A subjective study using 194 participants
was conducted using a crowdsourcing approach. The results showed that the majority
of the proposed adaptation techniques lead to significant improvements in the cloud
gaming QoE.

Flocking-based live streaming of 360-degree video

Liyang Sun
Yixiang Mao
Tongyu Zong
Yong Liu
Yao Wang

Streaming of live 360-degree video allows users to follow a live event from any view
point and has already been deployed on some commercial platforms. However, the current
systems can only stream the video at relatively low-quality because the entire 360-degree
video is delivered to the users under limited bandwidth. In this paper, we propose
to use the idea of "flocking" to improve the performance of both prediction of field
of view (FoV) and caching on the edge servers for live 360-degree video streaming.
By assigning variable playback latencies to all the users in a streaming session,
a "streaming flock" is formed and led by low latency users in the front of the flock.
We propose a collaborative FoV prediction scheme where the actual FoV information
of users in the front of the flock are utilized to predict of users behind them. We
further propose a network condition aware flocking strategy to reduce the video freeze
and increase the chance for collaborative FoV prediction on all users. Flocking also
facilitates caching as video tiles downloaded by the front users can be cached by
an edge server to serve the users at the back of the flock, thereby reducing the traffic
in the core network. We propose a latency-FoV based caching strategy and investigate
the potential gain of applying transcoding on the edge server. We conduct experiments
using real-world user FoV traces and WiGig network bandwidth traces to evaluate the
gains of the proposed strategies over benchmarks. Our experimental results demonstrate
that the proposed streaming system can roughly double the effective video rate, which
is the video rate inside a user's actual FoV, compared to the prediction only based
on the user's own past FoV trajectory, while reducing video freeze. Furthermore, edge
caching can reduce the traffic in the core network by about 80%, which can be increased
to 90% with transcoding on edge server.

Comparing fixed and variable segment durations for adaptive video streaming: a holistic analysis

Susanna Schwarzmann
Nick Hainke
Thomas Zinner
Christian Sieber
Werner Robitza
Alexander Raake

HTTP Adaptive Streaming (HAS) is the de-facto standard for video delivery over the
Internet. It enables dynamic adaptation of video quality by splitting a video into
small segments and providing multiple quality levels per segment. So far, HAS services
typically utilize a fixed segment duration. This reduces the encoding and streaming
variability and thus allows a faster encoding of the video content and a reduced prediction
complexity for adaptive bit rate algorithms. Due to the content-agnostic placement
of I-frames at the beginning of each segment, additional encoding overhead is introduced.
In order to mitigate this overhead, variable segment durations, which take encoder
placed I-frames into account, have been proposed recently. Hence, a lower number of
I-frames is needed, thus achieving a lower video bitrate without quality degradation.
While several proposals exploiting variable segment durations exist, no comparative
study highlighting the impact of this technique on coding efficiency and adaptive
streaming performance has been conducted yet. This paper conducts such a holistic
comparison within the adaptive video streaming eco-system. Firstly, it provides a
broad investigation of video encoding efficiency for variable segment durations. Secondly,
a measurement study evaluates the impact of segment duration variability on the performance
of HAS using three adaptation heuristics and the dash.js reference implementation.
Our results show that variable segment durations increased the Quality of Experience
for 54% of the evaluated streaming sessions, while reducing the overall bitrate by
7% on average.

Enabling adaptive bitrate algorithms in hybrid CDN/P2P networks

Hiba Yousef
Jean Le Feuvre
Paul-Louis Ageneau
Alexandre Storelli

As video traffic becomes the dominant part of the global Internet traffic, keeping
a good quality of experience (QoE) becomes more challenging. To improve QoE, HTTP
adaptive streaming with various adaptive bitrate (ABR) algorithms has been massively
deployed for video delivery. Based on their required input information, these algorithms
can be classified, into buffer-based, throughput-based or hybrid buffer-throughput
algorithms. Nowadays, due to their low cost and high scalability, peer-to-peer (P2P)
networks have become an efficient alternative for video delivery over the Internet,
and many attempts at merging HTTP adaptive streaming and P2P networks have surfaced.
However, the impact of merging these two approaches is still not clear enough, and
interestingly, the existing HTTP adaptive streaming algorithms lack testing in a P2P
environment. In this paper, we address and analyze the main problems raised by the
use of the existing HTTP adaptive streaming algorithms in the context of P2P networks.
We propose two methodologies to make these algorithms more efficient in P2P networks
regardless of the ABR algorithm used, one favoring overall QoE and one favoring P2P
efficiency. Additionally, we propose two new metrics to quantify the P2P efficiency
for ABR delivery over P2P.

3CPS: a novel supercompression for the delivery of 3D object textures

Hristina Hristova
Gwendal Simon
Viswanathan Swaminathan
Stefano Petrangeli

The growing popularity of applications based on 3D rendering, such as visual effects,
gaming, augmented and virtual reality, calls for the development of new solutions
for the delivery of 3D objects, in particular textures. The format of texture images,
which capture the characteristics of materials, has to address two constraints. First,
the delivery on the Internet imposes a reduction of the image size. Second, because
of memory limitations, the processing by the rendering engine is done in the GPU by
extracting small areas of the image only. The format of texture images should thus
enable the random-access feature for independent processing of small blocks of the
images, called texels, which negatively affects the texture compression performance
and, therefore, the network delivery.

We propose 3CPS, a novel solution for the compression and delivery of texture images.
In 3CPS, the texture image is compressed three times: first, at the authoring side,
by a traditional texture compression technique; second, still at the authoring side,
by a state-of-the-art image compression technique for better network delivery; third,
at the client side, the received image is re-compressed by a texture image technique
for better GPU processing. Our original idea leverages the fact that the image at
the client side has already been converted into a format that can be easily transformed
by the client for GPU processing. The last compression is thus expected to be much
faster than usual. In this paper, we introduce this concept, propose a fast and efficient
algorithm for the last re-compression, and demonstrate the advantages of our solution
by performing an extensive evaluation on real data sets. In particular, we show that
3CPS competes well with Basis supercompression, which is evaluated for the first time.

Hyperspectral reconstruction from RGB images for vein visualization

Neha Sharma
Mohamed Hefeeda

A hyperspectral camera captures a scene in many frequency bands across the spectrum,
providing rich information and facilitating numerous applications. The potential of
hyperspectral imaging has been established for decades. However, to date hyperspectral
imaging has only seen success in specialized and large-scale industrial and military
applications. This is mainly due to the high cost of hyperspectral cameras (upwards
of $20K) and the complexity of the acquisition system which makes the technology out
of reach for many commercial and end-user applications. In this paper, we propose
a deep learning based approach to convert RGB image sequences taken by regular cameras
to (partial) hyperspectral images. This can enable, for example, low-cost mobile phones
to leverage the characteristics of hyperspectral images in implementing novel applications.
We show the benefits of the conversion model by designing a vein localization and
visualization application that traditionally uses hyperspectral images. Our application
uses only RGB images and produces accurate results. Vein visualization is important
for point-of-care medical applications. We collected hyperspectral data to validate
the proposed conversion model. Experimental results demonstrate that the proposed
method is promising and can bring some of the benefits of expensive hyperspectral
cameras to the low-cost and pervasive RGB cameras, enabling many new applications
and enhancing the performance of others. We also evaluate the vein visualization application
and show its accuracy.

Dense LIDAR point clouds from room-scale scans

Henry Haugsten Hansen
Sayed Muchallil
Carsten Griwodz
Vetle Sillerud
Fredrik Johanssen

LiDARs can capture distances with high accuracy and should be very useful to create
point clouds that provide highly detailed representations of an environment. If these
reconstructions are meant as baseline or ground truth for other algorithms, they must
have a high density and accuracy.

Currently available LiDARs do still face some limitations. Either they have a limited
range, or they have a rather limited resolution in one or more dimensions. As a consequence,
all of them have to undergo motion to capture a larger environment. While some systems
follow extremely well-predictable motion paths such as satellite trajectories or robotic
arms, others require more spontaneous and flexible motion. These systems use either
visual simultaneous localization and mapping (vSLAM), GPS or IMU to achieve this,
but they are generally designed in such a way that human intervention is required
during the creation of high-quality point clouds.

In this paper, we make use of a rotating LiDAR with an attached IMU to create dense
point clouds of room-scale environments with the base accuracy of the LiDAR by compensating
for the various inaccuracies that are introduced by the LiDAR's motion. The resulting
dense scans are suitable as ground truths for other techniques because we retain the
error distribution of the LiDAR itself through the densification.

In contrast to other works, we do not aim at a visually pleasing or easily meshable
result and we can therefore avoid potentially inaccurate assumptions about the flatness
of surfaces. We take a two-step approach. First, we densify from a stationary position
changing only the LiDAR's pitch. Second, we add free motion to expose obstructed views.
We show that motion paths determined by repeated Iterative Closest Point (ICP) as
well as image matching on height maps can be used to create feasible priors for densification
using ICP.

QuRate: power-efficient mobile immersive video streaming

Nan Jiang
Yao Liu
Tian Guo
Wenyao Xu
Viswanathan Swaminathan
Lisong Xu
Sheng Wei

Smartphones have recently become a popular platform for deploying the computation-intensive
virtual reality (VR) applications, such as immersive video streaming (a.k.a., 360-degree
video streaming). One specific challenge involving the smartphone-based head mounted
display (HMD) is to reduce the potentially huge power consumption caused by the immersive
video. To address this challenge, we first conduct an empirical power measurement
study on a typical smartphone immersive streaming system, which identifies the major
power consumption sources. Then, we develop QuRate, a quality-aware and user-centric frame rate adaptation mechanism to tackle the power
consumption issue in immersive video streaming. QuRate optimizes the immersive video power consumption by modeling the correlation between
the perceivable video quality and the user behavior. Specifically, QuRate builds on top of the user's reduced level of concentration on the video frames during
view switching and dynamically adjusts the frame rate without impacting the perceivable
video quality. We evaluate QuRate with a comprehensive set of experiments involving 5 smartphones, 21 users, and 6
immersive videos using empirical user head movement traces. Our experimental results
demonstrate that QuRate is capable of extending the smartphone battery life by up to 1.24X while maintaining
the perceivable video quality during immersive video streaming. Also, we conduct an
Institutional Review Board (IRB)-approved subjective user study to further validate
the minimum video quality impact caused by QuRate.

MANTIS: time-shifted prefetching of YouTube videos to reduce peak-time cellular data usage

Shruti Lall
Uma Parthavi Moravapalle
Raghupathy Sivakumar

The load on wireless cellular networks is not uniformly distributed through the day,
and is significantly higher during peak periods. In this context, we present MANTIS, a time-shifted prefetching solution that prefetches content during off-peak periods
of network connectivity. We specifically focus on YouTube given that it represents
a significant portion of overall wireless data-usage. We make the following contributions:
first, we collect and analyze a real-life dataset of YouTube watch history from 206
users comprised of over 1.8 million videos spanning over a 1-year period and present
insights on a typical user's viewing behavior; second, we develop an accurate prediction
algorithm using a K-nearest neighbor classifier approach; third, we evaluate the prefetching
algorithm on two different datasets and show that MANTIS is able to reduce the traffic during peak periods by 34%; and finally, we develop
a proof-of-concept prototype for MANTIS and perform a user study.

Using informed access network selection to improve HTTP adaptive streaming performance

Theresa Enghardt
Thomas Zinner
Anja Feldmann

As end-user devices often have multiple access networks available, choosing the most
suitable network can help to improve application performance and user experience.
However, selecting the best access network for HTTP Adaptive Streaming (HAS) is non-trivial,
e.g., due to complex interactions between network conditions and the Adaptive Bit-Rate
algorithm (ABR), which adapts to network conditions by selecting which video representation
to load. In this paper, we propose to use an application-informed approach, Informed
Access Network Selection (IANS), to select the most suitable access network for each
video segment. We evaluate the impact of IANS on HAS performance in a testbed under
a variety of network conditions and using different workloads. We find that IANS improves
HAS performance substantially, in particular in cases where the available downstream
capacity is low. In the Capacity Decrease scenario, where capacity decreases drastically
during the video load, IANS can improve the estimated Mean Opinion Score (MOS) compared
to using a single network from 2.1 to 2.8. We compare IANS to MPTCP using the Lowest-RTT-first
scheduler, which continues to use a low downstream capacity network, resulting in
lower performance. This confirms that IANS can improve video streaming performance.

SALI360: design and implementation of saliency based video compression for 360° video streaming

Duin Baek
Hangil Kang
Jihoon Ryoo

In accordance with the recent enhancement of display technology, users demand a higher
quality of streaming service, which escalates the bandwidth requirement. Considering
the recent advent of high FPS (frame per second) 4K and 8K resolution 360° videos,
such bandwidth concern further intensifies in 360° Virtual Reality (VR) content streaming
even at a larger scale. However, the currently available bandwidth in most of the
developed countries can hardly support the bandwidth required to stream such a scale
of content. To address the mismatch between the demand on higher quality of streaming
service and the saturated network improvement, we propose SALI360 that practically
solves the mismatch by utilizing the characteristics of the human vision system (HVS).
By pre-rendering a set of regions - where viewers are expected to fixate - on 360°
VR content in higher quality than the other regions, SALI360 improves viewers' quality
of perception (QoP) while reducing content size with geometry-based 360° content encoding.
In our user experiment, we compare the performance of SALI360 to the existing 360°
content-encoding techniques based on 20 viewers' head movement and eye gaze traces.
To evaluate viewers' QoP, we propose FoL (field of look) that captures viewers' quality perception area in the visual focal
field (8°) rather than a wide (around 90°) field of view (FoV). Results of our experimental
360° VR video streaming show that SALI360 achieves 53.3% of PSNR improvement in FoL, while gaining 9.3% of PSNR improvement in FoV. In addition, our subjective study on 93 participants verifies that SALI360 improves
viewers' QoP in the 360° VR streaming service.

Energy considerations for ABR video streaming to smartphones: measurements, models and insights

Chaoqun Yue
Subhabrata Sen
Bing Wang
Yanyuan Qin
Feng Qian

Adaptive Bitrate (ABR) streaming is widely used in commercial video services. In this
paper, we profile energy consumption of ABR streaming on mobile devices. This profiling
is important, since the insights can help developing more energy-efficient ABR streaming
pipelines and techniques. We first develop component power models that provide online
estimation of the power draw for each component involved in ABR streaming. Using these
models, we then quantify the power breakdown in ABR streaming for both regular videos
and the emerging 360° panoramic videos. Our measurements validate the accuracy of
the power models and provide a number of insights. We discuss use cases of the developed
power models, and explore two energy reduction strategies for ABR streaming. Evaluation
demonstrates that these simple strategies can provide up to 30% energy savings, with
little degradation in viewing quality.

Resource optimization through hierarchical SDN-enabled inter data center network for
cloud gaming

Maryam Amiri
Hussein Al Osman
Shervin Shirmohammadi

Gaming on demand is an emerging service that combines techniques from Cloud Computing
and Online Gaming. This new paradigm is garnering prominence in the gaming industry
and leading to a new "anywhere and anytime" online gaming model. Despite its advantages,
cloud gaming's Quality of Experience (QoE) is challenged by high and varying end-to-end
communication delay. Since the significant part of the computational processing, including
game rendering and video compression, is performed on the cloud, properly allocating
game requests to the geographically distributed data centers (DCs) can lead to QoE
improvements resulting from lower delays. In this paper, we propose a hierarchical
Software Defined Network (SDN) controller architecture to near-optimally allocate
a gaming session to a DC while minimizing network delay and maximizing bandwidth utilization.
To do so, we formulate an optimization problem, and propose the Online Convex Optimization
(OCO) as a practical solution. Simulation results indicate that the proposed method
can provide close-to-optimal solutions, and outperforms classic offline techniques
e.g. Lagrangean relaxation. In addition, the proposed model improves the bandwidth
utilization of DCs, and reduces end-to-end delay and delay variation by gamers. As
a byproduct, our proposed method also achieves better fairness among multiple competing
players in comparison with existing methods.

Application of machine learning techniques for real-time sign language detection using
wearable sensors

Nazmus Saquib
Ashikur Rahman

Sign language is a method of communication primarily used by the hearing impaired
and mute persons. In this method, letters and words are expressed by hand gestures.
In fingerspelling, meaningful words are constructed by signaling multiple letters in a sequence. In
this paper, a system has been developed to detect fingerspelling in American Sign
Language (ASL) and Bengali Sign Language (BdSL) using (data) gloves containing some
suitably positioned sensors. The methodologies employed can be used even in resource-constrained
environments. The system is capable of accurately detecting both static and dynamic
symbols in the alphabets. The system shows a promising accuracy of (up to) 96%. Furthermore,
this work presents a novel approach to perform a continuous assessment of symbols
from a stream of run-time data.

UbiPoint: towards non-intrusive mid-air interaction for hardware constrained smart glasses

Lik Hang Lee
Tristan Braud
Farshid Hassani Bijarbooneh
Pan Hui

Throughout the past decade, numerous interaction techniques have been designed for
mobile and wearable devices. Among these devices, smartglasses mostly rely on hardware
interfaces such as touchpad and buttons, which are often cumbersome and counterintuitive
to use. Furthermore, smartglasses feature cheap and low-power hardware preventing
the use of advanced pointing techniques. To overcome these issues, we introduce UbiPoint,
a freehand mid-air interaction technique. UbiPoint uses the monocular camera embedded
in smartglasses to detect the user's hand without relying on gloves, markers, or sensors,
enabling intuitive and non-intrusive interaction. We introduce a computationally fast
and light-weight algorithm for fingertip detection, which is especially suited for
the limited hardware specifications and the short battery life of smartglasses. UbiPoint
processes pictures at a rate of 20 frames per second with high detection accuracy
- no more than 6 pixels deviation. Our evaluation shows that UbiPoint, as a mid-air
non-intrusive interface, delivers a better experience for users and smart glasses
interactions, with users completing typical tasks 1.82 times faster than when using
the original hardware.

Software-based versatile video coding decoder parallelization

Srinivas Gudumasu
Saurav Bandyopadhyay
Yong He

Versatile Video Coding (VVC) standard is currently being prepared as the latest video
coding standard of the ITU-T and ISO/IEC. The primary goal of the VVC, expected to
be finalized in 2020, is to further improve compression performance compared to its
predecessor HEVC. The frame level, slice level or Wavefront parallel processing (WPP)
existing in VTM (VVC Test Model) doesn't fully utilize the CPU capabilities available
in today's multicore systems. Moreover, VTM decoder sequentially processes the decoding
tasks. This design is not parallelization friendly. This paper proposes re-designed
decoding tasks that parallelize the decoder using: 1. Load balanced task parallelization
and 2. CTU (Coding Tree Unit) based data parallelization. The design overcomes the
limitations of the existing parallelization techniques by fully utilizing the available
CPU computation resource without compromising on the coding efficiency and the memory
bandwidth. The parallelization of CABAC and the slice decoding tasks is based on a
load sharing scheme, while parallelization of each sub-module of the slice decoding
task uses CTU level data parallelization. The parallelization scheme may either remain
restricted within an individual decoding task or utilize between task parallelization.
Such parallelization techniques achieve real-time VVC decoding on multi-core CPUs,
for bitstreams generated using VTM5.0 using Random-Access configuration. An overall
average decoding time reduction of 88.97% (w.r.t. VTM5.0 decoder) is achieved for
4K sequences on a 10-core processor.

Quality estimation models for gaming video streaming services using perceptual video
quality dimensions

Saman Zadtootaghaj
Steven Schmidt
Saeed Shafiee Sabet
Sebastian Möller
Carsten Griwodz

The gaming industry is one of the largest digital markets for decades and is steady
developing as evident by new emerging gaming services such as gaming video streaming,
online gaming, and cloud gaming. While the market is rapidly growing, the quality
of these services depends strongly on network characteristics as well as resource
management. With the advancement of encoding technologies such as hardware accelerated
engines, fast encoding is possible for delay sensitive applications such as cloud
gaming. Therefore, already existing video quality models do not offer a good performance
for cloud gaming applications. Thus, in this paper, we provide a gaming video quality
dataset that considers hardware accelerated engines for video compression using the
H.264 standard. In addition, we investigate the performance of signal-based and parametric
video quality models on the new gaming video dataset. Finally, we build two novel
parametric-based models, a planning and a monitoring model, for gaming quality estimation.
Both models are based on perceptual video quality dimensions and can be used to optimize
the resource allocation of gaming video streaming services.

An open software for bitstream-based quality prediction in adaptive video streaming

Huyen T. T. Tran
Duc Nguyen
Truong Cong Thang

HTTP Adaptive Streaming (HAS) has become a popular solution for multimedia delivery
nowadays. However, because of throughput fluctuations, video quality may be dramatically
varying. Also, stalling events may occur during a streaming session, causing negative
impacts on user experience. Therefore, a main challenge in HAS is how to evaluate
the overall quality of a session taking into account the impacts of quality variations
and stalling events. In this paper, we present an open software, called BiQPS, using a Long-Short Term Memory (LSTM) network to predict the overall quality of
HAS sessions. The prediction is based on bitstream-level parameters, so it can be
directly applied in practice. Through experiment results, it is found that BiQPS outperforms four existing models. Our software has been made available to the public
at https://github.com/TranHuyen1191/BiQPS.

PMData: a sports logging dataset

Vajira Thambawita
Steven Alexander Hicks
Hanna Borgli
Håkon Kvale Stensland
Debesh Jha
Martin Kristoffer Svensen
Svein-Arne Pettersen
Dag Johansen
Håvard Dagenborg Johansen
Susann Dahl Pettersen
Simon Nordvang
Sigurd Pedersen
Anders Gjerdrum
Tor-Morten Grønli
Per Morten Fredriksen
Ragnhild Eg
Kjeld Hansen
Siri Fagernes
Christine Claudi
Andreas Biørn-Hansen
Duc Tien Dang Nguyen
Tomas Kupka
Hugo Lewi Hammer
Ramesh Jain
Michael Alexander Riegler
Pål Halvorsen

In this paper, we present PMData: a dataset that combines traditional lifelogging
data with sports-activity data. Our dataset enables the development of novel data
analysis and machine-learning applications where, for instance, additional sports
data is used to predict and analyze everyday developments, like a person's weight
and sleep patterns; and applications where traditional lifelog data is used in a sports
context to predict athletes' performance. PMData combines input from Fitbit Versa
2 smartwatch wristbands, the PMSys sports logging smartphone application, and Google
forms. Logging data has been collected from 16 persons for five months. Our initial
experiments show that novel analyses are possible, but there is still room for improvement.

Kvazaar 2.0: fast and efficient open-source HEVC inter encoder

Ari Lemmetti
Marko Viitanen
Alexandre Mercat
Jarno Vanne

High Efficiency Video Coding (HEVC) is the key to economic video transmission and
storage in the current multimedia applications but tackling its inherent computational
complexity requires powerful video codec implementations. This paper presents Kvazaar
2.0 HEVC encoder that is the new release of our academic open-source software (github.com/ultravideo/kvazaar).
Kvazaar 2.0 introduces novel inter coding functionality that is built on advanced
rate-distortion optimization (RDO) scheme and speeded up with several early termination
mechanisms, SIMD-optimized coding tools, and parallelization strategies. Our experimental
results show that the proposed coding scheme makes Kvazaar 125 times as fast as the
HEVC reference software HM on the Intel Xeon E5-2699 v4 22-core processor at the additional
coding cost of only 2.4% on average. In constant quantization parameter (QP) coding,
Kvazaar is also 3 times as fast as the respective preset of the well-known practical
x265 HEVC encoder and is still able to attain 10.7% lower average bit rate than x265
for the same objective visual quality. These results indicate that Kvazaar has become
one of the leading open-source HEVC encoders in practical high-efficiency video coding.

A dataset for exploring gaze behaviors in text summarization

Kun Yi
Yu Guo
Weifeng Jiang
Zhi Wang
Lifeng Sun

Automatic text summarization has been a hot research topic for years. Though most
of the existing studies only use the content itself to generate the summaries, researchers
believe that an individual's reading behaviors have much to do with the summaries s/he generates, usually regarded as the ground
truth. However, such research is limited by the lack of a dataset that provides the
connection between people's reading behaviors and the summaries provided by them.
This paper fills in this gap by providing a dataset covering 50 individuals' gaze
behaviors collected by a high-accurate eye tracking device (that generates 100 gaze
points per second) when they are reading 100 articles (from 10 popular categories)
and composing the corresponding summaries for each article. Collected in a controlled
environment, our dataset with 157 million gaze points in total, provides not only
the basic gaze behaviors when different people read an article and compose its corresponding
summary, but also the connections between different behavior patterns and the summaries
they will provide. We believe such a dataset will be valuable for a wide range of
studies, and we also provide sample use cases of the dataset.

GPAC filters

Jean Le Feuvre

Modern multimedia frameworks mix a variety of functionalities, such as network inputs
and outputs, multiplexing stacks, compression, uncompressed domain effects and scripting,
and require realtime processing for live services. They usually end up becoming very
difficult to apprehend for end users and/or third-party developers, with complex testing
and maintenance. The GPAC open-source media framework is no exception here. After
15 years of development and experiences in interactive media content, the possibilities
offered by the framework were heavily restrained by a fixed media pipeline approach,
despite the large number of tools available in its code base. In this paper, we discuss
the major re-architecture undergone by GPAC to offer developers and end users a completely
configurable media pipeline in a simple way, review the core concepts of this new
design, their reasoning and the new features they unlock. We show how various complex
use cases can now simply be achieved and how the re-architecture improved GPAC stability,
making it a first-class candidate for research, commercial and educational projects
involving multimedia processing.

A scalable load generation framework for evaluation of video streaming workflows in
the cloud

Roberto Ramos-Chavez
Theo Karagkioules
Rufael Mekuria

HTTP Adaptive Streaming (HAS) is increasingly deployed at large, gradually replacing
traditional broadcast. However, testing large-scale deployments remains challenging,
costly and error-prone. Especially, testing with realistic streaming loads from massive
numbers of users is challenging and costly. To improve this, we introduce an open-source
load testing tool that can be deployed in the cloud or on-premise in a distributed
manner, for load generation.

Our presented tool is an extension of an existing open-source web-application load-testing
tool. In particular we have added functionality, that includes streaming load generation
for a multitude of protocols (i.e. Dynamic Adaptive Streaming over HTTP (DASH) and
HTTP-Live-Streaming (HLS)) and use-case implementations (e.g. live streaming, Video
on Demand (VoD), bit-rate switching). The extension facilitates testing streaming
back-ends at scale in a resource-efficient manner. We illustrate our tool's capabilities
via a series of use-cases, designed to test, among others, how streaming deployments
perform under different load scenarios, i.e. steep or gradual user ramp-up and stability
testing over long periods.

Open-source software tools for measuring resources consumption and DASH metrics

Mario Montagud
Juan Antonio De Rus
Rafael Fayos-Jordan
Miguel Garcia-Pineda
Jaume Segura-Garcia

When designing and deploying multimedia systems, it is essential to accurately know
about the necessary requirements and the Quality of Service (QoS) offered to the customers.
This paper presents two open-source software tools that contribute to these key needs.
The first tool is able to measure and register resources consumption metrics for any
Windows program (i.e. process id), like the CPU, GPU and RAM usage. Unlike the Task
Manager, which requires manual visual inspection for just a subset of these metrics,
the developed tool runs on top of the Powershell to periodically measure these metrics,
calculate statistics, and register them in log files. The second tool is able to measure
QoS metrics from DASH streaming sessions by running on top of TShark, if a non-secure
HTTP connection is used. For each DASH chunk, the tool registers: the round-trip time
from request to download, the number of TCP segments and bytes, the effective bandwidth,
the selected DASH representation, and the associated parameters in the MPD (e.g.,
resolution, bitrate). It also registers the MPD and the total amount of downloaded
frames and bytes. The advantage of this second tool is that these metrics can be registered
regardless of the player used, even from a device connected to the same network than
the DASH player.

LRRo: a lip reading data set for the under-resourced romanian language

Andrei Cosmin Jitaru
Şeila Abdulamit
Bogdan Ionescu

Automatic lip reading is a challenging and important research topic as it allows to
transcript visual-only recordings of a speaker into editable text. There are many
useful applications of such technology, starting from the aid of hearing impaired
people, to improving general automatic speech recognition. In this paper, we introduce
and release publicly lip reading resources for Romanian language. Two distinct collections
are proposed: (i) wild LRRo data is designed for an Internet in-the-wild, ad-hoc scenario,
coming with more than 35 different speakers, 1.1k words, a vocabulary of 21 words,
and more than 20 hours; (ii) lab LRRo data, addresses a lab controlled scenario for
more accurate data, coming with 19 different speakers, 6.4k words, a vocabulary of
48 words, and more than 5 hours. This is the first resource available for Romanian
lip reading and would serve as a pioneering foundation for this under-resourced language.
Nevertheless, given the fact that word-level models are not strongly language dependent,
these resources will also contribute to the general lip-reading task via transfer
learning. To provide a validation and reference for future developments, we propose
two strong baselines via VGG-M and Inception-V4 state-of-the-art deep network architectures.

Tools for live CMAF ingest

Rufael Mekuria
Dirk Griffioen
Arjen Wagenaar

An open source implementation of the CMAF live media ingest specification developed
by the DASH Industry Forum is presented. CMAF live ingest provides a protocol for
live uplink streaming of content using the Common Media Application Track Format (CMAF)
and HTTP POST. An example source, an example receiver and additional tooling for CMAF
metadata conversion are included. The tools can be used to perform live ingest of
CMAF track files to selected destinations in real-time. The reference receiver can
store and process CMAF tracks into a streaming media presentation. The additional
tools for conversion enable creation of timed metadata tracks. The distribution also
includes sample CMAF track files with aligned audio, video, timed text and metadata
content. The tools can be used for emulating live sources and receivers when investigating
workflows based on the CMAF live media ingest protocol. We demonstrate usage of the
tools in relevant streaming use cases showing performance benefits compared to ingest
using a reverse proxy cache.

A unified evaluation framework for head motion prediction methods in 360° videos

Miguel Fabián Romero Rondón
Lucile Sassatelli
Ramón Aparicio-Pardo
Frédéric Precioso

The streaming transmissions of 360° videos is a major challenge for the development
of Virtual Reality, and require a reliable head motion predictor to identify which
region of the sphere to send in high quality and save data rate. Different head motion
predictors have been proposed recently. Some of these works have similar evaluation
metrics or even share the same dataset, however, none of them compare with each other.
In this article we introduce an open software that enables to evaluate heterogeneous
head motion prediction methods on various common grounds. The goal is to ease the
development of new head/eye motion prediction methods. We first propose an algorithm
to create a uniform data structure from each of the datasets. We also provide the
description of the algorithms used to compute the saliency maps either estimated from
the raw video content or from the users' statistics. We exemplify how to run existing
approaches on customizable settings, and finally present the targeted usage of our
open framework: how to train and evaluate a new prediction method, and compare it
with existing approaches and baselines in common settings. The entire material (code,
datasets, neural network weights and documentation) is publicly available.

Quality of experience measurements of multipath TCP applications on iOS mobile devices

Katharina Keller
Patrick Felka
Jan Fornoff
Oliver Hinz
Amr Rizk

Multipath TCP (MPTCP) promises improvements in Quality of Service through connection
bundling. This leads to the belief that it will inevitably improve the Quality of
Experience (QoE), especially, for mobile applications running on top. The networking
and transport layer improvements stem from bundling multiple paths, e.g., WiFi and
LTE, as well as increasing the connection reliability through redundancy. For example,
a smartphone running an application over WiFi may switch to the cellular network without
service interruption upon user movement that gets the device out of the WiFi range,
thus avoiding outage events. However, the impact of MPTCP on QoE for different applications
has not yet been fully understood.

In this work, we present a dataset corresponding to a user study including some preliminary
results on the impact of MPTCP on QoE for mobile applications, specifically, (i) interactive mobile games, (ii) HTTP adaptive video streaming, and (iii) infinite scroll web browsing known from social networking applications. To this end,
we build a specifically designed testbed and develop a custom iOS application for
evaluation purposes. Despite the fact that the participating users show willingness
to tolerate additional cellular plan costs for better QoE, our preliminary results
show that MPTCP, utilizing WiFi and LTE, is not able to outperform single path TCP
with respect to QoE for most of the investigated use cases.

Low latency streaming and multi DRM with dash.js

Daniel Silhavy
Stefan Pham
Martin Lasak
Anita Chen
Stefan Arbanowski

Video streaming applications account for 60% of today's global internet traffic. The
trend to consume videos over the internet lead to a high demand for sophisticated
and robust video players. dash.js is an open source DASH player of the DASH-Industry-Forum
written in JavaScript utilizing the native browser APIs Media Source Extensions (MSE)
and Encrypted Media Extensions (EME). This paper gives a general overview of the player
and presents two specific features namely low-latency streaming and multi DRM playback.
For that purpose, we illustrate how CMAF chunks in combination with the corresponding
dash.js APIs and additional manifest parameters enable low latency streaming in the
browser. For DRM support we focus on the interaction between dash.js, the EME and
the underlying Content Decryption Module (CDM) of the browser.

UVG dataset: 50/120fps 4K sequences for video codec analysis and development

Alexandre Mercat
Marko Viitanen
Jarno Vanne

This paper provides an overview of our open Ultra Video Group (UVG) dataset that is
composed of 16 versatile 4K (3840×2160) test video sequences. These natural sequences
were captured either at 50 or 120 frames per second (fps) and stored online in raw
8-bit and 10-bit 4:2:0 YUV formats. The dataset is published on our website (ultravideo.cs.tut.fi)
under a non-commercial Creative Commons BY-NC license. In this paper, all UVG sequences
are described in detail and characterized by their spatial and temporal perceptual
information, rate-distortion behavior, and coding complexity with the latest HEVC/H.265
and VVC/H.266 reference video codecs. The proposed dataset is the first to provide
complementary 4K sequences up to 120 fps and is therefore particularly valuable for
cutting-edge multimedia applications. Our evaluations also show that it comprehensively
complements the existing 4K test set in VVC standardization, so we recommend including
it in subjective and objective quality assessments of next-generation VVC codecs.

Beyond throughput, the next generation: a 5G dataset with channel and context metrics

Darijo Raca
Dylan Leahy
Cormac J. Sreenan
Jason J. Quinlan

In this paper, we present a 5G trace dataset collected from a major Irish mobile operator.
The dataset is generated from two mobility patterns (static and car), and across two
application patterns (video streaming and file download). The dataset is composed
of client-side cellular key performance indicators (KPIs) comprised of channel-related
metrics, context-related metrics, cell-related metrics and throughput information.
These metrics are generated from a well-known non-rooted Android network monitoring
application, G-NetTrack Pro. To the best of our knowledge, this is the first publicly
available dataset that contains throughput, channel and context information for 5G
networks. To supplement our real-time 5G production network dataset, we also provide
a 5G large scale multi-cell ns-3 simulation framework. The availability of the 5G/mmwave
module for the ns-3 mmwave network simulator provides an opportunity to improve our
understanding of the dynamic reasoning for adaptive clients in 5G multi-cell wireless
scenarios. The purpose of our framework is to provide additional information (such
as competing metrics for users connected to the same cell), thus providing otherwise
unavailable information about the base station (eNodeB or eNB) environment and scheduling
principle, to end user. Our framework permits other researchers to investigate this
interaction through the generation of their own synthetic datasets.

Toadstool: a dataset for training emotional intelligent machines playing Super Mario Bros

Henrik Svoren
Vajira Thambawita
Pål Halvorsen
Petter Jakobsen
Enrique Garcia-Ceja
Farzan Majeed Noori
Hugo L. Hammer
Mathias Lux
Michael Alexander Riegler
Steven Alexander Hicks

Games are often defined as engines of experience, and they are heavily relying on
emotions, they arouse in players. In this paper, we present a dataset called Toadstool as well as a reproducible methodology to extend on the dataset. The dataset consists
of video, sensor, and demographic data collected from ten participants playing Super Mario Bros, an iconic and famous video game. The sensor data is collected through an Empatica
E4 wristband, which provides high-quality measurements and is graded as a medical
device. In addition to the dataset and the methodology for data collection, we present
a set of baseline experiments which show that we can use video game frames together
with the facial expressions to predict the blood volume pulse of the person playing
Super Mario Bros. With the dataset and the collection methodology we aim to contribute
to research on emotionally aware machine learning algorithms, focusing on reinforcement
learning and multimodal data fusion. We believe that the presented dataset can be
interesting for a manifold of researchers to explore exciting new interdisciplinary
questions.

Online learning for low-latency adaptive streaming

Theo Karagkioules
Rufael Mekuria
Dirk Griffioen
Arjen Wagenaar

Achieving low-latency is paramount for live streaming scenarios, that are now-days
becoming increasingly popular. In this paper, we propose a novel algorithm for bitrate
adaptation in HTTP Adaptive Streaming (HAS), based on Online Convex Optimization (OCO).
The proposed algorithm, named Learn2Adapt-LowLatency (L2A-LL), is shown to provide a robust adaptation strategy which, unlike most of the state-of-the-art techniques, does not
require parameter tuning, channel model assumptions, throughput estimation or application-specific
adjustments. These properties make it very suitable for users who typically experience
fast variations in channel characteristics. The proposed algorithm has been implemented
in DASH-IF's reference video player (dash.js) and has been made publicly available
for research purposes at [22]. Real experiments show that L2A-LL reduces latency significantly, while providing a high average streaming bit-rate,
without impairing the overall Quality of Experience (QoE); a result that is independent
of the channel and application scenarios. The presented optimization framework, is
robust due to its design principle; its ability to learn and allows for modular QoE
prioritization, while it facilitates easy adjustments to consider applications beyond
live streaming and/or multiple user classes.

When they go high, we go low: low-latency live streaming in dash.js with LoL

May Lim
Mehmet N. Akcay
Abdelhak Bentaleb
Ali C. Begen
Roger Zimmermann

Live streaming remains a challenge in the adaptive streaming space due to the stringent
requirements for not just quality and rebuffering, but also latency. Many solutions
have been proposed to tackle streaming in general, but only few have looked into better
catering to the more challenging low-latency live streaming scenarios. In this paper,
we re-visit and extend several important components (collectively called Low-on-Latency,
LoL) in adaptive streaming systems to enhance the low-latency performance. LoL includes
bitrate adaptation (both heuristic and learning-based), playback control and throughput
measurement modules.

Stallion: video adaptation algorithm for low-latency video streaming

Craig Gutterman
Brayn Fridman
Trey Gilliland
Yusheng Hu
Gil Zussman

As video traffic continues to dominate the Internet, interest in near-second low-latency
streaming has increased. Existing low-latency streaming platforms rely on using tens
of seconds of video in the buffer to offer a seamless experience. Striving for near-second
latency requires the receiver to make quick decisions regarding the download bitrate
and the playback speed. To cope with the challenges, we design a new adaptive bitrate
(ABR) scheme, Stallion, for STAndard Low-LAtency vIdeo cONtrol. Stallion uses a sliding window to measure the mean and standard deviation of both the bandwidth
and latency. We evaluate Stallion and compare it to the standard DASH DYNAMIC algorithm
over a variety of networking conditions. Stallion shows 1.8x increase in bitrate, and 4.3x reduction in the number of stalls.

Immersive media experience with MPEG OMAF multi-viewpoints and overlays

Kashyap Kammachi Sreedhar
Igor D. D. Curcio
Ari Hourunranta
Mikael Lepistö

The second edition of the Omnidirectional MediA Format (OMAF) standard developed by
the Moving Picture Experts Group (MPEG) defines two major features: overlays and multi-viewpoints.
Overlays help in enhancing the immersive experience by providing additional information
about the omnidirectional background video content. The multi-viewpoint feature enables
the content to be captured/experienced from multiple spatial locations. These two
powerful features along with interactivity, dispense the content provider with new
possibilities of storytelling (for example, non-linear) using immersive media.

In this demo, we show multi-viewpoints and overlays with an adaptive bit rate viewport-dependent
streaming framework. The framework uses tiles for multi-viewpoints which, along with
overlays, are encoded at multiple qualities using the HEVC Main 10 profile. The encoded
tiles of multi-viewpoints and overlay videos are encapsulated in ISO Base Media File
Format (ISOBMFF) and fragmented as Dynamic Adaptive Streaming over HTTP (MPEG-DASH)
segments. The DASH segments are then fetched by the OMAF player based on the user's
viewing conditions and rendered on the user device. Additionally, the framework allows
for user interaction, such as switching between viewpoints and enabling/disabling
of the overlays.

VVC bitstream extraction and merging operations for multi-stream media applications

Emmanuel Thomas
Alexandre Gabriel
Karim El Assal

In traditional video decoding applications, the number of elementary streams that
a hardware decoding platform of an end device can decode is determined at runtime
by the. Upon request by the application, the decoding platform verifies whether a
new decoding instance with an associated requirement in terms of data rate can fit
under the current workload. Conversely, if a device can decode one 4K elementary stream
in hardware, it may not be able to simultaneously decode four HD elementary streams
that would each correspond to requirements in terms of data rate of 1/4 of the 4K
elementary stream. Current video decoding platforms are thus designed with the assumption
that each elementary stream requires the instantiation of a dedicated video decoder
instance. At the same time, it has been increasingly common in new media applications
such as immersive media applications to simultaneously consume several elementary
streams in a synchronised fashion. The demo presents a new paradigm for media applications
for which elementary streams may be consumed in such synchronised manner where the
same decoder instance can be used. The demonstrator leverages on new features of the
Versatile Video Coding (VVC) standard and interfaces being defined in the ongoing
standardisation of MPEG-I part 13: Video Decoding Interface for Immersive Media. Stitching
and cropping videos in the compressed domain can be achieved by an application via
such defined interfaces. Without those interfaces, the same tasks are possible with
the High Efficiency Video Coding (HEVC) standard to some extent but are tedious. In
this demonstrator, we thus show how the new VVC codec can enable the decoupling of
the number of elementary streams consumed by the application and the number of running
video decoder instances. In addition, memory usage and CPU performance are also collected
and compared with a tradition multiple decoding instance approach.

A pipeline for multiparty volumetric video conferencing: transmission of point clouds over low latency DASH

Jack Jansen
Shishir Subramanyam
Romain Bouqueau
Gianluca Cernigliaro
Marc Martos Cabré
Fernando Pérez
Pablo Cesar

The advent of affordable 3D capture and display hardware is making volumetric videoconferencing
feasible. This technology increases the immersion of the participants, breaking the
flat restriction of 2D screens, by allowing them to collaborate and interact in shared
virtual reality spaces. In this paper we introduce the design and development of an
architecture intended for volumetric videoconferencing that provides a highly realistic
3D representation of the participants, based on pointclouds. A pointcloud representation
is suitable for real-time applications like video conferencing, due to its low-complexity
and because it does not need a time consuming reconstruction process. As transport
protocol we selected low latency DASH, due to its popularity and client-based adaptation
mechanisms for tiling. This paper presents the architectural design, details the implementation,
and provides some referential results. The demo will showcase the system in action,
enabling volumetric videoconferencing using pointclouds.

Content adaptive live encoding with open source codecs

Pradeep Ramachandran
Shushuang Yang
Praveen Tiwari
Gopi Satykrishna Akisetty

The landscape of video codecs used by broadcasters for their streaming solutions is
constantly evolving. This evolution is resulting in increasingly complex codecs forcing
broadcasters to consider solutions based on software codecs, as opposed to those based
on hardware codecs, owing to the increased flexibility. Additionally, open-source
software encoders for the popular AVC (x264), and emerging HEVC (x265) standards are
now sufficiently robust for broadcasters to seriously consider using them for deployment.
However, both these encoders, in their stock implementation, use a single set of encoder
settings for a given stream without adapting the encoder dynamically to the incoming
content. Without this ability, the encoder would have to be configured with conservative
settings to sustain the target frame rate, resulting in inefficient encodes that either
consume more bits than intended for the given quality, or achieve lower quality than
possible for the given bitrate.

This paper, and the associated demo, present an architecture that uses a PID (Proportional
Integral Derivate) module that can be leveraged for content adaptive live encoding
and discuss its implementation in the popular open-source codecs x264, and x265. The
PID controller monitors the dynamic framerate achieved by the encoder and reconfigures
various parameters of the encoder to ensure that the frame rate is maintained at the
expected level while maximizing quality. Our results show that with this implementation,
we can provide a broad-cast ready solution with open source AVC and HEVC codecs.

CAdViSE: cloud-based adaptive video streaming evaluation framework for the automated testing
of media players

Babak Taraghi
Anatoliy Zabrovskiy
Christian Timmerer
Hermann Hellwagner

Attempting to cope with fluctuations of network conditions in terms of available bandwidth,
latency and packet loss, and to deliver the highest quality of video (and audio) content
to users, research on adaptive video streaming has attracted intense efforts from
the research community and huge investments from technology giants. How successful
these efforts and investments are, is a question that needs precise measurements of
the results of those technological advancements. HTTP-based Adaptive Streaming (HAS)
algorithms, which seek to improve video streaming over the Internet, introduce video
bitrate adaptivity in a way that is scalable and efficient. However, how each HAS
implementation takes into account the wide spectrum of variables and configuration
options, brings a high complexity to the task of measuring the results and visualizing
the statistics of the performance and quality of experience. In this paper, we introduce
CAdViSE, our Cloud-based Adaptive Video Streaming Evaluation framework for the automated testing of adaptive media players. The paper
aims to demonstrate a test environment which can be instantiated in a cloud infrastructure,
examines multiple media players with different network attributes at defined points
of the experiment time, and finally concludes the evaluation with visualized statistics
and insights into the results.

Low latency DASH - more than just spec: DASH-IF test tools

Ece Öztürk
Daniel Silhavy
Torbjörn Einarsson
Thomas Stockhammer

DASH has become one of the widely adopted and contributed streaming formats for internet
streaming services. DASH Industry Forum has been the core catalyzer in this with interoperability
discussions supported by publicly available Interoperability Guidelines and testing
tools. This paper introduces DASH-IF activities and its tools and proposes a demonstration
based on the recent study of low latency DASH service support.

Cloud rendering-based volumetric video streaming system for mixed reality services

Serhan Gül
Dimitri Podborski
Jangwoo Son
Gurdeep Singh Bhullar
Thomas Buchholz
Thomas Schierl
Cornelius Hellge

Volumetric video is an emerging technology for immersive representation of 3D spaces
that captures objects from all directions using multiple cameras and creates a dynamic
3D model of the scene. However, processing volumetric content requires high amounts
of processing power and is still a very demanding task for today's mobile devices.
To mitigate this, we propose a volumetric video streaming system that offloads the
rendering to a powerful cloud/edge server and only sends the rendered 2D view to the
client instead of the full volumetric content. We use 6DoF head movement prediction
techniques, WebRTC protocol and hardware video encoding to ensure low-latency in different
parts of the processing chain. We demonstrate our system using both a browser-based
client and a Microsoft HoloLens client. Our application contains generic interfaces
that allow for easy deployment of various augmented/mixed reality clients using the
same server implementation.

A cloud-based end-to-end server-side dynamic ad insertion platform for live content

Tankut Akgul
Samet Ozcan
Alihan Iplik

In this paper, we present a cloud-based live video streaming and advertising platform
solution that enables internet-based live broadcasts for TV channels. The platform
supports server-side dynamic ad insertion with automated ad detection and personalized
ad placement. A unique feature of our solution is an interactive personalized single
ad that can be inserted at desired locations in the live stream independent of broadcaster's
commercial break period, which increases ad viewability up to 95% and completion rate
up to 97% on average. The platform also provides management interfaces both for the
broadcaster as well as for the advertisement agencies enabling fully automated programmatic
TV ads.

Multi-modal video forensic platform for investigating post-terrorist attack scenarios

Alexander Schindler
Andrew Lindley
Anahid Jalali
Martin Boyer
Sergiu Gordea
Ross King

The forensic investigation of a terrorist attack poses a significant challenge to
the investigative authorities, as often several thousand hours of video footage must
be viewed. Large scale Video Analytic Platforms (VAP) assist law enforcement agencies
(LEA) in identifying suspects and securing evidence. Current platforms focus primarily
on the integration of different computer vision methods and thus are restricted to
a single modality. We present a video analytic platform that integrates visual and
audio analytic modules and fuses information from surveillance cameras and video uploads
from eyewitnesses. Videos are analyzed according their acoustic and visual content.
Specifically, Audio Event Detection is applied to index the content according to attack-specific
acoustic concepts. Audio similarity search is utilized to identify similar video sequences
recorded from different perspectives. Visual object detection and tracking are used
to index the content according to relevant concepts. Innovative user-interface concepts
are introduced to harness the full potential of the heterogeneous results of the analytical
modules, allowing investigators to more quickly follow-up on leads and eyewitness
reports.

Fixed viewport applications for omnidirectional video content: combining traditional and 360 video for immersive experiences

Emmanouil Potetsianakis
Emmanuel Thomas
Karim El Assal
Oskar van Deventer

With omnidirectional videos, the viewer is able to direct her Field-of-View (FoV)
to any part of the scene while watching the content. This is achieved by rendering
the 360 video content on the inside of a (conceptual) sphere in which the viewer is
typically placed at the center. This is in contrast with traditional video that is
rendered on a 2D plane and the viewer is watching always through a viewport directed
by the content creator. These two approaches create a conflict between user experience
and creativity, since omnidirectional video provides the user with viewing freedom,
while traditional video allows for greater artistic expression by controlling the
viewport. In order to combine these two approaches we propose an immersive setup in
which the content changes between free-form viewing of omnidirectional (360 video
mode) and directed viewing of traditional videos (director's mode). In this demo paper
we present the benefits and reasoning behind this proposal and the means to implement
it using the OMAF (MPEG-I - Part 2) standard.

Toward HTTP adaptive streaming: implementation and challenges

Emre Karslı
Reza Shokri Kalan

With increasing popularity of smart devices and better bandwidth allocations for end
users, online video streaming demand is rising. As an Over-the-Top (OTT) service provider,
we implemented HTTP Adaptive Streaming (HAS) technology to address this concerns by
considering limitations of legacy streaming protocols. This study demonstrates some
of essential implementation challenges during the transition from RTMP-based streaming
to HAS technology in DIGITURK OTT platforms.

HTTP adaptive streaming over multiple network interfaces

Burak Kara
Sarp Ozturk
Ali C. Begen

Enhancing user experience in streaming applications is an important problem. Delivering
the best quality possible for the given network conditions is not an easy task. In
the case of a streaming client running in a multi-homed network, this problem becomes
more complicated. In the simplest form, one network can be picked randomly or based
on some criteria, and the streaming client solely uses that network. In another form,
multiple networks can be simultaneously used by the streaming client and due to the
aggregation, doing so may deliver better and more stable quality, and at a lower latency
than using either of the networks individually. However, using multiple networks simultaneously
is not trivial in certain scenarios. In this demo, we present a gateway-based solution
where the gateway is connected to two different networks and dynamically decides which
network(s) to use for streaming while hiding all the decision complexity from the
streaming clients behind it. In other words, our solution is transparent to the streaming
clients, which means any existing client can benefit from this solution without any
changes.

Metadata-based user interface design for enhanced content access and viewing

Adem A. Karmis
Alper Derya
Ali C. Begen

The nature of viewing is changing due to the huge volumes of content being produced
including user content generated by amateurs and the proliferation of personalized
services. The type of content being produced is not only for entertainment (movies/TV)
purposes, but also can be instructional and directive (classroom, documentary and
adult content). This leads to a type of viewing that is non-linear and requires increased
random access into the content to be viewed effectively. Those who produce the content
(or aggregate a number of existing ones) may not necessarily index it sufficiently
for a variety of reasons. In-advance indexing would not work anyway in case the indexing
used time-varying factors such as popularity, viewing frequency or duration. In this
demo, we tackle this problem and present a new seekbar design for the dash.js player
that allows the users to navigate the content more effectively and find the points
of interest within the content faster. This new seekbar uses auxiliary metadata to
show informative icons or color the parts of the media timeline differently to inform
the users.

Universal access for object-based media experiences

Juliette Carter
Rajiv Ramdhany
Mark Lomas
Tim Pearce
James Shephard
Michael Sparks

Changes in content consumption patterns of audiences has revealed an ever-increasing
appetite for richer and more engaging experiences. From live-streaming to interactive
branching stories, there is a growing trend of users seeking personalised experiences.
Object-based media experiences provide more meaningful encounters for the young and
old; tailoring these to their interest, and making them part of the story. However,
creating and distributing these experiences at scale to audiences is challenging to
content-providers due to device and experience heterogeneity introducing complexity
in the delivery chain. In this demo, we demonstrate how using cross-platform approaches,
game-engine-like runtimes, and cloud-based rendering can be combined to provide universal access to rich immersive experiences. A cross-platform runtime and build-process is used
to develop experience presentation capabilities and target multiple device platforms
from a single code-base. Where low-computing availability prohibits the execution
of such experiences, cloud-based rendering is harnessed to still achieve playback
on low-end devices without significant loss in quality of experience.

MMSys '20: Proceedings of the 11th ACM Multimedia Systems Conference

MMSys '20: Proceedings of the 11th ACM Multimedia Systems Conference

Sections

User login