MHV '22: Proceedings of the 1st Conference on Mile-High Video


MHV '22: Proceedings of the 1st Conference on Mile-High Video

Full Citation in the ACM Digital Library

Using CMAF to deliver high resolution immersive video with ultra-low end to end latency for live streaming

  • Andrew Zhang
  • Xiaomin Chen
  • Ying Luo
  • Anna Qingfeng Li
  • William Cheung

Immersive video with 8K or higher resolution utilizes viewport-dependent tile-based video with multi-resolutions (i.e. low-resolution background video with high-resolution video). OMAF defines how to deliver tiled immersive video through MPEG DASH. But End-to-End latency is a consistent problem for the MPEG DASH solution. Solutions using short segment with 1 sec duration will reduce latency, but even in those cases, without CDNs, the end-to-end latency is still 5 secs or more. And in most cases, massive segment files generated every second harden CDN, leading to much longer latencies, such as 20 secs or more. In this paper, we introduce a solution using Common Media Application Format (CMAF) to deliver tile-based immersive video to reduce the end-to-end latency to sub-3 secs. Based on CMAF: We enabled long duration CMAF segment with shorter End-to-End Latency by using long duration CMAF segmentation reduce CDN pressure since it reduces the amount segment files generated. In addition, we re-fetch relative CMAF chunks of high-resolution segments via our own adaptive viewport prediction algorithm. We use a decoder catching-up mechanism for prediction-missed tiles to reduce the M2HQ (Motion-To-High-Quality) latency while viewport changed within chunks. As we will show, this leads to an overall sub-3 seconds End-to-End latency with ~1 second Packager-Display Latency and average 300ms M2HQ latency can be reached with 5 seconds segmentation in non-CDN environment.

Take the red pill for H3 and see how deep the rabbit hole goes

  • Minh Nguyen
  • Christian Timmerer
  • Stefan Pham
  • Daniel Silhavy
  • Ali C. Begen

With the introduction of HTTP/3 (H3) and QUIC at its core, there is an expectation of significant improvements in Web-based secure object delivery. As HTTP is a central protocol to the current adaptive streaming methods in all major streaming services, an important question is what H3 will bring to the table for such services. To answer this question, we present the new features of H3 and QUIC, and compare them to those of H/1.1/2 and TCP. We also share the latest research findings in this domain.

Improving streaming quality and bitrate efficiency with dynamic resolution selection

  • Xavier Ducloux
  • Patrick Gendron
  • Thierry Fautier

Dynamic Resolution Selection is a technology that has been deployed by Netflix with its per-scene encoding mechanism applied to VOD assets. The technology is based on a posteriori analysis of all the encoded resolutions to determine the best resolution for a given scene, in terms of quality and bandwidth used, based on VMAF analysis. It cannot be applied to live content, as it would require too much processing power and can't be used in real time.

The method proposed in this paper is based on a machine learning (ML) mechanism that learns how to pick the best resolution to be encoded in a supervised learning environment. At run time, using the already existing pre-processing stage, the live encoder can decide on the best resolution to encode, without adding any processing complexity or delay. This results in higher quality of experience (QoE) or lower bitrate, as well as lower CPU footprint vs. a classical fixed ladder approach. This paper will present the results obtained for live HD or 4K content delivery across different networks, including classical TS (DVB), native IP (ATSC 3.0) and ABR (DASH/HLS). In addition, the paper will report on the interoperability results of tested devices.

VVC in the cloud and browser playback: it works

  • Adam Wieckowski
  • Gabriel Hege
  • Christian Lehmann
  • Benjamin Bross
  • Detlev Marpe
  • Christian Feldmann
  • Martin Smole

The most recent international video coding standard that was developed jointly by the ITU-T and ISO/IEC is Versatile Video Coding (VVC). While VVC contains a feature set for a very wide range of applications it also allows for significant bit-rate reductions of around 50% for the same subjective video quality compared to its predecessor, High Efficiency Video Coding (HEVC). After the standardization was finished in July 2020, many activities were started to allow integration of VVC into practical applications.

This paper shows how a practical workflow using VVC for a streaming application is already possible today. For this, we showcase how the Fraunhofer VVenC VVC encoder was implemented into the cloud-based encoding solution from Bitmovin. It is further detailed how VVC changes practical decisions like selecting the optimal bitrate ladder and how the cost and performance compare to other codecs. Finally, it is demonstrated how the Fraunhofer VVdeC decoder can be used with WebAssembly to make real time VVC playback in the browser possible.

HDR video coding with MPEG-5 LCEVC

  • Amaya Jiménez-Moreno
  • Lorenzo Ciccarelli
  • Rick Clucas
  • Simone Ferrara

High Dynamic Range (HDR) video content is continuing to gain market relevance for both streaming and brodacasting services, providing video with improved contrast and colour depth. However, the predominance of 8-bit based codecs and the wide availability of Standard Dynamic Range (SDR) devices still poses challenges regarding effective deployment of HDR content.

MPEG-5 LCEVC is a new video coding standard that works in combination with a separate video standard (e.g., AVC, HEVC, VVC, AV1) to enhance the quality of a video. The enhanced quality is provided by adding details coded through an enhancement layer to a lower resolution version of the same video coded through a base layer. These enhancement layers can be used to add an HDR enhancement layer to any underlying codec, even to 8-bit based codecs, which helps to achieve a more efficient encoding and solves backward-compatibility issues.

In this paper, we describe how LCEVC enables the encoding of HDR video, explaining some of the main tools to provide higher efficiency for this content. Moreover, we provide a series of test results and comparisons of encoding HDR using LCEVC to enhance different video codecs.

Latest advances in the development of the open-source player dash.js

  • Daniel Silhavy
  • Stefan Pham
  • Stefan Arbanowski
  • Stephan Steglich
  • Björn Harrer

The trend to consume high-quality videos over the internet lead to a high demand for sophisticated and robust video player implementations. dash.js is a prominent option for implementing production grade DASH-based applications and products, and is also widely used for academic research purposes. In this paper, we introduce the latest additions and improvements to dash.js. We focus on various features and use cases such as player performance and robustness, low latency streaming, metric reporting and digital rights management. The features and improvements introduced in this paper provide great benefits not only for media streaming clients, but also for the server-side components involved in the media stream process.

EpochSegmentSync: inter-encoder synchronisation for adaptive bit-rate streaming head-ends

  • Rufael Mekuria Roberto Ramos
  • Jamie Fletcher Mark Ogle Arjen
  • Dirk Griffioen Boy van Dijk

This paper presents EpochSegmentSynch, an approach for Inter-Encoder synchronization. It is a robust way to achieve synchronization between distributed live adaptive streaming encoders. The approach does not require a direct communication path between adaptive streaming encoders and supports seamless failover and recovery. Each encoder uses a common time anchor and generates segments with constant durations. By estimating the number of segments K since the anchor, aligned segment boundaries are calculated at distributed encoders. EpochSegmentSynch supports the case when encoders may join or leave a session at any time or may fail and start again at any time. It works in setups without strict system clock-synchronization such as in the common case when up to 100 ms clock skew exists. EpochSegmentSynch can be applied to different use cases, it can be used to synchronize the output of distributed encoders generating parts of a bit-rate ladder, or to synchronize tracks generated by different processes on the same encoder. The approach is implemented in an open-source implementation using the popular FFmpeg encoding tool library and the DASH-IF Live media ingest protocol. The distributed workflow is deployed with multiple encoders and a streaming server using docker-containers. These can be stopped and re-started to emulate failovers. The encoders use a use a synthetically generated test signal with color bars and a clock overlay allowing one to inspect visual and timeline discontinuities perceptually. In a realistic setup an additional timeline conversion step between the input timing e.g. MPEG-TS and the adaptive streaming timeline would need to be implemented. The paper provides some details on how this can be achieved. In the use case of redundant dual ingest with encoders failing and restarting, playback is observed as seamless. In the case, when distributed encoders are producing parts of the bit-rate ladder switching is also seamless. In both cases the live archive stored at the streaming server is observed to be continuous. Last, we detail how more complex cases such as non-integer frame-rates or splice-point insertions can be handled.

Enabling the immersive display: a new paradigm for content distribution

  • Arianne T. Hinds

Immersive displays, including holographic displays, dense multiview displays, VR headsets, and AR glasses have begun to emerge as a new class of display technologies that not only have the potential to create visual experiences to engage the viewer in ways unprecedented for existing 2D screen technologies, but also create a "forcing function" for industry to recognize that such displays are heterogeneous in terms of the types of input media signals that are required in order for these displays to create their optimal visual experiences. That is, until now, standards bodies and industry have successfully addressed the needs of new display technologies by extending the densities, frame rates, and color gamuts for rectilinear video formats. But what if the best of these immersive displays cannot be driven off of rectilinear video formats? Is it impossible then for such displays to be adopted and supported at scale? This paper will briefly summarize historical events motivating the demand for immersive displays and content. Next, we will characterize immersive displays and the types of input signals that afford such displays the best opportunity to create their optimal visual experiences. Starting with VR headsets, which are a relatively straightforward extension of stereoscopic televisions, the various types of immersive displays and their optimal input media signals will be presented, leading to the realization that new paradigms for content distribution are needed in order to prepare industry for the availability of display technologies that until now, were previously thought to exist only in science fiction.

The benefits of server hinting when DASHing or HLSing

  • May Lim
  • Mehmet N. Akcay
  • Abdelhak Bentaleb
  • Ali C. Begen
  • Roger Zimmermann

Streaming clients almost always compete for the available bandwidth and server capacity. Not every client's playback buffer conditions will be the same, though, nor should be the priority with which the server processes the individual requests coming from these clients. In an earlier work, we demonstrated that if clients conveyed their buffer statuses to the server using a Common Media Client Data (CMCD) query argument, the server could allocate its output capacity among all the requests more wisely, which could significantly reduce the rebufferings experienced by the clients.

In this paper, we address the same problem using the Common Media Server Data (CMSD) standard that is work-in-progress at the Consumer Technology Association (CTA). In this case, the incoming requests are scheduled based on their CMCD information. For example, the response to a request indicating a healthy buffer status is held/delayed until more urgent requests are handled. When the delayed response is eventually transmitted, the server attaches a new CMSD parameter to indicate how long the delay was. This parameter avoids misinterpretations and subsequent miscalculations by the client's rate-adaptation logic.

We implemented the server and client understanding/processing CMCD and CMSD, respectively. Our experiments show that the proposed CMSD parameter effectively eliminates unnecessary downshifting while reducing both the rebuffering rate and duration.

On multiple media representations and CDN performance

  • Yuriy Reznik
  • Thiago Teixeira
  • Robert Peck

This paper proposes a mathematical model describing the effects of using multiple media representations on CDN performance in HTTP-based streaming systems. Specifically, we look at cases of using multiple versions of the same content packaged differently and derive an asymptotic formula for CDN cache-miss probability considering parameters of the content's distribution and the distribution of formats used for packaging and delivery. We then study the validity of this proposed formula by considering statistics collected for several streaming deployments using mixed HLS and DASH packaging and show that it predicts the experimentally observed data reasonably well. We further discuss several possible extensions and applications of this proposed model.

Multimedia streaming analytics: quo vadis?

  • Cise Midoglu
  • Mariana Avelino
  • Shri Hari Gopalakrishnan
  • Stefan Pham
  • Pål Halvorsen

In today's complex OTT multimedia streaming ecosystem, the task of ensuring the best streaming experience to end-users requires extensive monitoring, and such monitoring information is relevant to various stakeholders including content providers, CDN providers, network operators, device vendors, developers, and researchers. Streaming analytics solutions address this need by aggregating performance information across streaming sessions, to be presented in ways that help improve the end-to-end delivery. In this paper, we provide an analysis of the state of the art in commercial streaming analytics solutions. We consider five products as representatives, and identify potential improvements with respect to terminology, QoE representation, standardization and interoperability, and collaboration with academia and the developer community.

Super-resolution based bitrate adaptation for HTTP adaptive streaming for mobile devices

  • Minh Nguyen
  • Ekrem Çetinkaya
  • Hermann Hellwagner
  • Christian Timmerer

The advancement of hardware capabilities in recent years made it possible to apply deep neural network (DNN) based approaches on mobile devices. This paper introduces a lightweight super-resolution (SR) network, namely SR-ABR Net, deployed at mobile devices to upgrade low-resolution/low-quality videos and a novel adaptive bitrate (ABR) algorithm, namely WISH-SR, that leverages SR networks at the client to improve the video quality depending on the client's context. WISH-SR takes into account mobile device properties, video characteristics, and user preferences. Experimental results show that the proposed SR-ABR Net can improve the video quality compared to traditional SR approaches while running in real time. Moreover, the proposed WISH-SR can significantly boost the visual quality of the delivered content while reducing both bandwidth consumption and number of stalling events.

Fast and effective AI approaches for video quality improvement

  • Marco Bertini
  • Leonardo Galteri
  • Lorenzo Seidenari
  • Tiberio Uricchio
  • Alberto Del Bimbo

In this work we present solutions based on AI techniques to the problem of real-time video quality improvement, addressing both video super resolution and compression artefact removal. These solutions can be used to revamp video archive materials allowing their reuse in modern video production and to improve the end user experience playing streaming videos in higher quality while requiring less bandwidth for their transmission. The proposed approaches can be used on a variety of devices as a post-processing step, without requiring any change in existing video encoding and transmission pipelines. Experiments on standard video datasets have shown that the proposed approaches improve video quality metrics considering either fixed bandwidth budgets or fixed quality goals.

Live OTT services delivery with Ad-insertion using VVC, CMAF-LL and ROUTE: an end-to-end chain

  • Thibaud Biatek
  • Mohsen Abdoli
  • Christophe Burdinat
  • Eric Toullec
  • Lucas Gregory
  • Mickael Raulet

During the past decade, broadcasting has been challenged by OTT services, offering a more personalized and flexible way of experiencing the content. The on-demand features paved the way for new media delivery paradigms, replacing traditional MPEG-TS stream-based approach by file-based IP protocols. With those new protocols came new ways of monetizing the content. While traditional TV leveraged advertisements to increase broadcasters' revenues for decades, it has never approached the level of personalization offered by OTT, until the release of ATSC-3.0 [8] and its IP stack in 2016. However, OTT services have some drawbacks compared to broadcasting: video quality, latency, and scaling capabilities when millions of viewers want to access a content at the same time. In this paper, we describe how the several recent technologies can be exploited to address these drawbacks. An end-to-end chain is also demonstrated, bringing significant improvement over existing approaches in terms of bandwidth, latency and experience.

The Versatile Video Coding (VVC) [1] standard has been released in mid-2020 by ISO/IEC MPEG and ITU-T VCEG. VVC has been designed to address a wide range of applications and video formats, while providing a substantial bandwidth saving (around 50%), compared to its predecessor, High Efficiency Video Coding (HEVC) [10], at an equivalent perceived video quality [12]. In this paper, VVC is used to reduce the bandwidth occupied by video over the network, which is a key issue in 2021, knowing that more than 80% of the internet traffic is used to deliver video content [2]. A live software implementation of VVC provided by ATEME is used in the headend before packaging and distribution, producing ISO/IEC Base Media File Format (ISOBMFF) output files [6].

The Common Media Application Format in its Low-Latency profile (CMAF-LL) is then used to deliver the video [7], leveraging HTTP chunk transfer encoding to deliver Low-Latency DASH (LL-DASH) [3]. To further reduce the bandwidth, the DASH-ROUTE server is leveraged to deliver multicast instead of redundant unicast sessions [11]. The ATEME packager, origin server and DASH-ROUTE multicast server are used for that purpose. Combined with VVC efficient source-coding, this paper provides a bandwidth efficient and low-latency manner of delivering OTT services at scale.

This paper adopts multi-period DASH manifests for ad insertion using XLink to signal ad-server URL. Typical streams coming from broadcasting studios, embedding SCTE-35 splicing event are used as input and interpreted by ATEME pre-processing engine to trigger the ad-insertion and multi-period events in the DASH manifest. The ad-server is then provisioned with VVC-encoded ad-clip for reducing the CDN cost of those files. To demonstrate such advanced features, the GPAC Framework [9] is used with real-time VVC software decoder [4, 5]. The multicast gateway and player from GPAC embeds the real-time decoder libraries to demonstrate both the ROUTE demuxing and ad-replacement within the player.

The benefits of using VVC and ROUTE has been measured. It is verified that the proposed solution enables similar latency as typical terrestrial broadcast services with a high-level of quality providing around 3 seconds glass-to-glass latency leveraging CMAF with HTTP chunk transfer encoding. An accurate description of encoding, packaging and delivery setting will be presented in the poster. Finally, the interoperability is demonstrated by integrating it with other CDN-providers and devices.

In summary, this paper describes implementation and demonstration of a state-of-the-art complete end-to-end and transmission chain for OTT, That aims to address well-known OTT drawbacks, such as latency and scaling capability, by combining the VVC, CMAF-LL and DASH-ROUTE. The proposed solution demonstrates significant benefits in terms of latency reduction, bandwidth saving and network traffic reduction, while enabling flexible and customized ad-insertion solutions for live OTT services. The interoperability of the proposed components is also demonstrated by integrating the chain with 3rd party CDN-providers and integrating it within broadcast environment. As a result, this paper shows the good degree of maturity of emerging technologies such as VVC, when combined with CMAF and ROUTE.

Sustainable OTT video distribution powered by 5G-multicast/unicast delivery and versatile video coding

  • Thibaud Biatek
  • Eduard François
  • Cédric Thienot
  • Wassim Hamidouche

Video over the internet has drastically grown these past years, currently representing more than 80% of the internet bandwidth [3]. The massive usage of unicast delivery leads to network congestion that can result in poor quality of experience for the viewer, high delivery cost for operators and increased energy consumption. The current methods for adaptive video streaming rely more on maximizing the video quality for a given bandwidth rather than minimizing the end-to-end (from video server to end-user display) energy consumption for a given level of quality. This paper aims at leveraging recently standardized delivery and coding technologies to maintain the video quality while monitoring and reducing video energy footprint and delivery costs.

The Versatile Video Coding (VVC) [2] recently issued by ISO/IEC and ITU-T is used to further reduce video services bandwidth over previous video coding solutions, in particular its predecessor HEVC [10]. Compared to HEVC, VVC enables around 50% of bandwidth saving, at an equivalent video quality [12]. This performance is achieved by extending existing HEVC coding tools and by introducing new ones. VVC also brings new high-level features to better address new use-cases and applications (e.g variable resolution, scalability, 360° and screen-content coding). This paper uses ATEME Titan encoder while the decoding is performed by a real time software decoder OpenVVC. This latter is a cross-platform library that provides consumers with real time decoding capability under different OS including MAC OS, Windows, Linux and Android on Intel x86 and ARM platforms.

A dynamic video format selection is proposed to limit the distribution bandwidth and energy cost based on quality/complexity tradeoffs sent to the player. The carriage of such metadata is achieved by using the Green-MPEG [6] and DASH standard. A content mapping and video artefacts masking is implemented to counterbalance video degradation with post-processing. These are based on SEI messages such as CTI and Grain Synthesis recently standardized in the VSEI specification [5]. An energy reporting solution is proposed in order for the end-viewer to be informed of the energy impact and enabling manual adjustment the energy/quality tradeoff.

The Common Media Application Format (CMAF) is used to deliver the video segments over the network [5]. The carriage of video, audio and green metadata is based on the ISO/IEC Base Media File Format [7] (ISOBMFF). To save network traffic, the video segments can be either sent in unicast for low-audience services or in multicast for highly popular ones. The DASH/FLUTE stack is implemented as multicast protocol. The OTT services are delivered in the context of 3GPP Release 17 [1] and DVB-MABR Phase 2 network [4], including delivery over managed networks (IPTV) and mobile (4G-Lte and 5G).

The estimated energy saving for the proposed delivery infrastructure is the following. First, the bitrate saving brought by improved compression is estimated to be .05-0.1 Wh/bit for 1 bit of transmitted video [11], representing 37.5-75kWh saving for a 5Mbps HEVC HD service. Usage of SL-HDR is estimated to save 15-20% of display energy [9], and even more for a receiver fully leveraging Green metadata. For an average TV set [8], it represents a 21kWh saving. Finally, radio saving is more complex to estimate, but preliminary results show that saving from multicast starts from 1-8 User-Equipment (UE) in a cell.

This paper proposes an OTT delivery solution enabling operators and end-users to monitor and control their energy impact. At the headend side, metadata are generated by a pre-processing block and embedded in SEI messages. The encoder is leveraging VVC and CMAF to encapsulate video services together with the metadata. The delivery network is optimized by using DASH/FLUTE profile of DVB-MABR to save bandwidth when OTT content are delivered over IPTV or mobile networks. On end-user side, a real-time software VVC decoding and post-decoding processing are demonstrated. An end-to-end demonstrator is provided and evaluated to assess the relevance of the proposed method in a real environment.

Delivering universal TV services in a multi-network and multi-device world with DVB-I

  • Thibaud Biatek
  • Mickael Raulet
  • Patrice Angot
  • Philippe Gonon
  • Cédric Thienot
  • Wassim Hamidouche
  • Pascal Perrot
  • Julien Lemotheux

The TV landscape went through significant changes during the past decades. The traditional broadcast switched from analog to digital thanks to new modulation, transport and coding systems. In the meantime, internet brought new ways of experiencing TV services with more customization by creating the need of on-demand features, paving the way to the streaming world we know today. These ecosystems grown separately, with MPEG-TS centric broadcast application being developed by DVB/ATSC and broadband streaming applications being developed by GAFAM based on IP-protocols from IETF. This created a fragmentation in the audience in terms of usage (linear, on-demand), access-network (IPTV, broadcast, OTT & 4G-Lte/5G) and devices (TV, set-top-boxes, mobiles). Broadcasters and operators addressed this fragmentation by declining services in many flavors, leveraging various and non-homogeneous technologies which led to a complex video delivery infrastructure having a lot of redundancies. This increases the delivery cost significantly and represent an energetic waste in networks and datacenters. In this paper, a solution for universal TV service delivery is proposed, based on recently standardized DVB-I, and addressing OTT, IPTV and 4G-Lte/5G mobile networks.

Recently, Versatile Video Coding (VVC) [1] has been added to the DVB toolbox as an enabler for new applications in the DVB ecosystem [2]. Beside, 3GPP-SA4 started characterization of video codecs for 5G applications, including VVC as a relevant compression technology [6]. VVC has been issued in mid-2021 and has been developed by JVET, a joint group of ITU-T and ISO/IEC. VVC has been designed to address a various kind of applications and formats through its design and provides around 50% of bandwidth saving compared to its predecessor HEVC [5] for a similar visual quality [5]. Thus, VVC is a relevant technology to address new use-cases including 8K, VR-360, gaming and augmented reality. Beside this multiplicity of codecs, the multiplicity of delivery network and devices brought new challenges. To address this fragmentation while maintaining the audience, DVB developed a new paradigm for media consumption in order to harmonize and make TV services universals: DVB-I [7]. DVB-I enables, through a centralized service-list, to access TV services in a network/device agnostic manner. The service list enables to describe in a universal way the access-networks and decoding capabilities, including prioritization aspects.

This paper proposes a delivery architecture based on DVB-I enabling video services to reach any kind of devices (Set-top-boxes, smartphones, TVs), on various network coming from broadcast (DVB) to broadband (3GPP) worlds. The headend produces video bitstreams (HEVC/VVC) and packages the stream using Common Media Application Format (CMAF) [3] producing DVB-DASH compliant streams for delivery over IPTV, 4G-Lte, 5G and OTT. The DVB-MABR standard is leveraged as well as 5GMS in order to reach end-devices. A service list URL substation mechanism is proposed to address interoperability of DVB-I with DVB-MABR and 3GPP networks. Clients on smartphones and android set-top-boxes are demonstrated, embedding DVB-MABR client, DVB-I client and live VVC decoding open-source library [4]. To demonstrate the proposed approach and architecture, an end-to-end live demonstrator is provided and further detailed in the poster.

The solution provides to operators a cost-effective manner of deploying universal services over multiple networks and devices. The bandwidth is optimized leveraging latest coding technologies (HEVC, VVC) while CMAF is used to unify packaging and enable low-latency delivery. DVB-MABR is implemented to optimize bandwidth over operators' network. These components are tight together by DVB-I in order to signal TV services in a network and device agnostic manner. Finally, a DVB/3GPP compliant player is proposed for any devices providing a consistent and high quality of experience.

Porting BQM perceptual video quality measure to hardware

  • Sergey Pyko
  • Boris Filippov
  • Tamar Shoham
  • Dror Gill
  • Nikolay Terterov
  • Alexander Ivanov
  • Vadim Demidov

With the explosive growth in video, fast and reliable perceptual video quality assessment using objective quality measures is needed, as recognized by Netflix [1, 2], SSIMWAVE [3], Beamr [4, 5] and Facebook [6, 7], to name a few. The Beamr Quality Measure (BQM) is a full reference image and video quality measure that can reliably assess the perceived quality of reconstructed image or video content, which was compressed using block-based methodologies. Due to demand for metrics which can be applied in real-time or on large amounts of video, an efficient and low-cost implementation is needed, making a hardware, or on chip solution, desirable. This paper details the process of porting BQM to hardware. Adaptations required for the hardware implementation are described, the optimizations applied are reviewed and some results are presented.

RICHTER: hybrid P2P-CDN architecture for low latency live video streaming

  • Reza Farahani
  • Hadi Amirpour
  • Farzad Tashtarian
  • Abdelhak Bentaleb
  • Christian Timmerer
  • Hermann Hellwagner
  • Roger Zimmermann

Content Distribution Networks (CDN) and HTTP Adaptive Streaming (HAS) are considered the principal video delivery technologies over the Internet. Despite the wide usage of these technologies, designing cost-effective, scalable, and flexible architectures that support low latency and high quality live video streaming is still a challenge. To address this issue, we leverage existing works that have combined the characteristics of Peer-to-Peer (P2P) networks and CDN-based systems and introduce a hybrid CDN-P2P live streaming architecture. When dealing with the technical complexity of managing hundreds or thousands of concurrent streams, such hybrid systems can provide low latency and high quality streams by enabling the delivery architecture to switch between the CDN and the P2P modes. However, modern networking paradigms such as Edge Computing, Network Function Virtualization (NFV), and distributed video transcoding have not been extensively employed to design hybrid P2P-CDN streaming systems. To bridge the aforementioned gaps, we introduce a hybRId P2P-CDN arcHiTecture for low LatEncy live video stReaming (RICHTER), discuss the details of its design, and finally give a few directions of the future work.

Microservices for multimedia: video encoding

  • Frank San Miguel
  • Naveen Mareddy
  • Anush Krishna Moorthy
  • Xiaomei Liu

Netflix has been one of the pioneers that has driven the industry adoption of a new paradigm of system architecture referred to as "microservices". Microservices, or more accurately, microservice architecture refers to an architecture where applications are modeled as a collection of services which are: highly maintainable and independently testable, loosely coupled, independently deployable and organized around business capabilities. Typically, each microservice is owned by a small team of developers that is responsible for its development, testing and deployment, i.e., its end-to-end lifecycle. Traditional microservices such as those used outside of multimedia processing at Netflix typically consist of an API with stateless business logic which is autoscaled based on request load. These APIs provide strong contracts and separate the application data and binary dependencies from systems.

As useful as traditional microservices are, several peculiarities of multimedia applications render such stateless services non ideal for media processing. Specifically, media processing (which includes video/audio processing, media encoding, timed-text processing, computer vision analysis etc.) relies on data that is embedded in files where the files themselves are contracts as opposed to fully visible data models that are common in non-media applications. At Netflix, media processing is resource intensive and bursty in nature. It is also highly parallelizable and re-triable, and so, even though work is generally a continuous stream with deadlines and priorities, the system can balance resources by evicting jobs as needed which can be retried at a later time.

In this talk, we will summarize Cosmos, a project that we've developed in order to enable workflow-driven media processing using a microservice architecture.

Cosmos is a computing platform that combines the best aspects of microservices with asynchronous workflows and serverless functions. It is designed specifically for resource intensive algorithms which are coordinated via complex hierarchical workflows. Cosmos supports both high throughput and low-latency workloads. The Cosmos platform offers: observability through built in logging, tracing, monitoring, alerting and error classification; modularity (both compile-time and run-time) through an opinionated framework for structuring a service; productivity through tooling such as code generators, containers, and command line interfaces; and delivery through a managed continuous-delivery pipelines.

The Cosmos platform allows media developers to build and run domain-specific, scale-agnostic components which are built atop three scale-aware subsystems that handle distributing the work. Each component can thus be independently developed, tested and deployed with clear abstraction from the underlying platform thereby providing a logical separation between the application and platform so that the details of distributed computing are hidden from media developers

Cosmos enables our media developers to take a service from commit to deployment in a matter of hours. To ensure the success of the large-scale overall system with independent fast-moving microservice development, innovative testing strategies are applied with various testing tools and quick rollback capability in production.

In the talk, using the Netflix Video Encoder Service as an example, we will describe the Cosmos architecture and our migration to microservices-based media processing. The talk will also cover our learnings around managing a large-scale migration and the mindset required in order to plan and execute a multi-year goal.

Multi-buffer AVX-512 accelerated parallelization of CBCS common encryption mode

  • Marcel Cornu
  • Mark Jewett
  • Sumit Mohan
  • Tomasz Kantecki
  • Gordon Kelly
  • Romain Bouqueau
  • Jean Le Feuvre
  • Alex Giladi

The Intel Multi-Buffer Crypto for IPsec Library provides highly optimized software implementations of the core cryptographic processing for TLS, Wireless (RAN), Cable and MPEG DRM. The library offers industry-leading performance on a range of Intel(R) Processors by utilizing the latest CPU instructions up to AVX512 and various software optimization techniques to maximize Intel CPU utilization.

The MPEG Common Encryption called MPEG-CENC is an encryption format for base Media files. This encryption format is seeing wide usage today to protect media content, with its capability to reduce overhead in encrypting large amounts of data while providing copyright protection. MPEG-CENC uses CBCS which is based on AES in CBC (Cipher block chaining) mode. Due to CBCS inherent nature of having encryption dependency between blocks it does not lend well to the SIMD (Single Instruction Multiple Data) instructions on CPU, and the serial implementation limits performance on cores. This is where multi buffer can be used to speed-up CBCS, which uses SIMD instructions and heavily leverages AES-NI extensions to parallelize data processing for CBCS and improve performance of CENC.

In this paper we will discuss Intel's latest Xeon® processors (codenamed Ice Lake) Sunny Cove core which brings improved vector AES performance giving 2x throughput compared to previous generation cores and the new AVX - 512V AES extension which enables AES operations to be performed on a full 64-byte ZMM registers (up to 4 AES blocks) with a single instruction.

Additionally, we will walk-through the Intel Multi-Buffer Crypto for IPsec Library's CENC cbcs implementation which leverages these new features and enhancements to dramatically improve crypto performance compared to previous generation Intel® processors and hence significantly reduces the crypto overhead in multimedia packager stacks such as GPAC when doing MPEG DRM encryption and decryption by up to 10x versus the default implementation (using openssl).

Perceptual modelling for banding detection

  • Kai Zeng
  • Hojatollah Yeaganeh
  • Zhou Wang

Banding is an annoying visual artifact that frequently appears at various stages along the chain of video acquisition, production, distribution and display. With the thriving popularity of ultra-high definition, high dynamic range, wide color gamut content, and the increasing user expectations that follow, the banding effect has been attracting a growing deal of attention for its strong negative impact on viewer experience in visual content that could otherwise have nearly perfect quality. What often frustrates many industrial practitioners is that simply increasing the bit-depth or bitrate of a video does not necessarily lead to removal or even reduction of banding. Indeed, with the recent accelerated growth of ultra-high definition (UHD), high dynamic range (HDR), wide color gamut (WCG) in content production, distribution services, and consumer display devices, severe banding occurs even more frequently than before and the visual effect is often much stronger. This is because UHD/HDR/WCG content typically covers a wider range of luminance levels and color variations than those of the traditional standard dynamic range (SDR) content. This, together with the limited and varying capabilities of display devices, creates major challenges to maintain smooth visual transitions simultaneously across all luminance levels and color variations. Automatic or objective image/video quality assessment (IQA/VQA) has been a highly active topic in the past two decades. However, popular IQA/VQA methods such as PSNR, SSIM, and MS-SSIM are insensitive to banding impairments. Therefore, the industry is in an urgent need of innovative approaches that are able to detect, control, and remove/reduce banding in an automated fashion. Here we present two different types of technologies of great promise on banding detection. The first is based on domain knowledge gained through deep and thorough understandings of the human visual system (HVS) and the video acquisition, production, distribution and display processes. Computational models derived from such domain knowledge are then combined to construct an overall banding detection and assessment model. In contrast to the first type of domain knowledge-driven methods, the second type of approaches are data-driven, with no or little domain knowledge assumed. Instead, a large number of images/videos and their ground-truth labels (with or without banding) are collected, and machine learning methods are then used to train black-box models such as the deep neural networks (DNNs) using the image/video dataset, so that the learned model may make good banding predictions on unseen image/video content.

ABR-aware prefetching methods in P2P

  • Hiba Yousef
  • Jean Le Feuvre
  • Alexandre Storelli

Adaptive BitRate (ABR) streaming has become one of the lead delivery techniques that adapts the video quality to the viewing conditions, with embedded client-side logic. In parallel, Peer-to-Peer (P2P) networks improve the system scalability by leveraging the resources of the participating peers, each peer downloading then sharing video segments with other peers. Those networks rely on P2P prefetching techniques to get as much data as possible from other peers, then keep it in a local memory either to serve the peer itself or other peers in the network.

P2P environments face a compatibility challenge with the client-side ABR logic, as P2P stack and ABR logic are typically unaware of each other. When the ABR receives the segment directly from the P2P cache, download time is very small and leads to an almost infinite bandwidth, therefore an incorrect next bitrate decision. In a previous work [2], we proposed Response-Delay, an initial solution to mock CDN response time to deliver the local cached P2P segments to the video player.

Interestingly, adjusting the download time has paved the way for further improvements. It leads the ABR to take different decisions, and hence, it controls the player ABR one way or another. We have shown in another previous work [1] that despite the ABR logic being a complete black box from the perspective of the P2P stack, it can be modelled with machine learning (ML) techniques by monitoring its inputs and final decisions. This present work binds together the ideas proposed in [2] and [1] to investigate the possibility of enhancing the prefetching technique. The main idea is to anticipate the quality switches using the ABR ML model, and from then, investigate two possible actions.

The first strategy (MLQF) consists in downloading the anticipated quality from other peers prior to the player's request. We show that this greatly improves the efficiency of P2P prefetching during ABR track switches.

The second strategy (MLQC) is to estimate the optimal response delay that will force the ABR logic to stay at the already prefetched quality, so that another quality of the same content is not double downloaded. We apply this strategy only when the prefetched quality is higher than what the ABR would request, so that we do not degrade the average quality.

CAdViSE or how to find the sweet spots of ABR systems

  • Babak Taraghi
  • Abdelhak Bentaleb
  • Christian Timmerer
  • Roger Zimmermann
  • Hermann Hellwagner

With the recent surge in Internet multimedia traffic, the enhancement and improvement of media players, specifically Dynamic Adaptive Streaming over HTTP (DASH) media players happened at an incredible rate. DASH Media players take advantage of adapting a media stream to the network fluctuations by continuously monitoring the network and making decisions in near real-time. The performance of algorithms that are in charge of making such decisions was often difficult to be evaluated and objectively assessed from an End-to-end or holistic perspective [1].

CAdViSE provides a Cloud-based Adaptive Video Streaming Evaluation framework for the automated testing of adaptive media players [4]. We will introduce the CAdViSE framework, its application, and propose the benefits and advantages that it can bring to every web-based media player development pipeline. To demonstrate the power of CAdViSE in evaluating Adaptive Bitrate (ABR) algorithms we will exhibit its capabilities when combined with objective Quality of Experience (QoE) models. Our team at Bitmovin Inc. and ATHENA laboratory has selected the ITU-T P.1203 (mode 1) quality evaluation model in order to assess the experiments and calculate the Mean Opinion Score (MOS), and better understand the behavior of a set of well-known ABR algorithms in a real-life setting [2]. We will display how we tested and deployed our framework using a modular architecture into a cloud infrastructure. This method yields a massive growth to the number of concurrent experiments and the number of media players that can be evaluated and compared at the same time, thus enabling maximum potential scalability. In our team's most recent experiments, we used Amazon Web Services (AWS) for demonstration purposes. Another awesome feature of CAdViSE that will be discussed here is the ability to shape the test network with endless network profiles. To do so, we used a fluctuation network profile and a real LTE network trace based on the recorded internet usage of a bicycle commuter in Belgium.

CAdViSE produces comprehensive logs for each experimental session. These logs can then be applied against different goals, such as objective evaluation or to stitch back media segments and conduct subjective evaluations. In addition, startup delays, stall events, and other media streaming defects can be imitated exactly as they happened during the experimental streaming sessions [3].

A novel approach to testing seamless audio & video playback in CTA WAVE

  • Bob Campbell
  • Yan Jiang

CTA's Web Application Video Ecosystem (WAVE) project aims to improve how internet-delivered video and audio is handled on consumer electronics devices. This paper presents in further detail the mechanisms that are proposed to automatically verify requirements in the Device Playback Capabilities Specification [1]. Specifically, those requirements include observations that audio and video playback is seamless. A test approach using artefacts applied onto the source video and audio will be described, which has proved successful. For video, QR codes are used; for audio, white noise is added, and a cross-correlation algorithm applied. The test media with these artefacts applied, and the novel processing applied in a software "observation framework" component, form part of the test environment provided by WAVE to the ecosystem. These open-source tools together with off the shelf hardware afford the tester a means to compare the original content with a recorded capture from the device under test, which may include mobile, smart TVs and other media playback devices, and automatically assert whether the WAVE requirements are met.

Video streaming using light-weight transcoding and in-network intelligence

  • Alireza Erfanian
  • Hadi Amirpour
  • Farzad tashtarian
  • Christian Timmerer
  • Hermann Hellwagner

In this paper, we introduce a novel approach, LwTE, which reduces streaming costs in HTTP Adaptive Streaming (HAS) by enabling light-weight transcoding at the edge. In LwTE, during encoding of a video segment in the origin server, a metadata is generated which stores the optimal encoding decisions. LwTE enables us to store only the highest bitrate plus corresponding metadata (of very small size) for unpopular video segments/bitrates. Since metadata is of very small size, replacing unpopular video segments/bitrates with their metadata results in considerable saving in the storage costs. The metadata is reused at the edge servers to reduce the required time and computational resources for on-the-fly transcoding.

Marrying WebRTC and DASH for interactive streaming

  • Julia Kenyon
  • Thomas Stockhammer
  • Ali C. Begen
  • Ofer Shem Tov
  • Louay Bassbouss
  • Daniel Silhavy

WebRTC is a set of W3C and IETF standards that allows the delivery of real-time content to users, with an end-to-end latency of under half a second. Support for WebRTC is built into all modern browsers across desktop and mobile devices, and it allows for streaming of video, audio and data. While the original focus of WebRTC has been on videoconferencing, it is increasingly being used today for real-time streaming of premium content because its ultra-low latency features enable several new user experiences, especially those that involve user interactivity, that are not easy to deliver or even possible with the traditional broadcast or streaming delivery protocols. Because of this increasing usage for premium content, the integration of WebRTC with the de facto adaptive streaming protocols such as MPEG's Dynamic Adaptive Streaming over HTTP (DASH) is essential. This paper gives information about the DASH Industry Forum's exploration activity on this very subject.

Efficient bitrate ladder construction for live video streaming

  • Vignesh V Menon
  • Hadi Amirpour
  • Mohammad Ghanbari
  • Christian Timmerer

In live streaming applications, service providers generally use a bitrate ladder with fixed bitrate-resolution pairs instead of optimizing it per title to avoid the additional latency caused to find optimum bitrate-resolution pairs for every video content. This paper introduces an online bitrate ladder construction scheme for live video streaming applications. In this scheme, each target bitrate's optimized resolution is determined from any pre-defined set of resolutions using Discrete Cosine Transform (DCT)-energy-based low-complexity spatial and temporal features for each video segment. Experimental results show that, on average, the proposed scheme yields significant bitrate savings while maintaining the same quality, compared to the HLS fixed bitrate ladder scheme without any noticeable additional latency in streaming.

Standards based end-to-end metadata transport for live production workflows

  • Kent Terry

One of the factors that has driven the rise to prominence of OTT services that deliver content directly to consumers via IP distribution is the increase in the audio and visual quality of content that they provide. The ability to deliver immersive and personalized audio enabled by next generation audio (NGA) codecs, and 4K/8K high dynamic range video, is one reason consumers recognize these services as delivering the highest quality content. A common requirement to fully enable these advanced and video capabilities is the use of rich, dynamic, time accurate metadata. This type of metadata is also key to enabling new emerging technology, such as VR, and future, not yet defined, technologies that will continue to drive content innovation.

While file based workflows for scripted and non-live content have added capabilities to utilize rich audio and video metadata in the production and distribution process, support for this type of metadata in live production and distribution has lagged, partly due to the prevalence of legacy audio and video technology that has limited metadata capabilities. The move to IP transport based methods for live content production provides the opportunity to remove these limitations. Work is in progress to define new standards for metadata transport that not only meet the requirements for current use cases but is flexible and extendable for future applications.

Work to define metadata transport standards for SMPTE ST 2110 systems, as well as audio metadata standards for AES67 systems is described. Interoperation with legacy systems, and with file based formats and workflows is also considered, and emerging standards in this area are discussed. How these emerging standards fit into a larger vision of "microphone to speaker" audio metadata and "camera to display" video metadata is also described. Particular focus will be given on enabling rich audio metadata in the latest NGA audio codecs such as AC-4.

CMCD at work with real-time, real-world data

  • William Law
  • Sean McCarthy

This study examines some of the first production data obtained from deploying Common Media Client Data (CMCD) into production environments within a global content delivery network (CDN) and a global content distributor. It covers player integrations into Shaka, hls.js and dash.js, details of CDN support, handling of CMCD in a multi-CDN environment by a content distributor and the analysis and interpretation of the returned data.

Optimizing real-time video encoders with ML

  • Nelson Francisco
  • Julien Le Tanou

The main goal when designing video compression systems is to maximize video quality for a given bitrate (or achieve a target video quality at the lowest possible bitrate), all within well-defined processing resources. Since economic and environmental aspects often place strict constraints on those resources, defining the optimal encoder toolset to maximize compression efficiency within the available computational footprint becomes crucial.

CAMBI: introduction and latest advances

  • Mariana Afonso
  • Joel Sole
  • Lukáš Krasula
  • Zhi Li
  • Pulkit Tandon

Banding is manifested as false contours in otherwise smooth regions in an image or a video. There are many different reasons for banding artifacts to occur, however, one of the most prominent causes is the quantization inside a video encoder. Compared to other types of artifacts common in video processing, e.g. blur, ringing, or blockiness, only a relatively small change of the original pixel values can produce an easily noticeable and visually very annoying case of banding. This property makes it very difficult for banding to be captured by generic objective quality metrics such as PSNR or VMAF [10] which brings a need for a distortion specific detector targeted directly at banding artifacts.

Most of the previous attempts to solve this problem tried to tackle it as false segments or false edges detection. Both block-based [7, 14] and pixel-based [2, 3, 15] segmentation methods have been tried in the first category, while the edge-based methods exploited different local statistics such as gradients, contrast, or entropy [4, 6, 9, 13]. The main difficulty for all of these approaches is distinguishing between the real and false edges or segments. Recently, banding detection has also been addressed by deep neural networks [8].

The above mentioned approaches have been developed for 8-bit content and mostly tuned towards the banding artifacts occurring in user-generated images and videos. Moreover, they do not address the potential presence of dithering - an intentionally inserted noise used to randomize the error caused by quantization. Dithering is commonly used during bit-depth conversion and is often enabled by default in popular image and video processing tools, such as ffmpeg [12]. Despite being highly effective in reducing the perceived banding, dithering does not suppress the false contours completely, and thus needs to be factored in a reliable banding detector.

Our goal was, therefore, to develop an algorithm capable of evaluating perceived banding in professionally generated videos processed in ways relevant to the adaptive streaming scenario (i.e. video compression and scaling). The requirements also included ability to capture the effect of dithering and to work on both 8-bit and 10-bit content.

We hereby present CAMBI, a Contrast Aware Multiscale Banding Index. CAMBI is a white-box solution to the above described problem derived from basic principles of human vision with just a few, perceptually-motivated, parameters. The first version was introduced at PCS'2021 [11]. Here, we also present several improvements made since then.

There are three main steps in CAMBI - input preprocessing, multiscale banding confidence calculation, and spatio-temporal pooling. Although it has been shown that chromatic banding exists [5], like most past works, we assume that most of the banding can be captured in the luma channel. The preprocessing step, therefore, consists of luma channel extraction followed by filtering to account for dithering and a spatial mask computation to exclude regions with textures. Banding confidence is calculated for 4 brightness level differences on 5 scales, taking into account contrast perception of human visual system. This creates 20 banding confidence maps per frame that are pooled spatially considering only a certain percentage of highest banding confidence. Such mechanism ensures that even the banding appearing in relatively small area of the frame is captured proportionally to its perceptual importance. Finally, the scores from different video frames are pooled into a single banding index.

To test the accuracy of CAMBI, we conducted a subjective test on 86 video clips created from 9 different sources from Netflix catalog using different levels of compression and scaling with and without dithering. The ground-truth mean opinion scores (MOS) were obtained from 26 observers who were asked to rate the annoyance of the banding in the scene on the continuous impairment scale annotated with 5 equidistant labels (imperceptible, perceptible but not annoying, slightly annoying, annoying, very annoying) [1]. CAMBI achieved a correlation exceeding 0.94 in terms of both Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank Order Correlation Coefficient (SROCC), significantly outperforming state-of-the-art banding detectors for our use-case.

CAMBI is currently used alongside VMAF in our production to improve the quality of encodes prone to banding. In the future, we are planning to integrate it into VMAF as one of the features to make it capable of accurately evaluating video quality in the presence of banding as well as other artifacts.

SCTE-224 and channel variants for a streamlined content delivery

  • Walid Hamri
  • Jean Macher

Like most video consumed today, live sports is available as a streaming app that runs on any device and offers a modern user experience. Sports rights still have to be enforced in this "watch anywhere" environment and require more sophisticated rules than the dreaded blackouts of legacy cable television.

ANSI/SCTE 224 [1] or ESNI (Event Scheduling and Notification Interface), defines a web interface for programmers to distribute rich schedules and content policies to affiliates or distribution platforms. Applied to regional sports, SCTE-224 can be used to define linear channel variants intended to specific audiences. A streaming video platform can then exploit the SCTE-224 data to create and deliver the correct linear content to viewers.

This paper will explain how SCTE-224 and the creation of channel variants was implemented for the Bally Sports services using a single programming workflow for IP, Satellite and OTT delivery. We will also cover how server-side ad insertion is combined with channel variants delivery to provide regionalization plus monetization at scale for one of the biggest sports streaming platforms in production today.

Evaluation of MPEG-5 part 2 (LCEVC) for live gaming video streaming applications

  • Nabajeet Barman
  • Steven Schmidt
  • Saman Zadtootaghaj
  • Maria G Martini

This paper presents an evaluation of the latest MPEG-5 Part 2 Low Complexity Enhancement Video Coding (LCEVC) for live gaming video streaming applications. The results are presented in terms of both objective and subjective quality measures. Our results indicate that LCEVC outperforms both x264 and x265 codecs in terms of bitrate savings using VMAF. Using subjective results, it is found that LCEVC outperforms the respective base codecs, especially for low bitrates. This effect is much more dominant for x264 as compared to x265, with marginal absolute improvement of quality scores for x265.

Session-based DASH streaming: a new MPEG standard for customizing DASH streaming per session

  • Iraj Sodagar
  • Alex Giladi

The MPEG DASH standard is widely deployed in over-the-top streaming services. The standard defines two key components: a manifest format to describe the presentation and a set of segment formats to describe the media segments. While the DASH's manifest format, Media Presentation Document (MPD) provides a set of extensive tools to describe the presentation timeline, this document is usually created for a large set of DASH clients and therefore it can be cached in CDNs for a large population. If the MPD needs to be customized per client, the cache efficiency of storing a single MPD for all clients would be lost. Recently the MPEG Systems Working Group (ISO/IEC/SC29/WG3) developed a new standard that allows an MPD to be customized at each client using an external document and a set of processing rules. The first version of the Session-Based DASH standard (ISO/IEC 23009-8) was recently finalized and will be published in the upcoming months.

ISO/IEC 23009-8 defines 3 components: 1) A Session-Based Document (SBD) which defines the MPD customization rules for a client for a given session, 2) A method of referencing the external Session-Based Document (SBD) in the DASH MPD, and finally, 3) a processing model for the client-side processing of SBDs. The SBD defines a post-processing procedure to customize each URL generated by the DASH client from an MPD. Before the DASH client requesting to download a resource using that URL, a process that is described in the SBD document is applied to the URL. The process can customize different parts of the URL, i.e. the host, path, port parts as well as its queries, using a template matching technique. The customization can be timed dependent, on the point in the media timeline that the URL corresponds to, or order dependent, i.e., on the location of the URL in the URL request orders. The result is a customized URL per client/session/URL that is produced from the given URL generated by the DASH client from MPD.

The ISO/IEC 23009-8 standard defines an architecture for the session-based DASH streaming that has a few benefits: 1) From the content creation side, it separates the client-based and session-based customization from MPD and therefore maintains the MPD caching efficiency while allowing the customization. It also enables to produce customization after packaging of MPD which means that it can be added to the current workflows as a post-processing step. 2) From the client-side, it allows implementation of SBD client as a separate and independent process from the DASH client, and therefore it can be added to the current clients as a separate process. Furthermore, the SBD processing can occur on the device or a different network entity such as application servers. 3) From the content distribution side, the SBD creation or customization can occur at different nodes of the network, at the origin server, as distribution centers and CDNS, or even at the home network gateways. The standard also allows multiple SBDs to be applied to the URLs of an MPD, enabling the customization to be requested by one or multiple entities in the ingest or distribution chain.

In this paper, we first describe the session-based DASH streaming architecture. Then the features of the SBD standard are outlined including the capabilities of customizing based on templates as well as the key-pair replacement and the possibilities of replacing various parts of a URL. Next, the SBD client processing model is described, and how the SBD client can be implemented on the device or as a network entity as a separate process. Finally, we demonstrate a forensic watermarking application using the SBD and demonstrate its capabilities and compare the efficiency of watermarking using the SBD standard vs the MPD customization per client/session.

Network congestion control and its impact on video streaming QoE

  • Ravid Hadar
  • Michael Schapira

Congestion control plays a crucial role in Internet-based content delivery. Congestion control brings order to the Internet's crowded traffic system by sharing the scarce network bandwidth between competing services and users. Congestion control algorithms continuously modulate the rate at which data packets are injected into the network by traffic sources in response to network conditions.

Congestion control immensely impacts quality of experience (QoE) for services like video streaming, video conferencing, and cloud gaming; sending packets too slowly prevents supporting high video quality (HD/UHD); sending too fast can overwhelm the network, resulting in data being lost or delayed, leading to phenomena such as video rebuffering.

While congestion control has been a key focus for both academic and industrial research for decades, the exact correlation between the performance of the congestion control algorithms employed by video servers and the QoE experienced by video clients remains poorly understood. We will report on our experimental results along these lines.

We evaluated and contrasted three dominant congestion control schemes: TCP Cubic [3], which is the default for many operating systems, and two recently proposed congestion control schemes, namely, Google's Bottleneck-Bandwidth-and-RTT (BBR) [1] protocol, and Performance-oriented Congestion Control (PCC) [2].

Our experimental setup consisted of a video cache that sends http-based video traffic across an emulated network environment towards a video client. We took into consideration both MPEG-DASH and HLS-based video streaming and both wired and wireless networks. We ran multiple experiments for varying network conditions (e.g., the available bandwidth, non-congestion-related packet loss, network latency, and depth of in-network buffers, etc.).

By monitoring the behavior of the congestion controller and examining the QoE data from the video player (e.g., video start-time, average bitrate, rebuffering ratio, etc.), we have been able to draw meaningful conclusions. Specifically, our results shed light on the features of network-level performance that most impact user-perceived QoE, quantify the benefits for performance of employing modern congestion control protocols, and provide insights into the interplay between congestion control, the network environment, and the video player.

Below is a diagram describing the experiment setup:

A standards-based framework for real-time media in immersive scenes

  • Imed Bouazizi
  • Thomas Stockhammer

Immersive media experiences are anticipated to become the norm in entertainment and communication in the near future, enabled by advances in computer graphics, capture and display systems, and networking technology. Immersive experiences are based on a rich 3D scene that enables immersion, fusion with the real world, and rich interactivity. However, 3D scenes are large, rich and complex - and hence stored and processed not only on devices, but on cloud systems. MPEG is currently working on specifying a set of functionalities that address different aspects of immersive media, including formats, access and delivery, and compression of these emerging media types.

The scene description standard as defined in part 14 of the MPEG immersive standard family [1] provides the entry point and glue to such immersive experiences. The key design principle of the architecture behind it, was to separate media access from rendering. The scene description standard achieves this by defining a separate Media Access Function (MAF) and the API to access it. The MPEG-I scene description reference architecture is depicted in 1.

The MAF receives instructions from the presentation engine on the media referenced in the scene. It uses this information to establish the proper media pipelines to fetch the media and pass it in the desired format to the presentation engine for rendering. The request for media also includes information about the current viewer's position as well as the scene camera position and intrinsic parameters. This enables the MAF to implement a wide range of optimization techniques, such as the adaptation of the retrieved media to the network conditions based on the viewer's position and orientation with regards to the object to be fetched. These adaptations may include partial retrieval, access at different levels of detail, and adjustment of quality. In this paper, we describe the architecture for immersive media and the functionality performed by the MAF to optimize the streaming of immersive media. We discuss the different adaptation options based on a selected set of MPEG formats for 3D content (i.e. video textures, dynamic meshes, and point clouds). We describe possible designs of such adaptation algorithms for real-time media delivery in the example of immersive conferencing.

Revisiting Bjontegaard delta bitrate (BD-BR) computation for codec compression efficiency comparison

  • Nabajeet Barman
  • Maria G Martini
  • Yuriy Reznik

Bjontegaard Delta Bitrate (BD-BR), proposed in 2001, remains one of the most widely and also misunderstood tools for the computation and comparison of codec compression efficiency of two or more video codecs. This paper presents three different studies evaluating different open-source implementations, extensions and alternatives on two different datasets, considering three objective quality metrics (PSNR, SSIM and VMAF) and subjective quality ratings.

Low encoding overhead ultra-low latency streaming via HESP through sparse initialization streams

  • Pieter-Jan Speelmans

HESP, the High Efficiency Streaming Protocol [4], realizes ultra-low latencies and ultra-short start-up times by combining two feeds, the keyframe-only Initialization Stream and the ultra-low latency CMAF-CTE Continuation Stream. HESP uses a keyframe from the Initialization Stream to start playback (via keyframe injection) of the Continuation Stream extremely close to the live edge. In previous research [5], the impact of the HESP keyframe injection on the video quality has been proven to be very low or even negligible. In contrast to the trivial double encoding for each quality in the bitrate ladder, in this paper we show that the overhead of the generation of the keyframe-only Initialization Streams can be reduced. We designed an approach in which the frequency of keyframes in the Initialization Streams is defined by a trade-off between the encoding overhead and two metrics in the viewing QoE: start-up time and time that it takes to switch to the highest feasible video quality of the ABR ladder. More specifically, for each quality Qi, fi is defined such that (i) switching to Qi, either for start-up or for switching to Qi as a higher quality, takes [EQUATION] additional delay, and (ii) there always is a Qi, lower than Qcurrent (unless Qcurrent is the lowest quality) to which the player can switch down instantly, which is needed in case of network problems. The resulting impact on the viewer QoE is characterizedby occasional (whenever an ABR switch to a higher quality is needed) short intervals [EQUATION] during which playback potentially is done at a lower than feasible video quality. Based on measurements, the proposed approach results in an overhead when encoding Initialization Streams of only 15 to 20%. Compared to "standard" HESP, the viewer QoE reduction is hardly noticeable.

Extend CMAF usage for large scale video delivery

  • Lucas Gregory
  • Khaled Jerbi
  • Mickael Raulet
  • Eric Toullec

During the last decade, an immense momentum drove HTTP adaptive streaming to become the key protocol for video streaming. This keen interest can be explained by the increasing bandwidth of internet connections and the tremendous progress of mobile networks with the emergence of LTE technologies. Moreover, using TCP based protocol is a grail for both developers and final users as it is supported by most of the connected devices and browsers, and it significantly reduces the handling of network related issues.

Among several protocols, the market has been dominated by two protocols: MPEG-DASH and HTTP live streaming. This situation presented a major drawback as content providers need to package the same content twice to ensure covering most of the users. The obvious solution was to unify the packaging and that is what happened with the advent of the playlist agnostic container format called 'Common Media Application Format' (CMAF) [4]. In addition to reducing the number of stored segments for OTT support, CMAF improved the versatility and the interoperability with modern technologies. It has also provided solutions to metadata carriage and to significantly reduce the OTT latency which used to be the bottleneck of the technology.

Things will not stop here for CMAF as recently, in March 2020, an ingest protocol based on CMAF was published by the CMAFIF and the DASH-IF and was revised during 2021 [2]. This protocol was initially aiming at defining a push-based communication between an encoder and a receiving entity such as a just-in-time packager or a Content Delivery Network, but it revealed to be a game changer to put CMAF everywhere. Indeed, today's first mile delivery uses mainly MPEG-2 TS and protocols like SRT or ZIXI for B2B headend and this way of encapsulation involves a loss of information as we need to go through specific sub-protocols to carry on meta data such as Advertising, thumbnails, and timed text over MPEG-2 TS. This loss can be avoided by using a complete content carriage with ISO BMFF [3]. Moreover, the ingest protocol allows going beyond the classic flow and make CMAF a robust communication support at large scale between any entities exchanging video streams. It can be a communication between encoders and transcoders or between encoders and storage entities or between a storage and a transcoder or even between a playout and a transcoder.

In this paper we present the CMAF ingest protocol and its key features. We also detail the benefits of using this technology compared to those of other existing formats in terms of redundancy, metadata carriage and support of recent codecs and timed text. Then, we present the evolution of CMAF thanks to the ingest protocol to handle video transmission at large scale in a fully standardized fashion. Finally, we will show concrete implementation architecture of communication between an encoder and a smart packager using timed text and timed metadata tracks for SCTE35 carriage [1, 5].

Performance-under-privacy: delivering commercial streaming content in a privacy-first world

  • William Law

The advent of MASQUE-based double-proxy solutions, along with Virtual Private Networks (VPNs) and Oblivious DNS Over HTTPS (ODoH), offer a welcomed improvement in consumer privacy. This presentation examines how these solutions work in the context of adaptive segmented streaming, how they impact commercial content distribution and how operators can optimize for performance-under-privacy to ensure that additional user privacy does not come at the cost of degraded Quality of Experience (QoE).

Novel temporal masking framework for perceptually optimized video coding

  • Dan Grois
  • Alex Giladi
  • Praveen Kumar Karadugattu
  • Niranjankumar Balasubramanian

The development of the 1st edition of HEVC by the Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Pictures Expert Group (MPEG) was officially finalized in January 2013 [11], thereby achieving a significant bitrate reduction of roughly 50% for substantially the same visual quality when compared to its predecessor [6, 7, 14]. The development process of HEVC was driven by the most recent scientific and technological achievements in the video coding field.

In turn, video applications continue to gain a lot of traction and to have an enormous demand [9]. A very significant increase in the bandwidth requirements is expected by 2023, particularly due to the increase in the resolution supported by devices. It is expected that 66% of the connected flat-panel TV sets will have the support for the Ultra-High Definition (UltraHD) resolution compared to only 33% in 2018 (note that "UltraHD" in this paper refers to the 3840×2160 resolution, also known as 4K or 2160p). The typical bitrate for a 60fps 4K HDR10 video is between 15 to 24 Mbps [9], which is is nearly four times more than twice the typical High-Definition (HD) video bitrate. In addition, the overall IP video traffic [8] is expected to grow to 82% of the overall Internet traffic by 2022, and about 21% of this traffic is expected to be UltraHD.

As a result, there is a continuous strong need to further decrease video transmission bitrate, especially for the UltraHD content, substantially without reducing the perceptual visual quality.

One of the promising approaches for increasing video coding gain is applying "visual masking", which is based on a very interesting phenomenon observed in the human visual system (HVS) [1, 16]. According to this phenomenon, two or more stimuli are presented sequentially to a viewer, with one stimulus acts as a target which has to be detected and described, while other stimuli are used to mask the visibility of that target [1]. With this regard, a good amount of research has been carried out in the video compression field, such as [2] for example, which exploits the above-mentioned phenomenon by providing a psycho-visual algorithm that has been implemented in the x264 encoder [4].

In turn, more advanced studies of [2] are further presented and discussed in [2] and [3].

In addition, in the most recent work, such as [15], it is proposed to mask temporal activities that are unnoticeable by human visual system by using a masking coefficient. Further, [12] presents a video just noticeable difference (JND) scheme by employing compound spatial and structure-based temporal masking, further measuring a JND threshold for each transform coefficient of a color video. Also, [17] proposes an improved transform-based JND estimation model considering multiple masking effects. However, all surveyed existing visual masking approaches, the most interesting of which are indicated above, lead to relatively low bitrate savings. As a result, these approaches have not been adopted in the video streaming/coding industry to date. In addition, computational complexity of existing visual masking schemes is relatively high due to the utilization of relatively complex quantization models [2].

The above-mentioned drawbacks are overcome in this work by providing a novel joint backward and forward temporal masking framework, which considers temporal distances between video frames along with closest scenecuts, and dynamically adjusts a plurality of masking parameters, such as masking window size and duration as well as corresponding quantization offsets for each video frame.

The proposed framework has been implemented in the popular x265 open-source HEVC encoder [5]. With that said, the framework is codec-independent and can be applied to other encoders and video coding standards. Different backward and forward masking time periods and quantizer behaviors are investigated to determine exact time periods for which temporal masking substantially does not impact video quality, as perceived by the human visual system (HVS).

The Double Stimulus Impairment Scale (DSIS) Methodology of the Experimental Results Analysis Variant II method [10, 13] was selected for conducting extensive subjective quality assessments for evaluating the benefits and advantages of the proposed scenecut-aware quantization framework.

The subjective results showed significant bitrate savings of up to about 26%, while achieving substantially the same perceived visual quality and maintaining relatively low computational complexity. In addition, one of the important findings is that the proposed joint backward and forward temporal masking framework tends to perform better for higher bitrates and frame rates, as well as for content that includes textures, such as water and snow.

Deploying the ITU-T P.1203 QoE model in the wild and retraining for new codecs

  • Werner Robitza
  • Rakesh Rao Ramachandra-Rao
  • Steve Göring
  • Alexander Dethof
  • Alexander Raake

This paper presents two challenges associated with using the ITU-T P.1203 standard for video quality monitoring in practice. We discuss the issue of unavailable data on certain browsers/platforms and the lack of information within newly developed data formats like Common Media Client Data. We also re-trained the coefficients of the P.1203.1 video model for newer codecs, and published a completely new model derived from the P.1204.3 bitstream model.

Machine learning assisted real-time DASH video QoE estimation technique for encrypted traffic

  • Raza Ul Mustafa
  • Christian Esteve Rothenberg

With the recent rise of video traffic, it is imperative to ensure Quality of Experience (QoE). The increasing adoption of end-to-end encryption hampers any payload inspection method for QoE assessments. This poses an additional challenge for network operators to monitor DASH video QoE of a user, which by itself is tricky due to the adaptive behaviour of HTTP Adaptive Streaming (HAS) mechanisms. To tackle these issues, we present a time-slot (window) QoE experience detection method based on network level Quality of Service (QoS) features for encrypted traffic. The proposed method continuously extracts relevant QoE features for HTTP Adaptive Streaming (HAS) from encrypted stream in real-time fashion basically, packet size and arrival time in a time-slot of (1,2,3,4,5)-seconds. Then, we derive Inter Packet Gap (IPG) metrics from arrival time that result in three recursive flow features (EMA, DEMA, CUSUM) to estimate the objective QoE following the ITU-P.1203 standard. Finally, we compute (packet size, throughput) distributions into (10-90)-percentile within each time-slot along with other QoS features such as throughput and total packets. The proposed QoS features are lightweight and do not require any chunk-detection approaches to estimate QoE, significantly reducing the complexity of the monitoring approach, and potentially improving on generalization to different HAS algorithms. We use different Machine Learning (ML) classifiers to feed the QoS features and yield a QoE category (Less QoE, Good, Excellent) based on bitrate, resolution and stall. We achieve an accuracy of 79% on predicting QoE using all ABS algorithms. Our experimental evaluation framework is based on the Mininet-WiFi wireless network emulator replaying real 5G traces. The obtained results validate the proposed methods and show high accuracy of QoE estimation of encrypted DASH traffic.

Update on the emerging versatile video coding (VVC) standard and its applications

  • Benjamin Bross
  • Mathias Wien
  • Jens-Rainer Ohm
  • Gary J. Sullivan
  • Yan Ye

Finalized in July 2020, the Versatile Video Coding (VVC) standard has begun moving beyond the abstract world of standardization into a diverse range of systems and products [5--7]. This presentation will provide updated information about this important standard, including new information about recent developments since completion of this major standardization project.

The standard was developed by an ITU-T/ISO/IEC Joint Video Experts Team (JVET) of the ISO/IEC MPEG and ITU-T VCEG working groups and has been formally approved and published as Recommendation H.266 by ITU-T and as International Standard ISO/IEC 23090-3 by ISO and IEC. Verification Testing of the capabilities of VVC has confirmed its major benefit in compression capability over previous standards for several key types of video content, including emerging applications such as ultra-high resolution and high dynamic range usage, screen content sharing and 360° immersive VR/AR/XR applications [8--10].

The presentation will include the discussion of these developments and also information on

• Open-source and other software availability and its uses

• Recent and upcoming deployments of products and services using VVC

• Incorporation of VVC into system environments and related standards

• A new second edition of VVC including an extension for high bit rate and high bit depth applications

• Metadata support in VVC using the new VSEI standard

• Explorations in JVET for potential future video coding technology beyond VVC

Additional resources for further information will also be provided (e.g., [1--4]).

VideoAI: realizing the potential of media analytics

  • Faisal Ishtiaq
  • Kerry Zinger

In this work, we describe a unique system and solution, VideoAI, that allows Comcast to rapidly create Artificial Intelligence (AI) and Machine Learning (ML) based media analytics experiences. The unprecedented growth of Artificial Intelligence, Machine Learning and Deep Learning (DL) has enabled a new level of insight into video content from detecting faces objects to understanding sentiment and emotions. However, many of the solutions today are highly customized to the desired solution and do not easily scale when applied to other solutions.

At Comcast we realized that to fully harness the power of latest in Media Analytics techniques - from content creation to consumption - requires a bottoms-up approach. Rather than highly customized solutions, we have developed a growing suite of AI ML capabilities that can be rapidly reused, reproposed, and deployed into a unified solution we call VideoAI. This can be used to quickly develop new Media Analytics powered solution across our media ecosystem.

VideoAI analyzes live/linear streams and files to generate temporal metadata that describes moment by moment what is happening in the video. It scans video, audio, and closed captions, along with other signals to generate time indexes enriched tags describing what is happening in the video. Using this approach, we are able to create solutions that change the way we do Dynamic Ad Insertion (DAI), enable binge watching, automatic chaptering, metadata enhancement, and much more. Leveraging the power of this framework, the machine learning algorithms have processed millions of hours of content that have improved the accuracy and robustness of the detectors and ensemble approaches.

In the proposed presentation we will describe in greater detail the algorithmic and systematic approach used to harness the power of AI/ML/DL in a reconfigurable and reusable way. The technical benefit of this approach results in minimized training cycles, more efficient use of compute cycles, and algorithmic improvements across use cases. We will also describe a set of applications enabled by VideoAI that includes linear and VOD Segmentation, and metadata enrichment for advertisement.

Overview of the DASH-HLS interoperability specification: 2021 edition

  • Zachary Cava

While CMAF has provided the foundation for the interoperable packaging of streaming media, today it is still common practice to produce media specific to the delivery formats utilized by a service provider. As DASH and HLS are the delivery formats the industry has converged towards, a survey of deployments for DASH and HLS revealed two leading reasons for divergent packaging: media packaging requirements that were misaligned across formats and a non-trivial amount of tribal knowledge required to address media for common deployment use-cases in each format.

To address the divergence of CMAF packaged media in DASH and HLS, the CTA WAVE project created a working group, the DASH-HLS Interoperability group, responsible for researching and transcribing the additional packaging and delivery format requirements necessary to achieve interoperability. Using industry guidance, the group defined a set of common streaming use-cases and published the interoperability details for the first four usecases in the 2021 Edition of the DASH-HLS Interoperability Specification (CTA-5005) [1]. The use-cases in this edition are: Basic On-Demand and Live Streaming, Low Latency Live Streaming, Encrypted Media Presentations, and Presentation Splicing.

This talk will provide an overview of the specification outputs for these initial use-cases including the defined packaging and addressing requirements and any identified missing interoperability points that represent opportunities for further research. Beyond the current specification, this talk will highlight the new use-cases and work currently being prioritized for the next edition and how interested entities can get involved with the development.

Behind the scene: delivering 4K olympics games

  • Derik Yarnell
  • Grant McGilvray

The Tokyo Olympic Games offered NBCUniversal and Comcast an opportunity to make a substantial leap forward in live, localized UHD broadcasts. The result was a radically different workflow, delivering Olympics content from 50+ local NBC affiliate stations in stunning 4K HDR with next generation audio. The solution shined again, a mere 6 months later, when Beijing hosted the 2022 Winter Olympic games.

Built on a new software-defined IP workflow, NBC was able to dynamically switch UHD simulcast content into locally up-converted affiliate signals. The result was nationwide UHD programming live from venue, while at the same time protecting local news and advertising inventory from each of the NBC affiliate stations.

When combined with the Comcast X1 platform, and delivered over the Comcast network, Comcast X1 UHD viewers were able to seamlessly transition into the UHD version of their local NBC station with Dolby Vision and Dolby Atmos.

Hybrid linear encoding: also known as remote worker

  • Davide Gandino

Lately linear content encoding and packaging have been moving to the public cloud, in order to maximize flexibility and speed of deployment. On the other hand companies usually do have spare capacity in their datacenter, this solution would allow to use such capacity while using the public cloud for resiliency / high availability, or simply to temporarily support the linear encoding workload in case of need. Hybrid clusters can allow you to use your data-center resources and your public cloud resources in a shared way, orchestrating the workload where it is most appropriate based on business / technical decisions, in a seamless way.

Incapable capabilities - less is more: improving user experiences and interoperability for streaming media services

  • Thomas Stockhammer
  • Cyril Concolato

This document reviews existing functionalities for media capability mechanisms in the streaming media space. It shows the multitude of existing functionalities and provides an overview on what is used in practice. We provide recommendations to implementers and industry fora on how we can improve media capability signalling including focussing (extensible, yet compact) model for signalling relevant capabilities, provide a simple mapping to device APIs and support the correct implementation of capabilities on devices.

Quality assessment of video with film grain

  • Kai Zeng
  • Hojatollah Yeaganeh
  • Zhou Wang

Film grain noise originally arises from small metallic silver particles on processed photographic celluloid. Although modern digital video acquisition systems are capable of largely reducing noise, sometimes to nearly invisible levels, the look of cinematic film grain has not gone away. Instead, content creators often purposely introduce simulated film grain in post-production to emulate dust in the environment, enrich texture details, and develop a certain visual tone. Despite the artistic benefits, film grain has posed significant challenges to video delivery systems. Compressing and transmitting videos containing film grain noise is extremely costly due to the large number of bits required to encode the noisy pixels of much higher entropy than the typical visual content of the scene. Heavy compression may remove film grain, but meanwhile, remove meaningful texture content in the visual scene or deteriorate the artistic effect of the creator's intent. It also casts major challenges to quality control of video delivery systems, for which film grain-susceptible fidelity measures are highly desirable for measurement and optimization purposes. Here after describing the characteristics of film grain and its impact to video quality, we present a novel framework that unifies natural video quality assessment and creative intent friendly video quality assessment. We also demonstrate an instantiation of the framework in the context of film-grained content in terms of predicting the perception of different groups of subjects.

Improving content discovery and viewer engagement with AI

  • Martin Prins

We are quickly heading towards the golden age of video. For consumers there is a wealth of video services to choose from, ranging from broad video offerings to content tailored to a specific audience or genre. Many services come with premium content and competitive pricing. This however brings a dilemma: people will only subscribe to a few services, and are more likely to replace a service for another, in case the service they use does not meet their needs. Churn rates of video services are already increasing, and with new services being launched in 2022, the fight for viewer attention will further intensify.

The battle for eyeballs will not be won by the services that provide the most content, but by the ones that offer the most engaging experience and help consumers quickly find the most relevant content, and who are able to do that in a cost effective manner.

In this presentation the author will discuss three applications where AI can help viewer engagement and ensure viewers can find, discover and play the content they are interested in as quickly as possible:

• Automatic content chaptering for news, talk shows and sports programs, such that viewers can quickly jump to the items they are interested in. This allows video services to make use of long-form content but address the viewer needs of short form playback. The concept has been popularized by YouTube which offers a tool to assist content creators to chapter their content, but still requires manual curation [2]. We explore what can be achieved in a fully automated way.

• Identifying meaningful topics in non-scripted live content, such that viewers can discover and follow content based on their topics of interest, and video service providers. These programs typically lack traditional metadata about the contents, due to the live nature and the costs involved to create metadata by hand.

• Generating high quality, appealing and personalized episode-specific thumbnails out of live (broadcast) video, to help the viewer pick the right content quicker. This allows video services to replace meaningless stock images and engage viewers better with (near)-live programming. Personalized imagery for Video on Demand has earlier been popularized by Netflix [1]. But generating images from broadcast video has additional challenges that need to be resolved to be able to produce appealing images.

The cases are from real implementations that are being tested with different video services and their viewers. As such the talk also dives into the practical challenges, how these were resolved and present results and takeaways from both technical and user perspectives.

Coding tool research for next generation AOM coding standard

  • Zhijun Lei

AV1 is an open, royalty-free video coding format designed by the Alliance for Open Media (AOMedia). Since it was finalized in mid-2018, AV1 has been supported by major content providers, such as YouTube and Netflix, and achieved great compression efficiency gain over previous generations of codecs.

Since the middle of 2019, AOM member companies have started the research and exploration work for the next generation of the coding standard after AV1. The actual development work started from the beginning of 2021 in the Codec Working Group, which is the main forum to discuss and review coding tool proposals from AOM member companies. Meanwhile, the testing sub-group also started the work to define the Common Test Conditions that are used to evaluate the compression efficiency gain and implementation complexity of the proposed coding tools.

In this talk, I will first present a high-level overview of the various AOM working groups, and coding tool evaluation process. Then some details about the Common Test Condition design will be provided, especially for a few unique configurations that are close to production usage, but never supported in any previous coding standard development process. In the second part of the talk, I will present a preview of few proposed coding tools, including their high-level idea and achieved coding gain.

How innovations in ASIC architectures and novel design approaches power tomorrow's video networks

  • Avinash Ramachandran

This paper describes a hardware AV1 encoder pipeline design and evaluates its performance relative to Advanced Video Coding with x264.