MHV '23: Proceedings of the 2nd Mile-High Video Conference

Digital Library logo
Full Citation in the ACM Digital Library

Content-aware convex hull prediction

Jayasingam Adhuran
Gosala Kulupana

Efficient bitrate ladders can enhance on-demand streaming services when bandwidth is constrained. Moreover, content-agnostic bitrate-resolution pairs in fixed bitrate ladders can lead to inefficient compression of video contents. Convex hull, which encloses multiple Rate-Distortion (RD) curves generated for multiple resolutions, can be used to determine the optimal bitrate-resolution pairs for a given video sequence. While literature is abundant with the determination of Quantization Parameters (QPs) that correspond to the cross-over point between adjacent RD curves, rate-controlled encodings require prior knowledge of the target bitrate instead of the target QP during the encoding. To this end, this research predicts the target bitrates for a set of picture quality values thereby generating a content gnostic optimal convex hull for a given video sequence. To support this, a machine learning based two-stage prediction framework is applied to predict a pool of bitrates from the uncompressed source videos at their native resolutions. The first stage of the framework predicts spatio-temporal features in the compressed domain which are subsequently employed to predict bitrates in the second stage. Exploiting the predicted bitrates, multiple strategies are proposed to select the bitrate-quality pairs to generate the convex hulls. Ground truth convex hull and a state-of-the-art fixed bitrate ladder are further used to evaluate the performance of the proposed convex hull prediction. Extensive experiments conducted with x265 based encoding report minimal variance from the ground truth convex hull and significant benefits over the fixed bitrate ladder which reports over 30% compression losses.

Improving the Performance of Web-Streaming by Super-Resolution Upscaling

Yuriy Reznik
Nabajeet Barman
Patrick Wagstrom

In recent years, we have seen significant progress in advanced image and video upscaling techniques, sometimes called super-resolution, or AI-based upscaling. Such algorithms are now broadly available in the forms of software SDKs, as well as functions natively supported by modern graphics cards. However, to take advantage of such technologies in video streaming applications, one needs to (a) add support for super-resolution upscaling in the video rendering chain, (b) develop means for quantifying the effects of using different upscaling techniques on perceived quality, and c) modify streaming clients to use such more advanced scaling techniques in a way that leads to improvements in quality, efficiency, or both.

In this paper, we discuss several techniques addressing these challenges. We first present an overview of super-resolution technology. We review available SDKs and libraries for adding super-resolution functionality in streaming players. We next propose a parametric quality model suitable for modeling the effects of different upscaling techniques. We validate it by using an existing widely used dataset with subjective scores. And finally, we present an improved adaptation logic for streaming clients, allowing them to save bandwidth while maintaining quality at the level achievable by standard scaling techniques. Our experiments show that this logic can reduce streaming bitrates by up to 38.9%.

Perceptual quality evaluation of variable resolution VVC video coding using RPR

Thomas Guionnet
Kenneth Andersson
Nikolay Tverdokhleb
Thomas Burnichon

In this paper we analyze the perceptual performance of encoding in reduced resolution at challenging bitrates on 4k content by means of reference picture resampling tool in VVC in comparison with encoding in source resolution. The subjective testing shows that adaptively encoding in reduced resolution based on GOP-based reference picture resampling encoder control present in the VVC reference software is always better or equal to the anchor given a 95% confidence interval. The objective metrics based on PSNR and MS-SSIM also both indicate a gain for these sequences.

Live Streaming using SRT with QUIC Datagrams

Maria Sharabayko
Maxim Sharabayko

The recently standardized QUIC transport protocol with its DATAGRAM extension provides a mechanism for sending data unreliably while leveraging the TLS-level security, connection migration, congestion control, and other features of QUIC transport. QUIC datagrams could be considered as an alternative to UDP transport, and are of particular interest for media streaming, gaming, and other real-time network applications. Given that multimedia traffic is highly sensitive to delay, jitter, packet losses, and bandwidth variations, there is a need for an application protocol on top of unreliable QUIC datagrams to compensate for those variations and to recover from packet loss wherever it makes sense. In this article, we examine the use of the Secure Reliable Transport (SRT) protocol with QUIC datagrams and evaluate its features and applicability for low latency live streaming use cases.

Robust SCTE 35 in the OTT workflow

Rufael Mekuria
Yasser Syed
Gary Hughes

Dynamic ad insertion and substitution are increasingly popular to monetize OTT services. A key enabling technique is the signaling of timeslots for insertion or substitution. In practice, SCTE 35 has been used for this, but OTT implementations have been somewhat inconsistent in practice. This paper highlights advances in ad insertion from both practical and a standardization perspectives to realize consistent and robust ad slot signalling in end-to-end OTT workflows. Guidelines for server or client-based ad substitution using SCTE 35 in DASH and ISO base media file formatted files are presented. It addresses the SCTE 35 fields, ISO BMFF and DASH fields and their relationship. The guidelines have been recently developed through joint discussions by DVB and SCTE resulting in publications as DVB-TA part 3 and SCTE 214 (2022) (and in its upcoming guidance annex) respectively. Additionally, this paper provides examples and illustrates specific use cases, such as the early termination of ad breaks, insertion of ads, and MPEG DASH period splitting. To showcase usage in a DASH manifest manipulator a cloud function implementation for period splitting of large dynamic presentations in real-time is described. Additionally, this paper details emerging techniques to carry SCTE 35 upstream in ISO BMFF and CMAF timed metadata tracks based on the recently published event message track ISO/IEC 23001-18 specification. This metadata track format mainly targets storage and upstream use cases. An implementation is described that generates example files based on this format with ad slot signaling and it is used to implement a distributed live CMAF uplink.

Open and optimized VVC Implementations on ARM Architectures

Benjamin Bross
Christian Lehmann
Gabriel Hege
Adam Wieckowski
Detlev Marpe

The international video coding standard Versatile Video Coding (VVC) has been finalized in July 2020 by ITU-T and ISO/IEC. In September 2020, the open and optimized VVC encoder (VVenC) and decoder (VVdeC) software has been released and continuously improved in terms of speed and efficiency. Originally developed for x86 architecture, recent optimizations on vectorization target ARM platforms. VVenC and VVdeC runtimes have been measured on various systems to evaluate their performance. Results show that the three fastest presets for VVenC enable encoding HD video with 4 to 25 fps on high performance laptops (Apple M1 Max and Intel i9-12900H) and UHD video with 2 to 10 fps on workstations (Apple M1 Ultra and Intel Xeon Gold 6348). The VVdeC decoder enables live decoding of HD video on a tablet with octa-core ARM CPU as well as live decoding of UHD video on the afore mentioned laptops.

RMTS: A Real-time Media Transport Stack Based on Commercial Off-the-shelf Hardware

Ping Yu
Yi A Wang
Ming Li
Jianxin Du
Raul Diaz

The broadcast production industry is undergoing a transformation from Serial Digital Interface (SDI) [18] to Internet Protocol (IP) [1] networks for media transport. Specialized equipment and FPGA implementations for IP based raw media transport are currently dominant due to strict low-latency and reliability requirements. These custom hardware solutions inevitably cause production environment operational complexity and scalability challenges. To enable more flexible and modular media production environments, this paper proposes RMTS, a software stack for real-time media transport based on commercial off-the-shelf (COTS) hardware to improve broadcast efficiency and scalability. RMTS provides end-to-end media transport capability compatible with the Society of Motion Picture and Television Engineers (SMPTE) [22] ST 2110 standard [13]. RMTS offers a time-sensitive scheduling algorithm that implements two timing related models by leveraging rate limiting and time synchronization functionality in Network Interface Cards (NICs): (1) an accurate traffic shaping model, and (2) an on-time delivery model, both compatible with ST 2110-21 "Professional Media Over Managed IP Networks: Traffic Shaping and Delivery Timing for Video" [16]. The high accuracy of the traffic shaping model has been validated through third-party ST 2110 testing tools. As a result, RMTS enables standards-based, realtime media transport on COTS systems for high bandwidth, low-latency media applications.

Energy-aware images: Quality of Experience vs Energy Reduction

Olivier Le Meur
Claire-Hélène Demarty
Franck Aumont
Laurent Blondé
Erik Reinhard

The production, transmission and display of video content requires significant amounts of energy. Whether broadcasting or streaming a video, its display on modern televisions is responsible for a significant proportion of the energy consumption. This paper proposes a framework for analysing and processing video frames that allows a modern screen to use less energy when displaying these frames. The content is analyzed prior to encoding and transmission, generating metadata that is attached to the content. After reception by a display, the metadata is then used along with display parameters and user settings to adapt the content prior to display. Two use cases are discerned: one whereby the energy is reduced as much as possible under the constraint that visual quality is maintained, and another where the highest visual quality is sought under the constraint of a fixed reduction of energy use. These use cases would well serve broadcasting and streaming services, respectively.

VVC for Immersive Video Streaming

Miska M. Hannuksela
Sachin Deshpande

Immersive video, ranging from 360° video to volumetric three-dimensional video, is expected to gain importance in the future. The recently finalized Versatile Video Coding (VVC) standard was designed to be suitable for a broad range of video content types and services, specifically including 360° video. In the same spirit, the format to encapsulate VVC in the ISO Base Media File Format (ISOBMFF) was tailor-made to support storage and streaming of immersive video. This paper reviews the features of VVC and its ISOBMFF storage for immersive video carriage. Furthermore, VVC-based profiles of the Omnidirectional Media Format standard are described as examples of using VVC and its ISOBMFF storage for immersive video streaming.

Framework for Authoring and Delivery of Targeted Media Presentations using a Smart Edge Proxy Cache

Roberto Ramos-Chavez
Rufael Mekuria
Espen Braastad
Arjen Wagenaar

We propose a framework for authoring and delivery of targeted media streaming presentations using existing online media content. The framework includes an authoring component and a delivery component. The authoring component stitches different existing media sources into a continuous streaming presentation that is suitable for playback on a broad range of devices. The authoring component includes an offline transcoding subcomponent and a Just-in-Time packaging subcomponent to generate media presentations on-the-fly. The presentations generated are continuous and have excellent general device playback compatibility. However, even if content is identical, caching and distribution at scale becomes challenging as each media segment may use different URI's. To address this issue, we introduce a delivery component that uses a segment naming scheme to enable media deduplication of segments in the edge cache proxy. In addition, a more generic caching scheme based on a Fowler-Noll-Vo version 1a content hash was implemented in the edge cache proxy. Both mechanisms can be used in combination with different adaptive streaming protocols. To illustrate the concept in practice, the smart edge cache proxy was implemented in a popular HTTP proxy accelerator. Evaluation of the framework using different targeted presentations shows that the edge cache avoids duplicate media caching and responds to requests without significant performance penalties. Further, the request overhead in the network is lower when comparing the naming scheme method to the content hashing based approach.

Cloud-based Workflow for AVC Film Grain Synthesis

Vijayakumar Gayathri Ramakrishna
Kaustubh Shripad Patankar
Mukund Srinivasan

This paper describes and demonstrates a cloud-based workflow for bit-accurate film grain synthesis for providing a better viewing experience in streaming video applications. Film grain technology can be potentially used with video sources that do not have original film grain, as a compression aid tool to improve the subjective quality by subtly masking video coding artifacts at low bitrate encodings. This paper describes the key components which can enable streaming service providers and streaming client applications to build workflow and video playback applications using film grain technology. Some of the component software are made available for public use on popular open-source platforms like GitHub.

Coding Techniques in JPEG XS for efficient Video Production and Contribution

Siegfried Foessel
Thomas Richter

JPEG XS is a tailored image codec for Video-Over-IP targeting visual lossless quality. The paper describes the requirements for professional use cases as well as tools and constraints to achieve an efficient coding design. The first chapter gives an overview of the requirements, followed by the design concept of JPEG XS. The next few chapters will discuss in detail the JPEG XS design and how it enables high-speed low-latency coding of video streams on CPUs and GPUs. We conclude by providing insights which additional use cases are addressed in the ongoing JPEG XS standardization activities.

AI-BASED LIGHT PARALLEL VIDEO ENCODER

Marwa Tarchouli
Marc Rivière
Thomas Guionnet
Mickael Raulet
Meriem Outtas
Olivier Deforges
Wassim Hamidouche

This paper presents a software framework for end-to-end learned video coding. The main goal is to provide the research community with a platform adapted to practical applications. To accomplish this purpose, several features have been integrated, including basic utilities such as color conversion and advanced functions such as overlapping patch-coding to enable flexible memory consumption and adaptation to various hardware parallelization capabilities. The MS-SSIM function has been altered to fit this approach, and the rationale behind this modification is explained. This framework is meant to serve as a basis for validation and comparison of state-of-the-art end-to-end learned video codecs. Therefore, the codec part is a module that can be replaced while keeping the framework structure, and performance metrics are included. Experimental results are provided to illustrate the effectiveness of the framework.

3GPP Rel-17 5G Media Streaming and 5G Broadcast powered by 5G-MAG Reference Tools

Daniel Silhavy
David Waring
Dev Audsin
Richard Bradbury
Johann Mika
Klaus Kuehnhammer
Kurt Krauss
Jordi J. Gimenez

The expansion of connectivity has opened the door for new ways of creating, distributing and consuming multimedia content. According to the Ericsson Mobility Report [1], by 2028, 5G networks will carry 69 percent of the world's smartphone traffic, with the delivery of video expected to account for 80 percent of the mobile data traffic. Engaging in the development of global technologies, such as 5G, is an opportunity to shape their capabilities towards the creation of new services and applications. 5G Media Streaming is a set of specifications, functionalities and APIs for service providers to manage 5G System capabilities such as content provisioning. quality of service, metrics and consumption reporting, network assistance, dynamic network policies, among other functionalities. Furthermore, it also allows managing the different delivery mechanisms available in 5G, including unicast, multicast and broadcast, edge processing, etc.

The 5G Media Action Group (5G-MAG) has undertaken the task to drive the implementation of 5G multimedia services and applications. This is done by implementing the open-source 5G-MAG Reference Tools, which also provide feedback to relevant standards developing organizations while strengthening collaboration between service providers, network operators, systems integrators, technology vendors, app developers and users.

This paper presents an overview of the technologies and features currently being developed as part of the 5G-MAG Reference Tools and the first services and applications supported. This currently involves several building blocks for 5G Media Streaming, including baseline features such as content hosting and media session handling, and an end-to-end toolbox for LTE-based 5G Broadcast. This is now the framework for contributors to add more advanced functionalities as soon as more specifications, features and functionalities become available.

Elastic Video Content Delivery Networks at the Edge

Tuan Tran
Christoph Neumann
Guillaume Bichot

Edge platforms are increasingly used by telecom operators to deploy functions and services as virtual appliances within their network in geographically distributed cloudlets. Edge platforms provide a great amount of flexibility, as they allow to dynamically scale the number of virtual resources allocated to a service. They further allow deployment of virtual appliances close to the end-users therefore potentially increasing the QoE of the delivered services.

In this paper, we discuss how to make a video Content Delivery Network (CDN) "edge-aware". We design a virtualized Edge CDN (vCDN) system that captures the dynamics and topology of the edge platform, handles end-users' location and mobility and takes into account application-specific metrics. The proposed vCDN system deploys virtual cache instances accordingly and routes the end-users' traffic to the most appropriate cache instance.

Performance of Low-Latency HTTP-based Streaming Players

Bo Zhang
Nabajeet Barman
Yuriy Reznik

Reducing end-to-end streaming latency is critical to HTTP-based live video streaming. There are currently two main technologies in this domain: Low-Latency HTTP Live Streaming (LL-HLS) and Low-Latency Dynamic Adaptive Streaming over HTTP (LL-DASH). These protocols are now supported by many popular streaming players, including HLS.js, DASH.js, Video.js, Shaka player, THEO player, and others. With some players, such as DASH.js and HLS.js, there are also several different rate-adaptation algorithms that may be deployed. This paper is dedicated to the evaluation of the performance of such low-latency players and their adaptation methods. The evaluation is based on a series of live streaming experiments, repeated using identical video content, encoders, encoding profiles, and network conditions, emulated by using traces of real-world networks. A variety of system performance metrics, such as average stream bitrate, the amounts of downloaded data, streaming latency, as well as buffering and switching statistics have been captured and reported. These results are used to describe the observed differences in the performance of low-latency players and systems.

Transcoding Quality Prediction for Adaptive Video Streaming

Vignesh V Menon
Reza Farahani
Prajit T Rajendran
Mohammed Ghanbari
Hermann Hellwagner
Christian Timmerer

In recent years, video streaming applications have proliferated the demand for Video Quality Assessment (VQA). Reduced reference video quality assessment (RR-VQA) is a category of VQA where certain features (e.g., texture, edges) of the original video are provided for quality assessment. It is a popular research area for various applications such as social media, online games, and video streaming. This paper introduces a reduced reference Transcoding Quality Prediction Model (TQPM) to determine the visual quality score of the video possibly transcoded in multiple stages. The quality is predicted using Discrete Cosine Transform (DCT)-energy-based features of the video (i.e., the video's brightness, spatial texture information, and temporal activity) and the target bitrate representation of each transcoding stage. To do that, the problem is formulated, and a Long Short-Term Memory (LSTM)-based quality prediction model is presented. Experimental results illustrate that, on average, TQPM yields PSNR, SSIM, and VMAF predictions with an R² score of 0.83, 0.85, and 0.87, respectively, and Mean Absolute Error (MAE) of 1.31 dB, 1.19 dB, and 3.01, respectively, for single-stage transcoding. Furthermore, an R² score of 0.84, 0.86, and 0.91, respectively, and MAE of 1.32 dB, 1.33 dB, and 3.25, respectively, are observed for a two-stage transcoding scenario. Moreover, the average processing time of TQPM for 4s segments is 0.328s, making it a practical VQA method in online streaming applications.

Determining Video Complexity to optimise Video Quality Assessment

Ivan Damnjanovic
Ian Trow

In this paper, we will present the results of our experiments with open-source encoders to understand the complexity of video content. We will show how the bitrate profile over time of an asset encoded in constant quality mode can be used as a robust estimate of content complexity. We will test three hypotheses: whether both Content Quality (CQ) and Constant Rate Factor (CRF) rate-control modes can be used interchangeably for complexity estimation, if a complexity estimate depends on the CQ/CRF parameter used and finally, the assertion that complexity depends on the upstream pre-processing applied to source content. A set of recommendations and parameters will be proposed that give a good estimate of content complexity. We will show how we used this technique to analyse video quality of statistical multiplexing systems, allowing us to efficiently process a large number of live sources and quickly pin-point potential deficiencies of the target statistical multiplexing implementation.

Greening of Streaming: The LESS Accord: Low Energy Sustainable Streaming

Dom Robinson

Greening of Streaming is an industry body that comprises many of the most significant operators, service providers, and technology vendors in the streaming industry. It acts to improve energy efficiency and sustainability efforts relating to streaming services architecture and design, promoting power as an equally important design consideration to price and performance in system development.

With significant international traction and many active working groups spanning all aspects of the streaming workflow, Greening of Streaming is practical and pioneering. It is not an accreditation or offsetting group: It is focused on real world engineering improvements to streaming systems. Neither is it a standards development organization (SDO). It is a user group (UG) that seeks to work with SDOs to encourage consideration of energy efficiency as a 'first class' key performance indicator (KPI) in the development of technical standards that relate to streaming.

The Low Energy Sustainable Streaming (LESS) Accord is a movement that Greening of Streaming is driving across the industry, inviting participation from diverse stakeholders in the development of streaming. The LESS Accord aims to dig deep into the heart of the broadcast and streaming industry and ask a taboo question of an historically quality-obsessed industry:

"What if the default streaming encoding profile was energy optimized with 'acceptable' quality for general viewing rather than, as it is today, quality optimized (and typically overprovisioned) with no energy consideration?"

The fundamental idea is that, in many cases, consumers cannot tell the difference between various streaming and broadcast service qualities, and increasingly the industry relies on computer-aided techniques to differentiate quality that humans cannot perceive.

One motivation behind the LESS Accord is to "give permission" to stakeholders to ask out loud what many engineers in the industry already instinctively, privately think and to explore how we might be able to deliver services that fulfill consumer's expectations without simply overselling imperceptible quality/value propositions and creating inappropriate, expensive, unsustainable, and unnecessary energy demands for no benefit to the viewer.

These energy demands may have environmental and economic impacts.

The LESS Accord seeks to reduce those impacts.

DVB Toolbox for Internet Television

Rufael Mekuria
Guilaume Bichot

This abstract overviews four technical specifications for Internet Television services developed by the DVB consortium.

Bandwidth Prediction in Low-Latency Media Transport

Abdelhak Bentaleb
Mehmet N. Akcay
May Lim
Ali C. Begen
Roger Zimmermann

Designing a robust bandwidth prediction algorithm for low-latency media transport that can quickly adapt to varying network conditions is challenging. In this paper, we present the working principles of a hybrid bandwidth predictor (termed BoB, Bang-on-Bandwidth) we developed recently for real-time communications and discuss its use with the new Media-over-QUIC (MOQ) protocol proposals.

Enabling immersive experiences in challenging network conditions

Michael Luby

Immersive experiences, such as remote collaboration and augmented and virtual reality, require delivery of large volumes of data with consistent ultra- low latency across wireless networks in fluctuating network conditions. We describe the high-level design behind a data delivery solution that meets these requirements, provide synthetic simulations of the design, describe and discuss the performance of a software implementation of the design, and provide performance results of the software implementation delivering data with consistent ultra-low latency in network conditions based on real- world measurements that demonstrate the efficacy of the solution.

CDN Performance Evaluation with Edge-Embedded Watermarking

Gwendal Simon
Gwenael Doerr

Piracy of copyrighted video content is an everlasting concern for copyrights holders, such as major studios who want to protect their latest movies/series or video distributors who want to safeguard their exclusive rights investments. To prevent piracy, the video technology providers have designed solutions based on the concept of watermarking, i.e., the video frames contain marks, which can be reliably extracted by a watermark extractor, while remaining invisible for the human eye. Yet, distributing watermarked streams is hard to scale in practice: to prepare and deliver a unique watermarked stream to each user, the service providers should implement a specific system in their Content Delivery Network (CDN) [3].

Novel Histogram-Based Scene Change Detection Scheme for x265 Open-Source Video Encoder

Santhoshini Sekar
Ashok Kumar Mishra
Alex Giladi
Dan Grois

There is an increasing need in bandwidth requirements, particularly, due to the increase in the device resolution [1]. As a result, there is a strong demand to decrease the video transmission bitrate without reducing visual quality. In 2010, a Joint Collaborative Team on Video Coding (JCT-VC) was established by ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Pictures Expert Group (MPEG) to work on the video coding standard, which was called the High-Efficiency Video Coding standard, or HEVC in short. The H.265/MPEG-HEVC video coding standard [2] was approved in 2013 (see Figure 1), and it provides approximately 50% coding gain compared to its predecessor H.264/MPEG-AVC [3],[4].

The x265 encoder [5]-[7] is a popular open-source encoder, which generates bitstreams compliant with the HEVC video coding standard. Built on top of x264 [8], the x265 encoder is integrated into several popular open-source frameworks, such as ffmpeg [9], GStreamer [10], and HandBrake [11]. x265 is used by a variety of broadcast and streaming service providers who leverage benefits of HEVC for streaming live and over-the-top (OTT) content. In addition to implementing nearly all the tools defined in HEVC, it implements many algorithmic optimizations that enable trading off encoder performance for quality [5]-[7]. Additionally, performance-critical kernels are implemented with hand-coded assembly kernels that use AVX2 and AVX-512 single instruction, multiple data instructions to improve performance on x86 CPUs. This flexible architecture of x265 makes it a popular choice for HEVC encoding for both on-premises and on the cloud services. In addition, recent x265 development efforts have been focused on a further coding gain increase. Particularly, there is a continuous need to extract key information automatically from videos for the purpose of indexing and scene analysis, and also in order to improve coding efficiency of the video encoders. To support this vision, reliable scene change detection algorithms must be developed.

In this work, we present a novel algorithm for the gradual and abrupt scene change detection by using picture histograms, variance and pixel intensity. The extensive experimental results show that the proposed algorithm can be used to detect scene change with a high reliability and with the reduced computational complexity. According to the proposed approach, the novel scenecut algorithm detects scene changes in a video and decides on how aggressively each I-frame has to to be placed within the video. For that, each picture/frame is divided into a number of regions. Histograms and pixel intensities are considered for these regions separately, such that the higher is a number of regions, the higher the reliability of the algorithm is. During this work, it was decided to fix the number of regions to the value of nine to optimize the above-mentioned scenarios. Figure 2 below presents a flow chart of the proposed novel scene change detection algorithm.

Note that the intensity contrast characterizes the intensity difference between an object and background. Both variance and intensity contrast are used for thresholding, which is obtained by computing their weighted sum. Further, a sliding window of three adjacent frames is employed to determine whether there is an abrupt scenecut or there is a gradual transition, while a decision of the number of frames within the sliding window is compromised due to the reliability and delay (i.e. if more frames are considered, then the reliability improves, but the delay increases as well).

The conducted extensive experimental results show significant bit-rate savings in terms of BD-BR [12] as well as computational complexity reduction in terms of encoded frames per second.

Sustainable TV Distribution by Delivering Universal DVB-I TV Services

Christophe Burdinat
Mickael Raulet
Pascal Perrot
Julien Lemotheux
Patrice Angot
Richard Lhermitte
Pierre-Loup Cabarat
Benoit Bui Do

TV consumption patterns and delivery methods have been subject to major changes in the last decades. It has led to two heterogeneous ecosystems with, on one side the traditional broadcast making use of MPEG-TS applications, and, on the other one, streaming applications over the broadband internet, addressing enriched use cases such as customization and VoD. To reach their fragmented audience across a plurality of access networks, devices and usages, service providers and operators have to decline their services into many technically heterogeneous flavors. In parallel, the growing awareness of the streaming industry's impact on the climate has become a strong incentive to streamline the entire TV service delivery ecosystem. [1] proposed an end to end sustainable solution based on DVB-I addressing OTT, IPTV, and 5G mobile networks. DVB-I [2] is an emerging standard developed by DVB to harmonize discovery and consumption of TV services over the multiplicity of access networks. As a discovery mechanism, DVB-I can help prioritize the selection of network and service flavors based on energy savings criteria. This paper considers the set of key features from [1] aimed at reducing power consumption for video streaming, covering codec, packaging and transport aspects.

Versatile Video Codec (VVC) [3], issued in mid-2021, provides around 50% bandwidth saving compared to its predecessor HEVC [4]. VCC is more complex and requires more power during the encoding and decoding phases but can provide large savings in transmission power arising from the reduced bandwidth demand. As indicated in [5], a large portion of energy consumption for video streaming is done by end devices. First measures on end devices for VVC were done in [6] and are complemented in this paper by a set of additional measurements. Subjective tests are performed to identify the best energy saving configuration (definition/codec) for mobile devices and TV sets: when the user experience is practically identical, the usage of smaller resolutions, in particular on small screens, is preferable. This paper investigates the impact of the codec choice (AVC/HEVC/VVC) on the energy consumption for the end to end streaming delivery system, from the head end to the end devices.

The paper then analyzes the end to end impact of using the Common Media Application Format (CMAF [7]) for packaging the video streams. CMAF has been specified to be used with both the DASH [8] and HLS [9] protocols, reducing consequently the amount of media content to be cached or broadcasted: the media can be packaged once and stored once, reducing its footprint in the head-end and the CDN. The evolution of the smartphone fleet and the increasing portion of devices supporting CMAF confirms the relevance of this format.

Finally, the paper considers the gains brought by using multicast for the distribution of live OTT services. First, at the CDN level, we can scale the delivery by leveraging DVB-MABR [10]. This can have a very significant impact of the CDN dimensioning, as it allows to absorb the consumption peaks during popular events. And then, at the access network level, by using point to multipoint transmissions over 5G using the new 5G Multicast Broadcast Services (5G MBS) feature [11].

Novel Motion-Compensated Spatio-Temporal Filtering Scheme for x265 Open-Source Video Encoder

Santhoshini Sekar
Ashok Kumar Mishra
Alex Giladi
Dan Grois

There is a strong demand to decrease the video transmission bitrate without reducing visual quality [1]. The x265 encoder [2]-[4] is a popular open-source encoder, which generates bitstreams compliant with the H.265/MPEG-HEVC video coding standard [5]. Built on top of x264[6], the x265 encoder is integrated into several popular open-source frameworks, such as ffmpeg [7], GStreamer [8], and Handbrake [9]. In addition, the x265 is used by a variety of broadcast and streaming service providers who leverage the benefits of HEVC for streaming live and over-the-top (OTT) content. In addition to implementing nearly all the tools defined in HEVC, it implements many algorithmic optimizations that enable trading off encoder performance for quality [2]-[4]. The performance-critical kernels are implemented with hand-coded assembly kernels that use AVX2 and AVX-512 single instruction, multiple data instructions to improve performance on x86 CPUs. This flexible architecture of x265 makes it a popular choice for HEVC encoding for both on-premises and cloud services.

Recent x265 development efforts have been focused on further improving the coding gains. Specifically, the motion compensated spatio-temporal filtering (MCSTF) employed within the coding loop is especially useful for pictures that contain a high level of noise. It utilizes previously generated motion vectors across different video content resolutions to find the best temporal correspondence for low-pass filtering, while the temporal filtering is applied to the I- and P-frames. Figure 1 schematically illustrates the motion estimation process for temporal filtering in a temporal window, which consists of 5 adjacent pictures: two past, two future and one central picture used for producing a single filtered picture. Motion estimation is applied between the central picture and each future or past picture, thereby generating multiple motion-compensated predictions, which are then combined by using adaptive filtering to produce a final noise-reduced picture. Thus, a hierarchical motion estimation scheme is employed (layers L0, L1 and L2, are illustrated in Figure 2). Subsampled pictures are generated for all reference pictures

and the original picture as well: i.e., L1, while L2 is derived from L1 by using the same subsampling method. First, the motion estimation is done for each 16x16 block in L2. Then, the selected motion vector is used as an initial value for estimating the motion in L1. After that, the same is performed for estimating the motion in L0.

As a final step, the subpixel motion is estimated for each 8x8 block by using an interpolation filter on L0. Particularly, the motion of reference pictures before and after, relative to the original picture, is estimated per the 8x8 picture block. In turn, the motion compensation is applied on the pictures before and after the original picture according to the best matching motion for each block. i.e., such that pixel coordinates of the original picture in each block have the best matching coordinates within the referenced pictures. The filter is then applied to the current pixels, and after that, the filtered picture is encoded. Note that the pixels are processed one by one for the luma and chroma channels. The new sample value, is calculated by using the following equation:

[EQUATION]

where I_o is the original pixel, I_r(i) is the intensity of the corresponding pixel within the motion compensated picture i, and w_r(i, a) is the weight of the motion compensated picture where a is the number of available motion compensated pictures. The conducted extensive experimental results show significant bit-rate savings in terms of BD-BR [10].

Elevating Your Streaming Experience with Just Noticeable Difference (JND)-based Encoding

Jingwen Zhu
Hadi Amirpour
Vignesh V Menon
Raimund Schatz
Patrick Le Callet

With the growing number of video streaming applications, intelligent video streaming has become increasingly important. By considering JND, video delivery can be improved by avoiding the selection of encodings with similar quality for a bitrate ladder. In this paper, we present an overview of the existing methods for modeling and predicting JND in the context of video streaming applications.

Dynamic CDN Switching - DASH-IF Content Steering in dash.js

Daniel Silhavy
Will Law
Stefan Pham
Ali C. Begen
Alex Giladi
Alex Balk

This paper overviews the content steering specification currently being developed in DASH Industry Forum and first implemented in the dash.js reference player.

MC-IF VVC technical guidelines

Lukasz Litwic
Justin Ridge
Alan Stein

Versatile Video Coding (VVC/H.266) is the latest of successful video coding standards jointly developed by ISO/IEC Moving Pictures Experts Group (MPEG) and ITU-T Visual Coding Experts Group (VCEG) [1]. With its best-in class video compression performance and new versatility features, VVC has capability to enhance existing applications and enabled new services. VVC is based on the same architecture as its predecessors, AVC/H.264 and HEVC/H.265, which enables seamless integration into existing workflows.

As the first VVC implementations enter the market, several application-oriented standards developing organizations (SDOs) and Industry Fora are defining VVC-based profiles and corresponding receivers' capabilities. However, these specifications don't typically prescribe how a service is realized and the impact of codec's operational parameters on delivered compression performance.

To this end, the Media Coding Industry Forum (MC-IF) [2] has initiated the development of VVC technical guidelines. These guidelines will serve as a reference for VVC configuration choices to address operational, interoperability, and regulatory needs while achieving optimal compression performance. The initial focus for guidelines development is for common media services deployments in broadcast and streaming verticals.

This talk will provide an overview of the guidelines' scope and selected use cases in the first release and beyond. It will follow with a presentation of selected VVC configuration aspects, with focus on new features most relevant to broadcast and streaming applications. The talk will conclude with presentation of a community review and guidelines development process for parties interested to contribute to guidelines development.

Redundant Encoding and Packaging for Segmented Live Media

Rufael Mekuria
Mohamad Raad
Ali C. Begen

We present the MPEG standardization activity on redundant encoding and packaging for live segmented media. The standardization includes profiling the Dynamic Adaptive Streaming over HTTP (DASH) Media Presentation Description for ingest, storage and redundant packaging applications. Further, a Common Media Application Format (CMAF) segment and track format is defined to support redundant encoding and packaging using a common timeline relative to the Unix epoch. The standardization is still ongoing and we solicit feedback from academic and industry practitioners.

Which CDN to Download From? A Client and Server Strategies

Abdelhak Bentaleb
Reza Farahani
Farzad Tashtarian
Hermann Hellwagner
Roger Zimmermann

Content Delivery Networks (CDNs) has been evolved to enable different video streaming services to deliver media content over the Internet with less latency and improved quality. However, using only a single CDN is highly vulnerable to outages and crashes, resulting in a poor viewer experience. Regardless of viewers, traffic, or media content, a single CDN will be never sufficient to satisfy viewers' quality of experience (QoE) requirements. To avoid single CDN issues, leveraging multiple CDNs from multiple providers, refers to multi-CDN, helps in improving performance, increasing geographic coverage, and alleviating outages. An essential part in multi-CDN solutions is the decision to select the best performing CDN in real-time, depending on periodic measurements of CDNs and video players. While multi-CDN architecture provides tremendous benefits, it has not been well investigated and integrated with the industry. This paper highlights various decision strategies for real-time CDN selection that helps content providers select the right solution aligned with their goals and business.

Common Media Server Data (CMSD) – Update on Implementations and Validation of Key Use Cases

Stefan Pham
Will Law
Ali C. Begen
Daniel Silhavy
Bertrand Berthelot
Stefan Arbanowski
Stephan Steglich

The CTA-5006 (Common Media Server Data, CMSD) specification establishes a uniform method for media servers to exchange data with each media object response. The aim is to enhance distribution efficiency, performance, and ultimately, the user experience. We provide an overview of CMSD implementations and focus on integrating CMSD into the dash.js reference player. Three use cases are evaluated to demonstrate the advantages of CMSD, including leveraging edge server throughput estimates to improve initial bitrate selection and low-latency live streaming, prefetching manifests and segments to improve startup delay, and allowing an edge server to suggest a playback bitrate to improve the collective experience. The outcomes from the initial implementations confirm the benefits of using CMSD.

Need for Low Latency: Media over QUIC

Zafer Gurel
Tugce Erkilic Civelek
Ali C. Begen

This paper overviews developing a low-latency solution for media ingest and distribution, the work undertaken by the IETF's new Media over QUIC (moq) working group. It summarizes the motivation, goals, current work and potential improvements.

Improving Netflix video quality with neural networks

Christos George Bampis
Li-Heng Chen
Zhi Li

Video downscaling is an important component of adaptive video streaming, which tailors streaming to screen resolutions of different devices and optimizes picture quality under varying network conditions. With video downscaling, a high-resolution input video is downscaled into multiple lower-resolution videos. This is typically done by a conventional resampling filter like Lanczos. In this work, we describe how we improved Netflix video quality by developing neural networks for video downscaling and deploying them at scale.

Cloud based AI-driven Video Super-Resolution Solution

Nelson Francisco
Julien Le Tanou

High production costs are delaying a widespread deployment of UHD broadcast offerings. Only a few special events tend to be produced and broadcast in UHD, with most 4K content coming from streaming providers such as Netflix, Amazon Video or Disney+, and even on those cases, availability is still significantly more limited than for their non-4K libraries. As a result, the potential of UHD displays is not fully exploited as the picture representation relies on the viewing device upscale capabilities, usually highly constrained by computing and power consumption limitations. High-quality up-conversion can be a viable solution to accelerate 4K availability, since costs can be reduced not only by complementing UHD offerings with high-quality content upscaled from existing HD libraries, but also by leveraging on existing production pipelines all the way up to a final up-conversion stage, while retaining control over how content will be rendered on 4K screens. We propose a cloud-based AI driven Super-Resolution upscaling solution that vastly outperforms traditional methods while retaining low and scalable operational costs.

Context-Aware HTTP Adaptive Video Streaming Utilizing QUIC's Stream Priority

Sindhu Chellappa
Reza Farahani
Radim Bartos
Hermann Hellwagner

In recent years, HTTP Adaptive Streaming (HAS) has been a predominant video delivery technology over the Internet. The existing HAS-based techniques ignore the context and the affective content of the video, rather its prime focus relies on determining the video quality of the fore coming data based on the current network conditions [4, 5]. By highlighting the video traffic based on the context (e.g., the goal in a soccer match or the climax of a movie) and allocating more network resources (e.g., prioritizing the important segments) to the highlighted segments will lead to end user satisfaction with a pleasant Quality of Experience (QoE). Quick UDP Internet Connections (QUIC), a recently standardized transport protocol, has gained popularity due to its promising features, e.g., reduced connection establishment latency, stream multiplexing and stream priority. This paper leverages QUIC's stream priority to support context-based streaming by introducing a novel video delivery approach.

Performance Assessment of AV1, x265 and VVenC Open-Source Encoder Implementations Compared to VVC and HEVC Reference Software Models

Dan Grois
Alex Giladi

Video applications continue to gain a lot of traction and have an enormous demand. As a result, there is a strong demand to decrease the video transmission bitrate without reducing visual quality [1],[2]. The efforts to achieve bitrate savings, especially for high-definition video content started in 2010, when the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Pictures Expert Group (MPEG) established a Joint Collaborative Team on Video Coding (JCT-VC) to work on the High-Efficiency Video Coding (HEVC) standard. Then, in 2013, the 1st version of the HEVC specification was approved by ITU-T as Recommendation H.265 and by ISO/IEC as MPEG-H, Part 2 [3]. When developing the H.265/MPEG-HEVC standard, high-resolution and high frame-rate video coding was considered as one of its main potential application scenarios, while keeping it applicable to almost all existing use cases that were already targeted by H.264/MPEG-AVC.

However, more efficient video compression techniques were still desired, especially for streaming Ultra HD video content as well as Panorama video content (so called, 360° video content) from concerts, shows, sport events, etc. Therefore, in order to fulfill this demand, the exploration phase for future video coding technologies beyond HEVC (ITU-T H.265 | ISO/IEC 23008-2) started in October 2015 by establishing a Joint Video Exploration Team (JVET) on Future Video Coding of ITU-T VCEG and ISO/IEC MPEG [4]. The development process of the video coding standard beyond HEVC was driven by the most recent scientific and technological achievements in the video coding field, and under the JVET development it was titled "Versatile Video Coding" [5], or in short, VVC [6]. The 1st version of the VVC standard (i.e. VVC v1) was officially finalized during the 19th JVET meeting, which took place between June 22 and July 1, 2020, and it was approved by ITU-T as Recommendation H.266 and by ISO/IEC as MPEG-I, Part 3 [6].

While the joint video coding standardization activities of ITU-T and ISO/IEC organizations rely on an open and collaborative process driven by its active members, several companies individually developed their own video coding formats [7], [8]. Such in 2015, the Alliance for Open Media(AOM) was formed with the objective to work towards next-generation media formats in general and with a particular short-term focus on the development of a video-coding scheme [9],[10],[11]. In April 2016, AOM released a baseline version of the developed video-coding scheme, which got the name of AV1. In turn, the final version of the 1st edition of AV1 released in 2018, claiming to provide a significant coding-efficiency gain over the current state-of-the-art video codecs [10]. In May 2020, AV1 codec had a major update and its 2nd edition, i.e. AV1 2.0, was released [12],[13]. In turn, AV1 3.0 with better compression performance was released in 2021 [12],[13]. However, experimental results related to coding efficiency comparison of AV1 versus standardized codecs such as HEVC and VVC, which are reported in the literature [14], are not consistent and are sometimes even contradictory. As a result, there is a lot of confusion about the ability of future versions of AV1 to compete with HEVC/VVC-based encoder implementations.

In order to put things into perspective and to provide relevant information in a reproducible and reliable form, this work presents detailed experimental results of a coding-efficiency comparison of AV1 (aomenc) v3.6 [12] versus VVC-based codecs -the open-source VVenC codec v1.7 [15] and VVC reference software model (VTM) v17.0, and versus HEVC-based codecs - the open-source x265 codec v3.5 [16],[17] and HEVC reference software model (HM) v17.0 [18], along with a detailed discussion of the selected software implementations, the choice of coding parameters, and the corresponding evaluation setup. All tested encoders have been set to the best quality operation modes (i.e. the slowest encoding options) with a Group of Pictures (GOP) size equal to 16. Specifically, in this work, the authors focus on 4K and 1080p video content (selected from JVET CTC [19]), since it is considered to be the most popular nowadays, especially within the consumer multimedia applications, and encoding such content is typically the most challenging due to significantly larger computational complexity in terms of encoding times compared to lower video resolutions. For the rate-distortion (R-D) performance assessment, the authors used a Bjøntegaard-Delta bit-rate (BD-BR) measurement method for calculating average bit-rate differences between R-D curves for the same objective quality (e.g., for the same PSNR_YUV) [20].

According to the experimental results, the coding efficiency of both AV1 and x265 was found to be significantly inferior to VVenC, while having the overhead of 31.2% and 118%, respectively. In terms of the encoding speed, the typical encoding time of AV1 is ~5 times slower than HM, and ~2 times slower than VVenC. On the other hand, the encoding time of x265 for the given best quality mode configuration is similar to that of HM, while HM provides bitrate savings of more than 22%. When compared to VTM, the HM runtime is more than 12 times faster, but with an overhead of ~52%.

Bitrate and Adaptive Streaming: What are We Measuring and Why?

Alex Giladi
Dan Grois
Kirithika Kalirathnam
Robert Dandrea

Currently, a significant increase in bandwidth requirements is expected in the next couple of years, particularly due to an increase in the device resolution [1]: a typical bit-rate for the 4K video is between 15 to 18 Mbps, while it is considered to be more than twice the High-Definition (HD) video bit-rate and a factor of nine the Standard-Definition (SD) video bit-rate. As a result, there is currently a strong demand to decrease video transmission bit-rate, substantially without reducing the visual presentation quality [1].

Bitrate is one of the most well-known and intuitively understood concepts in video transmission. With that said, there are several different definitions of how bitrate is measured, and different methods of measurement are applicable for different situations. Video compression standards, such as H.265/MPEG-HEVC [2], provide a normative definition of bitrate which applies to all Network Abstraction Layer (NAL) units or their subset, and it is followed by video encoders in a corresponding manner. However, we are not transmitting "bare" NAL units, and not necessarily doing so in a significantly constrained pipe. In addition, a ubiquitously used definition of bitrate in the MPEG-2 TS is provided within the measurement technical report for Digital Video Broadcasting (DVB) systems - ETSI 101 290 [3].

Furthermore, the situation is different in the adaptive video streaming: for example, a 1-sec sliding window over a stream is less relevant when the unit of transmission is a segment. MPEG DASH [4] and Apple HLS [5] have their own definitions of bitrate, based on maxima of segment bitrates, and extended signaling for content-adaptive encoding. The concept of a constant-rate "pipe" is also often irrelevant - video traffic is de-facto multiplexed with all other traffic within: e.g., an LTE cell or a service group in a DOCSIS broadband network and CDN storage, and egress become the limiting per-stream factor. With that said, segment-level measurement is important for streaming clients reasoning the sustainability of a given variant. Lastly, ISPs and CDNs often use the so-called 95/5 burstable model, where 95^th percentile of a 5-min average is used in lieu of second or segment-based calculation.

In this work, we first review various bitrate measurement models and illustrate the difference between them by using different measurement methods, which are applied to a set of constant-rate and content-adaptive streams, being generated by a variety of commercial encoders.

Then, we show benefits from using the segment rate, rather than a sliding window, as the target rate. The latter approach showed statistically significant improvements in compression efficiency by using the open source x265 encoder [6],[7], which is a popular open-source encoder that generates bitstreams compliant with the HEVC video coding standard [2].

CP-Steering: CDN- and Protocol-Aware Content Steering Solution for HTTP Adaptive Video Streaming

Reza Farahani
Abdelhak Bentaleb
Mohammad Shojafar
Hermann Hellwagner

In recent years, HTTP Adaptive Streaming (HAS)-based technologies, such as Dynamic Adaptive Streaming over HTTP (DASH), have become the predominant video delivery paradigm over the Internet. HAS-based content providers frequently employ multiple Content Delivery Network (CDNs) to distribute their content to the end users. Recently, Apple and DASH-IF standardization introduced content steering technique to enable content providers to switch the content source that a player utilizes at start-up or midstream. Due to diverse adaptive video streaming demands for latency-sensitive and/or bandwidth-sensitive streams, satisfying end users with a pleasant Quality of Experience (QoE) poses a new challenge for redesigning the current content steering strategies. This paper leverages the recent popular Quick UDP Internet Connections (QUIC) transport protocol and introduces a CDN- and Protocol-aware content Steering solution for HTTP Adaptive Streaming (CP-Steering). We discuss the details of CP-Steering strategy and then give directions for future work.

Improving ABR Encoding by Adaptation to "True Resolution" of the Content

Yuriy Reznik
Karl Lillevold
Abhijith Jagannath
Nabajeet Barman

As well known, when the input video is upscaled, the effectiveness of its transcoding and delivery may suffer. The encoded stream may not look sharp and use more bits than necessary. Then with adaptive streaming, extra streams may be added to reach such a maximum resolution and bitrate. The result is a significant waste of storage, bandwidth, and compute resources. In this paper, we explain the origins of this problem, survey existing methods for addressing it, and then propose our solution. Our proposed design incorporates a novel "true resolution" detection technique and a traditional CAE (context-aware encoding) encoding ladder generator. The CAE generator receives the detected "true resolution" of content as a limit for resolutions to include in the ladder. Such a limit enables all subsequent savings. We describe the details of our proposed resolution detection method, bring examples explaining how it works, and then study the performance of our proposed system in practice. Our study, performed using 500 video assets representing 120 hours of real-world production material, confirms the effectiveness of this technique. It shows that in many practical cases, the incoming content is, in fact, upscaled and that adding a "true resolution" detector to CAE brings very appreciable savings in bandwidth, storage, and compute costs.

Live Low-Latency Cloud-based Adaptive Video Streaming Evaluation Framework

Babak Taraghi

We present an open-source cloud-based adaptive video streaming evaluation framework; a testbed that can be instantiated in a cloud infrastructure, examine multiple media players with different network profiles, and conclude the evaluation with statistics and insights into the result [3]. It has been used in multiple projects to evaluate the performance of media players' Adaptive Bitrate (ABR) algorithms and to conduct subjective evaluations to compare with known QoE models like ITU-T P.1203. We used this framework to address unanswered questions such as (i) the minimum noticeable duration for stall events in HAS; (ii) the correlation between the media quality and the impact of stall events on QoE; (iii) the end-user preference regarding multiple shorter stall events versus a single longer stall event; and (iv) the end-user preference of media quality switches over stall events [2].

In the Live Low-Latency extension, LLL-CAdViSE [1], We generate the audiovisual content on the origin server. Unlike the previous version, no VOD content is required, and the generation, encoding, and packaging of the audio and video streams happen while the experiment is executing. We use FFMPEG to generate CMAF chunks and PUT them through an HTTP connection into the delivery server. The delivery server then would transfer the chunks per request by the clients, as shown in Figure 1. The network characteristics can be shaped based on real-life network traces. A comprehensive log of the events (e.g.,latency, stalls, quality switches, etc.) will be stored in the database to calculate the objective Mean Opinion Scores (MOS) automatically.

In this talk, we will demonstrate a low-latency live streaming video workflow, including experimental results and the key componenets of this open-source software will be discussed. The source-code of this testbed is available in the following repository:

https://github.com/cd-athena/LLL-CAdViSE

Video Quality Measurement & Control for Live Encoding

Jan De Cock

Video Quality (VQ) Measurement is essential in a variety of applications, e.g. to compare video codecs, to select bitrates or configurations, or to optimize encoder behavior. Many different VQ metrics have been developed over the past decades, both full-reference (comparing source and encoded videos) and no-reference (without access to the source) [3]. For most offline encoding use cases, these metrics typically suffice, and give a fairly accurate view of the subjective quality of compressed video streams. In live video distribution, however, VQ measurement becomes a lot more challenging, in particular when computational and cost constraints come into play. While several accurate metrics exist (such as MS-SSIM and VMAF), these are often too complex for real-time video compression and decision making. Faster metrics exist, but these typically lack accuracy.

Fortunately, we have seen initiatives towards more efficient metrics. As an example, simplifications of SSIM have been introduced in [1] In recent years, we have seen efforts to reduce the computational complexity of VMAF [2]. By using vectorization, fixed-point approximations, and multi-threading, the runtime of VMAF has already been reduced. Still, its overall complexity remains high. In particular, the complexity relative to that of the encoding process is an important driver for its usage. Translated into business terms, the cost per channel of the VQ metric becomes an important criterion for adoption, and can be prohibitively high.

In this talk, we focus on the computational complexity of commonly used VQ metrics, and discuss approximations for further complexity reduction, including approaches based on machine learning. We address the difficulties around real-time VQ measurement from a complexity/cost standpoint, especially when these metrics have to be embedded inside live encoders.

Furthermore, we introduce which metrics can be used inside encoders for real-time decision making and active video quality control. Based on the fast quality metrics discussed above, advanced rate-quality control mechanisms can be embedded deep inside the encoders. Quality control is beneficial in multiple scenarios, including: (i) for CBR quality control; (ii) for ABR quality control and profile optimization; and (iii) for intelligent bit distribution within "traditional" statmux bundles. Quality control leads to more efficient bitrate allocation, and more constant quality throughout the video stream. We introduce novel algorithms for quality control for each of these applications, along with compression savings we measured in the field.

Fast and Robust Video Deduplication

Chris Henry
Rijun Liao
Ruiyuan Lin
Zhebin Zhang
Hongyu Sun
Zhu Li

The popularity of social media networks and mobile devices have skyrocketed in recent years. This has led to the rapid increase in the video content being recorded and uploaded to online platforms like TikTok, YouTube, etc. As a result, a rise in the amount of illegal pirate videos can be observed. These pirate videos have the same content as original videos but with minor editing effects and variations in coding. The task of finding such duplicate videos is known as video deduplication. Also, storing this huge amount of video data is a challenging issue.

Generally, video deduplication systems rely on feature extraction followed by computing a similarity score. The ideal goal is to develop a discriminative feature to differentiate among similar videos. In this work, we present a robust and lightweight video deduplication that can find duplicate videos among a large video repository extremely quickly within a few milliseconds. We propose a robust and highly discriminative feature that was generated using fisher vector and a thumbnail feature. The fisher vector was generated by applying fisher vector aggregation on Scale-Invariant Feature Transform (SIFT) keypoints with Gaussian Mixture Model (GMM) as the generative model. GMM was trained via SIFT key-points extracted from frames uniformly sampled from videos. For the thumbnail feature, each frame is resized to a 12x12 resolution. We reduce the dimensions of the fisher vector and thumbnail feature to 32-d by using principal component analysis (PCA), leaving us with our proposed deduplication features.

A large repository of 1 million frames was generated by extracting frames from videos at uniform steps. Each frame in the repository was assigned a unique global timestamp. For fast video retrieval from the large repository, a multiple k-d tree setup was designed that trains a separate k-d tree for the fisher vector and the thumbnail feature. KNN search was used to search the k-d tree which returns the samples nearest to the query. In our case, we combine the nodes retrieved from both the k-d trees for better retrieval results.

The k-d tree retrieval provides us with a frame retrieval system. For video retrieval, we develop a data pruning strategy that utilizes the sequence ID and timestamp information to accurately retrieve duplicate videos. CDVS dataset [1] was used for training the GMM. The 1 million frame repository and the test data were generated using the large-scale FIVR-200K dataset [2]. Experimental results show that our retrieval results are highly accurate and the system can process a query video within a few milliseconds.

MHV '23: Proceedings of the 2nd Mile-High Video Conference

MHV '23: Proceedings of the 2nd Mile-High Video Conference

Sections

User login