AI4TV '19- Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery

AI4TV '19- Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery

Full Citation in the ACM Digital Library

SESSION: Keynote Ttalks

AI Gets Creative

  •      Marta Mrak

Numerous breakthroughs in multimedia signal processing are being enabled thanks to applications of machine learning in tasks such as multimedia creation, enhancement, classification and compression [1]. Notably, in the context of production and distribution of television programmes, it has been successfully demonstrated how Artificial Intelligence (AI) can support innovation in the creative sector. In the context of delivering TV programmes of stunning visual quality, the applications of deep learning have enabled significant advances when the original content is of poor quality / resolution, or when delivery channels are very limited. Examples when the enhancement of originally poor quality is needed include new content forms (e.g. user generated content) and historical content (e.g. archives), while limitations of delivery channels can, first of all, be addressed by improving content compression. As a state-of-the-art example, the benefits of deep-learning solutions have been recently demonstrated within an end-to-end platform for management of user generated content [2], where deep learning is applied to increase video resolution, evaluate video quality and enrich the video by providing automatic metadata. Within this particular application space where large amount of user generated content is available, the progress has also been made in addressing visual story editing using social media data in automatic ways, making programmes from large amount of content faster [3]. Broadcasters are also interested in restauration of historical content more cheaply. For example, adding colour to "black and white" content has until now been an expensive and time-consuming task. However, recently new algorithms have been developed to perform the task more efficiently. Generative Adversarial Networks (GANs) have become the baseline for many image-to-image translation tasks, including image colourisation. Aiming at the generation of more naturally coloured images from "black and white" sources, newest algorithms are capable of generalisation of the colour of natural images, producing realistic and plausible results [4]. In the context of content delivery, new generations of compression standards enable significant reduction of required bandwidth [5], however, with a cost of increased computational complexity. This is another area where AI can be utilised for better efficiency - either in its simple forms as decision trees [6,7] or more advanced deep convolutional neural networks [8]. Looking forward, this penetration of AI opens new challenges, such as interpretability of deep learning (to enable use AI in an accountable way as well as to enable AI-inspired low-complexity algorithms) and applicability in systems which require low-complexity solutions and/or do not have enough training data. However, overall further benefits of these new approaches include automatization of many traditional production tasks which has the potential to transform the way content providers make their programmes in cheaper and more effective ways.

SESSION: Full Papers

Gender Representation in French Broadcast Corpora and Its Impact on ASR Performance

  •      Mahault Garnerin
  • Solange Rossato
  • Laurent Besacier

This paper analyzes the gender representation in four major corpora of French broadcast. These corpora being widely used within the speech processing community, they are a primary material for training automatic speech recognition (ASR) systems. As gender bias has been highlighted in numerous natural language processing (NLP) applications, we study the impact of the gender imbalance in TV and radio broadcast on the performance of an ASR system. This analysis shows that women are under-represented in our data in terms of speakers and speech turns. We introduce the notion of speaker role to refine our analysis and find that women are even fewer within the Anchor category corresponding to prominent speakers. The disparity of available data for both gender causes performance to decrease on women. However, this global trend seems to be counterbalanced when sufficient amount of data per speaker is available.

Data-driven Summarization and Synchronized Second-screen Enrichment of Cycling Races: Using Live and Historical Sports Data to Reinvent Traditional Reporting

  •      Steven Verstockt
  • Erik Mannens
  • Jelle De Bock

Traditional broadcasters of cycling races are experiencing hard times as the numbers of spectators are decreasing each year. Other ways of reporting are needed to keep the viewer interested. In this paper, two possible solutions are proposed that have been evaluated during the Grand Depart of the Tour de France 2019 in Brussels. The first innovation focuses on data-driven summarization and allows end-users to query for personalized stories of a race, tailored to their wishes (such as the length of the clip and the teams and/or riders that they are interested in). The second innovation follows the second screen trend and synchronizes cycling heritage multimedia data with the riders' live location during the race. Both rich, interactive TV experiences are based on a combination of data mining and computer vision techniques which can also be applied to other sports with similar characteristics. Evaluation by a test audience showed that there is certainly potential in both formats.

A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization

  •      Evlampios Apostolidis
  • Alexandros I. Metsai
  • Eleni Adamantidou
  • Vasileios Mezaris
  • Ioannis Patras

In this paper we present our work on improving the efficiency of adversarial training for unsupervised video summarization. Our starting point is the SUM-GAN model, which creates a representative summary based on the intuition that such a summary should make it possible to reconstruct a video that is indistinguishable from the original one. We build on a publicly available implementation of a variation of this model, that includes a linear compression layer to reduce the number of learned parameters and applies an incremental approach for training the different components of the architecture. After assessing the impact of these changes to the model's performance, we propose a stepwise, label-based learning process to improve the training efficiency of the adversarial part of the model. Before evaluating our model's efficiency, we perform a thorough study with respect to the used evaluation protocols and we examine the possible performance on two benchmarking datasets, namely SumMe and TVSum. Experimental evaluations and comparisons with the state of the art highlight the competitiveness of the proposed method. An ablation study indicates the benefit of each applied change on the model's performance, and points out the advantageous role of the introduced stepwise, label-based training strategy on the learning efficiency of the adversarial part of the architecture.

On the Robustness of Deep Learning Based Face Recognition

  •      Werner Bailer
  • Martin Winter

Identifying persons using face recognition is an important task in applications such as media production, archiving and monitoring. Like other tasks, also face recognition pipelines have recently shifted to Deep Convolutional Neural Network (DNNs) based approaches. While they show impressive performance on standard benchmark datasets, the same performance is not always reached on real data from media applications. In this paper we address robustness issues in a face detection and recognition pipeline. First, we analyze the impact of image impairments (in particular compression) on face detection, and how to conceal them in order to improve face detection performance. This is studied both on face samples originating from still image and video data. Second, we propose approaches to improve open-set face recognition, i.e., handling of "unknown'' persons, in particular to reduce false positive recognitions. We provide experimental results on image and video data and provide conclusions that help to improve the performance in practical applications.

L-STAP: Learned Spatio-Temporal Adaptive Pooling for Video Captioning

  •      Danny Francis
  • Benoit Huet

Automatic video captioning can be used to enrich TV programs with textual informations on scenes. These informations can be useful for visually impaired people, but can also be used to enhance indexing and research of TV records. Video captioning can be seen as being more challenging than image captioning. In both cases, we have to tackle a challenging task where a visual object has to be analyzed, and translated into a textual description in natural language. However, analyzing videos requires not only to parse still images, but also to draw correspondences through time. Recent works in video captioning have intended to deal with these issues by separating spatial and temporal analysis of videos. In this paper, we propose a Learned Spatio-Temporal Adaptive Pooling (L-STAP) method that combines spatial and temporal analysis. More specifically, we first process a video frame-by-frame through a Convolutional Neural Network. Then, instead of applying an average pooling operation to reduce dimensionality, we apply our L-STAP, which attends to specific regions in a given frame based on what appeared in previous frames. Experiments on MSVD and MSR-VTT datasets show that our method outperforms state-of-the-art methods on the video captioning task in terms of several evaluation metrics.

AI for Audience Prediction and Profiling to Power Innovative TV Content Recommendation Services

  •      Lyndon Nixon
  • Krzysztof Ciesielski
  • Basil Philipp

In contemporary TV audience prediction, outliers are considered mere anomalies in the otherwise cyclical trend and seasonality components that can be used to make predictions. In the ReTV project, we want to provide more accurate audience predictions in order to enable innovative services for TV content recommendation. This paper presents a concept for identifying the source of outliers and factoring TV content categories and the occurrence of events as additional features for training TV audience prediction. We show how this can improve the accuracy of the audience prediction. Finally, we outline how this work could also be combined with AI-enabled audience profiling to power new content recommendation services.

SESSION: Demonstrations

Examples of Uses of Artificial Intelligence in Video Archives

  •      Antoine Mercier
  • Sébastien Ducret
  • Charlotte Bürki
  • Léonard Bouchet

This demo paper presents different applications of artificial intelligence techniques applied to video archives such as face recognition, visual search and classification based either on faces, objects or landmarks. All algorithms are based on features that are extracted once at the beginning of the pipeline, this shared first stage allows flexibility and reduced computing time. The algorithms have been put into production early in the development process allowing early feedback and resulting in a user-focused development.

Automatically Adapting and Publishing TV Content for Increased Effectiveness and Efficiency

  •      Basil Philipp
  • Krzysztof Ciesielski
  • Lyndon Nixon

By automatically adapting TV content for publication on secondary channels, we aim to increase user engagement and at the same time reduce the manual effort for broadcasters. We present a system architecture for automatic content adaptation and two distinct use cases built on top of it.

A Workstation for Real-Time Processing of Multi-Channel TV

  •      Mathieu Delalandre

This papers presents the architecture of a workstation for the real-time processing of multi-channel TV. A first application for real-time detection of video copy is discussed.