NarSUM '22: Proceedings of the 1st Workshop on User-centric Narrative Summarization of Long Videos

NarSUM '22: Proceedings of the 1st Workshop on User-centric Narrative Summarization of Long Videos

NarSUM '22: Proceedings of the 1st Workshop on User-centric Narrative Summarization of Long Videos

Full Citation in the ACM Digital Library

SESSION: NarSUM Invited Talks

Video Summarization in the Deep Learning Era: Current Landscape and Future Directions

  • Ioannis Patras

In this talk we will provide an overview of the field of video summarization with a focus on the developments, the trends and the open challenges in the era of Deep Learning and Big Data. After a brief introduction to the problem, we will provide a broad taxonomy of the works in the area and the recent trends from multiple perspectives, including types of methodologies/architectures; supervision signals; and modalities. We will then present current datasets and evaluation protocols highlighting their limitations and challenges that are faced with respect to it. Finally, we will close by giving our perspective for the challenges in the field and for interesting future directions.

Learning, Understanding and Interaction in Videos

  • Manmohan Chandraker

Advances in mobile phone camera technologies and internet connectivity have made videos one of the most intuitive ways to communicate and share experiences. Millions of cameras deployed in our homes, offices and public spaces record videos for purposes ranging across safety, assistance, entertainment and many others. This talk describes some of our recent progress in learning, understanding and interaction with such digital media. It will introduce methods in unsupervised and self-supervised representation learning that allow video solutions to be efficiently deployed with minimal data curation. It will discuss how physical priors or human knowledge are leveraged to understand insights in videos ranging from three-dimensional scene properties to language-based descriptions. It will also illustrate how these insights allow us to augment or interact with digital media with unprecedented photorealism and ease.


Panel Discussion: Emerging Topics on Video Summarization

  • Mohan Kankanhalli
  • Jianquan Liu
  • Yongkang Wong
  • Karen Stephen

With video capture devices becoming widely popular, the amount of video data generated per day has seen a rapid increase over the past few years. Browsing through hours of video data to retrieve useful information is a tedious and boring task. Video Summarization technology has played a crucial role in addressing this issue and is a well-researched topic in the multimedia community. This panel aims to bring together researchers working on relevant backgrounds to discuss emerging topics on video summarization, including recent developments, future directions, challenges, solutions, potential applications and other open problems.

SESSION: NarSUM Session: Dataset, Recognition, and Summarization

Narrative Dataset: Towards Goal-Driven Narrative Generation

  • Karen Stephen
  • Rishabh Sheoran
  • Satoshi Yamazaki

In this paper, we propose a new dataset called the Narrative dataset, which is a work in progress, towards generating video and text narratives of complex daily events from long videos, captured from multiple cameras. As most of the existing datasets are collected from publicly available videos such as YouTube videos, there are no datasets targeted towards the task of narrative summarization of complex videos which contains multiple narratives. Hence, we create story plots and conduct video shooting with hired actors to create complex video sets where 3 to 4 narratives happen in each video. In the story plot, a narrative composes of multiple events corresponding to video clips of key human activities. On top of the shot video sets and the story plot, the narrative dataset contains dense annotation of actors, objects, and their relationships for each frame as the facts of narratives. Therefore, narrative dataset richly contains holistic and hierarchical structure of facts, events, and narratives. Moreover, Narrative Graph, a collection of scene graphs of narrative events with their causal relationships, is introduced for bridging the gap between the collection of facts and generation of the summary sentences of a narrative. Beyond related subtasks such as scene graph generation, narrative dataset potentially provide challenges of subtasks for bridging human event clips to narratives.

Soccer Game Summarization using Audio Commentary, Metadata, and Captions

  • Sushant Gautam
  • Cise Midoglu
  • Saeed Shafiee Sabet
  • Dinesh Baniya Kshatri
  • Pål Halvorsen

Soccer is one of the most popular sports globally, and the amount of soccer-related content worldwide, including video footage, audio commentary, team/player statistics, scores, and rankings, is enormous and rapidly growing. Consequently, the generation of multimodal summaries is of tremendous interest for broadcasters and fans alike, as a large percentage of audiences prefer to follow only the main highlights of a game. However, annotating important events and producing summaries often requires expensive equipment and a lot of tedious, cumbersome, manual labour. In this context, recent developments in Artificial Intelligence (AI) have shown great potential. The goal of this work is to create an automated soccer game summarization pipeline using AI. In particular, our focus is on the generation of complete game summaries in continuous text format with length constraints, based on raw game multimedia, as well as readily available game metadata and captions where applicable, using Natural Language Processing (NLP) tools along with heuristics. We curate and extend a number of soccer datasets, implement an end-to-end pipeline for the automatic generation of text summaries, present our preliminary results from the comparative analysis of various summarization methods within this pipeline using different input modalities, and provide a discussion of open challenges in the field of automated game summarization.

Contrastive Representation Learning for Expression Recognition from Masked Face Images

  • Fanxing Luo
  • Longjiao Zhao
  • Yu Wang
  • Jien Kato

With the worldwide spread of COVID-19, people are trying different ways to prevent the spread of the virus. One of the most useful and popular ways is wearing a face mask. Most people wear a face mask when they go out, which makes facial expression recognition become harder. Thus, how to improve the performance of the facial expression recognition model on masked faces is becoming an important issue. However, there is no public dataset that includes facial expressions with masks. Thus, we built two datasets which are a real-world masked facial expression database (VIP-DB) and a man-made masked facial expression database (M-RAF-DB). To reduce the influence of masks, we utilize contrastive representation learning and propose a two-branches network. We study the influence of contrastive learning on our two datasets. Results show that using contrastive representation learning improves the performance of expression recognition from masked face images.