MMSports '19- Proceedings Proceedings of the 2nd International Workshop on Multimedia Content Analysis in Sports

MMSports '19- Proceedings Proceedings of the 2nd International Workshop on Multimedia Content Analysis in Sports

Full Citation in the ACM Digital Library

SESSION: Session 1

Video-based Analysis of Soccer Matches

  •      Maximilian T. Fischer
  • Daniel A. Keim
  • Manuel Stein

With the increasingly detailed investigation of game play and tactics in invasive team sports such as soccer, it becomes ever more important to present causes, actions and findings in a meaningful manner. Visualizations, especially when augmenting relevant information directly inside a video recording of a match, can significantly improve and simplify soccer match preparation and tactic planning. However, while many visualization techniques for soccer have been developed in recent years, few have been directly applied to the video-based analysis of soccer matches. This paper provides a comprehensive overview and categorization of the methods developed for the video-based visual analysis of soccer matches. While identifying the advantages and disadvantages of the individual approaches, we identify and discuss open research questions, soon enabling analysts to develop winning strategies more efficiently, do rapid failure analysis or identify weaknesses in opposing teams.

Retrieval of Similar Scenes Based on Multimodal Distance Metric Learning in Soccer Videos

  •      Tomoki Haruyama
  • Sho Takahashi
  • Takahiro Ogawa
  • Miki Haseyama

This paper presents a new method for retrieval of similar scenes based on multimodal distance metric learning in far-view soccer videos that broadly capture soccer fields and are not edited. We extract visual features and audio features from soccer video clips, and we extract text features from text data corresponding to these soccer video clips. In addition, distance metric learning based on Laplacian Regularized Metric Learning is performed to calculate the distances for each kind of features. Finally, by determining the final rank by integrating these distances, we realize successful multimodal retrieval of similar scenes from query scenes of soccer video clips. Experimental results show the effectiveness of our retrieval method.

A Deep Architecture for Multimodal Summarization of Soccer Games

  • Melissa Sanabria
  • Sherly
  • Frédéric Precioso
  • Thomas Menguy

The massive growth of sports videos, specially in soccer, has resulted in a need for the automatic generation of summaries, where the objective is not only to show the most important actions of the match but also to elicit as much emotion as the ones bring upon by human editors. State-of-the-art methods on video summarization mostly rely on video processing, however this is not an optimal approach for long videos such as soccer matches. In this paper we propose a multimodal approach to automatically generate summaries of soccer match videos that consider both event and audio features. The event features get a shorter and better representation of the match, and the audio helps detect the excitement generated by the game. Our method consists of three consecutive stages: Proposals, Summarization and Content Refinement. The first one generates summary proposals, using Multiple Instance Learning to deal with the similarity between the events inside the summary and the rest of the match. The Summarization stage uses event and audio features as input of a hierarchical Recurrent Neural Network to decide which proposals should indeed be in the summary. And the last stage, takes advantage of the visual content to create the final summary. The results show that our approach outperforms by a large margin not only the video processing methods but also methods that use event and audio features.

"Does 4-4-2 exist?" --: An Analytics Approach to Understand and Classify Football Team Formations in Single Match Situations

  •      Eric Müller-Budack
  • Jonas Theiner
  • Robert Rein
  • Ralph Ewerth

The chance to win a football match can be significantly increased if the right tactic is chosen and the behavior of the opposite team is well anticipated. For this reason, every professional football club employs a team of game analysts. However, at present game performance analysis is done manually and therefore highly time-consuming. Consequently, automated tools to support the analysis process are required. In this context, one of the main tasks is to summarize team formations by patterns such as 4-4-2 that can give insights into tactical instructions and patterns. In this paper, we introduce an analytics approach that automatically classifies and visualizes the team formation based on the players' position data. We focus on single match situations instead of complete halftimes or matches to provide a more detailed analysis. %in contrast to previous work. The novel classification approach calculates the similarity based on pre-defined templates for different tactical formations. A detailed analysis of individual match situations depending on ball possession and match segment length is provided. For this purpose, a visual summary is utilized that summarizes the team formation in a match segment. An expert annotation study is conducted that demonstrates 1)~the complexity of the task and 2)~the usefulness of the visualization of single situations to understand team formations. The suggested classification approach outperforms existing methods for formation classification. In particular, our approach gives insights into the shortcomings of using patterns like 4-4-2 to describe team formations.

SESSION: Keynote & Session 2

Combining Qualitative and Quantitative Analysis in Football with SportSense

  •      Philipp Seidenschwarz
  • Adalsteinn Jonsson
  • Fabian Rauschenbach
  • Martin Rumo
  • Lukas Probst
  • Heiko Schuldt

The task of performance analysts and coaches in football (and other team sports) is manifold: they need to assess the performance of individual players of their team, they need to monitor the interaction between players of their team and their tactical compliance, and they need to analyze other teams. For this, they usually have to consider various sources of information: video footage, tracking data, event data, and aggregated statistics. On the basis of this information, analysts have to generate quantitative summaries of events including their spatial and temporal distribution, and the qualitative assessment of individual events by considering the associated video footage. In this paper, we present SportSense, a system for sports video retrieval, that seamlessly combines quantitative and qualitative analysis. For this, SportSense provides dedicated filters that help analysts in selecting the events they are interested in. Moreover, it supports the comparative analysis of stored queries with respect to specific parameters. Essentially, SportSense allows to easily switch between qualitative and quantitative analyses to support coaches and analysts in a best possible way in their task. Based on a user study, we show the effectiveness of the proposed approach.

Frame-Level Event Detection in Athletics Videos with Pose-Based Convolutional Sequence Networks

  •      Moritz Einfalt
  • Charles Dampeyrou
  • Dan Zecha
  • Rainer Lienhart

In this paper we address the problem of automatic event detection in athlete motion for automated performance analysis in athletics. We specifically consider the detection of stride-, jump- and landing related events from monocular recordings in long and triple jump. Existing work on event detection in sports often uses manually designed features on body and pose configurations of the athlete to infer the occurrence of events. We present a two-step approach, where temporal 2D pose sequences extracted from the videos form the basis for learning an event detection model. We formulate the detection of discrete events as a sequence translation task and propose a convolutional sequence network that can accurately predict the timing of event occurrences. Our best performing architecture achieves a precision/recall of 92.3%/89.0% in detecting start and end of ground contact during the run-up and jump of an athlete at a temporal precision of +/- 1 frame at 200Hz. The results show that 2D pose sequences are a suitable motion representation for learning event detection in a sequence-to-sequence framework.

SESSION: Session 3

Real-time CNN-based Segmentation Architecture for Ball Detection in a Single View Setup

  •      Gabriel Van Zandycke
  • Christophe De Vleeschouwer

This paper considers the task of detecting the ball from a single viewpoint in the challenging but common case where the ball interacts frequently with players while being poorly contrasted with respect to the background. We propose a novel approach by formulating the problem as a segmentation task solved by an efficient CNN architecture. To take advantage of the ball dynamics, the network is fed with a pair of consecutive images. Our inference model can run in real time without the delay induced by a temporal analysis. We also show that test-time data augmentation allows for a significant increase the detection accuracy. As an additional contribution, we publicly release the dataset on which this work is based.

Prediction of Future Shot Direction using Pose and Position of Tennis Player

  • Tomohiro Shimizu
  • Ryo Hachiuma
  • Hideo Saito
  • Takashi Yoshikawa
  • Chonho Lee

In this paper, we propose a method to predict the future shot direction in a tennis match using pose information and player position. As far as we know, there is no work that deals with such a predictive task, so there is no shot direction dataset as yet. Therefore, using a YouTube tennis match video, we construct an time of impact and shot direction dataset. To reduce annotation costs, we propose a method to automatically label the shot direction. Moreover, we propose a method to predict the future shot direction using the constructed dataset. The shot direction is predicted using LSTM(long short-time memory), from sequential pose information up to the time of impact and the player position. We employ OpenPose to extract the position of skeleton joints. In the experiment, we evaluate the accuracy of shot direction prediction and verify the effectiveness of the proposed method. Since there are no studies that predict future shot direction, we set four baseline methods to evaluate the effectiveness of our proposed method.

Tracking Jockeys in a Cluttered Environment with Group Dynamics

  •      Mohammad Hedayati
  • Michael J. Cree
  • Jonathan B. Scott

This project aims to detect and track jockeys at the turning point of the horse races. The detection and tracking of the objects is a very challenging task in a crowded environment such as horse racing due to occlusion. However, in the horse race, the jockeys follow each other's paths and move as a slowly changing group. This group dynamic gives an important cue to approximate the location of obscured jockeys. This paper proposes a novel approach to handle occlusion by the integration of the group dynamic into jockeys tracking framework. The experimental result shows the effect of group dynamics on the tracking performance against partial and full occlusions.

Empirical Analysis of Pacing in Road Cycling

  •      Dietmar Saupe
  • Alexander Artiga Gonzalez
  • Ramona Burger
  • Chris Abbiss

The pacing profile adopted throughout a competitive time trial may be decisive in the overall outcomes of the event. Riders distribute their energy resources based on a range of factors including prior experience, perception of effort, knowledge of distance to cover and potential motivation. Some athletes and professional cycling teams may also quantify individual pacing strategies derived from computational scientific methods. In this work we collect and analyze data of self-selected individual pacing profiles from approximately 12,000 competitive riders on a well-known hill climbing road segment in the Adelaide Hills, South Australia. We found that riders chose from a variety of very different pacing profiles, including some opposing profiles. For the classification of pacing this paper describes the pipeline of collection GPS-based and time stamped performance data, data filtering, augmentation of road gradient and power values, and the classification procedure.

SESSION: Session 4

Running Event Visualization using Videos from Multiple Cameras

  •      Yeshwanth Napolean
  • Priadi T. Wibowo
  • Jan C. van Gemert

Visualizing the trajectory of multiple runners with videos collected at different points in a race could be useful for sports performance analysis. The videos and the trajectories can also aid in athlete health monitoring. While the runners unique ID and their appearance are distinct, the task is not straightforward because the video data does not contain explicit information as to which runners appear in each of the videos. There is no direct supervision of the model in tracking athletes, only filtering steps to remove irrelevant detections. Other factors of concern include occlusion of runners and harsh illumination. To this end, we identify two methods for runner identification at different points of the event, for determining their trajectory. One is scene text detection which recognizes the runners by detecting a unique 'bib number' attached to their clothes and the other is person re-identification which detects the runners based on their appearance. We train our method without ground truth but to evaluate the proposed methods, we create a ground truth database which consists of video and frame interval information where the runners appear. The videos in the dataset was recorded by nine cameras at different locations during the a marathon event. This data is annotated with bib numbers of runners appearing in each video. The bib numbers of runners known to occur in the frame are used to filter irrelevant text and numbers detected. Except for this filtering step, no supervisory signal is used. The experimental evidence shows that the scene text recognition method achieves an F1-score of 74. Combining the two methods, that is - using samples collected by text spotter to train the re-identification model yields a higher F1-score of 85.8. Re-training the person re-identification model with identified inliers yields a slight improvement in performance(F1 score of 87.8). This combination of text recognition and person re-identification can be used in conjunction with video metadata to visualize running events.

Detection of Tennis Events from Acoustic Data

  •      Aaron Baughman
  • Eduardo Morales
  • Gary Reiss
  • Nancy Greco
  • Stephen Hammer
  • Shiqiang Wang

Professional tennis is a fast-paced sport with serves and hits that can reach speeds of over 100 mph and matches lasting long in duration. For example, in 13 years of Grand Slam data, there were 454 matches with an average of 3 sets that lasted 40 minutes. The fast pace and long duration of tennis matches make tracking the time boundaries of each tennis point in a match challenging. The visual aspect of a tennis match is highly diverse because of its variety in angles, occlusions, resolutions, contrast and colors, but the sound component is relatively stable and consistent. In this paper, we present a system that detects events such as ball hits and point boundaries in a tennis match from sound data recorded in the match. We first describe the sound processing pipeline that includes preprocessing, feature extraction, basic (atomic) event detection, and point boundary detection. Then, we describe the overall cloud-based system architecture. Afterwards, we describe the user interface that includes a tool for data labeling to efficiently generate the training dataset, and a workbench for sound and model management. The performance of our system is evaluated in experiments with real-world tennis sound data. Our proposed pipeline can detect atomic tennis events with an F1-score of 92.39% and point boundaries with average precision and recall values of around 80%. This system can be very useful for tennis coaches and players to find and extract game highlights with specific characteristics, so that they can analyze these highlights and establish their play strategy.

Spectator Excitement Detection in Small-scale Sports Events

  •      Kazuhiro Abe
  • Chikara Nakamura
  • Yosuke Otsubo
  • Tetsuya Koike
  • Naoto Yokoya

Detection of the excitement of spectators in sports is useful for various applications such as automatic highlight generation and automatic video editing. Therefore, spectator analysis has been widely studied. The two main approaches used for this include holistic and object-based approaches. Holistic approaches have been applied in most previous works, however, they do not work in small-scale games, where there are fewer spectators compared to those of large-scale games. In this work, we propose a method for detecting the state of excitement of spectators in small-scale games using an object-based approach. To evaluate our method, we build our own datasets consisting of both spectator and player videos. Experimental results show that our method outperforms a holistic baseline method and allows excitement detection of individual spectators.

Flexible Automatic Football Filming and Summarization

  •      Francesco Turchini
  • Lorenzo Seidenari
  • Leonardo Galteri
  • Andrea Ferracani
  • Giuseppe Becchi
  • Alberto Del Bimbo

We propose a method aimed at reducing human intervention in football video shooting and highlights editing, allowing automatic highlight detection together with panning and zooming on salient areas of the playing field. Our recognition subsystem exploits computer vision algorithms to perform automatic detection, pan and zoom and extraction of salient segments of a recorded match. Matches are elaborated offline, extracting and analyzing motion and visual features of the elements in salient zones of the scene, i.e. midfield circle and penalty areas. Automatic summarization is performed by classifying subsequences of a match with machine learning algorithms, which are pretrained on previously acquired and annotated videos of other matches. Among salient actions, special attention is given to goal events, but also other generic highlights are identified. The only assumption for our method to work is to employ a pair of cameras which should frame the football pitch splitting the field in two halves. We demonstrate the functioning of our approach using two ultra high definition cameras, building a system which is also able to collect various metadata of the matches to extrapolate other salient information.