MMSports'21: Proceedings of the 4th International Workshop on Multimedia Content Analysis in Sports

MMSports'21: Proceedings of the 4th International Workshop on Multimedia Content Analysis in Sports

MMSports'21: Proceedings of the 4th International Workshop on Multimedia Content Analysis in Sports

Full Citation in the ACM Digital Library

SESSION: Session 1: Analyses in Team Sports

Session details: Session 1: Analyses in Team Sports

  • Rainer Lienhart

A Unified Taxonomy and Multimodal Dataset for Events in Invasion Games

  • Henrik Biermann
  • Jonas Theiner
  • Manuel Bassek
  • Dominik Raabe
  • Daniel Memmert
  • Ralph Ewerth

The automatic detection of events in complex sports games like soccer and handball
using positional or video data is of large interest in research and industry. One
requirement is a fundamental understanding of underlying concepts, i.e., events that
occur on the pitch. Previous work often deals only with so-called low-level events
based on well-defined rules such as free kicks, free throws, or goals. High-level
events, such as passes, are less frequently approached due to a lack of consistent
definitions. This introduces a level of ambiguity that necessities careful validation
when regarding event annotations. Yet, this validation step is usually neglected as
the majority of studies adopt annotations from commercial providers on private datasets
of unknown quality and focuses on soccer only. To address these issues, we present
(1) a universal taxonomy that covers a wide range of low and high-level events for
invasion games and is exemplarily refined to soccer and handball, and (2) release
two multi-modal datasets comprising video and positional data with gold-standard annotations
to foster research in fine-grained and ball-centered event spotting. Experiments on
human performance demonstrate the robustness of the proposed taxonomy, and that disagreements
and ambiguities in the annotation increase with the complexity of the event. Datasets
are available at

Multi-task Learning for Jersey Number Recognition in Ice Hockey

  • Kanav Vats
  • Mehrnaz Fani
  • David A. Clausi
  • John Zelek

Identifying players in sports videos by recognizing their jersey numbers is a challenging
task in computer vision. We have designed and implemented a multi-task learning network
for jersey number recognition. In order to train a network to recognize jersey numbers,
two output label representations are used (1) Holistic - considers the entire jersey
number as one class, and (2) Digit-wise - considers the two digits in a jersey number
as two separate classes. The proposed network learns both holistic and digit-wise
representations through a multi-task loss function. We determine the optimal weights
to be assigned to holistic and digit-wise losses through an ablation study. Experimental
results demonstrate that the proposed multi-task learning network performs better
than the constituent holistic and digit-wise single-task learning networks.

Automated Offside Detection by Spatio-Temporal Analysis of Football Videos

  • Ikuma Uchida
  • Atom Scott
  • Hidehiko Shishido
  • Yoshinari Kameda

In this paper, we propose a new automated method to detect offsides from football
match videos. The advantage of our method is that it can strictly follow the official
offside rules in which the dynamics of play actions are spatio-temporally investigated.
Furthermore, to overcome the difficult task of tracking the two-dimensional locations
of the players and the ball, we utilized geometric characteristics on the perspective
projection coupled with a Kalman filter to estimate information necessary for offside
detection. Based on these methods, our prototype system can recognize whether an attacking
player who crossed the offside line receives a pass from their teammate or not. To
the best of our knowledge, our proposed method is the first method that can automatically
determine offsides from video. Furthermore, this method is designed to enable online
processing in the future.

SESSION: Session 2: Novel MM Analysis Approaches in Sports

Session details: Session 2: Novel MM Analysis Approaches in Sports

  • Moritz Einfalt

STAR: Noisy Semi-Supervised Transfer Learning for Visual Classification

  • Hasib Zunair
  • Yan Gobeil
  • Samuel Mercier
  • Abdessamad Ben Hamza

Semi-supervised learning (SSL) has proven to be effective at leveraging large-scale
unlabeled data to mitigate the dependency on labeled data in order to learn better
models for visual recognition and classification tasks. However, recent SSL methods
rely on unlabeled image data at a scale of billions to work well. This becomes infeasible
for tasks with relatively fewer unlabeled data in terms of runtime, memory and data
acquisition. To address this issue, we propose noisy semi-supervised transfer learning,
an efficient SSL approach that integrates transfer learning and self-training with
noisy student into a single framework, which is tailored for tasks that can leverage
unlabeled image data on a scale of thousands. We evaluate our method on both binary
and multi-class classification tasks, where the objective is to identify whether an
image displays people practicing sports or the type of sport, as well as to identify
the pose from a pool of popular yoga poses. Extensive experiments and ablation studies
demonstrate that by leveraging unlabeled data, our proposed framework significantly
improves visual classification, especially in multi-class classification settings
compared to state-of-the-art methods. Moreover, incorporating transfer learning not
only improves classification performance, but also requires 6x less compute time and
5x less memory. We also show that our method boosts robustness of visual classification
models, even without specifically optimizing for adversarial robustness.

Three-Stream 3D/1D CNN for Fine-Grained Action Classification and Segmentation in
Table Tennis

  • Pierre-Etienne Martin
  • Jenny Benois-Pineau
  • Renaud P├ęteri
  • Julien Morlier

This paper proposes a fusion method of modalities extracted from video through a three-stream
network with spatio-temporal and temporal convolutions for fine-grained action classification
in sport. It is applied to TTStroke-21 dataset which consists of untrimmed videos
of table tennis games. The goal is to detect and classify table tennis strokes in
the videos, the first step of a bigger scheme aiming at giving feedback to the players
for improving their performance. The three modalities are raw RGB data, the computed
optical flow and the estimated pose of the player. The network consists of three branches
with attention blocks. Features are fused at the latest stage of the network using
bilinear layers. Compared to previous approaches, the use of three modalities allows
faster convergence and better performances on both tasks: classification of strokes
with known temporal boundaries and joint segmentation and classification. The pose
is also further investigated in order to offer richer feedback to the athletes.

SPEED21: Speed Climbing Motion Dataset

  • Petr Elias
  • Veronika Skvarlova
  • Pavel Zezula

With the recent advances in computer vision and deep learning, the research interest
in video-based and skeleton-based sports analysis is growing. Also, speed climbing
as a sport is on the rise, being included as an Olympic sport in Tokyo 2020. This
work aims to connect both of these worlds. First, a dataset of 362 speed climbing
performances is provided for the community of domain experts and practitioners in
human motion understanding and sports analysis. The dataset annotates pre-segmented
performances of 55 world elite athletes in the form of 2D skeleton sequences extracted
from world competition events videos. Secondly, a high descriptiveness and usability
of 2D skeleton data is demonstrated in the search scenario that matches climbers by
the similarities in their climbing style with high accuracy. The high k-NN search
precision above 90 % is achieved by a synergic combination of suitable representation
with a semi-dependent variant of Dynamic Time Warping (DTW). The proposed DTW variant
computes distances separately across individual semantic body parts (e.g., hands and
feet) whose atoms (joints or angles) are wired together for the temporal alignment.