MMSports '22: Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports

MMSports '22: Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports

MMSports '22: Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports

Full Citation in the ACM Digital Library

SESSION: Keynote Talk

Session details: Keynote Talk

  • Rainer Lienhart

DeepSportradar-v1: Computer Vision Dataset for Sports Understanding with High Quality Annotations

  • Gabriel Van Zandycke
  • Vladimir Somers
  • Maxime Istasse
  • Carlo Del Don
  • Davide Zambrano

With the recent development of Deep Learning applied to Computer Vision, sport video understanding has gained a lot of attention, providing much richer information for both sport consumers and leagues. This paper introduces DeepSportradar-v1, a suite of computer vision tasks, datasets and benchmarks for automated sport understanding. The main purpose of this framework is to close the gap between academic research and real world settings. To this end, the datasets provide high-resolution raw images, camera parameters and high quality annotations. DeepSportradar currently supports four challenging tasks related to basketball: ball 3D localization, camera calibration, player instance segmentation and player re-identification. For each of the four tasks, a detailed description of the dataset, objective, performance metrics, and the proposed baseline method are provided. To encourage further research on advanced methods for sport understanding, a competition is organized as part of the MMSports workshop from the ACM Multimedia 2022 conference, where participants have to develop state-of-the-art methods to solve the above tasks. The four datasets, development kits and baselines are publicly available.

SESSION: Session 1: Novel MM Analysis Approaches in Sports

Towards Automated Key-Point Detection in Images with Partial Pool View

  • Tim Woinoski
  • Ivan V. Bajic

Sports analytics has been an up-and-coming field of research among professional sporting organizations and academic institutions alike. With the insurgence and collection of athlete data, the primary goal of such analysis is to improve athletes' performance in a measurable and quantifiable manner. This work is aimed at alleviating some of the challenges encountered in the collection of adequate swimming data. Past works on this subject have shown that the detection and tracking of swimmers is feasible, but not without challenges. Among these challenges are pool localization and determining the relative positions of the swimmers relative to the pool. This work presents two contributions towards solving these challenges. First, we present a pool model with invariant key-points relevant for swimming analytics. Second, we study the detectability of such key-points in images with partial pool view, which are challenging but also quite common in swimming race videos.

Improving Exertion and Wellbeing Prediction in Outdoor Running Conditions using Audio-based Surface Recognition

  • Alexander Gebhard
  • Andreas Triantafyllopoulos
  • Shahin Amiriparian
  • Sandra Ottl
  • Valerie Dieter
  • Maurice Gerczuk
  • Mirko Jaumann
  • David Hildner
  • Patrick Schneeweiß
  • Inka Rösel
  • Inga Krauß
  • Björn W. Schuller

Timely detection of runner exertion is crucial for preventing overuse injuries and conditioning training. Similarly, maintaining high levels of wellbeing while running can improve retention rates for onboarders to the sport, with the associated benefits to public health that this entails. Thus, predicting exertion and wellbeing is a promising avenue of research for biomedical sports research. Previous work has shown that exertion and wellbeing can be predicted using biomechanical data collected from wearables attached to the runners' body. However, a particular challenge in outdoor running conditions is the mediating effect of running surface. We experimentally model this mediating effect by using surface-adapted models, which improve prediction rates for both variables. To that end, we investigate the feasibility of using audio-based surface classification to distinguish three main surface categories: gravel, asphalt, and dirt. Our best models achieve an unweighted average recall (UAR) of .619 and a UAR of .690 on our session-independent and session-dependent test set, respectively, which is an improvement over the .363 UAR achieved by a GPS-based approximation.

Video- and Location-based Analysis of Cycling Routes for Safety Measures and Fan Engagement

  • Pirlouit Dumez
  • Guillaume Prevost
  • Maarten Slembrouck
  • Jelle De Bock
  • Julien Marbaix
  • Steven Verstockt

Video-based analysis of cycling races can provide a lot of information that can be used to keep cycling interesting for the fans and improve cyclists' safety. In this paper, we propose a solution to collect and process the metadata of cycling races. The idea is to use edge computing, by collecting data from a car in front of the race and processing this data using a tailor-made setup. Our solution consists of a camera to record video, and a GPS module to map the corresponding locations. Both data streams are offered to a single board computer. The video frames are used for crowd size classification to roughly estimate the number of spectators present along the race route. Moreover, we use the same footage to recognize cyclists' names on the road's surface to determine the location of fans of specific cyclists to create metadata around fan engagement. The tailor-made system performs the processing of the video frames and the results are sent to a web server using a cellular network connection. A web application was created to visualize the crowd size and the location of cyclists' names on the road's surface.

SESSION: Session 2: Analyses in Team Sports and Individual Sports

Action Recognition using Time-series Heat Maps of Joint Positions from Volleyball Match Videos

  • Akimasa Kondo
  • Hideo Saito
  • Shoji Yachida
  • Ryo Fujiwara

Data analysis in sports is becoming increasingly important, and one of the sports in which sports analysts play an active role is volleyball. Volleyball analysts have the task of annotating match videos, a time-consuming and technically challenging task that makes use of data difficult. In this paper, we propose a method for recognizing players' actions from volleyball game videos using time-series heat maps of joint positions to automate the analysis of volleyball match videos. In experiments to verify the effectiveness of the proposed method, we confirmed that the use of time-series heat maps of joint positions improves both the accuracy and F1 score compared to the baseline method using only RGB images as an input. We also confirmed the effectiveness of the proposed method in recognizing players' actions from volleyball match videos, which were not included in the dataset.

BadmintonDB: A Badminton Dataset for Player-specific Match Analysis and Prediction

  • Kar-Weng Ban
  • John See
  • Junaidi Abdullah
  • Yuen Peng Loh

This paper introduces BadmintonDB, a new badminton dataset for training models for player-specific match analysis and prediction tasks, which are interesting challenges. The dataset features rally, strokes, and outcome annotations of 9 real-world badminton matches between two top players. We discussed our methodologies and processes behind selecting and annotating the matches. We also proposed player-independent and player-dependent Naive Bayes baselines for rally outcome prediction. The paper concludes with the analysis performed on the experiments to study the effects of player-dependent model on the prediction performances. We released our dataset at

Video-Based Detection of Combat Positions and Automatic Scoring in Jiu-jitsu

  • Valter Hudovernik
  • Danijel Skocaj

Due to the increasing capabilities of computer vision methods, it is now possible to apply them even to the most difficult scenarios, such as for vision-based analysis of a jiu-jitsu match. One of the biggest challenges of such scenarios are heavily occluded scenes. Jiu-jitsu is a grappling martial art in which athletes are interlocked in complex positions most of the time. This produces significant challenges for computer vision methods. We propose a method to track the athletes' poses even in such scenarios. The advantage of our method is that it combines positional, structural, and visual cues to overcome this problem and is able to cope with severe occlusions. We use this data to automatically predict combat positions at a high accuracy. Finally, we propose a novel approach for automatic scoring of a jiu-jitsu match from video using these predictions.

Pass Evaluation in Women's Olympic Ice Hockey

  • Robyn Ritchie
  • Alon Harell
  • Phillip Shreeves

Much of modern sports analytics is based on player and ball tracking data. Such data are mostly collected using wearable devices or an array of carefully located cameras and detectors. Many teams do not have such a luxury, especially in undervalued sports such a women's ice hockey; of those that do, the data are not typically publicly available. Recent developments in computer vision have allowed for the collection of tracking data directly from widely available broadcast video. Using event and tracking data collected directly from broadcast video during the elimination round games of the 2022 Winter Olympics, we create a framework for evaluating passing in women's ice hockey. We begin with physics-based motion models for both players and the puck, which we use to develop a model for probabilistic passing. Next, we model the rink control for each team and the scoring probability of the offensive team. These models are then combined into novel metrics for quantifying the various aspects of any single pass. By looking at the entire corpus of plays, we create several summary metrics describing players' risk-reward tendencies, and overall passing ability. All of our metrics can be presented graphically, allowing for easy adaptation by coaches, players, scouts, and other front-office personnel.

SESSION: Session 3: Analyses in Soccer

SoccerNet 2022 Challenges Results

  • Silvio Giancola
  • Anthony Cioppa
  • Adrien Deliège
  • Floriane Magera
  • Vladimir Somers
  • Le Kang
  • Xin Zhou
  • Olivier Barnich
  • Christophe De Vleeschouwer
  • Alexandre Alahi
  • Bernard Ghanem
  • Marc Van Droogenbroeck
  • Abdulrahman Darwish
  • Adrien Maglo
  • Albert Clapés
  • Andreas Luyts
  • Andrei Boiarov
  • Artur Xarles
  • Astrid Orcesi
  • Avijit Shah
  • Baoyu Fan
  • Bharath Comandur
  • Chen Chen
  • Chen Zhang
  • Chen Zhao
  • Chengzhi Lin
  • Cheuk-Yiu Chan
  • Chun Chuen Hui
  • Dengjie Li
  • Fan Yang
  • Fan Liang
  • Fang Da
  • Feng Yan
  • Fufu Yu
  • Guanshuo Wang
  • H. Anthony Chan
  • He Zhu
  • Hongwei Kan
  • Jiaming Chu
  • Jianming Hu
  • Jianyang Gu
  • Jin Chen
  • João V. B. Soares
  • Jonas Theiner
  • Jorge De Corte
  • José Henrique Brito
  • Jun Zhang
  • Junjie Li
  • Junwei Liang
  • Leqi Shen
  • Lin Ma
  • Lingchi Chen
  • Miguel Santos Marques
  • Mike Azatov
  • Nikita Kasatkin
  • Ning Wang
  • Qiong Jia
  • Quoc Cuong Pham
  • Ralph Ewerth
  • Ran Song
  • Rengang Li
  • Rikke Gade
  • Ruben Debien
  • Runze Zhang
  • Sangrok Lee
  • Sergio Escalera
  • Shan Jiang
  • Shigeyuki Odashima
  • Shimin Chen
  • Shoichi Masui
  • Shouhong Ding
  • Sin-wai Chan
  • Siyu Chen
  • Tallal El-Shabrawy
  • Tao He
  • Thomas B. Moeslund
  • Wan-Chi Siu
  • Wei Zhang
  • Wei Li
  • Xiangwei Wang
  • Xiao Tan
  • Xiaochuan Li
  • Xiaolin Wei
  • Xiaoqing Ye
  • Xing Liu
  • Xinying Wang
  • Yandong Guo
  • Yaqian Zhao
  • Yi Yu
  • Yingying Li
  • Yue He
  • Yujie Zhong
  • Zhenhua Guo
  • Zhiheng Li

The SoccerNet 2022 challenges were the second annual video understanding challenges organized by the SoccerNet team. In 2022, the challenges were composed of 6 vision-based tasks: (1) action spotting, focusing on retrieving action timestamps in long untrimmed videos, (2) replay grounding, focusing on retrieving the live moment of an action shown in a replay, (3) pitch localization, focusing on detecting line and goal part elements, (4) camera calibration, dedicated to retrieving the intrinsic and extrinsic camera parameters, (5) player re-identification, focusing on retrieving the same players across multiple views, and (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams. Compared to last year's challenges, tasks (1-2) had their evaluation metrics redefined to consider tighter temporal accuracies, and tasks (3-6) were novel, including their underlying data and annotations. More information on the tasks, challenges and leaderboards are available on Baselines and development kits are available on

STE: Spatio-Temporal Encoder for Action Spotting in Soccer Videos

  • Abdulrahman Darwish
  • Tallal El-Shabrway

The task of spotting events in videos has gained high attention recently, as it helps in better video understanding, and it is a key step in automating the highlights generation. In this work, we tackle the problem of action spotting in soccer videos by introducing a simple, light-weight Spatio-Temporal Encoder (STE) based on 1D convolutions and fully connected layers. The model consists of 3 blocks: a spatial encoder that captures the spatial semantics for each frame, then the temporal encoder that fetches features across frames. The prediction block maps the learnt spatial and temporal features to the predicted action class in the window. On the SoccerNet-v2 test dataset, the STE model scored 74.09% on the loose average-mAP evaluation metric. The SoccerNet 2022 challenge for action spotting restricted the upper bound for temporal tolerance on the predicted actions. On the former loose metric, the tolerance was allowed up to 60 seconds, which was dropped to 5 seconds on the new tight metric. The STE model obtained 40.1% on the tight a-mAP for the test data. To improve the temporal precision, a modification was applied on the prediction block to predict two outputs; the class and the frame index for the event. Thus, we introduced a modified version of the model: STE-v2 that improved the tight a-mAP to reach 58.71% on the challenge split and 58.48% on the test split. The code is publicly available for reproducibility at

A Graph-Based Method for Soccer Action Spotting Using Unsupervised Player Classification

  • Alejandro Cartas
  • Coloma Ballester
  • Gloria Haro

Action spotting in soccer videos is the task of identifying the specific time when a certain key action of the game occurs. Lately, it has received a large amount of attention and powerful methods have been introduced. Action spotting involves understanding the dynamics of the game, the complexity of events, and the variation of video sequences. Most approaches have focused on the latter, given that their models exploit the global visual features of the sequences. In this work, we focus on the former by (a) identifying and representing the players, referees, and goalkeepers as nodes in a graph, and by (b) modeling their temporal interactions as sequences of graphs. For the player identification, or player classification task, we obtain an accuracy of 97.72% in our annotated benchmark. For the action spotting task, our method obtains an overall performance of 57.83% average-mAP by combining it with other audiovisual modalities. This performance surpasses similar graph-based methods and has competitive results with heavy computing methods. Code and data are available at

A Transformer-based System for Action Spotting in Soccer Videos

  • He Zhu
  • Junwei Liang
  • Chengzhi Lin
  • Jun Zhang
  • Jianming Hu

Action Spotting in the broadcast soccer game is important to understand salient actions and video summary applications. In this paper, we propose an efficient transformer-based system for action spotting in soccer videos. We first use the multi-scale vision transformer to extract features from the videos. Then we adopt a sliding window strategy to further utilize temporal features and enhanced temporal understanding. Finally, the features are input to NetVLAD++ model to obtain the final results. Our model can learn a hierarchy of robust representations and perform well in the Action Spotting Task of SoccerNet Challenge 2022. Our method achieves excellent results and outperforms the baseline and previous published works.

SESSION: Session 4: Competitions

KaliCalib: A Framework for Basketball Court Registration

  • Adrien Maglo
  • Astrid Orcesi
  • Quoc-Cuong Pham

Tracking the players and the ball in team sports is key to analyse the performance or to enhance the game watching experience with augmented reality. When the only sources for this data are broadcast videos, sports-field registration systems are required to estimate the homography and re-project the ball or the players from the image space to the field space. This paper describes a new basketball court registration framework in the context of the MMSports 2022 camera calibration challenge. The method is based on the estimation by an encoder-decoder network of the positions of keypoints sampled with perspective-aware constraints. The regression of the basket positions and heavy data augmentation techniques make the model robust to different arenas. Ablation studies show the positive effects of our contributions on the challenge test set. Our method divides the mean squared error by 4.7 compared to the challenge baseline.

Dual Data Augmentation Method for Data-Deficient and Occluded Instance Segmentation

  • Bo Yan
  • Yadong Li
  • Xingran Zhao
  • Hongbin Wang

Instance segmentation is applied widely in image editing, image analysis and autonomous driving, etc. However, insufficient data and occlusion are common problems in practical application. DeepSportRadar Instance Segmentation challenge has focused on these problems. The goal of DeepSportRadar challenge is to tackle the segmentation of individual humans including players, coaches and referees on a basketball court. And the main characteristics of this challenge are there is a high level of occlusions between players and the amount of data is quite limited. In order to address these problems, we designed a Dual Data Augmentation(DDA) method including an offline data augmentation(ODA) strategy to tackle the data-deficient problem, and an online specific copy-paste(OS-CP) strategy to address the occlusion issue. We demonstrate the applicability proposed method on DeepSportRadar Instance Segmentation challenge. The segmentation model applied is Hybrid Task Cascade based detector on the Swin-Large-based CBNetV2 backbone. Experimental results demonstrate that proposed method can achieve a competitive result on the DeepSportRadar challenge, with 0.757AP@0.50:0.95 on the challenge set.

A Person Re-identification Approach Focusing on the Occlusion Problem and Ranking Optimization

  • Wenkai Zheng
  • Mei Yuan

Person re-identification (Re-ID) aims to re-identify people across multiple video frames captured at various time instants. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, recent works have proposed many deep learning based approaches to address the task. However, many state-of-the-arts Re-ID methods are not robust enough and do not perform well on the Synergy Re-Identification dataset provided by the DeepSportRadar Player Re-Identification Challenge due to the severe occlusion problem. To better re-identify basketball players, we propose a person Re-ID approach that focuses on the occlusion problem and ranking optimization. Specifically, our proposed approach consists of two stages. In the first stage we extract the global features and local features of the input image by two branches respectively. In the second stage, we propose a ranking optimization method that consists of three steps: k-reciprocal re-ranking, metric fusion, and distance mapping. We conduct extensive experiments to show that our proposed approach can achieve superior performance compared with the state-of-the-art methods and we also experimentally evaluate the training tricks and ranking optimization methods of existing Re-ID methods. Our proposed method achieves $98.38%$ mAP and $99.57%$ rank-1 on the challenge set of the Synergy Re-Identification dataset, and with this method our team, Fiery Tyrannosaurus Warrior, won the second place in the DeepSportRadar Player Re-Identification Challenge. Furthermore, we provide some ideas for potentially achieving better performance on the Synergy Re-Identification dataset.

CLIP-ReIdent: Contrastive Training for Player Re-Identification

  • Konrad Habel
  • Fabian Deuser
  • Norbert Oswald

Sports analytics benefits from recent advances in machine learning providing a competitive advantage for teams or individuals. One important task in this context is the performance measurement of individual players to provide reports and log files for subsequent analysis. During sport events like basketball, this involves the re-identification of players during a match either from multiple camera viewpoints or from a single camera viewpoint at different times. In this work, we investigate whether it is possible to transfer the out-standing zero-shot performance of pre-trained CLIP models to the domain of player re-identification. For this purpose we reformulate the contrastive language-to-image pre-training approach from CLIP to a contrastive image-to-image training approach using the InfoNCE loss as training objective. Unlike previous work, our approach is entirely class-agnostic and benefits from large-scale pre-training. With a fine-tuned CLIP ViT-L/14 model we achieve 98.44% mAP on the MMSports 2022 Player Re-Identification challenge. Furthermore we show that the CLIP Vision Transformers have already strong OCR capabilities to identify useful player features like shirt numbers in a zero-shot manner without any fine-tuning on the dataset. By applying the Score-CAM algorithm we visualise the most important image regions that our fine-tuned model identifies when calculating the similarity score between two images of a player.

Attention-Aware Multiple Granularities Network for Player Re-Identification

  • Qi An
  • Kuilong Cui
  • Rongshuai Liu
  • Chuanming Wang
  • Mengshi Qi
  • Huadong Ma

With the development of deep learning technologies, the performance of person re-identification (ReID) has been greatly improved. However, as a subdomain of person ReID, the research for player ReID is important for the sports field yet lacks sufficient effort so far. Player ReID aims to retrieve a specified player from a gallery of players' images captured by different cameras at various time steps. Compared with the traditional person ReID, player ReID suffers from various difficult problems, e.g., the high similarity of players' appearance, limited-scale datasets, variable and low image resolution, and severe occlusion. To solve such a challenging task, we propose a method named as Attention-Aware Multiple Granularities Network (A$^2$MGN), which consists of multiple branches to capture discriminative features of players from different granularities. Through the criterion of triplet loss and cross-entropy loss, the model can localize different parts of the player and make a comprehensive comparison between each pair of images. Extensive experiments demonstrate the effectiveness and superiority of our method, and our team (MM2022-bupt) achieves the top-3 in the challenge of ACM MMSports 2022 with mAP of 0.96, Rank-1 of 0.99, and Rank-5 of 1.00.