MMSports '23: Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports

MMSports '23: Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports

MMSports '23: Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports

Full Citation in the ACM Digital Library

SESSION: Keynote

AI for Youth Sports: Democratizing Professional Sport Analytics Tools

  • Mehrsan Javan

Sports analytics is about observing, understanding and describing the game in an intelligent manner. In practice, this requires a fully automated, robust end-to-end pipeline: from visual input, to player and group activities, to player and team evaluation, to planning. Despite major advancements in computer vision and machine learning, sports analytics is still in its infancy and relies heavily on simpler descriptive statistics. In addition, current sports analytics solutions are limited to top leagues and are not widely available for downmarket leagues and youth sports. In this talk, we explain how we have developed scalable and robust computer vision solutions to democratize sport analytics and offer pro-league-level insights to leagues with modest resources, including youth leagues. We highlight key challenges such as the requirement for low cost, low latency processing and the need for robustness despite variations in venues, and how we solved some of those problems.

SESSION: Session 1: Novel MM Analysis Approaches in Sports

SkiTech: An Alpine Skiing and Snowboarding Dataset of 3D Body Pose, Sole Pressure, and Electromyography

  • Erwin Wu
  • Takashi Matsumoto
  • Chen-Chieh Liao
  • Ruofan Liu
  • Hidetaka Katsuyama
  • Yuki Inaba
  • Noriko Hakamada
  • Yusuke Yamamoto
  • Yusuke Ishige
  • Hideki Koike

Effective analysis of skills requires high-quality, multi-modal datasets, especially in the field of artificial intelligence. However, creating such datasets for extreme sports, such as alpine skiing, can be challenging due to environmental constraints. Optical and wearable sensors may not perform optimally under diverse lighting, weather, and terrain conditions. To address these challenges, we present a comprehensive skiing/snowboarding dataset using a professional motor-based simulator. Using the realistic simulator, it is easy to obtain different types of data with a small domain gap between real-world data. Common data for skill analysis are collected, including camera images, 3D body pose, sole pressure, and leg electromyography, from athletes of different levels. Another key aspect is the comparison of cross-modal baselines, highlighting the versatility of the data across modalities. In addition, a real-world pilot test is conducted to assess the practical applicability and data robustness.

Personalised Speech-Based Heart Rate Categorisation Using Weighted-Instance Learning

  • Alexander Kathan
  • Shahin Amiriparian
  • Alexander Gebhard
  • Andreas Triantafyllopoulos
  • Maurice Gerczuk
  • Björn W. Schuller

Running as one of the most popular sports comes with many positive effects, but also with risks. Most injuries are caused by overexertion. To optimise training and prevent injuries, approaches are needed to easily monitor training behaviour. Previous research has shown that heart rate (HR) can be automatically classified using speech data. Real-world applications pose challenges due to the heterogeneity of individuals, which is why we introduce a personalised HR classification in this work. In particular, we first determine runners in the train set with similar acoustic patterns (x-vectors) compared to a runner in the test set. Further, we extract deep representations and hand-crafted features from the input data. Subsequently, using the computed similarity values, we adapt a Support Vector Machine (SVM) for each individual. In this context, we choose the runners with the lowest Euclidean distances and weight their train samples more heavily during the training process of the SVM. Our personalised approach yields a best relative improvement of 20.8% compared to a non-personalised model in a 5-class HR classification task. The obtained results demonstrate the effectiveness of our approach, paving the way for real-world, personalised applications.

Generating Factually Consistent Sport Highlights Narrations

  • Noah Sarfati
  • Ido Yerushalmy
  • Michael Chertok
  • Yosi Keller

Sports highlights are an important form of media for fans worldwide, as they provide short videos that capture key moments from games, often accompanied by the original commentaries of the game's announcers. However, traditional forms of presenting sports highlights have limitations in conveying the complexity and nuance of the game. In recent years, the use of Large Language Models (LLMs) for natural language generation has emerged and is a promising approach for generating narratives that can provide a more compelling and accessible viewing experience. In this paper, we propose an end-to-end solution to enhance the experience of watching sports highlights by automatically generating factually consistent narrations using LLMs and crowd noise extraction. Our solution involves several steps, including extracting the source of information from the live broadcast using a transcription model, prompt engineering, and comparing out-of-the-box models for consistency evaluation. We also propose a new dataset annotated on generated narratives from 143 Premier League plays and fine-tune a Natural Language Inference (NLI) model on it, achieving 92% precision. Furthermore, we extract crowd noise from the original video to create a more immersive and realistic viewing experience for sports fans by adapting speech enhancement SOTA models on a brand new dataset created from 155 Ligue 1 games.

DeepSportradar-v2: A Multi-Sport Computer Vision Dataset for Sport Understandings

  • Maxime Istasse
  • Vladimir Somers
  • Pratheeban Elancheliyan
  • Jaydeep De
  • Davide Zambrano

Advanced data collection technologies, computational tools, and sophisticated algorithms have a revolutionary impact on sports analytics on various aspects of sports, from athletes performance to fan engagement. Computer Vision (CV) and Deep Learning (DL) technologies play a crucial role in predicting players and game states from videos, but their effectiveness depends on the quantity and quality of training data, especially in sports with unique dynamics and camera angles. Each sport comes with its own set of challenges.

This paper introduces DeepSportradar-v2, a multi-sport suite of CV tasks that address the need for high-quality datasets for different sports. Supporting multi-sport allows academic researchers to better understand the dynamics of each sport and their specific challenges. In this paper, we first report the results from the 2022 competition, and provide all resources to replicate each result. Then, we present a newly released Cricket dataset and task, given the global popularity and relevance of this sport for the automated analysis and video understanding.

Similarly to the first edition, a competition has been organized as part of the MMSports workshop, where participants are invited to develop state-of-the-art methods for solving the proposed tasks using the publicly available datasets, development kits, and baselines.

SESSION: Session 2: Analyses in Individual and Team Sports

Video-based Skill Assessment for Golf: Estimating Golf Handicap

  • Christian Keilstrup Ingwersen
  • Artur Xarles
  • Albert Clapés
  • Meysam Madadi
  • Janus Nørtoft Jensen
  • Morten Rieger Hannemose
  • Anders Bjorholm Dahl
  • Sergio Escalera

Automated skill assessment in sports using video-based analysis holds great potential for revolutionizing coaching methodologies. This paper focuses on the problem of skill determination in golfers by leveraging deep learning models applied to a large database of video recordings of golf swings. We investigate different regression, ranking and classification based methods and compare to a simple baseline approach. The performance is evaluated using mean squared error (MSE) as well as computing the percentages of correctly ranked pairs based on the Kendall correlation. Our results demonstrate an improvement over the baseline, with a 35% lower mean squared error and 68% correctly ranked pairs. However, achieving fine-grained skill assessment remains challenging. This work contributes to the development of AI-driven coaching systems and advances the understanding of video-based skill determination in the context of golf.

Automatic Edge Error Judgment in Figure Skating Using 3D Pose Estimation from a Monocular Camera and IMUs

  • Ryota Tanaka
  • Tomohiro Suzuki
  • Kazuya Takeda
  • Keisuke Fujii

Automatic evaluating systems are fundamental issues in sports technologies. In many sports, such as figure skating, automated evaluating methods based on pose estimation have been proposed. However, previous studies have evaluated skaters' skills in 2D analysis. In this paper, we propose an automatic edge error judgment system with a monocular smartphone camera and inertial sensors, which enable us to analyze 3D motions. Edge error is one of the most significant scoring items and is challenging to automatically judge due to its 3D motion. The results show that the model using 3D joint position coordinates estimated from the monocular camera as the input feature had the highest accuracy at 83% for unknown skaters' data. We also analyzed the detailed motion analysis for edge error judgment. These results indicate that the monocular camera can be used to judge edge errors automatically. We will provide the figure skating single Lutz jump dataset, including pre-processed videos and labels at

Context-Aware 3D Object Localization from Single Calibrated Images: A Study of Basketballs

  • Marcello Davide Caio
  • Gabriel Van Zandycke
  • Christophe De Vleeschouwer

Accurately localizing objects in three dimensions (3D) is crucial for various computer vision applications, such as robotics, autonomous driving, and augmented reality. This task finds another important application in sports analytics and, in this work, we present a novel method for 3D basketball localization from a single calibrated image. Our approach predicts the object's height in pixels in image space by estimating its projection onto the ground plane within the image, leveraging the image itself and the object's location as inputs. The 3D coordinates of the ball are then reconstructed by exploiting the known projection matrix. Extensive experiments on the public DeepSport dataset, which provides ground truth annotations for 3D ball location alongside camera calibration information for each image, demonstrate the effectiveness of our method, offering substantial accuracy improvements compared to recent work. Our work opens up new possibilities for enhanced ball tracking and understanding, advancing computer vision in diverse domains. The source code of this work is made publicly available at

Event-based High-speed Ball Detection in Sports Video

  • Takuya Nakabayashi
  • Akimasa Kondo
  • Kyota Higa
  • Andreu Girbau
  • Shin'ichi Satoh
  • Hideo Saito

Ball detection in sports, particularly in fast-paced games like volleyball, where the ball is constantly in high motion, presents a significant challenge for game analysis and automated sports broadcasting. Conventional camera-based ball detection faces issues, such as motion blur, in high-speed ball movement scenes. To address these challenges, we propose a deep learning-based method for detecting balls using event cameras. Event cameras, also known as dynamic vision sensors, operate differently from traditional cameras. Instead of capturing frames at fixed intervals, they record individual pixel-level luminance changes, referred to as events. This unique feature enables event cameras to provide precise temporal information with low latency. Our proposed method transforms sparse events into an image format, enabling the use of current deep-learning architectures for object detection. Given the limited amount of events available for training an object detector, we generate synthetic events from RGB frames. This approach reduces the need for extensive annotation and ensures sufficient data availability. Experimental results confirm that our proposed method can detect balls that are undetectable in RGB frames and outperform existing methods that utilize event-based ball detection. Moreover, we conducted tests to verify our method's ability to detect balls in real events, not just synthetic ones. These results demonstrate that our proposed method opens up new possibilities in sports ball detection.

Mitigating Motion Blur for Robust 3D Baseball Player Pose Modeling for Pitch Analysis

  • Jerrin Bright
  • Yuhao Chen
  • John Zelek

Using videos to analyze pitchers in baseball can play a vital role in strategizing and injury prevention. Computer vision-based pose analysis offers a time-eficient and cost-effective approach. However, the use of accessible broadcast videos, with a 30fps framerate, often results in partial body motion blur during fast actions, limiting the performance of existing pose keypoint estimation models. Previous works have primarily relied on fixed backgrounds, assuming minimal motion differences between frames, or utilized multiview data to address this problem. To this end, we propose a synthetic data augmentation pipeline to enhance the model's capability to deal with the pitcher's blurry actions. In addition, we leverage in-the-wild videos to make our model robust under different real-world conditions and camera positions. By carefully optimizing the augmentation parameters, we observed a notable reduction in the loss by 54.2% and 36.2% on the test dataset for 2D and 3D pose estimation respectively. By applying our approach to existing state-of-the-art pose estimators, we demonstrate an average improvement of 29.2%. The findings highlight the effectiveness of our method in mitigating the challenges posed by motion blur, thereby enhancing the overall quality of pose estimation.

Rink-Agnostic Hockey Rink Registration

  • Jia Cheng Shang
  • Yuhao Chen
  • Mohammad Javad Shafiee
  • David A. Clausi

Hockey rink registration is a useful tool for aiding and automating sports analysis. When combined with player tracking, it can provide location information of players on the rink by estimating a homography matrix that can warp broadcast video frames onto an overhead template of the rink, or vice versa. However, most existing techniques require accurate ground truth information, which can take many hours to annotate, and only work on the trained rink types. In this paper, we propose a generalized rink registration pipeline that, once trained, can be applied to both seen and unseen rink types with only an overhead rink template and the video frame as inputs. Our pipeline uses domain adaptation techniques, semi-supervised learning, and synthetic data during training to achieve this ability and overcome the lack of non-NHL training data. The proposed method is evaluated on both NHL (source) and non-NHL (target) rink data and the results demonstrate that our approach can generalize to non-NHL rinks, while maintaining competitive performance on NHL rinks.

SESSION: Session 3: Analyses in Team Sports

Expected Goals Prediction in Professional Handball using Synchronized Event and Positional Data

  • Michael Adams
  • Alexander David
  • Marc Hesse
  • Ulrich Rückert

In this study, we employ an extensive single-season dataset of event and positional data, as well as machine learning techniques, to build an Expected Goals (xG) model for handball. The selected features of the data include distances, angles, and game context. Several algorithms were considered, and after a five-fold cross-validation and hyperparameter optimization, CatBoost emerged as the most suitable, achieving a predictive accuracy of about 70%. Our methodology integrates a data synchronization and throw detection approach, along with a comprehensive exploration of feature importance using the SHapley Additive exPlanations (SHAP) method. The findings provide insights into handball strategy and player performance, unlocking new potential for game analysis and tactical planning. Looking forward, the practical applications of this research extend to enhancing player training, refining team strategies, and offering a deeper understanding of handball dynamics.

ASTRA: An Action Spotting TRAnsformer for Soccer Videos

  • Artur Xarles
  • Sergio Escalera
  • Thomas B. Moeslund
  • Albert Clapés

In this paper, we introduce ASTRA, a Transformer-based model designed for the task of Action Spotting in soccer matches. ASTRA addresses several challenges inherent in the task and dataset, including the requirement for precise action localization, the presence of a long-tail data distribution, non-visibility in certain actions, and inherent label noise. To do so, ASTRA incorporates (a) a Transformer encoder-decoder architecture to achieve the desired output temporal resolution and to produce precise predictions, (b) a balanced mixup strategy to handle the long-tail distribution of the data, (c) an uncertainty-aware displacement head to capture the label variability, and (d) input audio signal to enhance detection of non-visible actions. Results demonstrate the effectiveness of ASTRA, achieving a tight Average-mAP of 66.82 on the test set. Moreover, in the SoccerNet 2023 Action Spotting challenge, we secure the 3rd position with an Average-mAP of 70.21 on the challenge set.

Multi-task Learning for Joint Re-identification, Team Affiliation, and Role Classification for Sports Visual Tracking

  • Amir M. Mansourian
  • Vladimir Somers
  • Christophe De Vleeschouwer
  • Shohreh Kasaei

Effective tracking and re-identification of players is essential for analyzing soccer videos. But, it is a challenging task due to the non-linear motion of players, the similarity in appearance of players from the same team, and frequent occlusions. Therefore, the ability to extract meaningful embeddings to represent players is crucial in developing an effective tracking and re-identification system. In this paper, a multi-purpose part-based person representation method, called PRTreID, is proposed that performs three tasks of role classification, team affiliation, and re-identification, simultaneously. In contrast to available literature, a single network is trained with multi-task supervision to solve all three tasks, jointly. The proposed joint method is computationally efficient due to the shared backbone. Also, the multi-task learning leads to richer and more discriminative representations, as demonstrated by both quantitative and qualitative results. To demonstrate the effectiveness of PRTreID, it is integrated with a state-of-the-art tracking method, using a part-based post-processing module to handle long-term tracking. The proposed tracking method, outperforms all existing tracking methods on the challenging SoccerNet tracking dataset.

Dynamic NeRFs for Soccer Scenes

  • Sacha Lewin
  • Maxime Vandegar
  • Thomas Hoyoux
  • Olivier Barnich
  • Gilles Louppe

The long-standing problem of novel view synthesis has many applications, notably in sports broadcasting. Photorealistic novel view synthesis of soccer actions, in particular, is of enormous interest to the broadcast industry. Yet only a few industrial solutions have been proposed, and even fewer that achieve near-broadcast quality of the synthetic replays. Except for their setup of multiple static cameras around the playfield, the best proprietary systems disclose close to no information about their inner workings. Leveraging multiple static cameras for such a task indeed presents a challenge rarely tackled in the literature, for a lack of public datasets: the reconstruction of a large-scale, mostly static environment, with small, fast-moving elements. Recently, the emergence of neural radiance fields has induced stunning progress in many novel view synthesis applications, leveraging deep learning principles to produce photorealistic results in the most challenging settings. In this work, we investigate the feasibility of basing a solution to the task on dynamic NeRFs, i.e., neural models purposed to reconstruct general dynamic content. We compose synthetic soccer environments and conduct multiple experiments using them, identifying key components that help reconstruct soccer scenes with dynamic NeRFs. We show that, although this approach cannot fully meet the quality requirements for the target application, it suggests promising avenues toward a cost-efficient, automatic solution. We also make our work dataset and code publicly available, with the goal to encourage further efforts from the research community on the task of novel view synthesis for dynamic soccer scenes. For code, data, and video results, please see

Jersey Number Recognition using Keyframe Identification from Low-Resolution Broadcast Videos

  • Bavesh Balaji
  • Jerrin Bright
  • Harish Prakash
  • Yuhao Chen
  • David A. Clausi
  • John Zelek

Player identification is a crucial component in vision-driven soccer analytics, enabling various downstream tasks such as player assess- ment, in-game analysis, and broadcast production. However, auto- matically detecting jersey numbers from player tracklets in videos presents challenges due to motion blur, low resolution, distortions, and occlusions. Existing methods, utilizing Spatial Transformer Networks, CNNs, and Vision Transformers, have shown success in image data but struggle with real-world video data, where jersey numbers are not visible in most of the frames. Hence, identifying frames that contain the jersey number is a key sub-problem to tackle. To address these issues, we propose a robust keyframe identification module that extracts frames containing essential high-level infor- mation about the jersey number. A spatio-temporal network is then employed to model spatial and temporal context and predict the probabilities of jersey numbers in the video. Additionally, we adopt a multi-task loss function to predict the probability distribution of each digit separately. Extensive evaluations on the SoccerNet dataset demonstrate that incorporating our proposed keyframe identification module results in a significant 37.81% and 37.70% increase in the accuracies of 2 different test sets with domain gaps. These results highlight the effectiveness and importance of our approach in tackling the challenges of automatic jersey number detection in sports videos.

SESSION: Session 4: Competitions

A Sparse Attention Pipeline for DeepSportRadar Basketball Player Instance Segmentation Challenge

  • Fang Gao
  • Wenjie Wu
  • Yan Jin
  • Lei Shi
  • Shengheng Ma

The ACM MMSports2023 DeepSportRadar Basketball Player Instance Segmentation Challenge was focused on addressing the issue of occlusion. The dataset's primary characteristics include vast background areas, a high degree of occlusion between athletes, and limited data volume. To tackle the challenge of severe occlusion among athletes, we developed a sparse attention pipeline. Firstly, we introduced the InternImage backbone network and Sparse Multi-Head Self-Attention Module with a sparse Transformer. This allowed the segmentation pipeline to prioritize critical regions, effectively dealing with occlusion and class imbalance issues. Secondly, we adopted a multi-scale processing strategy using the Simple Dual Refinement Feature Pyramid Networks (SDRFPN) to fuse features of different scales. This approach improved the ability to handle athletes' features with different scales and fine details. Lastly, during the training phase, we employed random flipping data augmentation, which assisted the segmentation pipeline in recognizing targets from various angles and orientations. On the DeepSportRadar Basketball Player Instance Segmentation Challenge dataset, the pipeline achieved an impressive Occlusion Metric (OM) score of 0.316.

Image- and Instance-Level Data Augmentation for Occluded Instance Segmentation

  • Jun Yu
  • Shenshen Du
  • Ruiqiang Yang
  • Lei Wang
  • Minchuan Chen
  • Qingying Zhu
  • Shaojun Wang
  • Jing Xiao

Instance segmentation is a fundamental computer vision task with widespread applications. Numerous novel methods have been proposed to address this task. However, limited data and occlusion are common issues that hinder the practical application of instance segmentation. In this paper, we address limited data issue by employing image-level data augmentation. Additionally, to address the occlusion issue, we propose Balanced Occlusion Aware Copy-Paste (BOACP), a method that can not only increase the number of instances in images but also balance occluded instances at the image level. This method can enhance the performance of model on occluded instances. For the model, we utilize the Hybrid Task Cascade (HTC) based on CBSwin-Base and CBFPN. Moreover, we conduct additional experiments to explore the Occlusion Metric (OM). Experimental results demonstrate the effectiveness of our proposed approach, and we achieve the first place in the first phase of DeepSportRadar Instance Segmentation Challenge in ACM MMSports 2023 Workshop.

Exploring Loss Function and Rank Fusion for Enhanced Person Re-identification

  • Jun Yu
  • Renda Li
  • Renjie Lu
  • Leilei Wang
  • Shuoping Yang
  • Lei Wang
  • Minchuan Chen
  • Qingying Zhu
  • Shaojun Wang
  • Jing Xiao

Person Re-Identification (Re-ID) emerges as a important technique in sports analytics, enabling the accurate matching and recognition of players throughout a game. The fundamental objective of the Person Re-ID task is to identify the same player across diverse camera views, thus establishing their identity association over time. Generally speaking, the difficulty of the person Re-Identification (Re-ID) task lies in the perspective changes, occlusion phenomena, and posture changes caused by different camera placements and angles. In particular, for the Synergy re-identification dataset, the overlapping occlusion phenomenon between players and the low resolution and motion blur of the image make the Re-ID task challenging. In this paper, we analyze the impact of different data augmentations on this dataset and find effective augmentation methods. Meanwhile, we adopt a contrastive image-to-image training method and achieve higher results with the class-independent InfoNCE loss. We also quantitatively compare it with class-related ID loss. Finally, we employ the k-reciprocal re-ranking method to reorganize and optimize the distance matrix output by the model. The above process enables a single model to have good retrieval performance on the Synergy re-identification dataset. Then, we summarize the key factors affecting the retrieval performance on this dataset. To further improve the retrieval performance, we propose an efficient model/rank fusion method to fuse the retrieval results of different models from two perspectives of similarity and dissimilarity. Our proposed method achieves 98.81% mAP on the challenge set of the Synergy re-identification dataset, with which our team achieved 1st place in the DeepSportRadar player re-identification challenge 2023.

Relative Boundary Modeling: A High-Resolution Cricket Bowl Release Detection Framework with I3D Features

  • Jun Yu
  • Leilei Wang
  • Renjie Lu
  • Shuoping Yang
  • Renda Li
  • Lei Wang
  • Minchuan Chen
  • Qingying Zhu
  • Shaojun Wang
  • Jing Xiao

Cricket Bowl Release Detection aims to segment specific portions of bowl release actions occurring in multiple videos, with a focus on detecting the entire time window of this action. Unlike traditional detection tasks that identify action categories at a specific moment, this task involves identifying events that typically span around 100 frames and require recognizing all instances of the bowl release action in the video. Strictly speaking, this task falls under a branch of temporal action detection. With the advancement of deep neural networks, recent works have proposed deep learning-based approaches to address this task. However, due to the challenge of unclear action boundaries in videos, many existing methods perform poorly on the DeepSportradar Cricket Bowl Release Dataset. To more accurately identify specific portions of the bowl release action in videos, we adopt a one-stage architecture based on Relative Boundary Modeling. Specifically, our method consists of three stages. In the first stage, we use the Inflated 3D ConvNet (I3D) model to extract spatio-temporal features from the input videos. In the second stage, we utilize Temporal Action Detection with Relative Boundary Modeling (TriDet) to model the boundaries of the bowl release action's specific portions based on the relative relationships between different time moments, thereby predicting the action's time window. Lastly, as the target events typically span around 100 frames and the predicted time windows may exhibit overlapping regions based on confidence scores, we implement a post-processing step to merge and filter these outputs, resulting in the final submission results. We conducted extensive experiments to demonstrate that our proposed method achieves superior performance. Additionally, we evaluated the training techniques of existing approaches. Our proposed method achieves a PQ score of 0.519, an SQ score of 0.822, and an RQ score of 0.632 on the challenge set of the DeepSportradar Cricket Bowl Release Dataset. Through this approach, our team, USTC\_IAT\_United, won the third place in the first phase of the DeepSportradar Cricket Bowl Release Challenge.

STAN: Spatial-Temporal Awareness Network for Temporal Action Detection

  • Minghao Liu
  • Haiyi Liu
  • Sirui Zhao
  • Fei Ma
  • Minglei Li
  • Zonghong Dai
  • Hao Wang
  • Tong Xu
  • Enhong Chen

In recent years, there have been significant advancements in the field of temporal action detection. However, few studies have focused on detecting actions in sporting events. In this context, the MMSports 2023 cricket bowl release challenge aims to identify the bowl release action by segmenting untrimmed videos. To achieve this, we propose a novel cricket bowl release detection framework based on Spatial-Temporal Awareness Network (STAN) which mainly consists of three modules: the spatial feature extraction module (SFEM), the temporal feature extraction module (TFEM), and the classification module (CM). Specifically, we first adopt ResNet to extract the spatial features from videos in SFEM. Then, the TFEM is designed to aggregate temporal features using Bi-LSTM to obtain spatial-temporal features. Afterward, the CM converts the spatial-temporal features into action category probabilities to localize the action segments. Besides, we introduce the weighted binary cross entropy loss to solve the data imbalance problem in cricket bowl release detection. Finally, the experiments show that our proposed STAN achieves competitive performance in 1st place with a PQ score of 0.643 on the cricket bowl release challenge. The code is also publicly available at