MMSports'18- Proceedings of the 1st International Workshop on Multimedia Content Analysis in Sports

Full Citation in the ACM Digital Library

SESSION: Keynote & Session 1

Session details: Keynote & Session 1

Hideo Saito

Understanding and Shaping the Athlete's Brain Using Body-Mind Reading and Feedback

Makio Kashino

In sports where athletes play against an opponent, for example, ballgames and martial arts, a variety of cognitive functions hold the key to winning, such as grasping the situation, strategizing against one's opponent, making appropriate decisions instantaneously, and keeping an ideal mental state under intense pressure. Most of these functions, however, are "implicit" brain functions that the athlete himself/herself is not even aware of and cannot control at will. This is thought to be one of the factors behind the difficulty of acquiring, exhibiting, and coaching a specific skill. The NTT Sports Brain Science project was established in January 2017 to conduct research with the aim of understanding superior implicit brain functions in top athletes, identifying the factors in winning, and improving the performance of athletes based on research findings. The technical core of the project is "body-mind reading and feedback", which is constituted by various technologies such as wearable sensing, computer vision, biometric analyses, virtual reality, and sonification. The basic concepts and some research topics of the project will be introduced, with a focus on baseball and softball.

Snooker Video Event Detection Using Multimodal Features

Junqing Yu
Yixin Huang
Yunfeng He

The key to content-based video retrieval is the automatic detection and annotation of semantic events. In view of the problem that the results of existing research on snooker video analysis cannot satisfy event detection needs, we propose a system for event detection for snooker game video based on a fusion of multimodal information. Our main contributions include the following: A full-table view detection method has been proposed that combines color and geometric features and can precisely locate the score bar and turn indicator. A new solution of text segmentation and recognition in score bar and turn indicator has been implemented by designing an official player database to make up the deficiency of the optical character recognition (OCR). An audio classification approach using the hidden Markov model has been proposed to recognize applause, laughter, sighs, and the sounds of shots in replays. In visual modality, we detect replays using the method of optical flow block matching, based on dynamic programming. Through multimodal fusion of visual, audio, and text information and domain knowledge of snooker, we realized the detection algorithms for nine semantic events: frames, high breaks, defensive counterattacks, long considerations, fouls, nice shots, nice safeties, faults, and funny events. The experimental results show that the proposed method achieves high performance for most semantic events. And a comparison with other methods has demonstrated the performance superiority of our proposed method.

A Convolutional Sequence to Sequence Model for Multimodal Dynamics Prediction in Ski Jumps

Dan Zecha
Christian Eggert
Moritz Einfalt
Stephan Brehm
Rainer Lienhart

A convolutional sequence to sequence model for predicting the jump forces of ski jumpers directly from pose estimates is presented. We collect the footage of multiple, unregistered cameras together with the output of force measurement plates and present a spatiotemporal calibration procedure for all modalities which is merely based on the athlete's pose estimates. The synchronized data is used to train a fully convolutional sequence to sequence network for predicting jump forces directly from the human pose. We demonstrate that the best performing networks produce a mean squared error of 0.062 on normalized force time series while being able to identify the moment of maximal force occurrence in the original video at 55% recall within +- 2 frames around the ground truth.

SESSION: Session 2

Session details: Session 2

Dan Zecha

Stillness Moves: Exploring Body Weight-Transfer Learning in Physical Training for Tai-Chi Exercise

Han Hong Lin
Ping Hsuan Han
Kuan Yin Lu
Chia Hung Sun
Pei Yi Lee
Yao Fu Jan
Amy Ming Sui Lee
Wei Zen Sun
Yi Ping Hung

Body weight-transfer plays an important role in many exercises. The correlation of the body posture, movement, and weight-transfer will mutually affect the trainee to do well in performances such as Tai-Chi exercise. According to the traditional way of learning Tai-Chi, we proposed Stillness Moves, a physical training system for Tai-Chi, which captures and records users' skeleton movement and weight-transfer information for offering real-time and summary visual feedback. Based on above, we provide a gradual learning program in physical training, which combines body movement and weight-transfer learning. We evaluated our system and compared the performance without and with weight-transfer guidance in the user study. The result demonstrated that weight-transfer guidance is beneficial for trainee learning the Tai-Chi moves. For difficult moves, the trainee should learn the weight-transfer first, then, learning the body movement.

Computational UAV Cinematography for Intelligent A/V Shooting Based on Semantic Visual Analysis

Fotini Patrona
Ioannis Mademlis
Anastasios Tefas
Ioannis Pitas

As audiovisual coverage of sports events using Unmanned Aerial Vehicles (UAVs) is becoming increasingly popular, intelligent audiovisual (A/V) shooting tools are needed to assist the cameramen and directors. Several challenges also arise by employing autonomous UAVs, including the accurate identification of the 2D region of cinematographic attention (RoCA) depicting rapidly moving target ensembles (e.g., athletes) and the automatic control of the UAVs so as to take informative and aesthetically pleasing A/V shots, by performing automatic or semiautomatic visual content analysis with no or minimal human intervention. A novel method implementing computational UAV cinematography for assisting sports coverage, based on semantic, human-centered visual analysis is proposed in this work. Athlete detection and tracking, as well as spatial athlete distribution on the image plane are the semantic features extracted from an aerial video feed captured by a UAV and exploited for the extraction of the RoCA, based solely on present and past athlete detections and their regions of interest (ROIs). A PID controller that visually controls a real or virtual camera in order to track the sports RoCA and produce aesthetically pleasing shots, without using 3D location-related information, is subsequently employed. The proposed method is evaluated on actual UAV A/V footage from soccer matches and promising results are obtained.

An On-site Visual Feedback Method Using Bullet-Time Video

Takasuke Nagai
Hidehiko Shishido
Yoshinari Kameda
Itaru Kitahara

This paper describes an on-site visual feedback method that executes all processes from capturing of multi-view videos to generating and displaying bullet-time videos in real-time. In order to realize the on-site visual feedback in a dynamic scene where the subject moves around, such as a sports scene, it is necessary to automatically set the target point to where an observer pays attention. We combine an RGB-D camera that detects the position of the subject with our developed bullet-time video generation method in real-time, and achieve automatic setting of the target point based on the measured 3D position. Furthermore, we incorporate a function to detect a keyframe and automatically switch the viewpoint, to enable easier and more intuitive observation.

Using Virtual Reality and Head-Mounted Displays to Increase Performance in Rowing Workouts

Sebastian Arndt
Andrew Perkis
Jan-Niklas Voigt-Antons

Technology is advancing rapidly in the domain of virtual reality, as well as in using sensors to gather feedback from our body and the environment we are interacting in. Combining these two technologies gives us the opportunity to create personalized and reactive immersive environments. These environments can be used for training of dangerous situations (e.g. fire, crashes, etc) or to improve skills with less distraction than regular natural environments would offer. The pilot study described in this paper, puts an athlete rowing on a stationary rowing machine into a virtual environment. The virtual reality receives movement data from several sensors of the rowing machine and displays those in the head-mounted display. In addition, metrics on technique are derived from the sensor data as well as physiological data. All this is used to investigate if and to which extend VR improves the technical skills of an athlete, during the complex sport of rowing. Furthermore, athletes are given subjective feedback about their performance comparing the standard rowing workout with the workout using VR. First results indicate improved performance of the workout and an enhanced experience for the athlete in the VR condition.

ORSNet: A Hybrid Neural Network for Official Sports Referee Signal Recognition

Tse-Yu Pan
Chen-Yuan Chang
Wan-Lun Tsai
Min-Chun Hu

In this work, we propose a novel sports referee training system based on wearable sensors and a real-time Official Referee Signal (ORS) segmentation/recognition method which can recognize 65 kinds basketball ORSs with the accuracy of 95.3%. A hybrid neural network named ORSNet is designed for recognizing gestures based on IMU signals. The proposed ORSNet involves convolution layers and recurrent layers to learn more representative features and correlations in temporal domain, respectively. A novel loss function and a weight sharing strategy are proposed to learn a more robust ORS recognition model. Moreover, we investigate the influence of applying a semi-supervised network in the proposed ORSNet.

Development of a Virtual Environment for Motion Analysis of Tennis Service Returns

Kei Saito
Katsutoshi Masai
Yuta Sugiura
Toshitaka Kimura
Maki Sugimoto

In sports performance analysis, it is important to understand differences between experts and novices in order to train novices in an efficient manner. To understand these differences within the game of tennis, we developed a virtual environment to analyze the responses of experts and novices to services. By capturing actual service motions of an expert, it is possible to reproduce virtualized services in the environment. We did experiments on types and courses of services. As a result, we found differences between experts and novices in preparation, leg movement, take-back returns, and degree of spine twist.

Swimming Pool Occupancy Analysis using Deep Learning on Low Quality Video

Morten B. Jensen
Rikke Gade
Thomas B. Moeslund

Automatically creating spatio-temporal occupancy analysis of public swimming pools is of great interest, both for administrators to optimize the use of these expensive facilities, and for users to schedule their activities outside peak hours. In this paper we apply current state-of-the-art deep learning methods within human detection on low quality swimming pool video. Furthermore, we propose a method for analyzing the spatio-temporal occupancy of a swimming pool. We show that it is possible to precisely detect swimmers in very challenging conditions by obtaining an AUC of 93.48 % from YOLOv2. An acceptable AUC of 79.29 % was obtained from Tiny-YOLO, which can be implemented on a low-cost embedded system capable of producing results in real-time on site. We expect that the performance of both networks can be improved with more training data.

SESSION: Keynote & Session 3

Session details: Keynote & Session 3

Rainer Lienhart

Practical Sports Video Analysis at Qoncept: A Few Case Studies

Tuukka Karvonen

Analyzing sports scenarios through video processing is an enticing approach to sports analytics, especially due to the lack of need to attach sensors to the measured target. However, it has many challenges inherent to inverse problems such as multidimensionality of the data and the noise in the measurements. There are also many practical challenges to providing a working computer vision -based solution such as equipment installation constraints and including these in the system design, and dealing with outside factors such as weather changes. We take a look at these various challenges through a few case studies. We go through both the algorithmic and practical issues faced when working on these problems, and as a result, hope to identify common problems when working in the field and through them construct a useful framework for building computer vision -based solutions to sports analysis problems.

Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks

Mengshi Qi
Yunhong Wang
Annan Li
Jiebo Luo

Sports video captioning is a task of automatically generating a textual description for sports events (e.g. football, basketball or volleyball games). Although previous works have shown promising performance in producing the coarse and general description of a video, it is still quite challenging to caption a sports video with multiple fine-grained player's actions and complex group relationship among players. In this paper, we present a novel hierarchical recurrent neural network (RNN) based framework with an attention mechanism for sports video captioning. A motion representation module is proposed to extract individual pose attribute and group-level trajectory cluster information. Moreover, we introduce a new dataset called Sports Video Captioning Dataset-Volleyball for evaluation. We evaluate our proposed model over two public datasets and our new dataset, and the experimental results demonstrate that our method outperforms the state-of-the-art methods.

Estimation of Runners' Number of Steps, Stride Length and Speed Transition from Video of a 100-Meter Race

Kentaro Yagi
Kunihiro Hasegawa
Yuta Sugiura
Hideo Saito

The purpose of this study is sensing movements of 100-m runners from video that is publicly available, for example, Internet broadcasts. Normally, information that can be obtained from a video is limited to the number of steps and average stride length. However, our proposed method makes it possible to measure not only this information, but also time-scale information like every stride length and speed transition from the same input. Our proposed method can be divided into three steps. First, we generate a panoramic image of the 100-m track. By doing this, we can estimate where the runners are running in a frame at the 100-meter scale. Second, we detect whether the runner steps in the frame. For this process, we utilize the detected track lines and leg joint positions of runners. Finally, we project every steps to the overview image of the 100-m track to estimate the stride length at the 100-m scale. In the experiment part, we apply our method to various race videos. We evaluate the accuracy of our method via comparison with the data measured using typical methods. In addition, we evaluate the accuracy of estimation of the number of steps and show visualized runners' steps and speed transitions.

Fast and Accurate Object Detection Using Image Cropping/Resizing in Multi-View 4K Sports Videos

Jianfeng Xu
Lertniphonphan Kanokphan
Kazuyuki Tasaka

Recently, fast and accurate DNN object detectors such as YOLO and SSD have attracted considerable attention. However, it still takes far more time than real time processing when inputting a 4K video and becomes even more challenging when inputting multi-view 4K sports videos due to the massive amount of data and small objects. This paper presents a novel approach to significantly accelerate object detection. Observing the fact that object regions (including players and balls) are very sparse in sports videos, we crop the images to greatly reduce the processing areas using the temporal or view correlations. However, simple image cropping may worsen the accuracy if the object is too small to be detected. A further observation is that the detection can be improved by resizing the small objects properly. Our experimental results on two soccer matches of J1 League demonstrate that we can make the detection much faster with a high accuracy for small objects.