MMAsia '23 Workshops: Proceedings of the 5th ACM International Conference on Multimedia in Asia Workshops

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Workshop Papers

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

Qiwei Li
Zuchao Li
Xiantao Cai
Bo Du
Hai Zhao

In recent years, the use of multi-modal pre-trained Transformers has led to significant advancements in visually-rich document understanding. However, existing models have mainly focused on features such as text and vision while neglecting the importance of layout relationship between text nodes. In this paper, we propose GraphLayoutLM, a novel document understanding model that leverages the modeling of layout structure graph to inject document layout knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm to adjust the text sequence based on the graph structure. Additionally, our model uses a layout-aware multi-head self-attention layer to learn document layout knowledge. The proposed model enables the understanding of the spatial arrangement of text elements, improving document comprehension. We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD and it achieves state-of-the-art results among these datasets. Our experiment results demonstrate that our proposed method provides a significant improvement over existing approaches and showcases the importance of incorporating layout information into document understanding models. We also conduct an ablation study to investigate the contribution of each component of our model. The results show that both the graph reordering algorithm and the layout-aware multi-head self-attention layer play a crucial role in achieving the best performance.

PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition

Chengxi Lei
Satwinder Singh
Feng Hou
Xiaoyun Jia
Ruili Wang

Most of the current speech data augmentation methods operate on either the raw waveform or the amplitude spectrum of speech. In this paper, we propose a novel speech data augmentation method called PhasePerturbation that operates dynamically on the phase spectrum of speech. Instead of statically rotating a phase by a constant degree, PhasePerturbation utilizes three dynamic phase spectrum operations, i.e., a randomization operation, a frequency masking operation, and a temporal masking operation, to enhance the diversity of speech data. We conduct experiments on wav2vec2.0 pre-trained ASR models by fine-tuning them with the PhasePerturbation augmented TIMIT corpus. The experimental results demonstrate 10.9% relative reduction in the word error rate (WER) compared with the baseline model fine-tuned without any augmentation operation. Furthermore, the proposed method achieves additional improvements (12.9% and 15.9%) in WER by complementing the Vocal Tract Length Perturbation (VTLP) and the SpecAug, which are both amplitude spectrum-based augmentation methods. The results highlight the capability of PhasePerturbation to improve the current amplitude spectrum-based augmentation methods.

Speech Emotion Recognition using Threshold Fusion for Enhancing Audio Sensitivity

Zhaojie Luo
Stefan Christiansson
Bence Ladóczki
Kazunori Komatani

Speech Emotion Recognition (SER) has found applications in various fields. However, most SER studies exhibit a bias towards the text modality, which can lead to incorrect recognition when non-verbal audio features convey the primary emotional information. To address this issue, we propose a two-step solution to enhance the audio emotion sensitivity of SER models. First, we use a parallel emotional speech dataset (ESD), which contains identical speech content pronounced with different emotions, to pretrain a speech content-independent emotion recognition model, named the Audio Sensitive Network (ASN). Second, we propose a novel threshold fusion technique utilizing the Tree-structured Parzen Estimator (TPE) to optimize different thresholds for each predictive label, integrating the ASN with baseline SER classifiers. To demonstrate the efficacy of our approach, we conduct experiments on the IEMOCAP and ESD datasets. The results reveal that our novel method enhances audio sensitivity by enhancing the performance of existing SER classifiers.

Automatic Labeling of Tibetan Prosodic Boundary Based on Speech Synthesis Tasks

Zom Yang
Kuntharrgyal Khysru
Yi Zhu
Long Daijicuo
Jianguo Wei

Prosodic is the highest expression of speech dynamics, which is mainly reflected in the pause, tone intensity, accent, and rhythm during natural pronunciation. Prosodic labeling is an important factor in improving the naturalness of speech and enhancing semantic understanding. The extraction of prosodic information can make the effect of speech synthesis closer to nature. In this paper, from the theory of Tibetan grammar and the characteristics of Tibetan speech, we design a method for automatic labeling of Prosodic boundaries that includes Tibetan text, acoustic features, and other Tibetan speech characteristics of Tibetan speech around the task of Tibetan speech synthesis. We choose 20975 Tibetan speech synthesis corpus to validate the designed automatic labeling method. The F1 values of Prosodic words, Prosodic phrases, and intonation phrases obtained by using the Prosodic labeling rule are 95%, 93.4%, and 90.4%, respectively, which provide certain feasibility and scientificity for mining the regular features of the pronunciation of the Tibetan language in the task of Tibetan speech synthesis.

A Large Vocabulary End-to-End Myanmar Automatic Speech Recognition

Hay Mar Soe Naing
Win Pa Pa

In recent years, sequence-to-sequence technology has become popular in automatic speech recognition area. This model replaces the classic complex pipeline with a single neural network architecture. This paper proposes the use of transformer- and conformer-based models on Myanmar automatic speech recognition system (UCSY-Myan-ASR). Classical hybrid long short-term memory (LSTM) and end-to-end models are presented and evaluated to improve error rates. The experiments were carried on the UCSY-82-hour speech corpus and evaluated in terms of syllable error rate (SER) and character error rate (CER). Using the Transformer approach, the best performance in the daily conversation domain reaches the SER of 9.6% and CER of 7.3%. When using the conformer model, the best performance in the news domain is 10.6% SER and 6.9% CER respectively.

MMRec: Simplifying Multimodal Recommendation

Xin Zhou

This paper presents an open-source toolbox, MMRec for multimodal recommendation. MMRec simplifies and canonicalizes the process of implementing and comparing multimodal recommendation models. The objective of MMRec is to provide a unified and configurable arena that can minimize the effort in implementing and testing multimodal recommendation models. It enables multimodal models, ranging from traditional matrix factorization to modern graph-based algorithms, capable of fusing information from multiple modalities simultaneously. Our documentation, examples, and source code are available at https://github.com/enoche/MMRec.

KyotoMOS: An Automatic MOS Scoring System for Speech Synthesis

Wangjin Zhou
Zhengdong Yang
Sheng Li
Chenhui Chu

The Mean Opinion Score (MOS) serves as a subjective measure for assessing the quality of synthesized speech. Nevertheless, the conventional approach to MOS evaluations can be resource-intensive in terms of both time and cost. This article unveils an automatic MOS scoring toolkit that builds upon our success in securing the top position for some metrics in VoiceMos2022 Challenge and emerging as champions in some tracks of the VoiceMos2023 Challenge. We offer a pre-trained MOS scoring tool for English and provide training code for other languages. Our documentation, examples, and source code are available at https://github.com/superphysics/KyotoMOS.

An Overview of the ICASSP Special Session on AI Security and Privacy in Speech and Audio Processing

Zhao Ren
Kun Qian
Tanja Schultz
Björn W. Schuller

Perceiving and producing speech and audio signals are the basic ways for humans to communicate with each other and know about the world. Benefiting from the advancement of Big Data, signal processing, and Artificial Intelligence (AI), intelligent machines have been rapidly developed to process speech and audio signals for assisting human life. Deep learning has been demonstrated to achieve excellent performance based on large amounts of data streams. In the meanwhile, the problems of security vulnerability and privacy leakage appear along with the booming technologies. Systems with security and privacy problems can expose users’ personal information to danger and cause users’ distrust. To facilitate technology development in tackling the aforementioned issues, the special session on “AI security and privacy in speech and audio processing" was organised at ICASSP 2023. In this study, we provide a comprehensive overview of the invited high-quality contributions at the special session. We further discuss the current research challenges, and point out potential avenues for future works. This work is expected to summarise the research advancements and inspire more innovative studies in this area.

RL-NMT: Reinforcement Learning Fine-tuning for Improved Neural Machine Translation of Burmese Dialects

Ye Kyaw Thu
Thazin Myint Oo
Thepchai Supnithi

In this study, we investigate the use of Reinforcement Learning (RL) for fine-tuning neural machine translation models for Burmese dialects. We perform experiments using two extremely low-resource Burmese dialect datasets: Burmese-Beik and Burmese-Rakhine, employing two different deep learning modeling techniques: bi-LSTM (Seq2Seq) and the Transformer. Our training procedure involved initially training models over varying numbers of epochs: 30, 40, 50, 60, and 70 epochs, and then fine-tuning them with RL for additional epochs, such that the total number of epochs for each model equaled 100. For instance, a model initially trained for 30 epochs was fine-tuned for an additional 70 epochs using RL. The results show that better quality machine translation was attained with RL over all initial models. Moreover, RL yields a significant improvement of Bilingual Evaluation Understudy (BLEU) scores (+4.73 BLEU for Burmese-to-Beik translation with Seq2Seq RL, +2.58 for Rakhine-to-Burmese translation with Transformer RL) over some of the baselines not utilizing RL training.

Research on the classification method of knowledge question intention for Tibetan language curriculum

Qie Yangzhuoma
Kuntharrgyal Khysru
Wan Maji
Jianguo Wei

As an important part of the intelligent question-answering system, intention classification has attracted more and more researchers ’ attention. Tibetan corpus is different from other common corpora. Its grammatical structure is relatively complex, and there is no large amount of labeled data, which makes the research of intention classification in this field scarce. Given the above problems, this paper takes Tibetan curriculum knowledge as an example and introduces the ZangBert-BiLSTM modeling question intention classification task. Firstly, ZangBert encodes the semantic information in the text sequence. In modeling the semantic information, the function word information is introduced by inserting special flag bits into the sequence. Then, the BiLSTM model is used to model the long-distance dependence information in the semantic information sequence. Finally, these text features are sent to the downstream classifier for intention classification. In order to verify the validity of the question intention classification model of primary school Tibetan curriculum knowledge, the corresponding question intention classification data set is labeled, and good intention classification performance is obtained on the data set, which verifies the validity of the work done in this paper.

Guided Image Filtering: A Survey and Evaluation Study

Weimin Yuan
Yinuo Wang
Cai Meng
Xiangzhi Bai

In the past decade, there has been an increasing success of guided image filtering (GIF). Leveraging the guidance image as a prior and transferring the structural details to the target image, GIF has demonstrated its ability in faithfully preserving image edges while maintaining low computational complexity. Additionally, GIF exhibits good capability in extracting and characterizing images from various domains. Researchers have proposed large numbers of GIF-like variants. Nevertheless, limited effort has been devoted to a systematic review and evaluation of these methods. To fill this gap, this paper provides a comprehensive survey of existing GIF-like methods, including model- and deep learning-based approaches. Moreover, extensive experiments are conducted to compare the performance of 18 representative methods. Analysis of the qualitative and quantitative results reveals several observations concerning the current state of this area.

Preprocessing Variations for Classification in Smart Manufacturing

Brian Jing Hong Nge
Caleb En Hong Tan
Yi Zhen Nicholas Wong
Christine Chia Yi Chiong
Chun Yong Chong
Mei Kuan Lim
Weng Kin Lai

Despite its contribution to Malaysia’s Gross National Income, research into automating Edible Bird’s Nest (EBN) classification is still preliminary even with its complexity as it takes into account multiple specific characteristics including colour, shape, size, and level of impurities. Furthermore, most smart manufacturing automation research is conducted under strictly controlled environments, not accounting for the possibility of low-quality visual data produced in real-life deployment settings. Thus, this paper addresses the need to automate the EBN classification process for the purpose of advancing smart manufacturing within its industry. To replicate the challenges posed by low-quality visual data commonly encountered in industrial environments, we employ a range of preprocessing techniques focused on brightness and blurriness. Subsequently, our chosen object detection algorithm, YOLOv8, was trained and evaluated on a manually collected dataset where the samples were provided by an EBN manufacturer. Variations of this dataset were created by modifying the amount of brightness and blurriness through Roboflow’s preprocessing settings. By combining both the original and preprocessed images to form the training dataset, we expose the model to the sample under desired conditions and then to conditions that lower its quality. The test results were tabulated based on the metrics: recall, precision, F1-score, mAP50 and mAP50-95. The results show that the overall performance of the deep learning model degrades when trained with preprocessed datasets compared to the dataset without preprocessing. The performance also follows the trend of increasing until a certain range for both brightness and blurriness before decreasing in performance. This trend justifies the need for research involving computer vision to investigate the optimum preprocessing configurations that would allow deep learning models to perform at their best.

3D Sports Field Registration via Parametric Learning

TsungHsun Tsai
Calvin Ku
Li Xin Ng
Min-Chun Hu
Chih-Yuan Yao
Hung-Kuo Chu

This paper addresses the challenge of registering a 3D sports field from a baseball pitcher scene image. Some recent works have proposed calibrating a 2D homography matrix and using it to project keypoints onto the court field. However, this approach has limitations in 2D registration scenarios. By using 3D registration instead, these systems can provide more precise data analysis and better visual effects. Furthermore, we introduce parametric model regression to predict the 3D spatial information of the sports field. Based on parametric model regression, domain generalization is employed to improve generalizability. Experiments show that our approach significantly outperforms other 2D registration methods.

MAAIG : Motion Analysis And Instruction Generation

Wei-Hsin Yeh
Pei Hsin Lin
Yu-An Su
Wen Hsiang Cheng
Lun-Wei Ku

Many people engage in self-directed sports training at home but lack the real-time guidance of professional coaches, making them susceptible to injuries or the development of incorrect habits. In this paper, we propose a novel application framework called MAAIG(Motion Analysis And Instruction Generation). It can generate embedding vectors for each frame based on user-provided sports action videos. These embedding vectors are associated with the 3D skeleton of each frame and are further input into a pretrained T5 model. Ultimately, our model utilizes this information to generate specific sports instructions. It has the capability to identify potential issues and provide real-time guidance in a manner akin to professional coaches, helping users improve their sports skills and avoid injuries.

Basketball Flow: Learning to Synthesize Realistic and Diverse Basketball GamePlays based on Strategy Sketches

Ming-Feng Kuo
Yu-Shuen Wang

In this study, we present BasketballFlow, a system designed to generate diverse basketball gameplays based on a pre-determined strategy sketch. A strategy sketch is a graphical representation that coaches use to outline their planned tactics, encompassing the projected routes of the ball and the offensive players. Despite the visual depiction of the offensive strategy, less experienced players might find it challenging to fully understand these tactics and often falter in their implementation due to interference from defensive players. Our system aims to remedy this by simulating different game scenarios that illustrate potential defensive maneuvers, thereby aiding these less experienced players in improving their success rate of tactical execution. BasketballFlow is composed of a variational generative adversarial network (VAEGAN) and a normalizing flow. The VAEGAN is tasked with producing highly accurate game scenarios, while the normalizing flow ensures a wide diversity in the simulated outcomes. Compared to other existing methods, BasketballFlow demonstrates superior proficiency in simulating a broad spectrum of gameplays while maintaining a lower Frèchet distance to real gameplays. The effectiveness of our BasketballFlow system is validated through our experimental results.

MMAsia '23 Workshops: Proceedings of the 5th ACM International Conference on Multimedia in Asia Workshops

MMAsia '23 Workshops: Proceedings of the 5th ACM International Conference on Multimedia in Asia Workshops

SESSION: Workshop Papers

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition

Speech Emotion Recognition using Threshold Fusion for Enhancing Audio Sensitivity

Automatic Labeling of Tibetan Prosodic Boundary Based on Speech Synthesis Tasks

A Large Vocabulary End-to-End Myanmar Automatic Speech Recognition

MMRec: Simplifying Multimodal Recommendation

KyotoMOS: An Automatic MOS Scoring System for Speech Synthesis

An Overview of the ICASSP Special Session on AI Security and Privacy in Speech and Audio Processing

RL-NMT: Reinforcement Learning Fine-tuning for Improved Neural Machine Translation of Burmese Dialects

Research on the classification method of knowledge question intention for Tibetan language curriculum

Guided Image Filtering: A Survey and Evaluation Study

Preprocessing Variations for Classification in Smart Manufacturing

3D Sports Field Registration via Parametric Learning

MAAIG : Motion Analysis And Instruction Generation

Basketball Flow: Learning to Synthesize Realistic and Diverse Basketball GamePlays based on Strategy Sketches

Sections

User login