MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia

MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia

MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia

Full Citation in the ACM Digital Library

SESSION: Full Papers

TFM a Dataset for Detection and Recognition of Masked Faces in the Wild

  • Gibran Benitez-Garcia
  • Hiroki Takahashi
  • Miguel Jimenez-Martinez
  • Jesus Olivares-Mercado

Droplet transmission is one of the leading causes of the spread of respiratory infections, such as coronavirus disease (COVID-19). The proper use of face masks is an effective way to prevent the transmission of such diseases. Nonetheless, different types of masks provide various degrees of protection. Hence, automatic recognition of face mask types may benefit the control access to facilities where a specific protection degree is required. In the last two years, several deep learning models have been proposed for face mask detection and properly wearing mask recognition. However, the current publicly available datasets do not consider the different mask types and occasionally lack real-world elements needed to train robust models. In this paper, we introduce a new dataset named TFM with sufficient size and variety to train and evaluate deep learning models for face mask detection and recognition. This dataset contains more than 135,000 annotated faces from about 100,000 photographs taken in the wild. We consider four mask types (cloth, respirators, surgical and valved) as well as unmasked faces, of which up to six can appear in a single image. The photographs were mined from Twitter within two years since the beginning of the COVID-19 pandemic. Thus, they include diverse scenes with real-world variations in background and illumination. With our dataset, the performance of four state-of-the-art object detection models is evaluated. The experimental results show that YOLOv5 can achieve about 90% of mAP@0.5, demonstrating that the TFM dataset can be used to train robust models and may help the community step forward in detecting and recognizing masked faces in the wild. Our dataset and pre-trained models used in the evaluation will be available upon the publication of this paper.

Deep Image and Kernel Prior Learning for Blind Super-Resolution

  • Kazuhiro Yamawaki
  • Xian-Hua Han

Recently, single image super-resolution (SR) has witnessed significant progress due to the powerful modeling capability of the deep learning networks. However, conventional deep learning-based super-resolution methods predict high-resolution (HR) images under the assumption of ideal degradation model such as the simulated bicubic down-sampling, and then unavoidably deteriorate the SR performance under un-controlled imaging conditions, such as real-world LR images. This study proposes an universal blind SR framework for adaptively and simultaneously predicting the underlying HR image and the counterpart blurring kernel from the observed LR image only. Specifically, we employ an encoder-decoder-based generative network to learn the inherent statistic prior of the HR image from a noise input while adopt a shallow convolution subnet with several stacked layers to estimate the blurring kernel from the observed LR image. Then, a convolution-based degradation module by setting the estimated blurring kernel as its weights is incorporated to obtain the approximated version of the LR image for formulating the loss function. In addition, a pre-trained discriminator is adopted to integrate the perceptual loss for recovering more accurate and natural HR image. We demonstrate the effectiveness of the proposed deep image and kernel prior learning framework using extensive experiments on both synthetic and real images, showing superiority over the state-of-the-art blind SR performance.

Asymmetric Label Propagation for Video Object Segmentation

  • Zhen Chen
  • Ming Yang
  • Shiliang Zhang

Semi-supervised video object segmentation aims to segment foreground objects across a video sequence based on their masks given at the first frame. The motion in adjacent frames tends to be smooth, yet object appearances could change substantially in subsequent frames due to clutters or occlusions. Most existing works segment a video frame by equally referring to segmentation masks of its previous frame and the first frame, and are prone to unreliable matching and accumulated segmentation errors. In order to alleviate this issue, this paper proposes to treat the first and previous frames differently to leverage the motion and appearance clues reliably, and presents an Asymmetric Label Propagation (ALP) method. ALP consists of a Confidence-guided Local Propagation (CLP) module and a Global Label Matching (GLM) module, respectively. CLP propagates labels from the previous frame to the current frame based on local affinity and appearance matching uncertainty. To further recover potential missing objects and alleviate error accumulation, GLM matches the current frame to both the foreground and background of the first frame, and adaptively fuses their matching results. The CLP and GLM outputs are fused to generate object-specific feature maps to perform multi-object segmentation. Extensive experiments on DAVIS and Youtube-VOS datasets demonstrate the effectiveness of the proposed method.

Informative Sample-Aware Proxy for Deep Metric Learning

  • Aoyu Li
  • Ikuro Sato
  • Kohta Ishikawa
  • Rei Kawakami
  • Rio Yokota

Among various supervised deep metric learning methods proxy-based approaches have achieved high retrieval accuracies. Proxies, which are class-representative points in an embedding space, receive updates based on proxy-sample similarities in a similar manner to sample representations. In existing methods, a relatively small number of samples can produce large gradient magnitudes (i.e., hard samples), and a relatively large number of samples can produce small gradient magnitudes (i.e., easy samples); these can play a major part in updates. Assuming that acquiring too much sensitivity to such extreme sets of samples would deteriorate the generalizability of a method, we propose a novel proxy-based method called Informative Sample-Aware Proxy (Proxy-ISA), which directly modifies a gradient weighting factor for each sample using a scheduled threshold function, so that the model is more sensitive to the informative samples. Extensive experiments on the CUB-200-2011, Cars-196, Stanford Online Products and In-shop Clothes Retrieval datasets demonstrate the superiority of Proxy-ISA compared with the state-of-the-art methods.

Federated Knowledge Transfer for Heterogeneous Visual Models

  • Wenzhe Li
  • Zirui Zhu
  • Tianchi Huang
  • Lifeng Sun
  • Chun Yuan

Federated learning (FL) is a privacy-preserving distributed learning paradigm that enables collaborative training of machine learning models among multiple participants. However, despite recent progress, existing federated learning systems can still not handle heterogeneous models. For instance, candidate clients with heterogeneous models are inaccessible to the established federated system. And within the federated system, local models are forbidden to be updated to become heterogeneous models, even though the updated models work better.

Considering the reality of heterogeneous models, we study two practical scenarios, Local Model Update Scenario and Hetero-Model Enrollment Scenario. We then proposes a novel method to tackle the problems, which we refer to as Federated learning with deep-layer Feature Alignment (FedDFA). FedDFA uses deep-layer knowledge distillation to align the feature representation and solve the knowledge transfer problem of heterogeneous models. We constructed a federated learning system where we take convolutional neural networks (CNNs) as local models and vision transformers (ViT) as heterogeneous models. We trained these models with three datasets (CIFAR-10, CELEBA, and ImageNet-1k) and their non-I.I.D. variants. As a result, our approach facilitates FL with wide applicability for various models and better generalization performance than the state-of-the-art methods.

Affective Embedding Framework with Semantic Representations from Tweets for Zero-Shot Visual Sentiment Prediction

  • Yingrui Ye
  • Yuya Moroto
  • Keisuke Maeda
  • Takahiro Ogawa
  • Miki Haseyama

This paper presents a zero-shot visual sentiment prediction method using semantic representation features of texts from tweets as the non-visual auxiliary data. Previous studies show that visual sentiment prediction methods can only predict the sentiment labels that are the same as the labels of the sentiment theory used in the training dataset, which means that they cannot predict the new sentiment label used in different sentiment theories. To solve the problem of predicting new labels, zero-shot learning has been proposed. The previous zero-shot visual sentiment prediction method uses Word2vec features and the adjective-noun pair features to obtain the semantical relationship between images and sentiment words to predict unseen sentiments. However, many adjective-noun pairs are not related to sentiments, which makes it difficult to compensate for an affective gap between low-level visual features and high-level sentiment semantics. Thus, to better compensate for the affective gap, it is considered to introduce the new non-visual auxiliary data. As people tend to share their feelings with both images and texts on social networking services, the texts from tweets are effective as the side information of the images in visual sentiment prediction. Thus, we introduce the semantic representations from tweets as the new non-visual auxiliary data to construct an affective embedding space, which makes a more effective zero-shot visual sentiment prediction model. Moreover, we propose a cross-dataset zero-shot task for visual sentiment prediction, which is more consistent with the real situation that the testing and training images may be in different domains. The contributions in this paper are to combine several semantic representation features for zero-shot visual sentiment prediction and the proposal of the cross-dataset zero-shot task for visual sentiment prediction. The experiments on several open datasets show the effectiveness of the proposed method.

SPEAKER VGG CCT: Cross-Corpus Speech Emotion Recognition with Speaker Embedding and Vision Transformers

  • Alessandro Arezzo
  • Stefano Berretti

In recent years, Speech Emotion Recognition (SER) has been investigated mainly transforming the speech signal into spectrograms that are then classified using Convolutional Neural Networks pre-trained on generic images and fine tuned with spectrograms. In this paper, we start from the general idea above and develop a new learning solution for SER, which is based on Compact Convolutional Transformers (CCTs) combined with a speaker embedding. With CCTs, the learning power of Vision Transformers (ViT) is combined with a diminished need for large volume of data as made possible by the convolution. This is important in SER, where large corpora of data are usually not available. The speaker embedding allows the network to extract an identity representation of the speaker, which is then integrated by means of a self-attention mechanism with the features that the CCT extracts from the spectrogram. Overall, the solution is capable of operating in real-time showing promising results in a cross-corpus scenario, where training and test datasets are kept separate. Experiments have been performed on several benchmarks in a cross-corpus setting as rarely used in the literature, with results that are comparable or superior to those obtained with state-of-the-art network architectures. Our code is available at

Self-Attentive CLIP Hashing for Unsupervised Cross-Modal Retrieval

  • Heng Yu
  • Shuyan Ding
  • Lunbo Li
  • Jiexin Wu

With the explosive growth of multi-modal data such as video, images, and text on the Internet, cross-modal retrieval has received extensive attention, especially the deep hashing method. Compared with the real-value method, deep hashing has shown promising prospects due to its low memory consumption and high searching efficiency. However, most existing studies have difficulties in effectively utilizing the raw image-text pairs to generate discriminative feature representations. Moreover, these methods ignore the latent relationship between different modalities and fail to construct a robust similarity matrix, resulting in suboptimal retrieval performance. In this paper, we focus on the unsupervised cross-modal hashing tasks and propose a Self Attentive CLIP Hashing (SACH) model. Specifically, we construct the feature extraction network by employing the pre-trained CLIP model, which has shown excellent performance in zero-shot tasks. Besides, to fully exploit the semantic relationships, an attention module is introduced to reduce the disturbance of redundant information and focus on important information. On this basis, we construct a semantic fusion similarity matrix that capable of preserving the original semantic relationships from different modalities. Extensive experiments show the superiority of SACH compared with recent state-of-the-art unsupervised hashing methods.

An End-to-End Scene Text Detector with Dynamic Attention

  • Jingyu Lin
  • Yan Yan
  • Hanzi Wang

Detecting the arbitrarily oriented text in natural images is a challenging task in multimedia due to variations in text curvatures, orientations, and aspect ratios of natural scenes. Most previous scene text detectors often fail to locate the text instances which have a peculiar shape (an extreme aspect ratio) precisely. In this paper, we propose a dynamic end-to-end framework (DEF) which includes a convolution-based dynamic encoder (CDE) with various attention types to generate a deformable and dynamic view for multi-oriented text instances and curve ones. Different from previous methods that apply time-consuming post-processing steps like NMS, our method uses a Transformer-based decoder (TD) with a bipartite matching loss to model the relationship of corresponding queries and ground truths. As a result, by leveraging such a well-designed architecture, the receptive field will not be limited to a fixed shape, and a combination of global attention and local features provides a better representation for texts in natural scenes. We conduct extensive experiments qualitatively and quantitatively on several popular datasets. Experimental results show that the proposed method achieves superior performance compared with several state-of-the-art scene text detectors.

Human-Avatar Interaction in Metaverse: Framework for Full-Body Interaction

  • Kit Yung Lam
  • Liang Yang
  • Ahmad Alhilal
  • Lik-Hang Lee
  • Gareth Tyson
  • Pan Hui

The metaverse is a network of shared virtual environments where people can interact synchronously through their avatars. To enable this, it is necessary to accurately capture and recreate (physical) human motion. This is used to render avatars correctly, reflecting the motion of their corresponding users. In large-scale environments this must be done in real-time. This paper proposes a human-avatar framework with full-body motion capture. Its goal is to deliver high-accuracy capture with low computational and network overheads. It relies on a lightweight Octree data structure to record and transmit motion to other users. We conduct a user study with 22 participants and perform a preliminary evaluation of its scalability. Our user study shows that Octree with Inverse Kinematic achieves the best trade-off, achieving low delay and high accuracy. Our proposed solution delivers the lowest delay, with an average of 67ms in an environment of 8 concurrent users. It attains a 55.7% improvement over the prior techniques.

Parallel Queries for Human-Object Interaction Detection

  • Junwen Chen
  • Keiji Yanai

Human-Object Interaction (HOI) Detection requires localizing a pair of humans and objects. Recent transformer-based methods leverage the query embeddings to represent the entire HOI instances. The target embeddings after decoding are used to represent the object and human characteristics at the same time. However, it is ambiguous to use the highly integrated embeddings to localize the human and object simultaneously. To address this problem, we split the detection decoding process into subject decoding and object decoding to detect the humans and objects in parallel. Our proposed method, Parallel Query Network (PQNet) uses two transformer decoders to decode the subject embeddings and object embeddings in parallel, and a novel verb decoder is used to fuse the representation from the detection decoding and predict the interaction. The attention mechanisms in the verb decoder consist of the attention between human and object embeddings and the attention between the fused embeddings and global semantic features. As the transformer architecture maintains the permutation of the input query embeddings, the paired boxes of humans and objects are directly predicted by feed-forward networks. With the full usage of the object detection part, our proposed architecture outperforms the state-of-the-art baseline method with half of the training epochs.

Sequential Frame-Interpolation and DCT-based Video Compression Framework

  • Yeganeh Jalalpour
  • Wu-chi Feng
  • Feng Liu

Video data is ubiquitous; capturing, transferring, and storing even compressed video data is challenging because it requires substantial resources. With the large amount of video traffic being transmitted on the internet, any improvement in compressing such data, even small, can drastically impact resource consumption. In this paper, we present a hybrid video compression framework that unites the advantages of both DCT-based and interpolation-based video compression methods in a single framework. We show that our work can deliver the same visual quality or, in some cases, improve visual quality while reducing the bandwidth by 10--20%.

360BroadView: Viewer Management for Viewport Prediction in 360-Degree Video Live Broadcast

  • Qian Zhou
  • Zhe Yang
  • Hongpeng Guo
  • Beitong Tian
  • Klara Nahrstedt

360-degree video is becoming an integral part of our content consumption through both video on demand and live broadcast services. However, live broadcast is still challenging due to the huge network bandwidth cost if all 360-degree views are delivered to a large viewer population over diverse networks. In this paper, we present 360BroadView, a viewer management approach to viewport prediction in 360-degree video live broadcast. We make some high-bandwidth network viewers be leading viewers to help the others (lagging viewers) predict viewports during 360-degree video viewing and save bandwidth. Our viewer management maintains the leading viewer population despite viewer churns during live broadcast, so that the system keeps functioning properly. Our evaluation shows that 360BroadView maintains the leading viewer population at a minimal yet necessary level for 97 percent of the time.

Two-Layer Learning-Based P-Frame Coding with Super-Resolution and Content-Adaptive Conditional ANF

  • David Alexandre
  • Hsueh-Ming Hang
  • Wen-Hsiao Peng

Deep-learning-based video compression technique has been rapidly growing in recent years. This paper adopts the Conditional Augmented Normalizing Flow video codec (CANF-VC) [8] as our basic system. To improve the quality of the condition signal (image) for CANF, we propose a two-layer structure learning-based video codec. At low cost of extra bit rate, the low-resolution base layer provides side information to improve the quality of motion-compensated reference frame through a super-resolution module with a merge-net. In addition, the base layer also provides information to the skip-mask generator. The skip-mask guides the coding mechanism to reduce the transmitted samples for the high-resolution enhancement layer. The experiment results indicate that the proposed two-layer coding scheme can provide 22.19% PSNR BD-Rate saving and 49.59% MS-SSIM BD-Rate saving over H.265 (HM 16.20) on the UVG test sequences.

Learned Bi-Directional Motion Prediction for Video Compression

  • Yunhui Shi
  • Shaopei An
  • Jin Wang
  • Baocai Yin

Motion estimation is a key component to remove the temporal redundancy in video compression. It is well-known that bi-directional motion estimation outperforms sequential motion estimation because of its capability to use both forward and backward reference frames. Previous approaches perform a coarse motion prediction operation to further remove motion spatial redundancy, which heavily rely on the regularity of the motion. However, most motions in natural video sequences are extremely complicated and objects usually move irregularly. To solve this problem, in this paper, we propose a fine motion prediction network by learning an importance map between the bi-directional references. Our designed network can generate more accurate prediction of the motion, yielding less residual. And it is also universal for videos with complex and irregular motions. Both objective and subjective quality results validate the effectiveness of our approach.

Deep Enhancement-Object Features Fusion for Low-Light Object Detection

  • Wan Teng Lim
  • Kelvin Ang
  • Yuen Peng Loh

With the robust development of deep learning, object detection has gained much attention for practical use cases such as in autonomous driving and surveillance. However, the task is still challenging to the state-of-the-arts in low-light. Consequently, image enhancement has become a common pre-processing step in the pipeline for object detection in low-light environments. Nonetheless, such two-step approach hinges on the reconstruction of the enhanced image which could introduce unseen artifacts and distortion that deteriorates the detection performance instead. Thus, this work proposes a deep enhancement-object features fusion approach to alleviate the problem by infusing deep features extracted from low-light image enhancement with the deep object features of a detection model. It is postulated that features learned by enhancement models emphasizes visual details which were otherwise disregarded by detection models that focus on the abstract appearance of objects. Hence, the fusion of such complementary features would compensate for the details lost due to low-visibility as well as circumvent the reconstruction error for better detection. Specifically, this work performs a study on fusing deep enhancement features from the state-of-the-art Deep Lightening Network (DLN) with the Yolov5 object detection model at various stages. Experiments on the ExDARK dataset showed that such fusion can improve the precision of object detection in various low-light image conditions and outperforms the conventional two-step pre-process-then-detect approach.

Image Compression for Machines Using Boundary-Enhanced Saliency

  • Yuanyuan Xu
  • Haolun Lan

With the rapid development of deep learning, more and more images and videos are used for machine analysis. The amount of images and video content consumed by machines has exceeded that of humans. However, the traditional image and video coding schemes are designed for human vision system, where information that is vital to machine vision, e.g., boundary of a salient object, may not be preserved during compression. In this paper, based on high efficiency video coding (HEVC) intra coding, we propose an image compression scheme for machines using boundary-enhanced saliency. Using image classification as an example task, Grad-CAM, a deep learning visualization method, is used to interpret classification results to generate a pixel-level saliency map for each image. Object segmentation and edge detection are then performed to generate boundary map of the salient object. With boundary-enhanced saliency map, we derive a coding tree unit (CTU)-level QP adjustment scheme, where more bits are allocated to salient regions of image concerning machine vision. Experimental results show that, compared with HEVC, our proposed scheme could achieve up to 29.94% and 31.53% bitrate saving with the same TOP 1 and TOP5 accuracy performance in image classification, respectively.

Deep Weighted Guided Upsampling Network for Depth of Field Image Upsampling

  • Lanling Zeng
  • Lianxiong Wu
  • Yang Yang
  • Xiangjun Shen
  • Yongzhao Zhan

Depth-of-field (DoF) rendering is an important technique in computational photography that simulates the human visual attention system. Existing DoF rendering methods usually suffer from a high computational cost. The task of DoF rendering can be accelerated by guided upsampling methods. However, the state-of-the-art guided upsampling methods fail to distinguish the focus and defocus areas, resulting in unsatisfying DoF effects. In this paper, we propose a novel deep weighted guided upsampling network (DWGUN) based on a encoder and decoder framework to jointly upsample the low-resolution DoF image under the guidance of the corresponding high-resolution all-in-focus image. Due to the intuitive weight design, the traditional weighted image upsampling is not tailored to DoF image upsampling. We propose a deep refocus-defocus edge-aware module (DREAM) to learn the spatially-varying weights and embed them in the deep weighted guided upsampling block (DWGUB). We have conducted comprehensive experiments to evaluate the proposed method. Rigorous ablation studies are also conducted to validate the rationality of the proposed components.

Multispectral Image Denoising via Structural Tensor Sparsity Promoting Model

  • Longlu Huang
  • Na Qi
  • Qing Zhu

Multispectral images (MSIs) contain more spectral information than traditional 2D images, which can provide a more accurate representation of objects. MSIs are easily affected by various noises when captured by sensors. In recent years, many MSI denoising methods, especially the Kronecker-basis-representation (KBR) method, have achieved great success. KBR uses tensor representation and decomposition to achieve good MSI denoising performance. However, each full band patch (FBP) group is decomposed in this method so that too many dictionary atoms are generated. In this paper, we propose a structural tensor sparsity promoting (STSP) model for MSI denoising. In order to decrease the number of dictionary atoms, we cluster FBP groups and learn orthogonal dictionaries for each class rather than each FBP group. To improve the denoising performance, the structural similarity among FBP groups are utilized in the STSP model by enforcing nonlocal centralized sparse constraint, where the compromise parameter is statistically and adaptively determined. Experimental results on the the CAVE dataset demonstrate that our model outperforms the state-of-art methods in terms of both objective and subjective quality.

Multi-Scale Channel Transformer Network for Single Image Deraining

  • Yuto Namba
  • Xian-Hua Han

Single image deraining is a very challenging task, as it requires not only restoring the spatial details and high contextual structures of the images, but also removing multiple layers of rain with varying degrees of blurring and resolutions. Recently, due to the powerful modeling capability of long-dependency, transformer-based models have manifested superior performance for high-level vision tasks, and have begun to be applied for low-level vision tasks such as various image restoration applications. However, its computational complexity increases quadratically with spatial resolutions, making it impossible to apply it to high-resolution images. In this study, we propose a novel Channel Transformer, which performs self-attention in the channel direction instead of the spatial direction. Specifically, we first incorporate multiple channel transformer blocks into a multi-scale architecture to extract multi-scale contexts and exploit channel long-dependence, and then learn a coarse estimation of the rain-free image. Finally, an original-resolution CNN-based module is employed to refine the coarse estimation via leveraging the previously learned multi-scale contexts. Experiments on several benchmark datasets demonstrate its superiority over the state-of-the-art methods.

Remote Sensing Image Colorization Based on Joint Stream Deep Convolutional Generative Adversarial Networks

  • Jingyu Wang
  • Jie Nie
  • Hao Chen
  • Huaxin Xie
  • Chengyu Zheng
  • Min Ye
  • Zhiqiang Wei

With the development of deep neural networks, especially generation networks, gray image coloring technology has made great progress. As one of the fields, remote sensing image colorization needs to be solved urgently. This is because remote sensing images cannot obtain clear color images due to the limitations of shooting equipment and transmission equipment. Compared with ordinary images, remote sensing images are characterized by the uneven spatial distribution of objects, therefore, it is a great challenge to ensure the spatial consistency of coloring. To embrace this challenge, we propose a new joint stream DCGAN including a micro stream and a macro stream, in which the latter is set as a prior to constrain the former for colorization. In addition, the Low-level Correlation Feature Extraction (LCFE) module is proposed to obtain the salient shallow detail feature with global correlation, which is used to enhance the global constraints as well as supplement the low-level information to the micro stream. What's more, we propose the Gated Selection (GSM) module by selecting useful information using a gated scheme to fuse features from two streams appropriately. Comprehensive comparison and ablation experiments are implemented and verify the proposed method performs surpasses other methods in both qualitative and quantitative metrics.

On the Robustness of 3D Object Detectors

  • Fatima Albreiki
  • Sultan Abu Ghazal
  • Jean Lahoud
  • Rao Anwer
  • Hisham Cholakkal
  • Fahad Khan

In recent years, significant progress has been achieved for 3D object detection on point clouds thanks to the advances in 3D data collection and deep learning techniques. Nevertheless, 3D scenes exhibit a lot of variations and are prone to sensor inaccuracies as well as information loss during pre-processing. Thus, it is crucial to design techniques that are robust against these variations. This requires a detailed analysis and understanding of the effect of such variations. This work aims to analyze and benchmark popular point-based 3D object detectors against several data corruptions. To the best of our knowledge, we are the first to investigate the robustness of point-based 3D object detectors. To this end, we design and evaluate corruptions that involve data addition, reduction, and alteration. We further study the robustness of different modules against local and global variations. Our experimental results reveal several intriguing findings. For instance, we show that methods that integrate Transformers at a patch or object level lead to increased robustness, compared to using Transformers at the point level. The code is available at

Robust Learning with Adversarial Perturbations and Label Noise: A Two-Pronged Defense Approach

  • Peng-Fei Zhang
  • Zi Huang
  • Xin Luo
  • Pengfei Zhao

Despite great success achieved, deep learning methods are vulnerable to noise in the training dataset, including adversarial perturbations and annotation noise. These harmful factors significantly influence the learning process of deep models, leading to less confident models. However, existing methods have not yet studied this practical and challenging issue.

In this paper, we propose a novel robust learning method, i.e., Two-Pronged Defense (TPD), which is capable of eliminating negative effects of both data perturbations and label noise during the learning process. On the one hand, to defend against delusive adversarial examples, the proposed method designs an asymmetric adversarial contrastive learning strategy to craft worse-case noisy example for original training data, and train the model to align the semantic between the perturbed data and the original data. In light of this, the TPD would be able to improve the generalization ability of the model on the potential adversarial examples. On the other hand, to combat noisy labels, the TPD applies semi-supervised learning by identifying and discarding noisy labels via a novel designed identification method. Extensive experiments on benchmarks demonstrate the incapability of existing methods and the effectiveness of the proposed method when facing both data and label noise. This work is the very first attempt in learning with data and label noise, and we hope it can pave the way for future studies in related fields.

Enhancing the Robustness of Deep Learning Based Fingerprinting to Improve Deepfake Attribution

  • Chieh-Yin Liao
  • Chen-Hsiu Huang
  • Jun-Cheng Chen
  • Ja-Ling Wu

Artificial Fingerprinting (AF or the so-called digital watermarking) is a technique that can be used to conduct Deepfake attribution by ensuring media authenticity. However, AF does not prioritize its robustness to certain kinds of distortions, making the embedded watermarks vulnerable to some standard image processing operations. Insufficient robustness reduces the practicality of digital watermarking techniques. To address this issue, we propose an enhanced distortion agnostic artificial fingerprinting (EDA-AF) framework which introduces a novel noise layer consisting of an attack booster followed by a convolutional network-based attacker. The attacker simulates various distortions by exploiting adversarial learning with AF for distortion agnostic robustness. Meanwhile, due to the modeling limitation of the convolutional network, we also employ the attack booster to apply a set of differentiable image distortions which cannot be well simulated by the attacker. Extensive experimental results show that the proposed approach improves the quality of the extracted fingerprints. EDA-AF can improve the bitwise accuracy by up to 36%, which takes another step forward on the road of Deepfake attribution.

Disentangled Image Attribute Editing in Latent Space via Mask-Based Retention Loss

  • Shunya Ohaga
  • Ren Togo
  • Takahiro Ogawa
  • Miki Haseyama

We propose an image attribute editing method with the mask-based retention loss. Although conventional image attribute editing methods can edit a particular attribute, they cannot retain non-editing attributes including unknown attributes before and after editing, which causes unexpected changes in the edited images. We solve this problem by dividing the pre- and post-edited images into the editing and non-editing regions and increasing the image similarity in the non-editing regions. In this paper, we introduce the novel mask-based retention loss to retain the non-editing regions. To compute the mask-based retention loss, we divide the images into the editing and non-editing regions by using a binary mask generated from the difference between the pre- and post-edited images. Experimental results show that our proposed method is qualitatively and quantitatively superior to state-of-the-art methods.

ObjectMix: Data Augmentation by Copy-Pasting Objects in Videos for Action Recognition

  • Jun Kimata
  • Tomoya Nitta
  • Toru Tamaki

In this paper, we propose a data augmentation method for action recognition using instance segmentation. Although many data augmentation methods have been proposed for image recognition, few of them are tailored for action recognition. Our proposed method, ObjectMix, extracts each object region from two videos using instance segmentation and combines them to create new videos. Experiments on two action recognition datasets, UCF101 and HMDB51, demonstrate the effectiveness of the proposed method and show its superiority over VideoMix, a prior work.

CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection

  • Dhanalaxmi Gaddam
  • Jean Lahoud
  • Fahad Shahbaz Khan
  • Rao Muhammad Anwer
  • Hisham Cholakkal

Existing deep learning-based 3D object detectors typically rely on the appearance of individual objects and do not explicitly pay attention to the rich contextual information of the scene. In this work, we propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework, which takes a 3D scene as an input and strives to explicitly integrate useful contextual information of the scene at multiple levels to predict a set of object bounding-boxes along with their corresponding semantic labels. To this end, we propose to utilize a context enhancement network that captures the contextual information at different levels of granularity followed by a multi-stage refinement module to progressively refine the box positions and class predictions. Extensive experiments on the large-scale ScanNetV2 benchmark reveals the benefits of our proposed method, leading to an absolute improvement of 2.0% over the baseline. In addition to 3D object detection, we investigate the effectiveness of our CMR3D framework for the problem of 3D object counting. Our source code is available at

SESSION: Short Papers

A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition

  • Vijay John
  • Yasutomo Kawanishi

Utilizing the sensor characteristics of the audio, visible camera, and thermal camera, the robustness of person recognition can be enhanced. Existing multimodal person recognition frameworks are primarily formulated assuming that multimodal data is always available. In this paper, we propose a novel trimodal sensor fusion framework using the audio, visible, and thermal camera, which addresses the missing modality problem. In the framework, a novel deep latent embedding framework, termed the AVTNet, is proposed to learn multiple latent embeddings. Also, a novel loss function, termed missing modality loss, accounts for possible missing modalities based on the triplet loss calculation while learning the individual latent embeddings. Additionally, a joint latent embedding utilizing the trimodal data is learnt using the multi-head attention transformer, which assigns attention weights to the different modalities. The different latent embeddings are subsequently used to train a deep neural network. The proposed framework is validated on the Speaking Faces dataset. A comparative analysis with baseline algorithms shows that the proposed framework significantly increases the person recognition accuracy while accounting for missing modalities.

SLGAN: Style- and Latent-Guided Generative Adversarial Network for Desirable Makeup Transfer and Removal

  • Daichi Horita
  • Kiyoharu Aizawa

There are five features to consider when using generative adversarial networks to apply makeup to photos of the human face. These features include (1) facial components, (2) interactive color adjustments, (3) makeup variations, (4) robustness to poses and expressions, and the (5) use of multiple reference images. To tackle the key features, we propose a novel style- and latent-guided makeup generative adversarial network for makeup transfer and removal. We provide a novel, perceptual makeup loss and a style-invariant decoder that can transfer makeup styles based on histogram matching to avoid the identity-shift problem. In our experiments, we show that our SLGAN is better than or comparable to state-of-the-art methods. Furthermore, we show that our proposal can interpolate facial makeup images to determine the unique features, compare existing methods, and help users find desirable makeup configurations.

Popularity-Aware Graph Social Recommendation for Fully Non-Interaction Users

  • Nozomu Onodera
  • Keisuke Maeda
  • Takahiro Ogawa
  • Miki Haseyama

In this paper, we address a novel social recommendation for users who have no interactions with items (unobserved users). This task can provide many applications such as recommendations for cold-start users after the first sign-up and targeted advertising, thus, it seems to be extremely meaningful. However, existing social recommendation methods are unsuitable for this task since they assume that all users have interactions with items or cannot recommend more effectively than MostPopular recommendation. Towards this end, we propose Unobserved user-oriented Graph Social Recommendation (UGSR), which learns the preferences of unobserved users and provides richer recommendations than MostPopular recommendation. The popularity-aware graph convolutional network, which is carefully designed for this task, simultaneously considers some user-item interactions, social relations, and item popularity for the effective user and item modeling.

Multimodal Fusion with Cross-Modal Attention for Action Recognition in Still Images

  • Jia-Hua Tsai
  • Wei-Ta Chu

We propose a cross-modal attention module to combine information from different cues and different modalities, to achieve action recognition in still images. Feature maps are extracted from the entire image, the detected human bounding box, and the detected human skeleton, respectively. Inspired by the transformer structure, we design the processing between the query vector from one cue/modality, and the key vector from another cue/modality. Feature maps from different cues/modalities are cross-referred so that better representations can be obtained to yield better performance. We show that the proposed framework outperforms the state-of-the-art systems without the requirement of an extra training dataset. We also conduct ablation studies to investigate how different settings impact the final results.

Zero-Shot Font Style Transfer with a Differentiable Renderer

  • Kota Izumi
  • Keiji Yanai

Recently, a large-scale language-image multi-modal model, CLIP, has been used to realize language-based image translation in a zero-shot manner without training. In this study, we attempted to generate language-based decorative fonts for font images using CLIP. By the existing image style transfer methods using CLIP, stylized font images are usually only surrounded by decorations, and the characters themselves do not change significantly. On the other hand, in this study, we use CLIP and vector graphics image representation using a differentiable renderer to achieve a style transfer of text images that matches the input text. The experimental results show that the proposed method transfers the style of font images to match the given texts. In addition to text images, we confirmed that the proposed method was also able to transform the style of simple logo patterns based on the given texts.

Wearable Camera Based Food Logging System

  • Kenshiro Sato
  • Yoko Yamakata
  • Sosuke Amano
  • Kiyoharu Aizawa

Recently, meal management apps have allowed people to record food items and calories from photos automatically. These technologies include extracting food regions from photos of served meals, identifying the name of the food in each region, and calculating nutritional data. However, what you eat is not the only indicator that should be kept in the food record. How fast you eat and the order in which you eat is also significant information for dietary management. Therefore, we aim to construct a system that automatically generates a meal log from first-person videos that users capture of their eating behavior with a wearable camera. To tackle the complex problems that the data this system assumes contains, we constructed an eating behavior record dataset: 9.9 hours of first-person video that assume the natural diets of a user. To investigate the feasibility of our proposed system, we evaluated whether the first step, the detection of the meal area in the video during the meal, could be achieved with sufficient accuracy using this dataset. Using the limited number of frames assumed to be annotated by the user as training data, 30 frames were annotated for user-specific model training and four frames for online adaptation, resulting in detection accuracy of 72% for food regions. Our next goal is to create a multi-user dataset and service the application.

Graph Neural Network Based Living Comfort Prediction Using Real Estate Floor Plan Images

  • Ryota Kitabayashi
  • Taro Narahara
  • Toshihiko Yamasaki

In recent years, machine learning has been widely used in the real estate field. However, most of these previous studies have been limited to analysis based on objective perspectives, such as analysis of the structure of the floor plan and rent estimation. On the other hand, we focus on the subjective "living comfort" of real estate properties and aim to predict people's impressions of properties based on information obtained from floor plan images. Specifically, by using deep learning to analyze floor plan images and graph structures reflecting the floor plans, it becomes possible to predict the attractiveness of each property in terms of spaciousness, modernity, privacy, and so on. As a result of the experiments, the effectiveness of using both the floor plan image and the corresponding graph structure for prediction was confirmed.

Wider or Deeper Neural Network Architecture for Acoustic Scene Classification with Mismatched Recording Devices

  • Lam Pham
  • Khoa Tran
  • Dat Ngo
  • Hieu Tang
  • Son Phan
  • Alexander Schindler

In this paper, we present a robust and low complexity model for Acoustic Scene Classification (ASC), the task of identifying the scene of an audio recording. We firstly construct an ASC model in which a novel inception-residual-based network architecture is proposed to deal with the issue of mismatched recording devices. To further improve the model performance but still satisfy the low footprint, we apply two techniques of ensemble of multiple spectrograms and model compression to the proposed ASC model. By conducting extensive experiments on the benchmark DCASE 2020 Task 1A Development dataset, we achieve the best model performing an accuracy of 71.3% and a low complexity of 0.5 Million (M) trainable parameters, which is very competitive to the state-of-the-art systems and potential for real-life applications on edge devices.

A Reality Check of Positioning in Multiuser Mobile Augmented Reality: Measurement and Analysis

  • Na Wang
  • Haoliang Wang
  • Stefano Petrangeli
  • Viswanathan Swaminathan
  • Fei Li
  • Songqing Chen

Multiuser Augmented Reality (MuAR) is essential to implementing the vision of Metaverse. With the pervasive mobile devices, MuAR enables multiple devices to share a common AR experience. In such experiences, the peer positions are critical to understand peers' intentions and actions so as to achieve the smooth interaction in AR. Such a spacial awareness requirement poses new challenges to MuAR. Traditionally, in AR experiences designed for the single user, the SLAM algorithm is adopted to compute self positions. However, the computed positions cannot be directly used to compute the relative positions of peer devices in MuAR, because they are computed with respect to independent coordinate systems associated with participating devices. To fill in the gap, the industry has recently proposed to implement peer tracking with the help of built-in Ultra Wideband (UWB) chip. In this work, we aim to perform a reality check on the proposed support, with the Nearby Interaction (NI) framework developed for iOS mobile devices as an example. The goal of our study is to gain an in-depth understanding about the reliability of the proposed support and identify potential issues. Through extensive measurements, we discover the peer tracking solution is not reliable sometimes, in terms of availability and accuracy. Furthermore, with regard to erroneous position reports, we present a quantitative analysis, summarizing the error types (e.g., transient errors and permanent errors) and revealing their underlying reasons. We believe the preliminary findings could help to improve the spacial awareness and enhance user experiences in MuAR.

Towards High Performance One-Stage Human Pose Estimation

  • Ling Li
  • Lin Zhao
  • Linhao Xu
  • Jie Xu

Making top-down human pose estimation method present both good performance and high efficiency is appealing. Mask RCNN can largely improve the efficiency by conducting person detection and pose estimation in a single framework, as the features provided by the backbone are able to be shared by the two tasks. However, the performance is not as good as traditional two-stage methods. In this paper, we aim to largely advance the human pose estimation results of Mask-RCNN and still keep the efficiency. Specifically, we make improvements on the whole process of pose estimation, which contains feature extraction and keypoint detection. The part of feature extraction is ensured to get enough and valuable information of pose. Then, we introduce a Global Context Module into the keypoints detection branch to enlarge the receptive field, as it is crucial to successful human pose estimation. On the COCO val2017 set, our model using the ResNet-50 backbone achieves an AP of 68.1, which is 2.6 higher than Mask RCNN (AP of 65.5). Compared to the classic two-stage top-down method SimpleBaseline, our model largely narrows the performance gap (68.1 APkp vs. 68.9 APkp) with a much faster inference speed (77 ms vs. 168 ms), demonstrating the effectiveness of the proposed method. Code is available at:

Singing Voice Detection via Similarity-Based Semi-Supervised Learning

  • Xi Chen
  • Yongwei Gao
  • Wei Li

Data-driven methods play an important role in Singing Voice Detection (SVD). However, datasets with precise annotations are scarce. In this paper, we propose an SVD method via similarity-based semi-supervised learning (SSSL_SVD). For one thing, we propose to enrich the diversity of training data using the self-training semi-supervised method (SSL). In SSL, pseudo labels of the unlabeled data are first generated by a pre-trained teacher model and are then used to train a student model. For another thing, we propose to measure the audio frame from a similarity-based perspective. Taking it into consideration, we could provide more appropriate learning targets. Finally, experiment results indicate that the proposed method achieved comparable results with state-of-the-art (SOTA) algorithms.


A Music Loop Sequencer with User-Adaptive Music Loop Selection

  • Yuki Iwamoto
  • Tetsuro Kitahara

A loop sequencer enables non-musicians to compose high-quality musical pieces by concatenating short musical audio materials called music loops, but if the system has many music loops, selecting favorite ones is not easy. Smart Loop Sequencer with automatic loop selection has been proposed, but no attempts to adapt the selection behavior to each user's musical preference have been made. We introduce a user adaptation method based on the users' manual music loop replacement data to our Smart Loop Sequencer. The experiment results show that most participants felt that selected music loops became closer to those they selected.

Action Detection System Based on Pose Information

  • Ryo Kawai
  • Noboru Yoshida
  • Jianquan Liu

This paper introduces an action detection system based on pose information. The system utilizes view-invariant pose feature not relying on machine learning techniques, and it can detect human actions regardless of camera settings thus it is easy to apply for any target actions. System users only need to register sample images including target actions beforehand. In detection phase, the system receives an image from live camera at short intervals and computes the similarity between captured image and each pre-registered sample image. If the similarity is higher than the specified threshold, the system judges the target action is detected. We evaluated the detection performance of a common action "phone-call" using cellphone and confirmed its effectiveness.

DeepHair: A DeepFake-Based Hairstyle Preview System

  • Yu-Hsuan Lo
  • Shih-Wei Sun

In this paper, a deepfake-based hairstyle preview system is proposed. Most of the existing hairstyle preview systems have limitations for hairstyle transferring from still images. To provide a reliable hairstyle preview experience from different angle of views in a video, we propose a prototype system based on deepfake. A user's face can be appropriately transplanted into a hairstyle target model's face part. Additionally, users can preview various hairstyles from different angle of views in the generated hairstyle preview video. In this prototype, we provide 8 previewing hairstyles for female users and male users, including: central parting, ponytail, wavy bob, and curls for female; man bun, middle part, bangs, and side part for male.

Emotional Talking Faces: Making Videos More Expressive and Realistic

  • Sahil Goyal
  • Shagun Uppal
  • Sarthak Bhagat
  • Dhroov Goel
  • Sakshat Mali
  • Yi Yu
  • Yifang Yin
  • Rajiv Ratn Shah

Lip synchronization and talking face generation have gained a specific interest from the research community with the advent and need of digital communication in different fields. Prior works propose several elegant solutions to this problem. However, they often fail to create realistic-looking videos that account for people's expressions and emotions. To mitigate this, we build a talking face generation framework conditioned on a categorical emotion to generate videos with appropriate expressions, making them more real-looking and convincing. With a broad range of six emotions i.e., anger, disgust, fear, happiness, neutral, and sad, we show that our model generalizes across identities, emotions, and languages.

FoodLog Athl: Multimedia Food Recording Platform for Dietary Guidance and Food Monitoring

  • Kei Nakamoto
  • Kohei Kumazawa
  • Hiroaki Karasawa
  • Sosuke Amano
  • Yoko Yamakata
  • Kiyoharu Aizawa

This paper presents a new food recording tool, FoodLog Athl, for the healthcare or physical enhancement of its users. Unlike existing food recording tools, we designed the system for dietitians or third parties who monitor the users. The tool not only supports the users by functions such as food image recognition, but also it helps the dietitians watch and communicate with users. Furthermore, it calculates nutritional values from food records - the use of the tool reduces the workload of dietitians and focuses their work on nutrition guidance.

Rubber Material Retrieval System using Electron Microscope Images for Rubber Material Development

  • Rintaro Yanagi
  • Ren Togo
  • Takahiro Ogawa
  • Miki Haseyama

For developing valuable rubber materials, machine learning-based computer-aided analysis systems have been attracting a lot of attention. However, these systems mainly focus on analyzing the table and textual data, and the electron microscope images including the rich material information have not been enough analyzed. By effectively using these electron microscope images, further support for the material discovery is realized. In this paper, we present a material information retrieval system via electron microscope image space. Our system aims to support visually and comprehensively grasping the relationships between various rubber materials and those properties. By effectively using the electron microscope image space for material information retrieval, it is expected that the advances in material development are further accelerated.

JamSketch Deep α: A CNN-Based Improvisation System in Accordance with User's Melodic Outline Drawing

  • Tetsuro Kitahara
  • Akio Yonamine

We present an improvisation system called JamSketch Deep α, which generates melodies using a convolutional neural network (CNN) according to melodic outlines given by the user. The original version of JamSketch used a genetic algorithm for generating melodies, so it took time for obtaining melodies. We therefore show a prototype of our CNN-based version of JamSketch, which immediately generates a Blues-style melody once the user draws an outline.

GSTH266enc: A GStreamer Plugin for VVC Encoder

  • Advaiit Rajjvaed
  • Saurabh Puri
  • Gurdeep Bhullar
  • Gaëlle Martin-Cocher

Innovation in video systems remains essential in today's times particularly to support the need of low latency, low delay video applications such as XR/VR video streaming and cloud gaming. This paper presents a new Gstreamer plugin (GSTH266enc) for Versatile Video Coding (VVC) encoders. VVC is the most recent ISO international video coding standard that was finalized in July 2020. As per the writing of this paper, GSTH266enc is the only known implementation which supports encoding using a VVC codec in GStreamer. It is also the first plugin providing agnostic support for various VVC encoder implementations. The goal of the plugin is to provide developers with a higher-level of abstraction over VVC encoder implementations. This provides easy access from a multimedia application to use a VVC encoder without dealing with the intricacies of the video encoding library. The work focuses on achieving flexibility, dynamicity, and fast development in multimedia processing by providing a tailored solution of multi-layered API. The plugin will be made publicly available on InterDigital GitHub1 repository.

Intelligent Video Surveillance Platform Based on FFmpeg and Yolov5

  • Chuanxu Jiang
  • Yanfang Wang
  • Qian Huang
  • Yiming Wang
  • Yuhan Dai

With the development of multimedia, video surveillance systems are becoming more popular. However, the current video surveillance systems have a general function and are unable to provide Intelligent perception.

We design an intelligent video surveillance platform, which uses Qt, FFmpeg, and streaming technology. It supports multi-protocol (RTSP, RTMP, HTTP, HLS), multi-window video playback, and adopts the YOLOv5 framework for real-time intelligent analysis according to different yolov5 models(e.g., mask detection, helmet, and safety vest detection, etc.). The overall architecture is created by using demand-driven design, and the system is divided into three main modules: video playback module, video surveillance module, and intelligent detection module.