# ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

## SESSION: Full Research Papers

### Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection

• Evlampios Apostolidis
• Vasileios Mezaris
• Ioannis Patras

This paper presents a new method for unsupervised video thumbnail selection. The developed network architecture selects video thumbnails based on two criteria: the representativeness and the aesthetic quality of their visual content. Training relies on a combination of adversarial and reinforcement learning. The former is used to train a discriminator, whose goal is to distinguish the original from a reconstructed version of the video based on a small set of candidate thumbnails. The discriminator's feedback is a measure of the representativeness of the selected thumbnails. This measure is combined with estimates about the aesthetic quality of the thumbnails (made using a SoA Fully Convolutional Network) to form a reward and train the thumbnail selector via reinforcement learning. Experiments on two datasets (OVP and Youtube) show the competitiveness of the proposed method against other SoA approaches. An ablation study with respect to the adopted thumbnail selection criteria documents the importance of considering the aesthetics, and the contribution of this information when used in combination with measures about the representativeness of the visual content.

### Efficient Indexing of 3D Human Motions

• Petra Budikova
• Jan Sedmidubsky
• Pavel Zezula

Digitization of human motion using 2D or 3D skeleton representations offers exciting possibilities for many applications but, at the same time, requires scalable content-based retrieval techniques to make such data reusable. Although a lot of research effort focuses on extracting content-preserving motion features, there is a lack of techniques that support efficient similarity search on a large scale. In this paper, we introduce a new indexing scheme for organizing large collections of spatio-temporal skeleton sequences. Specifically, we apply the motion-word concept to transform skeleton sequences into structured text-like motion documents, and index such documents using an extended inverted-file approach. Over this index, we design a new similarity search algorithm that exploits the properties of the motion-word representation and provides efficient retrieval with a variable level of approximation, possibly reaching constant search costs disregarding the collection size. Experimental results confirm the usefulness of the proposed approach.

### Global Relation-Aware Attention Network for Image-Text Retrieval

• Jie Cao
• Shengsheng Qian
• Huaiwen Zhang
• Quan Fang
• Changsheng Xu

The cross-modal image-text retrieval has attracted extensive attention in recent years, which contributes to the development of search engine. Fine-grained features and cross-attention have been widely used in past researches to reach the goal of cross-modal image-text matching. Although cross-related methods have achieved remarkable results, the features must be encoded again in evaluation phase due to the interaction of the two modalities, which is unsuitable for actual scenarios of search engine development. In addition, the aggregated feature does not contain sufficient semantics since it is merely obtained by simple mean pooling. Furthermore, connecting weights of self-attention blocks are target position invariant, which lacks the expected adaptability. To tackle these limitations, in this paper, we propose a novel Global Relation-aware Attention Network (GRAN) for image-text retrieval by designing Global Attention Module (GAM) and Relation-aware Attention Module (RAM) which play an important role in modeling the global feature and the relationships of local fragments. Firstly, we propose Global Attention Module (GAM) followed the fine-grained features to obtain meaningful global feature. Secondly, we use several stacked transformer encoders to further encode features separately. Finally, we propose Relation-aware Attention Module (RAM) to generate a vector which represents the relation information to infer the attention intensity of pairwise fragments. The local features, the global feature, and their relations are considered jointly to conduct an efficient image-text retrieval. Extensive experiments are conducted on the benchmark datasets of Flickr30K and MSCOCO, demonstrating the superiority of our method. On the Flickr30K, compared to the state-of-the-art method TERAN, we improve Recall@K(K=1) metric by 5.8% and 4.0 on the image and text retrieval tasks, respectively.

### MS-SincResNet: Joint Learning of 1D and 2D Kernels Using Multi-scale SincNet and ResNet for Music Genre Classification

• Pei-Chun Chang
• Yong-Sheng Chen
• Chang-Hsing Lee

In this study, we proposed a new end-to-end convolutional neural network, called MS-SincResNet, for music genre classification. MS-SincResNet appends 1D multi-scale SincNet (MS-SincNet) to 2D ResNet as the first convolutional layer in an attempt to jointly learn 1D kernels and 2D kernels during the training stage. First, an input music signal is divided into a number of fixed-duration (3 seconds in this study) music clips, and the raw waveform of each music clip is fed into 1D MS-SincNet filter learning module to obtain three-channel 2D representations. The learned representations carry rich timbral, harmonic, and percussive characteristics comparing with spectrograms, harmonic spectrograms, percussive spectrograms and Mel-spectrograms. ResNet is then used to extract discriminative embeddings from these 2D representations. The spatial pyramid pooling (SPP) module is further used to enhance the feature discriminability, in terms of both time and frequency aspects, to obtain the classification label of each music clip. Finally, the voting strategy is applied to summarize the classification results from all 3-second music clips. In our experimental results, we demonstrate that the proposed MS-SincResNet outperforms the baseline SincNet and many well-known hand-crafted features. Considering individual 2D representation, MS-SincResNet also yields competitive results with the state-of-the-art methods on the GTZAN dataset and the ISMIR2004 dataset. The code is available at https://github.com/PeiChunChang/MS-SincResNet.

### MLFont: Few-Shot Chinese Font Generation via Deep Meta-Learning

• Xu Chen
• Lei Wu
• Minggang He
• Lei Meng
• Xiangxu Meng

The automatic generation of Chinese fonts is challenging due to the large quantity and complex structure of Chinese characters. When there are insufficient reference samples for the target font, existing deep learning-based methods cannot avoid overfitting caused by too few samples, resulting in blurred glyphs and incomplete strokes. To address these problems, this paper proposes a novel deep meta-learning-based font generation method (MLFont) for few-shot Chinese font generation, which leverages existing fonts to improve the generalization capability of the model for new fonts. Existing deep meta-learning methods mainly focus on few-shot image classification. To apply meta-learning to font generation, we present a meta-training strategy based on Model-Agnostic Meta-Learning (MAML) and a task organization method for font generation. The meta-training makes the font generator easy to fine-tune for new font generation tasks. Through random font generation tasks and extraction of glyph content and style separately, the font generator learns the prior knowledge of character structure in the meta-training stage, and then quickly adapts to the generation of new fonts with a few samples by fine-tuning of adversarial training. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods with more complete strokes and less noise in the generated character images.

### Facial Structure Guided GAN for Identity-preserved Face Image De-occlusion

• Yiu-Ming Cheung
• Mengke Li
• Rong Zou

In some practical scenarios, such as video surveillance and personal identification, we often have to address the recognition problem of occluded faces, where content replacement by serious occlusion with non-face objects always produces partial appearance and ambiguous representation. Under the circumstances, the performance of face recognition algorithms will often deteriorate to a certain degree. In this paper, we therefore address this problem by removing occlusions on face images and present a new two-stage Facial Structure Guided Generative Adversarial Network (FSG-GAN). In Stage I of the FSG-GAN, the variational auto-encoder is used to predict the facial structure. In Stage II, the predicted facial structure and the occluded image are concatenated and fed into a generative adversarial network (GAN) based model to synthesize the de-occlusion face image. In this way, the facial structure knowledge can be transferred to the synthesis network. Especially, in order to enable the occluded face image to be perceived well, the generator in the GAN based synthesis network utilizes the hybrid dilated convolution modules to extend the receptive field. Furthermore, aiming at further eliminating the appearance ambiguity as well as unnatural texture, a multi-receptive fields discriminator is proposed to utilize the features from different levels. Experiments on the benchmark datasets show the efficacy of the proposed FSG-GAN.

### Heterogeneous Side Information-based Iterative Guidance Model for Recommendation

• Feifei Dai
• Xiaoyan Gu
• Zhuo Wang
• Mingda Qian
• Bo Li
• Weiping Wang

Heterogeneous side information has been widely used in recommender systems to alleviate the data sparsity problem. However, the heterogeneous side information in existing methods provides insufficient guidance for predicting user preferences as its effect is inevitably weakened during utilization. Furthermore, most existing methods cannot effectively utilize the heterogeneous side information to understand users and items. They often neglect the interrelation among various types of heterogeneous side information of a user or an item. As a result, it is difficult for existing methods to comprehensively understand users and items so that the recommender system recommends inappropriate items to users. To overcome the above drawbacks, we propose an interrelation learning-based recommendation method with iterative heterogeneous side information guidance (ILIG). ILIG includes two modules: 1) Iterative Heterogeneous Side Information Guidance Module. It uses heterogeneous side information to iteratively guide the prediction of user preferences, which effectively enhances the effect of the heterogeneous side information. 2) Interrelation Learning-based Portrait Construction Module. It captures the interrelation among various types of heterogeneous side information to comprehensively learn the representations of users and items. To demonstrate the effectiveness of ILIG, we conduct extensive experiments on Movielens-100K, Movielens-1M, and BookCrossing datasets. The experimental results show that ILIG outperforms the state-of-the-art recommender systems.

### Dense Scale Network for Crowd Counting

• Feng Dai
• Hao Liu
• Yike Ma
• Xi Zhang
• Qiang Zhao

Crowd counting has been widely studied by computer vision community in recent years. Due to the large scale variation, it remains to be a challenging task. Previous methods adopt either multi-column CNN or single-column CNN with multiple branches to deal with this problem. However, restricted by the number of columns or branches, these methods can only capture a few different scales and have limited capability. In this paper, we propose a simple but effective network called DSNet for crowd counting, which can be easily trained in an end-to-end fashion. The key component of our network is the dense dilated convolution block, in which each dilation layer is densely connected with the others to preserve information from continuously varied scales. The dilation rates in dilation layers are carefully selected to prevent the block from gridding artifacts. To further enlarge the range of scales covered by the network, we cascade three blocks and link them with dense residual connections. We also introduce a novel multi-scale density level consistency loss for performance improvement. To evaluate our method, we compare it with state-of-the-art algorithms on five crowd counting datasets (ShanghaiTech, UCF-QNRF, UCF_CC_50, UCSD and WorldExpo'10). Experimental results demonstrate that DSNet can achieve the best overall performance and make significant improvements.

### Leveraging Two Types of Global Graph for Sequential Fashion Recommendation

• Yujuan Ding
• Yunshan Ma
• Wai Keung Wong
• Tat-Seng Chua

Sequential fashion recommendation is of great significance in online fashion shopping, which accounts for an increasing portion of either fashion retailing or online e-commerce. The key to building an effective sequential fashion recommendation model lies in capturing two types of patterns: the personal fashion preference of users and the transitional relationships between adjacent items. The two types of patterns are usually related to user-item interaction and item-item transition modeling respectively. However, due to the large sets of users and items as well as the sparse historical interactions, it is difficult to train an effective and efficient sequential fashion recommendation model. To tackle these problems, we propose to leverage two types of global graph, i.e., the user-item interaction graph and item-item transition graph, to obtain enhanced user and item representations by incorporating higher-order connections over the graphs. In addition, we adopt the graph kernel of LightGCN [9] for the information propagation in both graphs and propose a new design for item-item transition graph. Extensive experiments on two established sequential fashion recommendation datasets validate the effectiveness and efficiency of our approach.

### HSGMP: Heterogeneous Scene Graph Message Passing for Cross-modal Retrieval

• Yu Duan
• Yun Xiong
• Yao Zhang
• Yuwei Fu
• Yangyong Zhu

Semantic relationship information is important to the image-text retrieval task. Existing work usually extract relationship information by calculating the relationship value pairwise, which is hardly to find out a meaningful semantic relationship. A more reasonable method is to convert the modal to a scene graph, thereby explicitly modeling the relationship. Scene graph is a kind of graph data structure modeling the scene of modality. There are two concept in a scene graph, object and relationship. In image modal, object indicates the image region and relationship represents the predicate of the image regions. In text modal, object indicates the entity and relationship represents the association between entities, also known as semantic relationship. In image-text retrieval task, both object and relationship are important, and a key challenge is to obtain semantic information. In this paper, image and text are represented as two kinds of scene graphs: visual scene graph and textual scene graph, and then they are combined into Heterogeneous Scene Graph(HSG). By explicitly modeling relationships using directed graph, the information can be passed edge-wise. To further extract semantic information, we introduce the metapath, which can extract specific semantic information on specified path. Moreover, we propose Heterogeneous Message Passing(HMP) to communicate information on the metapath. After the message passing, the similarity of two modalities can be represented as the similarity of the graphs. Experiment shows that the model achieve competitive results on Flickr30K and MSCOCO, which indicates that our approach has advantages in image-text retrieval.

### GCNBoost: Artwork Classification by Label Propagation through a Knowledge Graph

• Cheikh Brahim El Vaigh
• Noa Garcia
• Benjamin Renoust
• Chenhui Chu
• Yuta Nakashima
• Hajime Nagahara

The rise of digitization of cultural documents offers large-scale contents, opening the road for development of AI systems in order to preserve, search, and deliver cultural heritage. To organize such cultural content also means to classify them, a task that is very familiar to modern computer science. Contextual information is often the key to structure such real world data, and we propose to use it in form of a knowledge graph. Such a knowledge graph, combined with content analysis, enhances the notion of proximity between artworks so it improves the performances in classification tasks. In this paper, we propose a novel use of a knowledge graph, that is constructed on annotated data and pseudo-labeled data. With label propagation, we boost artwork classification by training a model using a graph convolutional network, relying on the relationships between entities of the knowledge graph. Following a transductive learning framework, our experiments show that relying on a knowledge graph modeling the relations between labeled data and unlabeled data allows to achieve state-of-the-art results on multiple classification tasks on a dataset of paintings, and on a dataset of Buddha statues. Additionally, we show state-of-the-art results for the difficult case of dealing with unbalanced data, with the limitation of disregarding classes with extremely low degrees in the knowledge graph.

### Can Action be Imitated? Learn to Reconstruct and Transfer Human Dynamics from Videos

• Yuqian Fu
• Yanwei Fu
• Yu-Gang Jiang

Given a video demonstration, can we imitate the action contained in this video? In this paper, we introduce a novel task, dubbed mesh-based action imitation. The goal of this task is to enable an arbitrary target human mesh to perform the same action shown on the video demonstration. To achieve this, a novel Mesh-based Video Action Imitation (M-VAI) method is proposed by us. M-VAI first learns to reconstruct the meshes from the given source image frames, then the initial recovered mesh sequence is fed into mesh2mesh, a mesh sequence smooth module proposed by us, to improve the temporal consistency. Finally, we imitate the actions by transferring the pose from the constructed human body to our target identity mesh. High-quality and detailed human body meshes can be generated by using our M-VAI. Extensive experiments demonstrate the feasibility of our task and the effectiveness of our proposed method.

### SAGN: Semantic Adaptive Graph Network for Skeleton-Based Human Action Recognition

• Ziwang Fu
• Feng Liu
• Jiahao Zhang
• Hanyang Wang
• Chengyi Yang
• Qing Xu
• Jiayin Qi
• Xiangling Fu
• Aimin Zhou

With the continuous development and popularity of depth cameras, skeleton-based human action recognition has attracted people's wide attention. Graph Convolutional Network (GCN) has achieved remarkable performance. However, the existing methods do not better consider the semantic characteristics, which can help to express the current concept and scene information. Semantic information can also help with better granularity classification. In addition, most of the existing models require a lot of computation. What's more, adaptive GCN can automatically learn the graph structure and consider the connections between joints. In this paper, we propose a relatively less computationally intensive model, which combines semantic and adaptive graph network (SAGN) for skeleton-based human action recognition. Specifically, we mainly combine the dynamic characteristics and bone information to extract the data, taking the correlation between semantics into the model. In the training process, SAGN includes an adaptive network so that we can make attention mechanism more flexible. We design the Convolutional Neural Network (CNN) for feature extraction on the time dimension. The experimental results show that SAGN achieves the state-of-the-art performance on NTU-RGB+D 60 and NTU-RGB+D 120 datasets. SAGN can promote the study of skeleton-based human action recognition. The source code is available at https://github.com/skeletonNN/SAGN.

### Text-Guided Visual Feature Refinement for Text-Based Person Search

• Liying Gao
• Kai Niu
• Zehong Ma
• Bingliang Jiao
• Tonghao Tan
• Peng Wang

Text-based person search is a task to retrieve the corresponding person in a large-scale image database given a textual description, which has important value in various fields like video surveillance. In the inferring phase, language descriptions, serving as queries, guide to search the corresponding person images. Most existing methods apply cross-modal signals to guide feature refinement. However, they employ visual features from the gallery to refine textual features, which may cause high similarity between unmatched pairs. Besides, the similarity-based cross-modal attention could disturb the choice of interested areas for descriptions. In this paper, we analyze the deficiency of previous methods and carefully design a Text-guided Visual Feature Refinement network (TVFR), which utilizes text as reference to refine visual representations. Firstly, we divide each visual feature into several horizontal stripes for fine-grained refinement. After that, we employ a text-based filter generation module to generate description-customized filters, which are used to indicate the corresponding stripes mentioned in the textual input. Thereafter, we employ a text-guided visual feature refinement module to fuse part-level visual features adaptively for each description. In experiments, we validate our TVFR through extensive experiments on CUHK-PEDES, which is the only available dataset for text-based person search. To the best of our knowledge, the TVFR outperforms other state-of-the-art methods.

### RGB-D Scene Recognition based on Object-Scene Relation and Semantics-Preserving Attention

• Yuhui Guo
• Xun Liang

Scene recognition is challenging due to intra-class diversity and inter-class similarity. Previous works recognize scenes either with global representations or with intermediate representations of objects. By contrast, we investigate more discriminative sequential representation of object-to-scene relations (SOSRs) for scene recognition. Particularly, we develop an Attention-Preserving Memory-Learning (APML) model, which enforces the Memory Network of the semantic domain to guide the Learning Network of the appearance domain in the learning procedure. Accordingly, we allocate semantics-preserving attention to different objects, which is more effective to seek the key encoded SOSR and discard the misleading encoded SOSR between objects and scene without requiring extra labeled data. Based on the proposed APML networks, we obtain the state-of-the-art results of RGB-D scene recognition on SUN RGB-D and NYUD2 datasets.

### Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval

• Xiaoshuai Hao
• Yucan Zhou
• Dayan Wu
• Wanqian Zhang
• Bo Li
• Weiping Wang

Cross-modal retrieval between videos and texts has attracted growing attention due to the rapid growth of user-generated videos on the web. To solve this problem, most approaches try to learn a joint embedding space to measure the cross-modal similarities, while paying little attention to the representation of each modality. Video is more complicated than the commonly used visual feature, since the audio and caption on the screen also contain rich information. Recently, the aggregations of multiple features in videos boost the benchmark of the video-text retrieval system. However, they usually handle each feature independently, which ignores the interchange of high-level semantic relations among these multiple features. Moreover, despite the inter-modal ranking constraint where semantically-similar texts and videos should stay closer, the modality-specific requirement, i.e. two similar videos/texts should have similar representations, is also significant. In this paper, we propose a novel Multi-Feature Graph ATtention Network (MFGATN) for cross-modal video-text retrieval. Specifically, we introduce a multi-feature graph attention module, which enriches the representation of each feature in videos with the interchange of high-level semantic information among them. Moreover, we elaborately design a novel Dual Constraint Ranking Loss (DCRL), which simultaneously considers the inter-modal ranking constraint and the intra-modal structure constraint to preserve both the cross-modal semantic similarity and the modality-specific consistency in the embedding space. Experiments on two datasets, i.e. MSR-VTT and MSVD, demonstrate that our method achieves significant performance gain compared with the state-of-the-arts.

### HPOF:3D Human Pose Recovery from Monocular Video with Optical Flow

• Bin Ji
• Chen Yang
• Yao Shunyu
• Ye Pan

This paper introduces HPOF, a novel deep neural network to reconstruct the 3D human motion from a monocular video. Recently, model-based methods have been proposed to simplify the reconstruction task by estimating several parameters that control a deformable surface model to fit the person in the image. However, learning the parameters from a single image is a highly ill-posed problem, and the process is ultimately data-hungry. Existing 3D datasets are not sufficient, and the usage of 2D in-the-wild datasets is often susceptible to the inadequate precision of manual annotations. To address the above issues, our method yields substantial improvements in two domains. First, we leverage optical flow to supervise the 2D rendered images of predicted SMPL models to learn short-term temporal features. Besides, taking long-term temporal consistency into account, we define a novel temporal encoder based on a dilated convolutional network. The encoder decomposes the learning process of human shape and pose, first guarantees the invariance of the body shape, and then simulates a more reasonable forward kinematics process on this basis to achieve more accurate pose estimation. In addition, an adversarial learning framework is applied to supervise the reconstruction progress in a coarse-grained way. We show that HPOF not only improves the accuracy of 3D poses but ensures the realistic body structure throughout the video. We perform extensive experimentation to demonstrate the superiority of our method and analyze the effectiveness of our model, surpassing other state-of-the-arts.

### Leveraging EfficientNet and Contrastive Learning for Accurate Global-scale Location Estimation

• Giorgos Kordopatis-Zilos
• Panagiotis Galopoulos
• Ioannis Kompatsiaris

In this paper, we address the problem of global-scale image geolocation, proposing a mixed classification-retrieval scheme. Unlike other methods that strictly tackle the problem as a classification or retrieval task, we combine the two practices in a unified solution leveraging the advantages of each approach with two different modules. The first leverages the EfficientNet architecture to assign images to a specific geographic cell in a robust way. The second introduces a new residual architecture that is trained with contrastive learning to map input images to an embedding space that minimizes the pairwise geodesic distance of same-location images. For the final location estimation, the two modules are combined with a search-within-cell scheme, where the locations of most similar images from the predicted geographic cell are aggregated based on a spatial clustering scheme. Our approach demonstrates very competitive performance on four public datasets, achieving new state-of-the-art performance in fine granularity scales, i.e., 15.0% at 1km range on Im2GPS3k.

### Relation-aware Hierarchical Attention Framework for Video Question Answering

• Fangtao Li
• Ting Bai
• Chenyu Cao
• Zihe Liu
• Chenghao Yan
• Bin Wu

Video Question Answering (VideoQA) is a challenging video understanding task since it requires a deep understanding of both question and video. Previous studies mainly focus on extracting sophisticated visual and language embeddings, fusing them by delicate hand-crafted networks. However, the relevance of different frames, objects, and modalities to the question are varied along with the time, which is ignored in most of existing methods. Lacking understanding of the the dynamic relationships and interactions among objects brings a great challenge to VideoQA task. To address this problem, we propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos. In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features. Then a graph-based relation encoder is utilized to extract the static relationship between visual objects. To capture the dynamic changes of multimodal objects in different video frames, we consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer. We conduct extensive experiments on a large scale VideoQA dataset, and the experimental results demonstrate that our RHA outperforms the state-of-the-art methods.

### Cross-Modal Image-Recipe Retrieval via Intra- and Inter-Modality Hybrid Fusion

• Jiao Li
• Jialiang Sun
• Xing Xu
• Wei Yu
• Fumin Shen

In recent years, the Internet has stimulated the explosion of multimedia data. Food-related cooking videos, images, and recipes promote the rapid development of food computing. Image-recipe retrieval is an important sub-task in the field of cross-modal retrieval, which focuses on the measurement of the association between food image and recipe (title, ingredients, instructions). Although the existing methods have proposed some feasible solutions to achieve the goal of Image-recipe retrieval, there are still the following issues: 1) complex model structure and time-consuming training process. 2) the lack of information interaction within modalities and information integration between images and recipes. To this end, we propose a novel lightweight framework namedIntra- and Inter-Modality Hybrid Fusion (IMHF). Our IMHF model abandons a separate deep vision encoder and utilizes the transformer module to unify the visual and text features. In this way, valuable information from images and recipes can be condensed and the direct information interaction between the two modalities can be promoted. Both the intra- and inter-modality fusion can be realized. Extensive experiment results on the large-scale benchmark dataset Recipe1M demonstrate that our model IMHF with a lightweight architecture is superior to the state-of-the-art approaches.

### Unsupervised Deep Cross-Modal Hashing by Knowledge Distillation for Large-scale Cross-modal Retrieval

• Mingyong Li
• Hongya Wang

Cross-modal hashing (CMH) maps heterogeneous multiple modality data into compact binary code to achieve fast and flexible retrieval across different modalities, especially in large-scale retrieval. As the data don't need a lot of manual annotation, unsupervised cross-modal hashing has a wider application prospect than supervised method. However, the existing unsupervised methods are difficult to achieve satisfactory performance due to the lack of credible supervisory information. To solve this problem, inspired by knowledge distillation, we propose a novel unsupervised Knowledge Distillation Cross-Modal Hashing method (KDCMH), which can use similarity information distilled from unsupervised method to guide supervised method. Specifically, firstly, the teacher model adopted an unsupervised distribution-based similarity hashing method, which can construct a modal fusion similarity matrix.Secondly, under the supervision of teacher model distillation information, student model can generate more discriminative hash codes. In two public datasets NUS-WIDE and MIRFLICKR-25K, extensive experiments have proved the significant improvement of KDCMH on several representative unsupervised cross-modal hashing methods.

### A Unified-Model via Block Coordinate Descent for Learning the Importance of Filter

• Qinghua Li
• Xue Zhang
• Cuiping Li
• Hong Chen

Deep Convolutional Neural Networks (CNNs) are increasingly used in multimedia retrieval, and accelerating Deep CNNs has recently received an ever-increasing research focus. Among various approaches proposed in the literature, filter pruning has been regarded as a promising solution, which is due to its advantage in significant speedup and memory reduction of both network model and intermediate feature maps. Many works have been proposed to find unimportant filters, and then prune it for accelerating Deep CNNs. However, they mainly focus on using heuristic methods to evaluate the importance of filters, such as the statistical information of filters (e.g., prune filter with small $\ell_2$-norm), which may be not perfect. In this paper, we propose a novel filter pruning method, namely A Unified-Model via Block Coordinate Descent for Learning the Importance of Filter (U-BCD). The importance of the filters in our U-BCD is learned by optimizing method. We can simultaneously learn the filter parameters and the importance of filters by block coordinate descent method. When applied to two image classification benchmarks, the effectiveness of our U-BCD is validated. Notably, on CIFAR-10, our U-BCD reduces more than 57% FLOPs on ResNet-110 with even 0.08% relative accuracy improvement, and also achieve state-of-the-art results on ILSVRC-2012.

### Local-enhanced Interaction for Temporal Moment Localization

• Guoqiang Liang
• Shiyu Ji
• Yanning Zhang

Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to re-weight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.

### Reading Scene Text by Fusing Visual Attention with Semantic Representations

• Zhiguang Liu
• Liangwei Wang
• Jian Qiao

Recognizing text in an unconstrained environment is a challenging task in computer vision. Many prevalent approaches to it employ a recurrent neural network that is difficult to train or rely heavily on sophisticated model designs for sequence modeling. In contrast to these methods, we propose a unified lexicon-free framework to enhance the accuracy of text recognition using only attention and convolution. We use a relational attention module to leverage visual patterns and word representations. To ensure that the predicted sequence captures the contextual dependencies within a word, we embed linguistic dependencies from a language model into the optimization framework. The proposed mutual attention model is an ensemble of visual cues and linguistic contexts that together improve performance. The results of experiments show that our system achieves state-of-the-art performance on datasets of texts from regular and irregular scenes. It also significantly enhances recognition performance on noisy scanned documents.

### Generative Adversarial Networks with Bi-directional Normalization for Semantic Image Synthesis

• Jia Long
• Hongtao Lu

Semantic image synthesis aims at translating semantic label maps to photo-realistic images. However, most of previous methods easily generate blurred regions and artifacts, and the quality of these images is far from realistic. There are two unresolved problems existing: first, these methods directly feed the semantic label as input to the deep network, through convolution operation to produce the normalization parameters γ and β, we find that the semantic labels are different from real scene images, they are not able to provide detailed structural information, making it difficult to synthesize local details and structures; second, there are no bi-directional information flow between the semantic labels and the real scene images, this leads to inefficiently utilize the semantic information and maintain semantic constrains to preserve the semantic information in the process of semantic image synthesis. We propose Bi-directional Normalization (BDN) in our generative adversarial networks to solve these problems, which allows semantic label information and real scene image feature representation to be effectively utilized by a bi-directional way for generating high quality images. Extensive experiments on several challenging datasets demonstrate significantly better than that results of existing methods in both visual fidelity and quantitative metrics.

### A Smart Adversarial Attack on Deep Hashing Based Image Retrieval

• Junda Lu
• Mingyang Chen
• Yifang Sun
• Wei Wang
• Yi Wang
• Xiaochun Yang

Deep hashing based retrieval models have been widely used in large-scale image retrieval systems. Recently, there has been a surging interest in studying the adversarial attack problem in deep hashing based retrieval models. However, the effectiveness of existing adversarial attacks is limited by their poor perturbation management, unawareness of ranking weight, and only laser-focusing on the attack image. These shortages lead to high perturbation costs yet low AP reductions. To overcome these shortages, we propose a novel adversarial attack framework to improve the effectiveness of adversarial attacks. Our attack designs a dimension-wise surrogate Hamming distance function to help with wiser perturbation management. Further, in generating adversarial examples, instead of focusing on a single image, we propose to collectively incorporate relevant images combined with an AP-oriented (average precision) weight function. In addition, our attack can deal with both untargeted and targeted adversarial attacks in a flexible manner. Extensive experiments demonstrate that, with the same attack performance, our model significantly outperforms state-of-the-art models in perturbation cost on both untargeted and targeted attack tasks.

### Image-to-Image Transfer Makes Chaos to Order

• Sanbi Luo
• Tao Guo

GAN-based image-to-image transfer tools have achieved remarkable results in image generation. However, most of the research efforts focus on changing the style features, e.g., color and texture. The spatial features, e.g., the locations of objects in input and output images, always keep consistent. If the above tools are employed to translate locations, such as organizing objects from a chaotic scene to an orderly scene in images (i.e., chaos to order), can these tools work well? Therefore, we investigate the issue of image-to-image location transfer and receive a preliminary conclusion that it is hard to manipulate spatial features of objects in raw images automatically. In this paper, we propose a novel framework called LT-GAN to address the above issue. Specifically, a multi-stage generation structure is designed, where the location translation is performed based on semantic labels as a bridge to enhance the effect of automatically manipulate the spatial features of raw images. Experimental results demonstrate the effectiveness of the proposed multi-stage generation strategy. Meanwhile, a Color Histogram Loss is explored to evaluate the similarity of color distribution between a chaotic scene image and the corresponding orderly scene image. The quality of orderly scene images generated by the final stage is improved significantly in LT-GAN by using the combination of feature extraction and the Color Histogram Loss. Moreover, to break through the limitation of public datasets in image-to-image transfer tasks, a new dataset named M2C is constructed for this new application scenario of location transfer, including more than 15,000 paired images and the corresponding semantic labels in total. The dataset is available at \urlhttps://drive.google.com/open?id=1amr9ga9wvhnIzeZ48OHbLapHGqOb4-Up

### Summary of the 2021 Embedded Deep Learning Object Detection Model Compression Competition for Traffic in Asian Countries

• Yu-Shu Ni
• Chia-Chi Tsai
• Jiun-In Guo
• Jenq-Neng Hwang
• Bo-Xun Wu
• Po-Chi Hu
• Ted T. Kuo
• Po-Yu Chen
• Hsien-Kai Kuo

The 2021 embedded deep learning object detection model compression competition for traffic in Asian countries held in IEEE ICMR2021 Grand Challenges focuses on the object detection technologies in autonomous driving scenarios. The competition aims to detect objects in traffic with low complexity and small model size in the Asia countries (e.g., Taiwan), which contains several harsh driving environments. The target detected objects include vehicles, pedestrians, bicycles and crowded scooters. There are 89,002 annotated images provided for model training and 1,000 images for validation. Additional 5,400 testing images are used in the contest evaluation process, in which 2,700 of them are used in the qualification stage competition, and the rest are used in the final stage competition. There are in total 308 registered teams joining this competition this year, and the top 15 teams with the highest detection accuracy entering the final stage competition, from which 9 teams submitted the final results. The overall best model belongs to team "as798792", followed by team "Deep Learner" and team "UCBH." Two special awards of best accuracy award best and bicycle detections go to the same team "as798792," and the other special award of scooter detection goes to team "abcda."

### Nested Dense Attention Network for Single Image Super-Resolution

• Cheng Qiu
• Yirong Yao
• Yuntao Du

Recently, deep convolutional neural networks (CNNs) are widely used in single image super-resolution (SISR) and have recorded impressive performance. However, most of the existing CNNs architectures can not fully utilize the correlation of feature maps in the middle layers, and abundant features of different levels are lost. Furthermore, convolution operation is limited by processing one local neighborhood at a time, which lacks global information. To address these issues, we propose the nested dense attention network (NDAN) for generating more refined and structured high-resolution images. Specifically, we propose nested dense structure (NDS) to better integrate features of different levels extracted from different layers. Besides that, in order to capture inter-channel dependencies more efficiently, we propose the adaptive channel attention module (ACAM) to adaptively rescale channel-wise features by automatically adjusting the weights of different receptive fields. Furthermore, to better explore the global-level context information, we design hybrid non-local module (HNLM) and hybrid non-local up-sampler (HNLU) to upscale the images by capturing spatial-wise long-distance dependencies and channel-wise long-distance correlation. Numerous experiments demonstrate the effectiveness of our model by achieving higher PSNR and SSIM scores and generating images with better structures against the state-of-the-art methods.

### Multi-scale Dynamic Network for Temporal Action Detection

• Yifan Ren
• Xing Xu
• Fumin Shen
• Zheng Wang
• Yang Yang
• Heng Tao Shen

In recent years, as the fundamental task in video understanding, Temporal Action Detection is attracting extensive attention. Most existing approaches use the same model parameters to process all input videos, which are not adaptive to the input video during the inference stage. In this paper, we propose a novel model termed Multi-scale Dynamic Network (MDN) to tackle this problem. The proposed MDN model incorporates multiple Multi-scale Dynamic Modules (MDMs). Each MDM can generate video-specific and segment-specific convolution kernels based on video content from different scales and adaptively capture rich semantic information for the prediction. Besides, we also design a new Edge Suppression Loss (ESL) function for MDN to pay more attention to hard examples. Extensive experiments conducted on two popular benchmarks ActivityNet-1.3 and THUMOS-14 show that the proposed MDN model achieves the state-of-the-art performance.

### Distractor-Aware Tracker with a Domain-Special Optimized Benchmark for Soccer Player Tracking

• Zikai Song
• Zhiwen Wan
• Wei Yuan
• Ying Tang
• Junqing Yu
• Yi-Ping Phoebe Chen

Player tracking in broadcast soccer videos has received widespread attention in the field of sports video analysis, however, we note that there is not a suitable tracking algorithm specifically for soccer video, and the existing benchmarks used for soccer player tracking cover few scenarios with low difficulties. From the observation of the soccer scene that interference and occlusion are knotty problems because the distractors are extremely similar to the targets, a distractor-aware player tracking algorithm and a high-quality benchmark for soccer play tracking (BSPT) have been presented. The distractor-aware player tracking algorithm is able to perceive semantic information about distracting players in the background by similarity judgment, the semantic distractor-aware information is encoded into a context vector and is constantly updated as the objects move through a video sequence. Distractor-aware information is then appended to the tracking result of the baseline tracker to improve the intra-class discriminative power. BSPT contains a total of 120 sequences with rich annotations. Each sequence covers 8 specialized frame-level attributes from soccer scenarios and the player occlusion situations are finely divided into 4 categories for a more comprehensive comparison. In the experimental section, the performance of our algorithm and the other 14 compared trackers are evaluated on BSPT with detailed analysis. Experimental results reveal the effectiveness of the proposed distractor-aware model especially under the attribute of occlusion. The BSPT benchmark and raw experimental results are available on the project page at http://media.hust.edu.cn/BSPT.htm.

### Efficient Nearest Neighbor Search by Removing Anti-hub

• Kimihiro Tanaka
• Yusuke Matsui
• Shin'ichi Satoh

The central research question of the nearest neighbor search is how to reduce the memory cost while maintaining its accuracy. Instead of compressing each vector as is done in the existing methods, we propose a way to subsample unnecessary vectors to save memory. We empirically found that such unnecessary vectors have low hubness scores and thus can be easily identified beforehand. Such points are called anti-hubs in the data mining community. By removing anti-hubs, we achieved a memory-efficient search while preserving accuracy. In million-scale experiments, we showed that any vector compression method improves search accuracy by partial replacement with anti-hub removal under the same memory usage. A billion-scale benchmark showed that our data reduction combined with the best search method achieves higher accuracy under the assumption of fixed memory consumption. For example, our method had a much higher recall@100 (0.53) compared with the existing method (0.23) for the same memory consumption (6GB).

### A Denoising Convolutional Neural Network for Self-Supervised Rank Effectiveness Estimation on Image Retrieval

• Lucas Pascotti Valem
• Daniel Carlos Guimarães Pedronette

Image and multimedia retrieval has established as a prominent task in an increasingly digital and visual world. Mainly supported by decades of development on hand-crafted features and the success of deep learning techniques, various different feature extraction and retrieval approaches are currently available. However, the frequent requirements for large training sets still remain as a fundamental bottleneck, especially in real-world and large-scale scenarios. In the scarcity or absence of labeled data, choosing what retrieval approach to use became a central challenge. A promising strategy consists in to estimate the effectiveness of ranked lists without requiring any groundtruth data. Most of the existing measures exploit statistical analysis of the ranked lists and measure the reciprocity among lists of images in the top positions. This work innovates by proposing a new and self-supervised method for this task, the Deep Rank Noise Estimator (DRNE). An algorithm is presented for generating synthetic ranked list data, which is modeled as images and provided for training a Convolutional Neural Network that we propose for effectiveness estimation. The proposed model is a variant of the DnCNN (Denoiser CNN), which intends to interpret the incorrectness of a ranked list as noise, which is learned by the network. Our approach was evaluated on 5 public image datasets and different tasks, including general image retrieval and person re-ID. We also exploited and evaluated the complementary between the proposed approach and related rank-based approaches through fusion strategies. The experimental results showed that the proposed method is capable of achieving up to 0.88 of Pearson correlation with MAP measure in general retrieval scenarios and 0.74 in person re-ID scenarios.

### Know Yourself and Know Others: Efficient Common Representation Learning for Few-shot Cross-modal Retrieval

• Shaoying Wang
• Hanjiang Lai
• Zhenyu Shi

Learning the common representations for various modalities of data is the key component in cross-modal retrieval. Most existing deep approaches learn multiple networks to independently project each sample into a common representation. However, each representation is only extracted from the corresponding data, which totally ignores the relationships between other data. Thus it is challenging to learn efficient common representations when lacking sufficient supervised multi-modal data for training, e.g., few-shot cross-modal retrieval. How to efficiently exploit the information contained in other examples is underexplored. In this work, we present the Self-Others Net, a few-shot cross-modal retrieval model that fully exploits information contained both in its own and other samples. First, we propose a self-network to fully exploit the correlations that lurk in the data itself. It integrates the features at different layers and extracts the multi-level information in the self-network. Second, an others-network is further proposed to model the relationships among all samples, which learns the Mahalanobis tensor and mixes the prototypes of all data to capture the non-linear dependencies for common representation learning. Extensive experiments are conducted on three benchmark datasets, which demonstrate clear improvements of the proposed method over the state-of-the-arts.

### Neural Symbolic Representation Learning for Image Captioning

• Xiaomei Wang
• Lin Ma
• Yanwei Fu
• Xiangyang Xue

Traditional image captioning models mainly rely on one encoder-decoder architecture to generate one natural sentence for a given image. Such an architecture mostly uses deep neural networks to extract the neural representations of the image while ignoring the information of abstractive concepts as well as their intertwined relationships conveyed in the image. To this end, to comprehensively characterize the image content and bridge the gap between neural representations and high-level abstractive concepts, we make the first attempt to investigate the ability of neural symbolic representation of the image for the image captioning task. We first parse and convert a given image to neural symbolic representation in the form of an attributed relational graph, with the nodes denoting the abstractive concepts and the branches indicating the relationships between connected nodes, respectively. By performing computations over the attributed relational graph, the neural symbolic representation evolves step by step, with the node and branch representations as well as their corresponding importance weights transiting step by step. Empirically, extensive experiments validate the effectiveness of the proposed method. It enables a more comprehensive understanding of the given image by integrating the neural representation and neural symbolic representation, with the state-of-the-art results being achieved on both the MSCOCO and Flickr30k datasets. Besides, the proposed neural symbolic representation is demonstrated to better generalize to other domains with significant performance improvements compared with existing methods on the cross domain image captioning task.

### G-CAM: Graph Convolution Network Based Class Activation Mapping for Multi-label Image Recognition

• Yangtao Wang
• Yanzhao Xie
• Yu Liu
• Lisheng Fan

In most multi-label image recognition tasks, human visual perception keeps consistent for different spatial transforms of the same image. Existing approaches either learn the perceptual consistency with only image-level supervision or preserve the middle-level feature consistency of attention regions but neglect the (global) label dependencies between different objects over the dataset. To address this issue, we integrate graph convolution network (GCN) and propose G-CAM, which learns visual attention consistency via GCN based class attention mapping (CAM) for multi-label image recognition. G-CAM consists of an image feature extraction module to generate the feature maps of the original image and its transformed one and a GCN module to learn weighted classifiers that capture the label dependencies between different objects. Different from previous works which use fully-connected classification layer, G-CAM first fuses weighted classifiers with the feature vector to generate the predicted labels for each input image, then combines weighted classifiers with the feature maps to respectively obtain the transformed attention heatmaps of the original image and the attention heatmaps of its transformed one. We can compute the attention consistency loss according to the distance between these two attention heatmaps. Finally, this loss is combined with the multi-label classification loss to update the whole network in an end-to-end manner. We conduct extensive experiments on three multi-label image datasets including FLICKR25K, MS-COCO and NUS-WIDE. Experimental results demonstrate G-CAM can achieve better performance compared with the state-of-the-art multi-label image recognition methods.

### NASTER: Non-local Attentional Scene Text Recognizer

• Lei Wu
• Xueliang Liu
• Yanbin Hao
• Yunjie Ma
• Richang Hong

Scene text recognition has been widely investigated in computer vision. In the literature, the encoder-decoder based framework, which first encodes image into feature map and then decodes them into corresponding text sequences, have achieved great success. However, this solution fails in low-quality images, as the local visual features extracted from curved or blurred images are difficult to decode into corresponding text. To address this issue, we propose a new framework for Scene Text Recognition (STR), named Non-Local Attentional Scene Text Recognizer (NASTER). We use ResNet with Global Context Block (GC block) to extract global visual features. The global context information is then captured in parallel using the self-attention module and finally decoded by a multi-layer attention decoder with an intermediate supervision module. The proposed method achieves the state-of-the-art performances on seven benchmark datasets, demonstrating the effectiveness of our approach.

### Few-Shot Action Localization without Knowing Boundaries

• Ting-Ting Xie
• Christos Tzelepis
• Fan Fu
• Ioannis Patras

Learning to localize actions in long, cluttered, and untrimmed videos is a hard task, that in the literature has typically been addressed assuming the availability of large amounts of annotated training samples for each class -- either in a fully-supervised setting, where action boundaries are known, or in a weakly-supervised setting, where only class labels are known for each video. In this paper, we go a step further and show that it is possible to learn to localize actions in untrimmed videos when a) only one/few trimmed examples of the target action are available at test time, and b) when a large collection of videos with only class label annotation (some trimmed and some weakly annotated untrimmed ones) are available for training; with no overlap between the classes used during training and testing. To do so, we propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a fine-grained similarity pattern between pairs of videos (trimmed or untrimmed), and uses them to generate Temporal Class Activation Maps (TCAMs) for seen or unseen classes. The TCAMs serve as temporal attention mechanisms to extract video-level representations of untrimmed videos, and to temporally localize actions at test time. To the best of our knowledge, we are the first to propose a weakly-supervised, one/few-shot action localization network that can be trained in an end-to-end fashion. Experimental results on THUMOS14 and ActivityNet1.2 datasets, show that our method achieves performance comparable or better to state-of-the-art fully-supervised, few-shot learning methods.

### Learning Hierarchical Visual-Semantic Representation with Phrase Alignment

• Baoming Yan
• Qingheng Zhang
• Liyu Chen
• Lin Wang
• Leihao Pei
• Jiang Yang
• Enyun Yu
• Xiaobo Li
• Binqiang Zhao

Effective visual-semantic representation is critical to the image-text matching task. Various methods are proposed to develop image representation with more semantic concepts and a lot of progress has been achieved. However, the internal hierarchical structure in both image and text, which could effectively enhance the semantic representation, is rarely explored in the image-text matching task. In this work, we propose a Hierarchical Visual-Semantic Network (HVSN) with fine-grained semantic alignment to exploit the hierarchical structure. Specifically, we first model the spatial or semantic relationship between objects and aggregate them into visual semantic concepts by the Local Relational Attention (LRA) module. Then we employ Gated Recurrent Unit (GRU) to learn relationships between visual semantic concepts and generate the global image representation. For the text part, we develop phrase features from related words, then generate text representation by learning relationships between these phrases. Besides, the model is trained with joint optimization of image-text retrieval and phrase alignment task to capture the fine-grained interplay between vision and language. Our approach achieves state-of-the-art performance on Flickr30K and MS-COCO datasets. On Flickr30K, our approach outperforms the current state-of-the-art method by 3.9% relatively in text retrieval with image query and 1.3% relatively for image retrieval with text query (based on Recall@1). On MS-COCO, our HVSN improves image retrieval by 2.3% relatively and text retrieval by 1.2% relatively. Both quantitative and visual ablation studies are provided to verify the effectiveness of the proposed modules.

### Social Relation Analysis from Videos via Multi-entity Reasoning

• Chenghao Yan
• Zihe Liu
• Fangtao Li
• Chenyu Cao
• Zheng Wang
• Bin Wu

Videos contain rich semantic information. Analyzing social relations in video semantics can help machines interpret the behavior of human beings. However, most of the work related to social relationship recognition is based on still images, while video-based social relationship analysis tasks are less concerned. Here we propose a Multi-entity Relation Reasoning (MRR) framework that can be used for recognizing or predicting social relations in videos. To capture temporal features and contextual cues in videos, and use richer information to represent the person in the video, we track each person's appearance timeline and design a multi-entity representation method to build a social relationship knowledge graph. Then we use graph attention networks to gather information from the entity's neighborhood. Besides, situation information is helpful to identify relationships, we design a situation information extraction module to generate situation embedding from the video clip. Finally, a decoder is adopted to predict relationships between character entities. We evaluate the model on the MovieGraphs dataset and verify the effectiveness of the proposed framework.

### Aligning Visual Prototypes with BERT Embeddings for Few-Shot Learning

• Kun Yan
• Zied Bouraoui
• Ping Wang
• Shoaib Jameel
• Steven Schockaert

Few-shot learning (FSL) is the task of learning to recognize previously unseen categories of images from a small number of training examples. This is a challenging task, as the available examples may not be enough to unambiguously determine which visual features are most characteristic of the considered categories. To alleviate this issue, we propose a method that additionally takes into account the names of the image classes. While the use of class names has already been explored in previous work, our approach differs in two key aspects. First, while previous work has aimed to directly predict visual prototypes from word embeddings, we found that better results can be obtained by treating visual and text-based prototypes separately. Second, we propose a simple strategy for learning class name embeddings using the BERT language model, which we found to substantially outperform the GloVe vectors that were used in previous work. We furthermore propose a strategy for dealing with the high dimensionality of these vectors, inspired by models for aligning cross-lingual word embeddings. We provide experiments on miniImageNet, CUB and tieredImageNet, showing that our approach consistently improves the state-of-the-art in metric-based FSL.

### TEACH: Attention-Aware Deep Cross-Modal Hashing

• Hong-Lei Yao
• Yu-Wei Zhan
• Zhen-Duo Chen
• Xin Luo
• Xin-Shun Xu

Hashing methods for cross-modal retrieval have recently been widely investigated due to the explosive growth of multimedia data. Generally, real-world data is imperfect and has more or less redundancy, making cross-modal retrieval task challenging. However, most existing cross-modal hashing methods fail to deal with the redundancy, leading to unsatisfactory performance on such data. In this paper, to address this issue, we propose a novel cross-modal hashing method, namely aTtEntion-Aware deep Cross-modal Hashing (TEACH). It could perform feature learning and hash-code learning simultaneously. Besides, with designed attention modules for different modalities, one for each, TEACH can effectively highlight the useful information of data while suppressing the redundant information. Extensive experiments on benchmark datasets demonstrate that our method outperforms some state-of-the-art hashing methods in cross-modal retrieval tasks.

### Scene Text Recognition with Cascade Attention Network

• Min Zhang
• Meng Ma
• Ping Wang

Scene text recognition (STR) has experienced increasing popularity both in academia and in industry. Regarding STR as a sequence prediction task, most state-of-the-art (SOTA) approaches employ the attention-based encoder-decoder architecture to recognize texts. However, these methods still struggle in localizing the precise alignment center associated with the current character, which is also named as the attention drift phenomenon. One major reason is that directly converting low-quality or distorted word images to sequential features may introduce confusing information and thus mislead the network. To address the problem, this paper proposes a cascade attention network. The model is composed of three novel attention modules: a vanilla attention module that attends to sequential features from the horizontal direction, a cross-network attention module to take advantage of both one-dimension contextual information and two-dimension visual distributions, and an aspects fusion attention module to fuse spatial and channel-wise information. Accordingly, the network manages to yield distinguished and refined representations correlated to the target sequence. Compared to SOTA methods, experimental results on seven benchmarks demonstrate the superiority of our framework in recognizing scene texts on various conditions.

### Multi-Attention Audio-Visual Fusion Network for Audio Spatialization

• Wen Zhang
• Jie Shao

In our daily life, we are exposed to a large number of video files. Compared with video containing only mono audio, video with stereo can provide us with better audio-visual experience. However, a large number of ordinary users do not have professional equipment to record videos with high-quality stereo. In order to make it more convenient for users to obtain videos with stereo, we propose an effective method to convert mono audio in the video into stereo. One of the keys to this task is how to effectively inject visual information extracted from video frames into the audio signal. We design a novel multi-attention fusion network (MAFNet) based on the self-attention mechanism to extract the spatial features related to the sound source in the video frames and fuse them into audio features well. Furthermore, in order to obtain stereo with higher quality, we design an additional iterative structure which can refine and optimize the generated stereo sound by several iterations. Our proposed approach is validated on two challenging video datasets (FAIR-Play and YT-MUSIC), and achieves new state-of-the-art performance.

• Feng Zhao
• Donglin Wang
• Xintao Xiang

• Xinzhe Zhou

### Joint Hand-Object Pose Estimation with Differentiably-Learned Physical Contact Point Analysis

• Nan Zhuang

Hand-object pose estimation aims to jointly estimate 3D poses of hands and the held objects. During the interaction between hands and objects, the position and motion of keypoints in hands and objects are tightly related and there naturally exist some physical restrictions, which is usually ignored by most previous methods. To address this issue, we propose a learnable physical affinity loss to regularize the joint estimation of hand and object poses. The physical constraints mainly focus on enhancing the stability of grasping, which is the most common interaction manner between hands and objects. Together with the physical affinity loss, a context-aware graph network is also proposed to jointly learn independent geometry prior and interaction messages. The whole pipeline consists of two components. First an image encoder is used to predict 2D keypoints from RGB image and then a contextual graph module is designed to convert 2D keypoints into 3D estimations. Our graph module treats the keypoints of hands and objects as two sub-graphs and estimates initial 3D coordinates according to their topology structure separately. Then the two sub-graphs are merged into a whole graph to capture the interaction information and further refine the 3D estimation results. Experimental results show that both our physical affinity loss and our context-aware graph network can effectively capture the relationship and improve the accuracy of 3D pose estimation.

### HINFShot: A Challenge Dataset for Few-Shot Node Classification in Heterogeneous Information Network

• Zifeng Zhuang
• Xintao Xiang
• Siteng Huang
• Donglin Wang

Few-shot learning aims to generalize to novel classes. It has achieved great success in image and text classification tasks. Inspired by such success, few-shot node classification in homogeneous graph has attracted much attention but few works have begun to study this problem in Heterogeneous Information Network (HIN) so far. We consider few-shot learning in HIN and study a pioneering problem HIN Few-Shot Node Classification (HIN-FSNC) that aims to generalize the node types with sufficient labeled samples to unseen node types with only few-labeled samples. However, existing HIN datasets contain just one labeled node type, which means they cannot meet the setting of unseen node types. To facilitate the investigation of HIN-FSNC, we propose a large-scale academic HIN dataset called HINFShot. It contains 1,235,031 nodes with four node types (author, paper, venue, institution) and all the nodes regardless of node type are divided into 80 classes. Finally, we conduct extensive experiments on HINFShot and the result indicates a significant challenge of identifying novel classes of unseen node types in HIN-FSNC.

## SESSION: Short Research Papers

### Learning to Select: A Fully Attentive Approach for Novel Object Captioning

• Marco Cagrandi
• Marcella Cornia
• Matteo Stefanini
• Lorenzo Baraldi
• Rita Cucchiara

Image captioning models have lately shown impressive results when applied to standard datasets. Switching to real-life scenarios, however, constitutes a challenge due to the larger variety of visual concepts which are not covered in existing training sets. For this reason, novel object captioning (NOC) has recently emerged as a paradigm to test captioning models on objects which are unseen during the training phase. In this paper, we present a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set, and to constrain the generative process of a language model accordingly. Our architecture is fully-attentive and end-to-end trainable, also when incorporating constraints. We perform experiments on the held-out COCO dataset, where we demonstrate improvements over the state of the art, both in terms of adaptability to novel objects and caption quality.

### Semi-supervised Many-to-many Music Timbre Transfer

• Yu-Chen Chang
• Wen-Cheng Chen
• Min-Chun Hu

This work presents a music timbre transfer model that aims to transfer the style of a music clip while preserving the semantic content. Compared to the existing music timbre transfer models, our model can achieve many-to-many timbre transfer between different instruments. The proposed method is based an autoencoder framework, which comprises two pretrained encoders trained in a supervised manner and one decoder trained in an unsupervised manner. To learn more representative features for the encoders, we produced a parallel dataset, called MI-Para, which is synthesized from MIDI files and digital audio workstations (DAW). Both the objective and the subjective evaluation results showed the effectiveness of the proposed framework. To scale up the application scenario, we also demonstrate that our model can achieve style transfer by training in a semi-supervised manner with a smaller parallel dataset.

### Text-Enhanced Attribute-Based Attention for Generalized Zero-Shot Fine-Grained Image Classification

• Yan-He Chen
• Mei-Chen Yeh

We address the generalized zero-shot fine-grained image classification problem, in which classes are visually similar and training images for some classes are not available. We leverage auxiliary information in the form of textual descriptions to facilitate the task. Specifically, we propose a text-enhanced attribute-based attention mechanism to compute features from the most relevant image regions guided from the most relevant attributes. Experiments on two popular datasets of CUB and AWA2 show the effectiveness of the proposed method.

### Spatio-Temporal Activity Detection and Recognition in Untrimmed Surveillance Videos

• Konstantinos Gkountakos
• Despoina Touska
• Konstantinos Ioannidis
• Theodora Tsikrika
• Stefanos Vrochidis
• Ioannis Kompatsiaris

This work presents a spatio-temporal activity detection and recognition framework for untrimmed surveillance videos consisting of a three-step pipeline: object detection, tracking, and activity recognition. The framework relies on the YOLO v4 architecture for object detection, Euclidean distance for tracking, while the activity recognizer uses a 3D Convolutional Deep learning architecture employing spatio-temporal boundaries and addressing it as multi-label classification. The evaluation experiments on the VIRAT dataset achieve accurate detections of the temporal boundaries and recognitions of activities in untrimmed videos, with better performance for the multi-label compared to the multi-class activity recognition.

• Haifan Gong
• Guanqi Chen
• Sishuo Liu
• Yizhou Yu
• Guanbin Li

Due to the severe lack of labeled data, existing methods of medical visual question answering usually rely on transfer learning to obtain effective image feature representation and use cross-modal fusion of visual and linguistic features to achieve question-related answer prediction. These two phases are performed independently and without considering the compatibility and applicability of the pre-trained features for cross-modal fusion. Thus, we reformulate image feature pre-training as a multi-task learning paradigm and witness its extraordinary superiority, forcing it to take into account the applicability of features for the specific image comprehension task. Furthermore, we introduce a cross-modal self-attention~(CMSA) module to selectively capture the long-range contextual relevance for more effective fusion of visual and linguistic features. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods. Our code and models are available at https://github.com/haifangong/CMSA-MTPT-4-MedicalVQA.

### Body Shape Calculator: Understanding the Type of Body Shapes from Anthropometric Measurements

• Shintami Chusnul Hidayati
• Yeni Anistyasari

Human body shape, which describes the contours of the body figure as well as the distribution of muscles and fat, contains a rich source of information, from health issues to aesthetic presentation of fashion styles. However, most of the existing methods for estimating body types are derived from subjective measures, which are susceptible to multiple biases. Determining the type of body shapes is still a challenging analytical task, for which open questions remain regarding good feature representation and classification methods, given noisy and imbalanced real-world data. In this work, we propose a novel body type recognition framework based on anthropometric measurements, which integrates label filtering and pseudo-feature synthesis modules. Label filtering is proposed to identify and filter out potentially noisy labels during classifier training, while pseudo-feature is generated to improve feature representation. Experimental results on the collected dataset from online feeds demonstrate the effectiveness of the approach compared to the state-of-the-art baselines.

### Unsupervised Video Summarization via Multi-source Features

• Hussain Kanafani
• Junaid Ahmed Ghauri
• Sherzod Hakimov
• Ralph Ewerth

Video summarization aims at generating a compact yet representative visual summary that conveys the essence of the original video. The advantage of unsupervised approaches is that they do not require human annotations to learn the summarization capability and generalize to a wider range of domains. Previous work relies on the same type of deep features, typically based on a model pre-trained on ImageNet data. Therefore, we propose to incorporate multiple feature sources with chunk and stride fusion to provide more information about the visual content. For a comprehensive evaluation on the two benchmarks TVSum and SumMe, we compare our method with four state-of-the-art approaches. Two of these approaches were implemented by ourselves to reproduce the reported results. Our evaluation shows that we obtain state-of-the-art results on both datasets while also highlighting the shortcomings of previous work with regard to the evaluation methodology. Finally, we perform error analysis on videos for the two benchmark datasets to summarize and spot the factors that lead to misclassifications.

### Evaluating Contrastive Models for Instance-based Image Retrieval

• Tarun Krishna
• Kevin McGuinness
• Noel O'Connor

In this work, we evaluate contrastive models for the task of image retrieval. We hypothesise that models that are learned to encode semantic similarity among instances via discriminative learning should perform well on the task of image retrieval, where relevancy is defined in terms of instances of the same object. Through our extensive evaluation, we find that representations from models trained using contrastive methods perform on-par with (and outperforms) a pre-trained supervised baseline trained on the ImageNet labels in retrieval tasks under various configurations. This is remarkable given that the contrastive models require no explicit supervision. Thus, we conclude that these models can be used to bootstrap base models to build more robust image retrieval engines.

• Xiaocheng Lu
• Yuan Yuan
• Qi Wang

For license plate detection (LPD), most of the existing work is based on images as input. If these algorithms can be applied to multiple frames or videos, they can be adapted to more complex unconstrained scenes. In this paper, we propose a LPD framework for detecting license plates in multiple frames or videos, called AWFA-LPD, which effectively integrates the features of nearby frames. Compared with image based detection models, our network integrates optical flow extraction module, which can propagate the features of local frames and fuse with the reference frame. Moreover, we concatenate a non-link suppression module after the detection results to post-process the bounding boxes. Extensive experiments demonstrate the effectiveness and efficiency of our framework.

### NMS-Loss: Learning with Non-Maximum Suppression for Crowded Pedestrian Detection

• Zekun Luo
• Zheng Fang
• Sixiao Zheng
• Yabiao Wang
• Yanwei Fu

Non-Maximum Suppression (NMS) is essential for object detection and affects the evaluation results by incorporating False Positives (FP) and False Negatives (FN), especially in crowd occlusion scenes. In this paper, we raise the problem of weak connection between the training targets and the evaluation metrics caused by NMS and propose a novel NMS-Loss making the NMS procedure can be trained end-to-end without any additional network parameters. Our NMS-Loss punishes two cases when FP is not suppressed and FN is wrongly eliminated by NMS. Specifically, we propose a pull loss to pull predictions with the same target close to each other, and a push loss to push predictions with different targets away from each other. Experimental results show that with the help of NMS-Loss, our detector, namely NMS-Ped, achieves impressive results with Miss Rate of 5.92% on Caltech dataset and 10.08%on CityPersons dataset, which are both better than state-of-the-art competitors.

### Image Retrieval by Hierarchy-aware Deep Hashing Based on Multi-task Learning

• Bowen Wang
• Liangzhi Li
• Yuta Nakashima
• Takehiro Yamamoto
• Hiroaki Ohshima
• Yoshiyuki Shoji
• Kenro Aihara
• Noriko Kando

Deep hashing has been widely used to approximate nearest-neighbor search for image retrieval tasks. Most of them are trained with image-label pairs without any inter-label relationship, which may not make full use of the real-world data. This paper presents deep hashing, named HA2SH, that leverages multiple types of labels with hierarchical structures that an ethnological museum assigns to their artifacts. We experimentally prove that HA2SH can learn to generate hashes that give a better retrieval performance. Our code is available at https://github.com/wbw520/minpaku.

### Weakly Supervised Sketch Based Person Search

• Lan Yan
• Wenbo Zheng
• Fei-Yue Wang
• Chao Gou

Person search often requires a query photo of the target person. However, in many practical scenarios, there is no guarantee that such a photo is always available. In this paper, we define the problem of sketch based person search, which uses a sketch instead of a photo as the probe for retrieving. We tackle this problem in a weak supervision setting and propose a clustering and feature attention based weakly supervised learning framework, which contains two stages of pedestrian detection and sketch based person re-identification. Specially, we introduce multiple detectors, followed by fuzzy c-means clustering to achieve weakly supervised pedestrian detection. Moreover, we design an attention module to learn discriminative features in subsequent re-identification network. Extensive experiments show the superiority of our method.

### Personal Knowledge Base Construction from Multimodal Data

• An-Zi Yen
• Chia-Chung Chang
• Hen-Hsen Huang
• Hsin-Hsi Chen

With the passage of time, people often have misty memories of their past experiences. Information recall support for people by collecting personal lifelogs is emerging. Recently, people tend to record their daily life via filming Video Weblog (VLog), which contains visual and audio data. These large scale multimodal data can be used to support information recall service that enables users to query their past experiences. The challenging issue is the semantic gap between the visual concept and the textual query. In this paper, we aim to extract personal life events from vlogs shared on YouTube and construct a personal knowledge base (PKB) for individuals. A multitask learning model is proposed to extract the components of personal life events, such as subjects, predicates and objects. The evaluation is performed on a video collection from three YouTubers who are English native speakers. Experimental results show our model achieves promising performance.

### 2.5D Pose Guided Human Image Generation

• Kang Yuan
• Sheng Li

In this paper, we propose a 2.5D pose guided human image generation method that integrates depth information with 2D poses. Given a target 2.5D pose and an image of a person, our method generates a new image of that person with the target pose. To incorporate depth information into the pose structure, we design a three-layer pose space that allows accurate pose transfer compared with regular 2D pose structure. Specifically, our pose space enables the generative models to address the occlusion problems commonly happened in human image generation and also helps recognize spatial front-back relations of limbs. Extensive quantitative and qualitative results on the DeepFashion and Human 3.6M datasets demonstrate the effectiveness of our method.

### Collaborative Representation for Deep Meta Metric Learning

• Min Zhu
• Weifeng Liu
• Kai Zhang
• Ye Li
• Peng Liu
• Baodi Liu

Most metric learning methods utilize all training data to construct a single metric, and it is usually over-fitting on the "salient" feature. To overcome this issue, we propose a deep meta metric learning method based on collaborative representation. We construct multiple episodes from the original training data to train a general metric, where each episode consists of a query set and a support set. Then, we introduce a collaborative representation method, which fits the query sample with the support samples per class. We predict the query sample's label via the optimal fitness among the query sample and the support samples in each specific class. Besides, we adopt a hard mining strategy to learn a more discriminative metric according to increasing the training tasks' difficulty. Experiments verify that our method achieves state-of-the-art results on three re-ID benchmark datasets.

## SESSION: Brave New Idea

### Ten Questions in Lifelog Mining and Information Recall

• An-Zi Yen
• Hen-Hsen Huang
• Hsin-Hsi Chen

With the advance of science and technology, people are used to recording their daily life events via writing blogs, uploading social media posts, taking photos, or filming videos. Such rich repository personal information is useful for supporting human living assistance, such as information recall service. The main challenges are how to store and manage personal knowledge from various sources, and how to provide support for people who may have difficulty recalling past experiences. In this position paper, we propose a research agenda on personal knowledge mining from various sources of lifelogs, personal knowledge base construction, and information recall for assisting people to recall their experiences. Ten research questions are formulated.

## SESSION: Challenge Papers

### Bag of Tricks for Building an Accurate and Slim Object Detector for Embedded Applications

• Yongkun Du
• Zhineng Chen
• Caiyan Jia
• Xuanya Li
• Yu-Gang Jiang

Object detection is an essential computer vision task that possesses extensive application prospects in on-road applications. Copious novel methods have been proposed in this branch recently. However, the majority of them have high computational cost, making them intractable to be deployed on embedded devices. In this paper, taking YOLOv5s, the smallest model in the YOLOv5 family, as the baseline, we explore a bag of tricks that improve the detection performance for a specified on-road application, under the premise of ensuring that it does not increase the computational cost of YOLOv5s. Specifically, we introduce relevantly external data to deal with the problems of sample imbalance. Meanwhile, knowledge distillation is employed to transfer knowledge from a cumbersome model to a compact model, where a united distillation scheme is developed to enhance the effectiveness. In addition, a pseudo-label based training strategy is utilized to further learn from the biggest YOLOv5 model. We have applied the above tricks to the Embedded Deep Learning Object Detection Model Compression Competition for Traffic in Asian Countries held in conjunction with ICMR 2021. The experiments have shown that all the tricks are useful. Their combination have built an accurate and slim detection model. It is highly competitive and has been ranked 2nd place in the competition. We believe the tricks are also meaningful for building other application-oriented object detectors.

### Efficient-ROD: Efficient Radar Object Detection based on Densely Connected Residual Network

• Chih-Chung Hsu
• Chieh Lee
• Lin Chen
• Min-Kai Hung
• Yu-Lun Lin
• Xian-Yu Wang

### DANet: Dimension Apart Network for Radar Object Detection

• Bo Ju
• Wei Yang
• Jinrang Jia
• Xiaoqing Ye
• Qu Chen
• Xiao Tan
• Hao Sun
• Yifeng Shi
• Errui Ding

In this paper, we propose a dimension apart network (DANet) for radar object detection task. A Dimension Apart Module (DAM) is first designed to be lightweight and capable of extracting temporal-spatial information from the RAMap sequences. To fully utilize the hierarchical features from the RAMaps, we propose a multi-scale U-Net style network architecture termed DANet. Extensive experiments demonstrate that our proposed DANet achieves superior performance on the radar detection task at much less computational cost, compared to previous pioneer works. In addition to the proposed novel network, we also utilize a vast amount of data augmentation techniques. To further improve the robustness of our model, we ensemble the predicted results from a bunch of lightweight DANet variants. Finally, we achieve 82.2% on average precision and 90% on average recall of object detection performance and rank at 1st place in the ROD2021 radar detection challenge. Our code is available at: \urlhttps://github.com/jb892/ROD2021_Radar_Detection_Challenge_Baidu.

### Object Detection on Embedded Systems for Traffic in Asian Countries

• Bao-Hong Lai
• Hsun-Ping Hsieh

In this paper, we present a novel embedded deep learning solution for traffic object detection. Considering the memory, computing speed, and environmental requirements in the MediaTek Dimensity 1000 Series embedded device, it is worth mentioning that we are the first one to re-implement and propose an efficient object detection algorithm based on You Only Look Once version 5 (YOLOv5) to address this issue in TensorFlow framework. The backbone of our network is mainly constructed by Cross Stage Partial (CSP) modules, which significantly boost the accuracy of our model and keep the model lightweight. Besides, to enhance the prediction effectiveness, we propose to combine the official training dataset and several external open datasets as our comprehensive training data. We also adopt multiple data augmentation techniques in the training phase, making the model learn a stronger feature extraction ability for various object categories. According to the results of extensive experiments and the final competition scores, our solution can get a not bad performance under the condition of low parameters and complexity. Our team is the third-place winner in Embedded Deep Learning Object Detection Model Compression Competition in ACM International Conference on Multimedia Retrieval 2021.

### Squeeze-and-Excitation network-Based Radar Object Detection With Weighted Location Fusion

• Pengliang Sun
• Xuetong Niu
• Pengfei Sun
• Kele Xu

Radar object detection refers to identify objects from radar data, and the topic has received increasing interest during the last years, due to the appealing property of radar imaging and evident applications. However, the detection performance heavily relied on semantic information extraction, which is a great challenge in practical settings. Moreover, although remarkable progress has been made, most previous attempts are restrained from the essentially limited property of the employed single modality. Inspired by the recent success of cross-modality deep learning, we propose a novel cross-modality deep learning framework for the radar object detection task using the Squeeze-and-Excitation network, aiming to provide more powerful feature representation. Moreover, a novel noisy detection approach is also explored in our study, to increase the model's ability to handle with noise. Finally, a novel weighted location fusion strategy is introduced in our framework, to improve the detection performance further. To empirically investigate the effectiveness of the proposed framework, we conduct extensive experiments on the 2021 ICMR ROD challenge. The obtained results suggest that our framework outperforms related approaches. Our method ranks as the 3rd place on the final leaderboard, with an average precision (AP) percentage of 76.1. Models and codes are available at https://github.com/sunpengliang/modelConfusion.

### ROD2021 Challenge: A Summary for Radar Object Detection Challenge for Autonomous Driving Applications

• Yizhou Wang
• Jenq-Neng Hwang
• Gaoang Wang
• Hui Liu
• Kwang-Ju Kim
• Hung-Min Hsu
• Jiarui Cai
• Haotian Zhang
• Zhongyu Jiang
• Renshu Gu

The Radar Object Detection 2021 (ROD2021) Challenge, held in the ACM International Conference on Multimedia Retrieval (ICMR) 2021, has been introduced to detect and classify objects purely using an FMCW radar for autonomous driving applications. As a robust sensor to all-weather conditions, radar has rich information hidden in the radio frequencies, which can potentially achieve object detection and classification. This insight will provide a new object perception solution for an autonomous vehicle even in adverse driving scenarios. The ROD2021 Challenge is the first public benchmark focusing on this topic, which attracts great attention and participation. There are more than 260 participants among 37 teams from more than 10 countries with different academic and industrial affiliations, contributing about 300 submissions in the first phase and 400 submissions in the second phase. The final performance is evaluated by average precision (AP). Results add strong value and a better understanding of the radar object detection task for the autonomous vehicle community.

### Embedded YOLO: Faster and Lighter Object Detection

• Wen-Kai Wu
• Chien-Yu Chen
• Jiann-Shu Lee

Object detection is a fundamental but very important task in computer vision. Most current algorithms require high computing resources, which hinders their deployment on embedded system. In this research, we propose a neural network model named Embedded YOLO to solve this problem. We propose the DSC_CSP module to replace the middle layers of YOLOv5s to reduce the number of model parameters. On the other hand, in order to avoid the decrease of performance due to the reduction of parameters, we utilize knowledge distillation to maintain performance. To make good use of the information provided by data augmentation, we propose a new method called Dynamic Interpolation Mosaic to improve the original Mosaic. Due to serious imbalance in the number of samples of different data types, we employ a two-stage training scheme to overcome the data imbalance problem. The proposed model achieved the best results in the ICMR2021 Grand Challenge PAIR Competition with 0.59 mAP and model size of 12MB and 41 FPS on the MediaTek's Dimensity 1000 platform. These results confirm that the proposed model is suitable for deployment in embedded systems for object detection task.

### Radar Object Detection Using Data Merging, Enhancement and Fusion

• Jun Yu
• Xinlong Hao
• Xinjian Gao
• Qiang Sun
• Yuyu Liu
• Peng Chang
• Zhong Zhang
• Fang Gao
• Feng Shuang

Compared to visible images, radar images are generally considered to be an active and robust solution, even in adverse driving situations, for object detection. However, the accuracy of radar object detection (ROD) is always poor. Owing to taking full advantage of data merging, enhancement and fusion, this paper proposes an effective ROD system with only radar images as the input. First, an aggregation module is designed to merge the data from all chirps in the same frame. Then, various gaussian noises with different parameters are employed to increase data diversity and reduce over-fitting based on the analysis of training data. Moreover, due to the process of inference with default parameters is not accurate enough, some hyperparameters are changed to increase the accuracy performance. Finally, a combination strategy is adopted to benefit from multi-model fusion. ROD2021 Challenge is supported by ACM ICMR 2021, and our team (ustc-nelslip) ranked 2nd in the test stage of this challenge. Diverse evaluations also verify the superiority of the proposed system.

### Scene-aware Learning Network for Radar Object Detection

• Zangwei Zheng
• Xiangyu Yue
• Kurt Keutzer
• Alberto Sangiovanni Vincentelli

Object detection is essential to safe autonomous or assisted driving. Previous works usually utilize RGB images or LiDAR point clouds to identify and localize multiple objects in self-driving. However, cameras tend to fail in bad driving conditions, e.g. bad weather or weak lighting, while LiDAR scanners are too expensive to get widely deployed in commercial applications. Radar has been drawing more and more attention due to its robustness and low cost. In this paper, we propose a scene-aware radar learning framework for accurate and robust object detection. First, the learning framework contains branches conditioning on the scene category of the radar sequence; with each branch optimized for a specific type of scene. Second, three different 3D autoencoder-based architectures are proposed for radar object detection and ensemble learning is performed over the different architectures to further boost the final performance. Third, we propose novel scene-aware sequence mix augmentation (SceneMix) and scene-specific post-processing to generate more robust detection results. In the ROD2021 Challenge, we achieved a final result of average precision of 75.0% and an average recall of 81.0%. Moreover, in the parking lot scene, our framework ranks first with an average precision of 97.8% and an average recall of 98.6%, which demonstrates the effectiveness of our framework.

## SESSION: Conflict of Interest Papers

### GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization

• Jia-Hong Huang
• Luka Murn
• Marta Mrak
• Marcel Worring

Traditional video summarization methods generate fixed video representations regardless of user interest. Therefore such methods limit users' expectations in content search and exploration scenarios. Multi-modal video summarization is one of the methods utilized to address this problem. When multi-modal video summarization is used to help video exploration, a text-based query is considered as one of the main drivers of video summary generation, as it is user-defined. Thus, encoding both the text-based query and the video effectively is important for the task of multi-modal video summarization. In this work, a new method is proposed that uses a specialized attention network and contextualized word representations to tackle this task. The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator. Based on the evaluation of the existing multi-modal video summarization benchmark, experimental results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method. https://github.com/Jhhuangkay/GPT2MVS-Generative-Pre-trained-Transformer....

### Impact of Interaction Strategies on User Relevance Feedback

• Omar Shahbaz Khan
• Björn Þór Jónsson
• Jan Zahálka
• Stevan Rudinac
• Marcel Worring

User Relevance Feedback (URF) is a class of interactive learning methods that rely on the interaction between a human user and a system to analyze a media collection. To improve URF system evaluation and design better systems, it is important to understand the impact that different interaction strategies can have. Based on the literature and observations from real user sessions from the Lifelog Search Challenge and Video Browser Showdown, we analyze interaction strategies related to (a) labeling positive and negative examples, and (b) applying filters based on users' domain knowledge. Experiments show that there is no single optimal labeling strategy, as the best strategy depends on both the collection and the task. In particular, our results refute the common assumption that providing more training examples is always beneficial: strategies with a smaller number of prototypical examples lead to better results in some cases. We further observe that while expert filtering is unsurprisingly beneficial, aggressive filtering, especially by novice users, can hinder the completion of tasks. Finally, we observe that combining URF with filters leads to better results than using filters alone.

## SESSION: Demonstrations

### Automatic Baseball Pitch Overlay

• Ting-Hsuan Chou
• Wei-Ta Chu

To provide rich viewing experience and assist pitcher training, we propose an automatic baseball pitch overlay system in this paper. Given multiple pitching video sequences, this system detects and tracks the ball to construct ball trajectories. Because of occlusion, motion blur, and background noise, the ball usually cannot be detected successfully. We propose a series of processes like initial compensation and polynomial fitting to construct complete trajectories. To make the overlay results more appealing, different sequences are weighted differently, and different trajectories are intentionally drawn in different colors. We believe this would be the first fully-automatic pitch overlay system that only takes pitching videos as inputs. Source code is at \\https://github.com/chonyy/ML-auto-baseball-pitching-overlay.

### Video Action Retrieval Using Action Recognition Model

• Yuko Iinuma
• Shin'ichi Satoh

In addition to video sharing services such as YouTube, the spread of video-based social networking services such as TikTok and Instagram have led to the accumulation of vast amounts of video data. In this situation, it has become important to develop a technology to efficiently retrieve necessary information from the accumulated video data. In this paper, we propose a method to retrieve similar videos by focusing on people and their actions in the videos. We apply this method to various datasets and TV videos to demonstrate its usefulness. Demonstration video: \urlhttps://youtu.be/y86api-FXpU

### MeTILDA: Platform for Melodic Transcription in Language Documentation and Application

• Mitchell Lee
• Praveena Avula
• Min Chen

Blackfoot language is an endangered language that needs to be documented, analyzed, and preserved. Blackfoot is challenging to learn and teach because it is a pitch accent language whose words with same characters can take on different meanings when changing in pitch. Linguistics researchers are working to create visual aids, called Pitch Art, to teach the nuance in pitch changes. However, the existing techniques used to create Pitch Art fail to accurately indicate changes in pitch and require time-consuming work across multiple applications. To address this issue, this project proposes a system called MeTILDA (Melodic Transcription in Language Documentation and Application) to provide new forms of audio analysis and to automate the process of creating Pitch Art. MeTILDA provides value to a variety of stakeholders. Linguistics researchers are provided with tools to analyze and compare Blackfoot speeches. Teachers are given collections of words and recordings from native speakers to teach students. Students are given the ability to compare their own pronunciation of Blackfoot words to that of native speakers. We also present a new form of audio analysis, called perceptual scale, to provide more effective visuals of perceived changes in pitch movement. By collaborating with domain experts in this field, we have validated the effectiveness of MeTILDA in creating Pitch Art using the perceptual scale.

### IR Questioner: QA-based Interactive Retrieval System

• Rintaro Yanagi
• Ren Togo
• Takahiro Ogawa
• Miki Haseyama

Image retrieval from a given text query (text-to-image retrieval) is one of the most essential systems, and it is effectively utilized for databases (DBs) on the Web. To make them more versatile and familiar, a retrieval system that is adaptive even for personal DBs such as images in smartphones and lifelogging devices should be considered. In this paper, we present a novel text-to-image retrieval system that is specialized for personal DBs. With the cross-modal scheme and the question-answering scheme, the developed system enables users to obtain the desired image effectively even from personal DBs. Our demo is available at https://sites.google.com/view/ir-questioner/.

## SESSION: Reproducibility Paper

### Reproducibility Companion Paper: Knowledge Enhanced Neural Fashion Trend Forecasting

• Yunshan Ma
• Yujuan Ding
• Xun Yang
• Lizi Liao
• Wai Keung Wong
• Tat-Seng Chua
• Jinyoung Moon
• Hong-Han Shuai

This companion paper supports the replication of the fashion trend forecasting experiments with the KERN (Knowledge Enhanced Recurrent Network) method that we presented in the ICMR 2020. We provide an artifact that allows the replication of the experiments using a Python implementation. The artifact is easy to deploy with simple installation, training and evaluation. We reproduce the experiments conducted in the original paper and obtain similar performance as previously reported. The replication results of the experiments support the main claims in the original paper.

## SESSION: Doctoral Consortium

### A Beneficial Dual Transformation Approach for Deep Learning Networks Used in Steel Surface Defect Detection

• Fityanul Akhyar
• Chih-Yang Lin
• Gugan S. Kathiresan

Steel surface defect detection represents a challenging task in real-world practical object detection. Based on our observations, there are two critical problems which create this challenge: the tiny size, and vagueness of the defects. To solve these problems, this study a proposes a deep learning-based defect detection system that uses automatic dual transformation in the end-to-end network. First, the original training images in RGB are transformed into the HSV color model to re-arrange the difference in color distribution. Second, the feature maps are upsampled using bilinear interpolation to maintain the smaller resolution. The latest and state-of-the-art object detection model, High-Resolution Network (HRNet) is utilized in this system, with initial transformation performed via data augmentation. Afterward, the output of the backbone stage is applied to the second transformation. According to the experimental results, the proposed approach increases the accuracy of the detection of class 1 Severstal steel surface defects by 3.6% versus the baseline.

### Discrete Tchebichef Transform for Versatile Video Coding

• Ka-Hou Chan
• Sio-Kei Im

The Discrete Tchebichef Transform (DTT) is a transform method based on discrete orthogonal Tchebichef polynomials, which have applications found in image compression and video coding. Our method is to construct all DTT-related discrete orthogonal transforms in the required size (corresponding to the coding unit supported by H.266/VVC). To investigate the feature of Tchebichef polynomials, we make use of a novel discrete orthogonal matrix generation method with determined DTT roots, and scaling and rounding a DTT that depends on the quantization parameter, instead of integer approximation. We can obtain an accurate integer DTT matrix. Experimental results show that this method can improve the video quality and require fewer bit rates.

### Fire Detection using Transformer Network

• Kai-lung Hua

Technological breakthroughs in computing have empowered vision-based surveillance systems to detect fire using transformers framework. Over the last few decades, convolutional neural networks (CNNs) have been broadly applied for many computer vision-related problems and provided satisfactory results. However, due to the inductive prejudices embedded in convolutional operations, it cannot comprehend long-range dependencies. Vision transformers (ViT) has recently become an alternative to CNN for a vision problem by factoring an image as a patches sequence and leverage intra-attention between pixels. This paper shows that ViT is a viable tool for automated fire detection by aggregating features from the whole spatial context. The proposed method is tested on benchmark fire datasets to reveal the framework's strength and effectiveness.

## SESSION: Special Session Paper

### Visible-infrared Person Re-identification with Human Body Parts Assistance

• Huangpeng Dai
• Qing Xie
• Jiachen Li
• Yanchun Ma
• Lin Li
• Yongjian Liu

Person re-identification (re-id) has received ever-increasing research focus, because of its important role in video surveillance applications. This paper addresses the re-id problem between visible images of color cameras and infrared images of infrared cameras, which is significant in case that the appearance information is insufficient in poor illumination conditions. In this field, there are two key challenges, i.e., the difficulty to locate the discriminative information to re-identify the same person between visible and infrared images, and the difficulty to learn a robust metric for such large-scale cross-modality retrieval. In this paper, we propose a novel human body parts assistance network (BANet) to tackle the two challenges above. BANet mainly focuses on extracting discriminative information and learning robust features by leveraging the human body part cues. Extensive experiments demonstrate that the proposed approach outperforms the baseline and the state-of-the-art methods.

### Look Back Again: Dual Parallel Attention Network for Accurate and Robust Scene Text Recognition

• Zilong Fu
• Hongtao Xie
• Guoqing Jin
• Junbo Guo

Nowadays, it is a trend that using a parallel-decoupled encoder-decoder (PDED) framework in scene text recognition for its flexibility and efficiency. However, due to the inconsistent information content between queries and keys in the parallel positional attention module (PPAM) used in this kind of framework(queries: position information, keys: context and position information), visual misalignment tends to appear when confronting hard samples(e.g., blurred texts, irregular texts, or low-quality images). To tackle this issue, in this paper, we propose a dual parallel attention network (DPAN), in which a newly designed parallel context attention module (PCAM) is cascaded with the original PPAM, using linguistic contextual information to compensate for the information inconsistency between queries and keys. Specifically, in PCAM, we take the visual features from PPAM as inputs and present a bidirectional language model to enhance them with linguistic contexts to produce queries. In this way, we make the information content of the queries and keys consistent in PCAM, which helps to generate more precise visual glimpses to improve the entire PDED framework's accuracy and robustness. Experimental results verify the effectiveness of the proposed PCAM, showing the necessity of keeping the information consistency between queries and keys in the attention mechanism. On six benchmarks, including regular text and irregular text, the performance of DPAN surpasses the existing leading methods by large margins, achieving new state-of-the-art performance. The code is available on \urlhttps://github.com/Jackandrome/DPAN.

### Contextualized Keyword Representations for Multi-modal Retinal Image Captioning

• Jia-Hong Huang
• Ting-Wei Wu
• Marcel Worring

Medical image captioning automatically generates a medical description to describe the content of a given medical image. Traditional medical image captioning models create a medical description based on a single medical image input only. Hence, an abstract medical description or concept is hard to be generated based on the traditional approach. Such a method limits the effectiveness of medical image captioning. Multi-modal medical image captioning is one of the approaches utilized to address this problem. In multi-modal medical image captioning, textual input, e.g., expert-defined keywords, is considered as one of the main drivers of medical description generation. Thus, encoding the textual input and the medical image effectively are both important for the task of multi-modal medical image captioning. In this work, a new end-to-end deep multi-modal medical image captioning model is proposed. Contextualized keyword representations, textual feature reinforcement, and masked self-attention are used to develop the proposed approach. Based on the evaluation of an existing multi-modal medical image captioning dataset, experimental results show that the proposed model is effective with an increase of +53.2% in BLEU-avg and +18.6% in CIDEr, compared with the state-of-the-art method. https://github.com/Jhhuangkay/Contextualized-Keyword-Representations-for...

### MSAV: An Unified Framework for Multi-view Subspace Analysis with View Consistence

• Huibing Wang
• Guangqi Jiang
• Jinjia Peng
• Xianping Fu

With the development of multimedia period, information is always caputred with multiple views, which causes a research upsurge on multi-view learning. It is obvious that multi-view data contains more information than those single view ones. Therefore, it is crucial to develop the multi-view algorithms to adapt the demand of many applications. Even though some excellent multi-view algorithms were proposed, most of them can only deal with the specific problems. To tacle this problem, this paper proposes an unified framework named Multi-view Subspace Analysis with View Consistence (MSAV), which provides an unified means to extend those single-view dimension reduciton algorithms into multi-view versions. MSAV first extends multi-view data into kernel space to avoid the problem caused by different dimensions of the data from multiple views. Then, we introduced a self-weighted learning strategy to automatically assign weights for all views according to their importance. Finally, in order to promote the consistence of all views, Hilbert-Schmidt Independence Criterion is adopted by MSAV. Furthermore, We conducted experiments on several benchmark datasets to verify the performance of MSAV.

### A Tensor Sparse Representation-Based CBMIR System for Computer-Aided Diagnosis of Focal Liver Lesions and its Pilot Trial

• Jian Wang
• Xian-Hua Han
• Lanfen Lin
• Hongjie Hu
• Yen-Wei Chen

Clinicians refer to diagnosed medical cases in order to make correct diagnosis and take appropriate treatments, due to the complexity of focal liver lesions. It's a heavy burden, however, for medical doctors to find out similar and meaningful cases from the accumulated extreme large medical datasets. Content based medical image retrieval (CBMIR) that searches for similar images in a large database has been attracting increasing research interest recently. A CBMIR system provides doctors the diagnosed cases to improve the diagnosis accuracy and confidence. This paper proposed a tensor sparse representation method to extract temporal and spatial features of multi-phase CT images, so as to provide doctors medical cases more relevant to the query one. The proposed tensor sparse representation method is applied to the retrieval of focal liver lesions (FLLs). Experiments show that the proposed method achieved better retrieval performance than conventional methods. Pilot trial was conducted and results show that diagnosis accuracy and confidence was improved significantly by the developed CBMIR system based on the proposed method.

### M-DFNet: Multi-phase Discriminative Feature Network for Retrieval of Focal Liver Lesions

• Yingying Xu
• Jing Liu
• Lanfen Lin
• Hongjie Hu
• Ruofeng Tong
• Jingsong Li
• Yen-Wei Chen

Content based medical image retrieval (CBMIR) plays a great role in computer aided diagnosis for assisting radiologists to detect and characterize focal liver lesions (FLLs). Deep learning has gained exciting performance on CBMIR. While the features generated by deep learning models trained using softmax loss are always separable but not discriminative enough, which is insufficient for retrieval task. In this paper, we propose a multi-phase discriminative feature network (M-DFNet) with a DeepExtracter and a feature refine module (FRModule) to learn discriminative and separable features under a joint supervision of center loss and softmax loss. The hybrid loss enables to minimize intra-class variations and enlarge inter-class differences as much as possible. The FRModule is proposed to recalibrate the deep features based on the learned class centers to tackle the complex imaging manifestations of FLLs and further enhance both the feature discrimination and generalization. Multi-phase computed tomography (CT) images contain pivotal information for diagnosis of FLLs. Thus the M-DFNet is designed to cope with multi-phase information and we explore an appropriate and effective method for multi-phase feature integration on limited data. Experimental results clearly demonstrate strong performance superiority by our proposed method.

### M2GUDA: Multi-Metrics Graph-Based Unsupervised Domain Adaptation for Cross-Modal Hashing

• Chengyuan Zhang
• Zhi Zhong
• Lei Zhu
• Shichao Zhang
• Da Cao
• Jianfeng Zhang

Cross-modal hashing is a critical but very challenging task that is to retrieve similar samples of one modality via queries of other modalities. To improve the unsupervised cross-modal hashing, domain adaptation techniques can be used to support unsupervised hashing learning by transferring semantic knowledge from labeled source domain to unlabeled target domain. However, there are two problems that cannot be ignored: (1) most of domain adaptation based researches mainly focused on unimodal hashing or cross-modal real value-based retrieval but the study for cross-modal hashing is limited; (2) most existing studies only consider one or two consistency constraints during the domain adaptation learning. To this end, this paper propose a novel end-to-end framework to realize unsupervised domain adaptation for cross-modal hashing. This method, dubbed M$^2$GUDA, including four different consistency constraints: structure consistency, domain consistency, semantic consistency and modality consistency for domain adaptation learning. Besides, to enhance the structure consistency learning, we develop a multi-metrics graph modeling method to capture structure information comprehensively. Extensive experiments are performed on three common used benchmarks to evaluate the effectivity of our method. The results show that our method outperforms several state-of-the-art cross-modal hashing methods.

### Human Pose Estimation based on Attention Multi-resolution Network

• Congcong Zhang
• Ning He
• Qixiang Sun
• Xiaojie Yin
• Ke Lu

Recently, multi-resolution neural networks, which combine features of different resolutions, have achieved good results in human pose estimation tasks. In this paper, we propose an attention-mechanism-based multi-resolution network, which adds an attention mechanism to the High-Resolution Network (HRNet) to enhance the feature representation of the network. It improves the ability of networks with different resolutions to extract key features from images, and causes the output to contain more effective multi-resolution representation information, so that the corresponding point positions of human joints can be estimated more accurately. Experiments on the MPII and COCO datasets, and verification on the MPII datasets, obtained an average accuracy of 90.3% under the PCKh@0.5 evaluation standard, and good results were also achieved on the COCO dataset (with an AP of 76.5). The experimental results show that our network model is effective in improving the accuracy of key point estimation in the human pose estimation task.

## SESSION: Workshop Summaries

### ICDAR'21: Intelligent Cross-Data Analysis and Retrieval

• Minh-Son Dao
• Michael Alexander Riegler
• Duc-Tien Dang-Nguyen
• Cathal Gurrin
• Minh-Triet Tran
• Thanh-Binh Nguyen

Cross-data analytics and retrieval have gained significant improvement recently. People can now extract more data insights precisely and quickly towards having many excellent applications serving human lives. Since people create multimedia and other types of data that reflect the diverse perspectives of human lives, these data are just pieces of the puzzle of the world's pictures. Hence, it is necessary to assembly all these pieces towards having a better solution for human-centered problems. Hence, the workshop welcomes those who work with multimedia and others and come from diverse research domains and disciplines to work on intelligent cross-data analytics and retrieval to bring a smart, sustainable society to human beings. The research domain can vary from well-being, disaster prevention and mitigation, mobility to food computing, to name a few.

### Introduction to the Fourth Annual Lifelog Search Challenge, LSC'21

• Cathal Gurrin
• Björn Þór Jónsson
• Klaus Schöffmann
• Duc-Tien Dang-Nguyen
• Jakub Lokoč
• Minh-Triet Tran
• Wolfgang Hürst
• Luca Rossetto
• Graham Healy

The Lifelog Search Challenge (LSC) is an annual benchmarking challenge for comparing approaches to interactive retrieval from multi-modal lifelogs. LSC'21, the fourth challenge, attracted sixteen participants, each of which had developed interactive retrieval systems for large multimodal lifelogs. These interactive retrieval systems participated in a comparative evaluation in front of an online live-audience at the LSC workshop at ACM ICMR'21. This overview presents the motivation for LSC'21, the lifelog dataset used in the competition, and the participating systems.

### MMArt-ACM'21: International Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia 2021

• Min-Chun Hu
• Ichiro Ide
• Kensuke Tobitani

The International Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia (MMArt-ACM) solicits contributions on methodology advancement and novel applications of multimedia artworks and attractiveness computing that emerge in the era of big data and social media. The topics of the accepted papers cover an analytic topic on comic contents understanding to generative topics on image synthesis and conversion. The actual MMArt-ACM'21 Proceedings are available at: https://dl.acm.org/doi/proceedings/10.1145/3460426.

### MMPT'21: International Joint Workshop on Multi-Modal Pre-Training for Multimedia Understanding

• Bei Liu
• Jianlong Fu
• Shizhe Chen
• Qin Jin
• Alexander Hauptmann
• Yong Rui

Pre-training has been an emerging topic that provides a way to learn strong representation in many fields (e.g., natural language processing, computing vision). In the last few years, we have witnessed many research works on multi-modal pre-training which have achieved state-of-the-art performances on many multimedia tasks (e.g., image-text retrieval, video localization, speech recognition). In this workshop, we aim to gather peer researchers on related topics for more insightful discussion. We also intend to attract more researchers to explore and investigate more opportunities of designing and using innovative pre-training models for multimedia tasks.

### CEA'21: The 13th Workshop on Multimedia for Cooking and Eating Activities

• Yoko Yamakata
• Atsushi Hashimoto

The 13th Workshop on Multimedia for Cooking and Eating Activities presents This overview introduces the aim of the CEA'21 workshop and the list of papers presented in the workshop.