ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Short Papers

TransPCC: Towards Deep Point Cloud Compression via Transformers

Zujie Liang
Fan Liang

High-efficient point cloud compression (PCC) techniques are necessary for various 3D practical applications, such as autonomous driving, holographic transmission, virtual reality, etc. The sparsity and disorder nature make it challenging to design frameworks for point cloud compression. In this paper, we present a new model, called TransPCC that adopts a fully Transformer auto-encoder architecture for deep Point Cloud Compression. By taking the input point cloud as a set in continuous space with learnable position embeddings, we employ the self-attention layers and necessary point-wise operations for point cloud compression. The self-attention based architecture enables our model to better learn point-wise dependency information for point cloud compression. Experimental results show that our method outperforms state-of-the-art methods on large-scale point cloud dataset.

The Impact of Dataset Splits on Classification Performance in Medical Videos

Markus Fox
Klaus Schoeffmann

The creation of datasets in medical imaging is a central topic of research, especially with the advances of deep learning in the past decade. Publications of such datasets typically report baseline results with one or more deep neural networks in the form of established performance metrics (e.g., F1-score, Jaccard, etc.). Then, much work is done trying to beat these baseline metrics to compare different neural architectures. However, these reported metrics are almost meaningless when the underlying data does not conform to specific standards. In order to better understand what standards we need, we have reproduced and analyzed a study of four medical image classification datasets in laparoscopy. With automated frame extraction of surgical videos, we find that the resulting images are way too similar and produce high evaluation metrics by design. We show this similarity with a basic SIFT algorithm that produces high evaluation metrics on the original data. We confirm our hypothesis by creating and evaluating a video-based dataset split from the original images. The original network evaluated on the video-based split performs worse than our basic SIFT algorithm on the original data.

OSCARS: An Outlier-Sensitive Content-Based Radiography Retrieval System

Xiaoyuan Guo
Jiali Duan
Saptarshi Purkayastha
Hari Trivedi
Judy Wawira Gichoya
Imon Banerjee

Improving the retrieval relevance on noisy datasets is an emerging need for the curation of a large-scale clean dataset in the medical domain. While existing methods can be applied for class-wise retrieval (aka. inter-class), they cannot distinguish the granularity of likeness within the same class (aka. intra-class). The problem is exacerbated on medical external datasets, where noisy samples of the same class are treated equally during training. Our goal is to identify both intra/inter-class similarities for fine-grained retrieval. To achieve this, we propose an Outlier-Sensitive Content-based rAdiologhy Retrieval System (OSCARS), consisting of two steps. First, we train an outlier detector on a clean internal dataset in an unsupervised manner. Then we use the trained detector to generate the anomaly scores on the external dataset, whose distribution will be used to bin intra-class variations. Second, we propose a quadruplet (a, p, nintra, ninter) sampling strategy, where intra-class negatives nintra are sampled from bins of the same class other than the bin anchor a belongs to, while n_inter are randomly sampled from inter-classes. We suggest a weighted metric learning objective to balance the intra and inter-class feature learning. We experimented on two representative public radiography datasets. Experiments show the effectiveness of our approach. The training and evaluation code can be found in https://github.com/XiaoyuanGuo/oscars.

Unseen Food Segmentation

Yuma Honbu
Keiji Yanai

Food image segmentation is important for detailed analysis on food images, especially for classification of multiple food items and calorie amount estimation. However, there is a costly problem in training a semantic segmentation model because it requires a large number of images with pixel-level annotations. In addition, the existence of a myriad of food categories causes the problem of insufficient data in each category. Although several food segmentation datasets such as the UEC-FoodPix Complete has been released so far, the number of food categories is still limited to a small number.

In this study, we propose an unseen class segmentation method with high accuracy by using both zero-shot and few-shot segmentation methods for any unseen classes. we make the following contributions: (1) we propose a UnSeen Food Segmentation method (USFoodSeg) that uses the zero-shot model to infer the segmentation mask from the class label words of unseen classes and those images, and uses the few-shot model to refine the segmentation masks. (2) We generate segmentation masks for 156 categories of the unseen class UEC-Food256, totaling 17,000 images, and 85 categories in the Food-101 dataset, totaling 85,000 images, with an accuracy of over 90%. Our proposed method is able to solve the problem of insufficient food segmentation data.

DMPCANet: A Low Dimensional Aggregation Network for Visual Place Recognition

Yinghao Wang
Haonan Chen
Jiong Wang
Yingying Zhu

Visual place recognition (VPR) aims to estimate the geographical location of a query image by finding its nearest reference images from a large geo-tagged database. Most of the existing methods adopt convolutional neural networks to extract feature maps from images. Nevertheless, such feature maps are high-dimensional tensors, and it is a challenge to effectively aggregate them into a compact vector representation for efficient retrieval. To tackle this challenge, we develop an end-to-end convolutional neural network architecture named DMPCANet. The network adopts the regional pooling module to generate feature tensors of the same size from images of different sizes. The core component of our network, the Differentiable Multilinear Principal Component Analysis (DMPCA) module, directly acts on tensor data and utilizes convolution operations to generate projection matrices for dimensionality reduction, thereby reducing the dimensionality to one sixteenth. This module can preserve crucial information while reducing data dimensions. Experiments on two widely used place recognition datasets demonstrate that our proposed DMPCANet can generate low-dimensional discriminative global descriptors and achieve the state-of-the-art results.

VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP

Yikang Li
Jenhao Hsiao
Chiuman Ho

Video-text retrieval is an essential task in cross-modal information retrieval, i.e., retrieving relevant videos from a large and unlabelled dataset given textual queries. Existing methods that simply pool the image features (e.g., based on the CLIP encoder [14]) from frames to build the video descriptor often result in sub-optimal video-text search accuracy since the information among different modalities is not fully exchanged and aligned. In this paper, we proposed a novel dual-encoder model to address the challenging video-text retrieval problem, which uses a highly efficient cross-attention module to facilitate the information exchange between multiple modalities (i.e., video and text). The proposed VideoCLIP is evaluated on two benchmark video-text datasets, MSRVTT and DiDeMo, and the results show that our model can outperform existing state-of-the-art methods while the retrieval speed is much faster than the traditional query-agnostic search model.

Music-to-Dance Generation with Multiple Conformer

Mingao Zhang
Changhong Liu
Yong Chen
Zhenchun Lei
Mingwen Wang

It is necessary for the music-to-dance generation to consider both the kinematics in dance that is highly complex and non-linear and the connection between music and dance movement that is far from deterministic. Existing approaches attempt to address the limited creativity problem, but it is still a very challenging task. First, it is a long-term sequence-to-sequence task. Second, it is noisy in the extracted motion keypoints. Last, there exist local and global dependencies in the music sequence and the dance motion sequence. To address these issues, we propose a novel autoregressive generative framework that predicts future motions based on past motions and music. This framework contains a music conformer, a motion conformer, and a cross-modal conformer, which utilizes the conformer to encode music and motion sequences, and further adapt the cross-modal conformer to the noisy dance motion data that enable it to not only capture local and global dependencies among the sequences but also reduce the effect of noisy data. Quantitative and qualitative experimental results on the publicly available music-to-dance dataset demonstrate our method improves greatly upon the baselines and can generate long-term coherent dance motions well-coordinated with the music.

OCR-oriented Master Object for Text Image Captioning

Wenliang Tang
Zhenzhen Hu
Zijie Song
Richang Hong

Text image captioning aims to understand the scene text in images for image caption generation. The key issue of this challenging task is to understand the relationship between the text OCR tokens and images. In this paper, we propose a novel text image captioning method by purifying the OCR-oriented scene graph with themaster object. The master object is the object to which the OCR is attached, which is the semantic relationship bridge between the OCR token and the image. We consider the master object as a proxy to connect OCR tokens and other regions in the image. By exploring the master object for each OCR token, we build the purified scene graph based on the master objects and then enrich the visual embedding by the Graph Convolution Network (GCN). Furthermore, we cluster the OCR tokens and feed the hierarchical information to provide a richer representation. Experiments on the TextCaps validation and test dataset demonstrate the effectiveness of the proposed method.

Supervised Contrastive Vehicle Quantization for Efficient Vehicle Retrieval

Yongbiao Chen
Kaicheng Guo
Fangxin Liu
Yusheng Huang
Zhengwei Qi

This paper considers large-scale efficient vehicle re-identification (Vehicle ReID). Existing works adopting deep hashing techniques function by projecting vehicle images into compact binary codes in the Hamming space. Since Hamming distance is less distinct, a considerable amount of discriminative information will be lost, leading to degraded retrieval performances. Inspired by the recent advancements in contrastive learning, we put forward the very first product quantization based framework for large-scale efficient vehicle re-identification: Supervised Contrastive Vehicle Quantization (SCVQ). Specifically, we integrate the product quantization process into deep supervised learning by designing a differentiable quantization network. In addition, we propose a novel supervised cross-quantized contrastive quantization (SCQC) loss for similarity-preserving learning, which is tailored for the asymmetric retrieval in the product quantization process. Comprehensive experiments on two public benchmarks have evidenced the superiority of our framework against the state-of-the-arts. Our work is open-sourced at https://github.com/chrisbyd/ContrastiveVehicleQuant

Fashion Style-Aware Embeddings for Clothing Image Retrieval

Rino Naka
Marie Katsurai
Keisuke Yanagi
Ryosuke Goto

Clothing image retrieval is becoming increasingly important as users on social media grow to enjoy sharing their daily outfits. Most conventional methods offer single query-based retrieval and depend on visual features learnt via target classification training. This paper presents an embedding learning framework that uses novel style description features available on users' posts, allowing image-based and multiple choice-based queries for practical clothing image retrieval. Specifically, the proposed method exploits the following complementary information for representing fashion styles: season tags, style tags, users' heights, and silhouette descriptions. Then, we learn embeddings based on a quadruplet loss that considers the ranked pairings of the visual features and the proposed style description features, enabling flexible outfit search based on either of these two types of features as queries. Experiments conducted on WEAR posts demonstrated the effectiveness of the proposed method compared with several baseline methods.

SESSION: Session 1A: Reidentification

Multiple Biological Granularities Network for Person Re-Identification

Shuyuan Tu
Tianzhen Guan
Li Kuang

The task of person re-identification is to retrieve images of a specific pedestrian among cross-camera person gallery captured in the wild. Previous approaches commonly concentrate on the whole person images and local pre-defined body parts, which are ineffective with diversity of person poses and occlusion. In order to alleviate the problem, researchers began to implement attention mechanisms to their model using local convolutions with limited fields. However, previous attention mechanisms focus on the local feature representations ignoring the exploration of global spatial relation knowledge. The global spatial relation knowledge contains clustering-like topological information which is helpful for overcoming the situation of diversity of person poses and occlusion. In this paper, we propose the Multiple Biological Granularities Network (MBGN) based on Global Spatial Relation Pixel Attention (GSRPA) taking the human body structure and global spatial relation pixels information into account. First, we design an adaptive adjustment algorithm (AABS) based on human body structure, which is complementary to our MBGN. Second, we propose a feature fusion strategy taking multiple biological granularities into account. Our strategy forces the model to learn diversity of person poses by balancing the local semantic human body parts and global spatial relations. Third, we propose the attention mechanism GSRPA. GSRPA enhances the weight of spatial relational pixels, which digs out the person topological information for overcoming occlusion problem. Extensive evaluations on the popular datasets Market-1501 and CUHK03 demonstrate the superiority of MBGN over the state-of-the-art methods.

TriReID: Towards Multi-Modal Person Re-Identification via Descriptive Fusion Model

Yajing Zhai
Yawen Zeng
Da Cao
Shaofei Lu

The cross-modal person re-identification (ReID) aims to retrieve one person from one modality to the other single modality, such as text-based and sketch-based ReID tasks. However, for these different modalities of describing a person, combining multiple aspects can obviously make full use of complementary information and improve the identification performance. Therefore, to explore how to comprehensively consider multi-modal information, we advance a novel multi-modal person re-identification task, which utilizes both text and sketch as a descriptive query to retrieve desired images. In fact, the textual description and the visual description are understood together to retrieve the person in the database to be more aligned with real-world scenarios, which is promising but seldom considered. Besides, based on an existing sketch-based ReID dataset, we construct a new dataset, TriReID, to support this challenging task in a semi-automated way. Particularly, we implement an image captioning model under the active learning paradigm to generate sentences suitable for ReID, in which the quality scores of the three levels are customized. Moreover, we propose a novel framework named Descriptive Fusion Model (DFM) to solve the multi-modal ReID issue. Specifically, we first develop a flexible descriptive embedding function to fuse the text and sketch modalities. Further, the fused descriptive semantic feature is jointly optimized under the generative adversarial paradigm to mitigate the cross-modal semantic gap. Extensive experiments on the TriReID dataset demonstrate the effectiveness and rationality of our proposed solution.

Temporal-Consistent Visual Clue Attentive Network for Video-Based Person Re-Identification

Bingliang Jiao
Liying Gao
Peng Wang

Video-based person re-identification (ReID) aims to match video trajectories of pedestrians across multi-view cameras and has important applications in criminal investigation and intelligent surveillance. Compared with single image re-identification, the abundant temporal information contained in video sequences makes it describe pedestrian instances more precisely and effectively. Recently, most existing video-based person ReID algorithms have made use of temporal information by fusing diverse visual contents captured in independent frames. However, these algorithms only measure the salience of visual clues in each single frame, inevitably introducing momentary interference caused by factors like occlusion. Therefore, in this work, we introduce a Temporal-consistent Visual Clue Attentive Network (TVCAN), which is designed to capture temporal-consistently salient pedestrian contents among frames. Our TVCAN consists of two major modules, the TCSA module, and the TCCA module, which are responsible for capturing and emphasizing consistently salient visual contents from the spatial dimension and channel dimension, respectively. Through extensive experiments, the effectiveness of our designed modules has been verified. Additionally, our TVCAN outperforms all compared state-of-the-art methods on three mainstream benchmarks.

Pluggable Weakly-Supervised Cross-View Learning for Accurate Vehicle Re-Identification

Lu Yang
Hongbang Liu
Lingqiao Liu
Jinghao Zhou
Lei Zhang
Peng Wang
Yanning Zhang

Learning cross-view consistent feature representation is the key for accurate vehicle Re-identification (ReID), since the visual appearance of vehicles changes significantly under different viewpoints. To this end, many existing approaches resort to the supervised cross-view learning using extensive extra viewpoints annotations, which however, is difficult to deploy in real applications due to the expensive labelling cost and the continous viewpoint variation that makes it hard to define discrete viewpoint labels. In this study, we present a pluggable Weakly-supervised Cross-View Learning (WCVL) module for vehicle ReID. Through hallucinating the cross-view samples as the hardest positive counterparts with small luminance difference and large local feature variance, we can learn the consistent feature representation via minimizing the cross-view feature distance based on vehicle IDs only without using any viewpoint annotation. More importantly, the proposed method can be seamlessly plugged into most existing vehicle ReID baselines for cross-view learning without re-training the baselines. To demonstrate its efficacy, we plug the proposed method into a bunch of off-the-shelf baselines and obtain significant performance improvement on four public benchmark datasets, i.e., VeRi-776, VehicleID, VRIC and VRAI.

SESSION: Session 1B: Recommendations

An Effective Two-way Metapath Encoder over Heterogeneous Information Network for Recommendation

Yanbin Jiang
Huifang Ma
Xiaohui Zhang
Zhixin Li
Liang Chang

Heterogeneous information networks (HINs) are widely used in recommender system research due to their ability to model complex auxiliary information beyond historical interactions to alleviate data sparsity problem. Existing HIN-based recommendation studies have achieved great success via performing graph convolution operators between pairs of nodes on predefined metapath induced graphs, but they have the following major limitations. First, existing heterogeneous network construction strategies tend to exploit item attributes while failing to effectively model user relations. In addition, previous HIN-based recommendation models mainly convert heterogeneous graph into homogeneous graphs by defining metapaths ignoring the complicated relation dependency involved on the metapath. To tackle these limitations, we propose a novel recommendation model with two-way metapath encoder for top-N recommendation, which models metapath similarity and sequence relation dependency in HIN to learn node representations. Specifically, our model first learns the initial node representation through a pre-training module, and then identifies potential friends and item relations based on their similarity to construct a unified HIN. We then develop the two-way encoder module with similarity encoder and instance encoder to capture the similarity collaborative signals and relational dependency on different metapaths. Finally, the representations on different meta-paths are aggregated through the attention fusion layer to yield rich representations. Extensive experiments on three real datasets demonstrate the effectiveness of our method.

Multi-Modal Contrastive Pre-training for Recommendation

Zhuang Liu
Yunpu Ma
Matthias Schubert
Yuanxin Ouyang
Zhang Xiong

Personalized recommendation plays a central role in various online applications. To provide quality recommendation service, it is of crucial importance to consider multi-modal information associated with users and items, e.g., review text, description text, and images. However, many existing approaches do not fully explore and fuse multiple modalities. To address this problem, we propose a multi-modal contrastive pre-training model for recommendation. We first construct a homogeneous item graph and a user graph based on the relationship of co-interaction. For users, we propose intra-modal aggregation and inter-modal aggregation to fuse review texts and the structural information of the user graph. For items, we consider three modalities: description text, images, and item graph. Moreover, the description text and image complement each other for the same item. One of them can be used as promising supervision for the other. Therefore, to capture this signal and better exploit the potential correlation of intra-modalities, we propose a self-supervised contrastive inter-modal alignment task to make the textual and visual modalities as similar as possible. Then, we apply inter-modal aggregation to obtain the multi-modal representation of items. Next, we employ a binary cross-entropy loss function to capture the potential correlation between users and items. Finally, we fine-tune the pre-trained multi-modal representations using an existing recommendation model. We have performed extensive experiments on three real-world datasets. Experimental results verify the rationality and effectiveness of the proposed method.

Flexible Order Aware Sequential Recommendation

Mingda Qian
Xiaoyan Gu
Lingyang Chu
Feifei Dai
Haihui Fan
Bo Li

Sequential recommendations can dynamically model user interests, which has great value since users' interests may change rapidly with time. Traditional sequential recommendation methods assume that the user behaviors are rigidly ordered and sequentially dependent. However, some user behaviors have flexible orders, meaning the behaviors may occur in any order and are not sequentially dependent. Therefore, traditional methods may capture inaccurate user interests based on wrong dependencies. Motivated by this, several methods identify flexible orders by continuity or similarity. However, these methods fail to comprehensively understand the nature of flexible orders since continuity or similarity do not determine order flexibilities. Therefore, these methods may misidentify flexible orders, leading to inappropriate recommendations. To address these issues, we propose a Flexible Order aware Sequential Recommendation (FOSR) method to identify flexible orders comprehensively. We argue that orders' flexibilities are highly related to the frequencies of item pair co-occurrences. In light of this, FOSR employs a probabilistic based flexible order evaluation module to simulate item pair frequencies and infer accurate order flexibilities. The frequency labeling module extracts labels from the real item pair frequencies to guide the order flexibility measurement. Given the measured order flexibilities, we develop a flexible order aware self-attention module to model dependencies from flexible orders comprehensively and learn dynamic user interests effectively. Extensive experiments on four benchmark datasets show that our model outperforms various state-of-the-art sequential recommendation methods.

Sequential Intention-aware Recommender based on User Interaction Graph

Jinpeng Chen
Yuan Cao
Fan Zhang
Pengfei Sun
Kaimin Wei

The next-item recommendation problem has received more and more attention from researchers in recent years. Ignoring the implicit item semantic information, existing algorithms focus more on the user-item binary relationship and suffer from high data sparsity. Inspired by the fact that user's decision-making process is often influenced by both intention and preference, this paper presents a SequentiAl inTentiOn-aware Recommender based on a user Interaction graph (Satori). In Satori, we first use a novel user interaction graph to construct relationships between users, items, and categories. Then, we leverage a graph attention network to extract auxiliary features on the graph and generate the three embeddings. Next, we adopt self-attention mechanism to model user intention and preference respectively which are later combined to form a hybrid user representation. Finally, the hybrid user representation and previously obtained item representation are both sent to the prediction modul to calculate the predicted item score. Testing on real-world datasets, the results prove that our approach outperforms state-of-the-art methods.

SESSION: Session 2A: Visual+Text Retrieval

TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval

Yongbiao Chen
Sheng Zhang
Fangxin Liu
Zhigang Chang
Mang Ye
Zhengwei Qi

Deep hashing has gained growing popularity in approximate nearest neighbor search for large-scale image retrieval. Until now, the deep hashing for the image retrieval community has been dominated by convolutional neural network architectures, e.g. Resnet [22]. In this paper, inspired by the recent advancements of vision transformers, we present Transhash, a pure transformer-based framework for deep hashing learning. Concretely, our framework is composed of two major modules: (1) Based onVision Transformer (ViT), we design a siamese Multi-Granular Vision Tansformer backbone (MGVT) for image feature extraction. To learn fine-grained features, we innovate a dual-stream multi-granular feature learning on top of the transformer to learn discriminative global and local features. (2) Besides, we adopt a Bayesian learning scheme with a dynamically constructed similarity matrix to learn compact binary hash codes. The entire framework is jointly trained in an end-to-end manner. To the best of our knowledge, this is the first work to tackle deep hashing learning problems without convolutional neural networks (CNNs). We perform comprehensive experiments on three widely-studied datasets: CIFAR-10, NUSWIDE and IMAGENET. The experiments have evidenced our superiority against the existing state-of-the-art deep hashing methods. Specifically, we achieve 8.2%, 2.6%, 12.7% performance gains in terms of average mAP for different hash bit lengths on three public datasets, respectively.

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval

Zhihao Fan
Zhongyu Wei
Zejun Li
Siyuan Wang
Haijun Shan
Xuanjing Huang
Jianqing Fan

Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grained semantic units in both sides of vision and language. For the training, we propose multi-scale matching from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.

Relevance-based Margin for Contrastively-trained Video Retrieval Models

Alex Falcon
Swathikiran Sudhakaran
Giuseppe Serra
Sergio Escalera
Oswald Lanz

Video retrieval using natural language queries has attracted increasing interest due to its relevance in real-world applications, from intelligent access in private media galleries to web-scale video search. Learning the cross-similarity of video and text in a joint embedding space is the dominant approach. To do so, a contrastive loss is usually employed because it organizes the embedding space by putting similar items close and dissimilar items far. This framework leads to competitive recall rates, as they solely focus on the rank of the groundtruth items. Yet, assessing the quality of the ranking list is of utmost importance when considering intelligent retrieval systems, since multiple items may share similar semantics, hence a high relevance. Moreover, the aforementioned framework uses a fixed margin to separate similar and dissimilar items, treating all non-groundtruth items as equally irrelevant. In this paper we propose to use a variable margin: we argue that varying the margin used during training based on how much relevant an item is to a given query, i.e. a relevance-based margin, easily improves the quality of the ranking lists measured through nDCG and mAP. We demonstrate the advantages of our technique using different models on EPIC-Kitchens-100 and YouCook2. We show that even if we carefully tuned the fixed margin, our technique (which does not have the margin as a hyper-parameter) would still achieve better performance. Finally, extensive ablation studies and qualitative analysis support the robustness of our approach. Code will be released at \urlhttps://github.com/aranciokov/RelevanceMargin-ICMR22.

CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval

Yaoxin Zhuo
Yikang Li
Jenhao Hsiao
Chiuman Ho
Baoxin Li

With the ever-increasing multimedia data on the Web, cross-modal video-text retrieval has received a lot of attention in recent years. Deep cross-modal hashing approaches utilize the Hamming space for achieving fast retrieval. However, most existing algorithms have difficulties in seeking or constructing a well-defined joint semantic space. In this paper, an unsupervised deep cross-modal video-text hashing approach (CLIP4Hashing) is proposed, which mitigates the difficulties in bridging between different modalities in the Hamming space through building a single hashing net by employing the pre-trained CLIP model. The approach is enhanced by two novel techniques, the dynamic weighting strategy and the design of the min-max hashing layer, which are found to be the main sources of the performance gain. Compared with conventional deep cross-modal hashing algorithms, CLIP4Hashing does not require data-specific hyper-parameters. With evaluation using three challenging video-text benchmark datasets, we demonstrate that CLIP4Hashing is able to significantly outperform existing state-of-the-art hashing algorithms. Additionally, with larger bit sizes (e.g., 2048 bits), CLIP4Hashing can even deliver competitive performance compared with the results based on non-hashing features.

SESSION: Session 2B: Deep Learning - Methodological Advancements

Nearest Neighbor Search with Compact Codes: A Decoder Perspective

Kenza Amara
Matthijs Douze
Alexandre Sablayrolles
Hervé Jégou

Modern approaches for fast retrieval of similar vectors on billion-scaled datasets rely on compressed-domain approaches such as binary sketches or product quantization. These methods minimize a certain loss, typically the Mean Squared Error or other objective functions tailored to the retrieval problem. In this paper, we re-interpret popular methods such as binary hashing or product quantizers as auto-encoders, and point out that they implicitly make suboptimal assumptions on the form of the decoder. We design backward-compatible decoders that improve the reconstruction of the vectors from the same codes, which translates to a better performance in nearest neighbor search. Our method significantly improves over binary hashing methods and product quantization on popular benchmarks.

Teaching a New Dog Old Tricks: Contrastive Random Walks in Videos with Unsupervised Priors

Jan Schutte
Pascal Mettes

This paper focuses on self-supervised representation learning in videos with guidance from multimodal priors. Where the temporal dimension is commonly used as supervision proxy for learning frame-level or clip-level representations, a number of works have recently shown how to learn local representations in space and time through cycle-consistency. Given a starting patch, the contrastive goal is to track the patch in subsequent frames, followed by a backtracking to the original frame with the starting patch as goal. While effective for down-stream tasks such as segmentation and body joint propagation, affinities between patches need to be learned from scratch. This setup not only requires many videos for self-supervised optimization, it also fails when using smaller patches and more connections between consecutive frames. On the other hand, there are multiple generic cues from multiple modalities that provide valuable information about how patches should propagate in videos, from saliency and optical flow to photometric center biases. To that end, we introduce Guided Contrastive Random Walks. The main idea is to employ well-known multimodal priors to provide fixed prior affinities. We outline a general framework where prior affinities are combined with learned affinities to guide the cycle-consistency objective. Empirically, we show that Guided Contrastive Random Walks result in better spatio-temporal representations for two down-stream tasks. More importantly, when using smaller patches and therefore more connections between patches, our approach further improves, while the unguided baseline can no longer learn meaningful representations.

FedNKD: A Dependable Federated Learning Using Fine-tuned Random Noise and Knowledge Distillation

Shaoxiong Zhu
Qi Qi
Zirui Zhuang
Jingyu Wang
Haifeng Sun
Jianxin Liao

Multimedia retrieval models need the ability to extract useful information from large-scale data for clients. As an important part of multimedia retrieval, image classification model directly affects the efficiency and effect of multimedia retrieval. We need a lot of data to train a image classification model applied to multimedia retrieval task. However, with the protection of data privacy, the data used to train the model often needs to be kept on the client side. Federated learning is proposed to use data from all clients to train one model while protecting privacy. When federated learning is applied, the distribution of data across different clients varies greatly. Disregarding this problem yields a final model with unstable performance. To enable federated learning to work dependably in the real world with complex data environments, we propose FedNKD, which utilizes knowledge distillation and random noise. The superior knowledge of each client is distilled into a central server to mitigate the instablity caused by Non-IID data. Importantly, a synthetic dataset is created by some random noise through back propagation of neural networks. The synthetic dataset will contain the abstract features of the real data. Then we will use this synthetic dataset to realize the knowledge distillation while protecting users' privacy. In our experimental scenarios, FedNKD outperforms existing representative algorithms by about 1.5% in accuracy.

Weakly Supervised Fine-grained Recognition based on Combined Learning for Small Data and Coarse Label

Anqi Hu
Zhengxing Sun
Qian Li

Learning with weak supervision already becomes one of the research trends in fine-grained image recognition. These methods aim to learn feature representation in the case of less manual cost or expert knowledge. Most existing weakly supervised methods are based on incomplete annotation or inexact annotation, which is difficult to perform well limited by supervision information. Therefore, using these two kind of annotations for training at the same time could mine more relevance while the annotating burden will not increase much. In this paper, we propose a combined learning framework by coarse-grained large data and fine-grained small data for weakly supervised fine-grained recognition. Combined learning contains two significant modules: 1) a discriminant module, which maintains the structure information consistent between coarse label and fine label by attention map and part sampling, 2) a cluster division strategy, which mines the detail differences between fine categories by feature subtraction. Experiment results show that our method outperforms weakly supervised methods and achieves the performance close to fully supervised methods in CUB-200-2011 and Stanford Cars datasets.

SESSION: Demos

Real-Time Deepfake System for Live Streaming

Yifei Fan
Modan Xie
Peihan Wu
Gang Yang

This paper proposes a real-time deepfake framework to assist users use deep forgery to conduct live streaming, further to protect privacy and increase interesting by selecting different reference faces to create a non-existent fake face. Nowadays, because of the demand for live broadcast functions such as selling goods, playing games, and auctions, the opportunities for anchor exposure are increasing, which leads live streamers pay more attention to their privacy protection. Meanwhile, the traditional technology of deepfake is more likely to infring on the portrait rights of others, so our framework supports users to select different face features for facial tampering to avoid infringement. In our framework, through feature extractor, heatmap transformer, heatmap regression and face blending, face reenactment could be confirmed effectively. Users can enrich the personal face feature database by uploading different photos, and then select the desired picture for tampering on this basis, and finally real-time tampering live broadcast is achieved. Moreover, our framework is a closed loop self-adaptation system as it allows users to update the database themselves to extend face feature data and improve conversion efficiency.

EmoMTB: Emotion-aware Music Tower Blocks

Alessandro B. Melchiorre
David Penz
Christian Ganhör
Oleg Lesota
Vasco Fragoso
Florian Friztl
Emilia Parada-Cabaleiro
Franz Schubert
Markus Schedl

We introduce Emotion-aware Music Tower Blocks (EmoMTB), an audiovisual interface to explore large music collections. It creates a musical landscape, by adopting the metaphor of a city, where similar songs are grouped into the same building and nearby buildings form neighborhoods of particular genres. In order to personalize the user experience, an underlying classifier monitors textual user-generated content, by predicting their emotional state and adapting the audiovisual elements of the interface accordingly. EmoMTB enables users to explore different musical styles either within their comfort zone or outside of it. Besides, tailoring the results of the recommender engine to match the affective state of the user, EmoMTB offers a unique way to discover and enjoy music. EmoMTB supports exploring a collection of circa half a million streamed songs using a regular smartphone as a control interface to navigate in the landscape.

ViRMA: Virtual Reality Multimedia Analytics

Aaron Duane
Björn Pór Jónsson

In this paper we describe the latest iteration of the Virtual Reality Multimedia Analytics (ViRMA) system, a novel approach to multimedia analysis in virtual reality which is supported by the Multi-dimensional Multimedia Model.

Person Search by Uncertain Attributes

Tingting Dong
Jianquan Liu

This paper presents a person search system by uncertain attributes. Attribute-based person search aims at finding person images that are the best matched with a set of attributes specified by a user as a query. The specified query attributes are inherently uncertain due to many factors such as the difficulty of retrieving characteristics of a target person from brain-memory and environmental variations like light and viewpoint. Also, existing attribute recognition techniques typically extract confidence scores along with attributes. Most of state-of-art approaches for attribute-based person search ignore the confidence scores or simply use a threshold to filter out attributes with low confidence scores. Moreover, they do not consider the uncertainty of query attributes. In this work, we resolve this uncertainty by enabling users to specify a level of confidence with each query attribute and consider uncertainty in both query attributes and attributes extracted from person images. We define a novel matching score to measure the degree of a person matching with query attribute conditions by leveraging the knowledge of probabilistic databases. Furthermore, we propose a novel definition of Critical Point of Confidence and compute it for each query attribute to show the impact of confidence levels on rankings of results. We develop a web-based demonstration system and show its effectiveness using real-world surveillance videos.

SESSION: Best Paper Candidates

Dual-Level Decoupled Transformer for Video Captioning

Yiqi Gao
Xinglin Hou
Wei Suo
Mengyang Sun
Tiezheng Ge
Yuning Jiang
Peng Wang

Video captioning aims to understand the spatio-temporal semantic concept of the video and generate descriptive sentences. The de-facto approach to this task dictates a text generator to learn from offline-extracted motion or appearance features from pre-trained vision models. However, these methods may suffer from the so-called "couple" drawbacks on both video spatio-temporal representation and sentence generation. For the former, "couple" means learning spatio-temporal representation in a single model(3DCNN), resulting the problems named disconnection in task/pre-train domain and hard for end-to-end training. As for the latter, "couple" means treating the generation of visual semantic and syntax-related words equally. To this end, we present D2 - a dual-level decoupled transformer pipeline to solve the above drawbacks: (i) for video spatio-temporal representation, we decouple the process of it into "first-spatial-then-temporal" paradigm, releasing the potential of using dedicated model(e.g. image-text pre-training) to connect the pre-training and downstream tasks, and makes the entire model end-to-end trainable. (ii) for sentence generation, we propose Syntax-Aware Decoder to dynamically measure the contribution of visual semantic and syntax-related words. Extensive experiments on three widely-used benchmarks (MSVD, MSR-VTT and VATEX) have shown great potential of the proposed D2 and surpassed the previous methods by a large margin in the task of video captioning.

Cross-Modal Retrieval between Event-Dense Text and Image

Zhongwei Xie
Lin Li
Luo Zhong
Jianquan Liu
Ling Liu

This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.

Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval

Sheng Zeng
Changhong Liu
Jun Zhou
Yong Chen
Aiwen Jiang
Hanxi Li

Cross-modal image-text retrieval is a fundamental task in information retrieval. The key to this task is to address both heterogeneity and cross-modal semantic correlation between data of different modalities. Fine-grained matching methods can nicely model local semantic correlations between image and text but face two challenges. First, images may contain redundant information while text sentences often contain words without semantic meaning. Such redundancy interferes with the local matching between textual words and image regions. Furthermore, the retrieval shall consider not only low-level semantic correspondence between image regions and textual words but also a higher semantic correlation between different intra-modal relationships. We propose a multi-layer graph convolutional network with object-level, object-relational-level, and higher-level learning sub-networks. Our method learns hierarchical semantic correspondences by both local and global alignment. We further introduce a self-attention mechanism after the word embedding to weaken insignificant words in the sentence and a cross-attention mechanism to guide the learning of image features. Extensive experiments on Flickr30K and MS-COCO datasets demonstrate the effectiveness and superiority of our proposed method.

SESSION: Session 3A: Visual+Text Retrieval

Ingredient-enriched Recipe Generation from Cooking Videos

Jianlong Wu
Liangming Pan
Jingjing Chen
Yu-Gang Jiang

Cooking video captioning aims to generate the text instructions that describes the cooking procedures presented in the video. Current approaches tend to use large neural models or use more robust feature extractors to increase the expressive ability of features, ignoring the strong correlation between consecutive cooking steps in the video. However, it is intuitive that previous cooking steps can provide clues for the next cooking step. Specially, consecutive cooking steps tend to share the same ingredients. Therefore, accurate ingredients recognition can help to introduce more fine-grained information in captioning. To improve the performance of video procedural caption in cooking video, this paper proposes a framework that introduces ingredient recognition module which uses the copy mechanism to fuse the predicted ingredient information into the generated sentence. Moreover, we integrate the visual information of the previous step into the generation of the current step, and the visual information of the two steps together assist in the generation process. Extensive experiments verify the effectiveness of our propose framework and it achieves the promising performances on both YouCookII and Cooking-COIN datasets.

Cross-lingual Adaptation for Recipe Retrieval with Mixup

Bin Zhu
Chong-Wah Ngo
Jingjing Chen
Wing-Kwong Chan

Cross-modal recipe retrieval has attracted research attention in recent years, thanks to the availability of large-scale paired data for training. Nevertheless, obtaining adequate recipe-image pairs covering the majority of cuisines for supervised learning is difficult if not impossible. By transferring knowledge learnt from a data-rich cuisine to a data-scarce cuisine, domain adaptation sheds light on this practical problem. Nevertheless, existing works assume recipes in source and target domains are mostly originated from the same cuisine and written in the same language. This paper studies unsupervised domain adaptation for image-to-recipe retrieval, where recipes in source and target domains are in different languages. Moreover, only recipes are available for training in the target domain. A novel recipe mixup method is proposed to learn transferable embedding features between the two domains. Specifically, recipe mixup produces mixed recipes to form an intermediate domain by discretely exchanging the section(s) between source and target recipes. To bridge the domain gap, recipe mixup loss is proposed to enforce the intermediate domain to locate in the shortest geodesic path between source and target domains in the recipe embedding space. By using Recipe 1M dataset as source domain (English) and Vireo-FoodTransfer dataset as target domain (Chinese), empirical experiments verify the effectiveness of recipe mixup for cross-lingual adaptation in the context of image-to-recipe retrieval.

Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis

Pei Dong
Lei Wu
Lei Meng
Xiangxu Meng

In this paper, we focus on generating photo-realistic images from given text descriptions. Current methods first generate an initial image and then progressively refine it to a high-resolution one. These methods typically indiscriminately refine all granularity features output from the previous stage. However, the ability to express different granularity features in each stage is not consistent, and it is difficult to express precise semantics by further refining the features with poor quality generated in the previous stage. Current methods cannot refine different granularity features independently, resulting in that it is challenging to clearly express all factors of semantics in generated image, and some features even become worse. To address this issue, we propose a Hierarchical Disentangled Representations Generative Adversarial Networks (HDR-GAN) to generate photo-realistic images by explicitly disentangling and individually modeling the factors of semantics in the image. HDR-GAN introduces a novel component called multi-granularity feature disentangled encoder to represent image information comprehensively through explicitly disentangling multi-granularity features including pose, shape and texture. Moreover, we develop a novel Multi-granularity Feature Refinement (MFR) containing a Coarse-grained Feature Refinement (CFR) model and a Fine-grained Feature Refinement (FFR) model. CFR utilizes coarse-grained disentangled representations (e.g., pose and shape) to clarify category information, while FFR employs fine-grained disentangled representations (e.g., texture) to reflect instance-level details. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and CLEVR-SV) demonstrate the rationality and superiority of our method.

Style-woven Attention Network for Zero-shot Ink Wash Painting Style Transfer

Haochen Sun
Lei Wu
Xiang Li
Xiangxu Meng

Traditional Chinese painting is a unique form of artistic expression. Compared with western art painting, it pays more attention to the verve in visual effect, especially ink painting, which makes good use of lines and pays little attention to information such as texture. Some style transfer methods have recently begun to apply traditional Chinese painting style (such as ink wash style) to photorealistic. Ink stylization of different types of real-world photos in a dataset using these style transfer methods has some limitations. When the input images are animal types that have not been seen in the training set, the generated results retain some semantic features of the data in the training set, resulting in distortion. Therefore, in this paper, we attempt to separate the feature representations for styles and contents and propose a style-woven attention network to achieve zero-shot ink wash painting style transfer. Our model learns to disentangle the data representations in an unsupervised fashion and capture the semantic correlations of content and style. In addition, an ink style loss is added to improve the learning ability of the style encoder. In order to verify the ability of ink wash stylization, we augmented the publicly available dataset $ChipPhi$. Extensive experiments based on a wide validation set prove that our method achieves state-of-the-art results.

SESSION: Session 3B: Applications

Automatic Visual Recognition of Unexploded Ordnances Using Supervised Deep Learning

Georgios Begkas
Panagiotis Giannakeris
Konstantinos Ioannidis
Georgios Kalpakis
Theodora Tsikrika
Stefanos Vrochidis
Ioannis Kompatsiaris

Unexploded Ordnance (UXO) classification is a challenging task which is currently tackled using electromagnetic induction devices that are expensive and may require physical presence in potentially hazardous environments. The limited availability of open UXO data has, until now, impeded the progress of image-based UXO classification, which may offer a safe alternative at a reduced cost. In addition, the existing sporadic efforts focus mainly on small scale experiments using only a subset of common UXO categories. Our work aims to stimulate research interest in image-based UXO classification, with the curation of a novel dataset that consists of over 10000 annotated images from eight major UXO categories. Through extensive experimentation with supervised deep learning we uncover key insights into the challenging aspects of this task. Finally, we set the baseline on our novel benchmark by training state-of-the-art Convolutional Neural Networks and a Vision Transformer that are able to discriminate between highly overlapping UXO categories with 84.33% accuracy.

Generating Topological Structure of Floorplans from Room Attributes

Yu Yin
Will Hutchcroft
Naji Khosravan
Ivaylo Boyadzhiev
Yun Fu
Sing Bing Kang

Analysis of indoor spaces requires topological information. In this paper, we propose to extract topological information from room attributes using what we call Iterative and adaptive graph Topology Learning (ITL). ITL progressively predicts multiple relations between rooms; at each iteration, it improves node embeddings, which in turn facilitates the generation of a better topological graph structure. This notion of iterative improvement of node embeddings and topological graph structure is in the same spirit as [5]. However, while [5] computes the adjacency matrix based on node similarity, we learn the graph metric using a relational decoder to extract room correlations. Experiments using a new challenging indoor dataset validate our proposed method. Qualitative and quantitative evaluation for layout topology prediction and floorplan generation applications also demonstrate the effectiveness of ITL.

MultiCLU: Multi-stage Context Learning and Utilization for Storefront Accessibility Detection and Evaluation

Xuan Wang
Jiajun Chen
Hao Tang
Zhigang Zhu

In this work, a storefront accessibility image dataset is collected from Google street view and is labeled with three main objects for storefront accessibility: doors (for store entrances), doorknobs (for accessing the entrances) and stairs (for leading to the entrances). Then MultiCLU, a new multi-stage context learning and utilization approach, is proposed with the following four stages: Context in Labeling (CIL), Context in Training (CIT), Context in Detection (CID) and Context in Evaluation (CIE). The CIL stage automatically extends the label for each knob to include more local contextual information. In the CIT stage, a deep learning method is used to project the visual information extracted by a Faster R-CNN based object detector to semantic space generated by a Graph Convolutional Network. The CID stage uses the spatial relation reasoning between categories to refine the confidence score. Finally in the CIE stage, a new loose evaluation metric for storefront accessibility, especially for knob category, is proposed to efficiently help BLV users to find estimated knob locations. Our experiment results show that the proposed MultiCLU framework can achieve significantly better performance than the baseline detector using Faster R-CNN, with +13.4% on mAP and +15.8% on recall, respectively. Our new evaluation metric also introduces a new way to evaluate storefront accessibility objects, which could benefit BLV group in real life.

UF-VTON: Toward User-Friendly Virtual Try-On Network

Yuan Chang
Tao Peng
Ruhan He
Xinrong Hu
Junping Liu
Zili Zhang
Minghua Jiang

Image-based virtual try-on aims to transfer a clothes onto a person while preserving both person's and cloth's attributes. However, the existing methods to realize this task require a target clothes, which cannot be obtained in most cases. To address this issue, we propose a novel user-friendly virtual try-on network (UF-VTON), which only requires a person image and an image of another person wearing a target clothes to generate a result of the person wearing the target clothes. Specifically, we adopt a knowledge distillation scheme to construct a new triple dataset for supervised learning, propose a new three-step pipeline (coarse synthesis, clothing alignment, and refinement synthesis) for try-on task, and utilize an end-to-end training strategy to further refine the results. In particular, we design a new synthesis network that includes both CNN blocks and swin-transformer blocks to capture global and local information and generate highly-realistic try-on images. Qualitative and quantitative experiments show that our method achieves the state-of-the-art virtual try-on performance.

SESSION: Session 3C: Synchronized MM

Learning Sample Importance for Cross-Scenario Video Temporal Grounding

Peijun Bao
Yadong Mu

The task of temporal grounding aims to locate video moment in an untrimmed video, with a given sentence query. This paper for the first time investigates some superficial biases that are specific to the temporal grounding task, and proposes a novel targeted solution. Most alarmingly, we observe that existing temporal ground models heavily rely on some biases (e.g., high preference on frequent concepts or certain temporal intervals) in the visual modal. This leads to inferior performance when generalizing the model in cross-scenario test setting. To this end, we propose a novel method called Debiased Temporal Language Localizer (Debias-TLL) to prevent the model from naively memorizing the biases and enforce it to ground the query sentence based on true inter-modal relationship. Debias-TLL simultaneously trains two models. By our design, a large discrepancy of these two models' predictions when judging a sample reveals higher probability of being a biased sample. Harnessing the informative discrepancy, we devise a data re-weighing scheme for mitigating the data biases. We evaluate the proposed model in cross-scenario temporal grounding, where the train / test data are heterogeneously sourced. Experiments show large-margin superiority of the proposed method in comparison with state-of-the-art competitors.

Efficient Linear Attention for Fast and Accurate Keypoint Matching

Suwichaya Suwanwimolkul
Satoshi Komorita

Recently Transformers have provided state-of-the-art performance in sparse matching, crucial to realize high-performance 3D vision applications. Yet, these Transformers lack efficiency due to the quadratic computational complexity of their attention mechanism. To solve this problem, we employ an efficient linear attention for the linear computational complexity. Then, we propose a new attentional aggregation that achieves high accuracy by aggregating both the global and local information from sparse keypoints. To further improve the efficiency, we propose the joint learning of feature matching and description. Our learning enables simpler and faster matching than Sinkhorn, often used in matching the learned descriptors from Transformers. Our method achieves competitive performance with only 0.84M learnable parameters against the bigger SOTAs, SuperGlue (12M parameters) and SGMNet (30M parameters), on three benchmarks, HPatch, ETH, Aachen Day-Night.

Video2Subtitle: Matching Weakly-Synchronized Sequences via Dynamic Temporal Alignment

Ben Xue
Chenchen Liu
Yadong Mu

This paper investigates a new research task in multimedia analysis, dubbed as Video2Subtitle. The goal of this task is to finding the most plausible subtitle from a large pool for a querying video clip. We assume that the temporal duration of each sentence in a subtitle is unknown. Compared with existing cross-modal matching tasks, the proposed Video2Subtitle confronts several new challenges. In particular, video frames / subtitle sentences are temporally ordered, respectively, yet no precise synchronization is available. This casts Video2Subtitle into a problem of matching weakly-synchronized sequences. In this work, our technical contributions are two-fold. First, we construct a large-scale benchmark for the Video2Subtitle task. It consists of about 100K video clip / subtitle pairs with a full duration of 759 hours. All data are automatically trimmed from conversational sub-parts of movies and youtube videos. Secondly, an ideal algorithm for tackling Video2Subtitle requires both temporal synchronization of the visual / textual sequences, but also strong semantic consistency between two modalities. To this end, we propose a novel algorithm with the key traits of heterogeneous multi-cue fusion and dynamic temporal alignment. The proposed method demonstrates excellent performances in comparison with several state-of-the-art cross-modal matching methods. Additionally, we also depict a few interesting applications of Video2Subtitle, such as re-generating subtitle for given videos.

Dual-Channel Localization Networks for Moment Retrieval with Natural Language

Bolin Zhang
Bin Jiang
Chao Yang
Liang Pang

According to the given natural language query, moment retrieval aims to localize the most relevant moment in an untrimmed video. The existing solutions for this problem can be roughly divided into two categories based on whether candidate moments are generated: i) Moment-based approach: It pre-cuts the video into a set of candidate moments, performs multimodal fusion, and evaluates matching scores with the query. ii) Clip-based approach: It directly aligns video clips and query with predicting matching scores without generating candidate moments. Both frameworks have respective shortcomings: the moment-based models suffer from heavy computations, while the performance of clip-based models is familiarly inferior to moment-based counterparts. To this end, we design an intuitive and efficient Dual-Channel Localization Network (DCLN) to balance computational cost and retrieval performance. For reducing computational cost, we capture the temporal relations of only a few video moments with the same start or end boundary in the proposed dual-channel structure. The start or end channel map index represents the corresponding video moment's start or end time boundary. For improving model performance, we apply the proposed dual-channel localization network to efficiently encode the temporal relations on the dual-channel map and learn discriminative features to distinguish the matching degree between natural language query and video moments. The extensive experiments on two standard benchmarks demonstrate the effectiveness of our proposed method.

SESSION: Session 4A: Alignment and Localization

Phrase-level Prediction for Video Temporal Localization

Sizhe Li
Chang Li
Minghang Zheng
Yang Liu

Video temporal localization aims to locate a period that semantically matches a natural language query in a given untrimmed video. We empirically observe that although existing approaches gain steady progress on sentence localization, the performance of phrase localization is far from satisfactory. In principle, the phrase should be easier to localize as fewer combinations of visual concepts need to be considered; such incapability indicates that the existing models only capture the sentence annotation bias in the benchmark but lack sufficient understanding of the intrinsic relationship between simple visual and language concepts, thus the model generalization and interpretability is questioned. This paper proposes a unified framework that can deal with both sentence and phrase-level localization, namely Phrase Level Prediction Net (PLPNet). Specifically, based on the hypothesis that similar phrases tend to focus on similar video cues, while dissimilar ones should not, we build a contrastive mechanism to restrain phrase-level localization without fine-grained phrase boundary annotation required in training. Moreover, considering the sentence's flexibility and wide discrepancy among phrases, we propose a clustering-based batch sampler to ensure that contrastive learning can be conducted efficiently. Extensive experiments demonstrate that our method surpasses state-of-the-art methods of phrase-level temporal localization while maintaining high performance in sentence localization and boosting the model's interpretability and generalization capability. Our code is available at https://github.com/sizhelee/PLPNet.

Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization

Xingyu Shen
Long Lan
Huibin Tan
Xiang Zhang
Xurui Ma
Zhigang Luo

Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two issues: 1) modality interaction could be unexpectedly friendly to modality specific learning that merely learns modality specific patterns, and 2) modality interaction easily confuses spatio-temporal cues and ultimately makes time cues in the original video ambiguous. In this paper, we propose a modality synergy with spatio-temporal cue purification method (MS2P) for SQML to address the above two issues. Particularly, a conceptually simple modality synergy strategy is explored to keep features modality specific while absorbing the other modality complementary information with both carefully designed cross-attention unit and non-contrastive learning. As a result, modality specific semantics can be calibrated progressively in a safer way. To preserve time cues in original video, we further purify video representation into spatial and temporal parts to enhance localization resolution by the proposed two light-weight sentence-aware filtering operations. Experiments on Charades-STA, TACoS, and ActivityNet Caption datasets show our model outperforms the state-of-the-art approaches by a large margin.

HybridVocab: Towards Multi-Modal Machine Translation via Multi-Aspect Alignment

Ru Peng
Yawen Zeng
Junbo Zhao

Multi-modal machine translation (MMT) aims to augment the linguistic machine translation frameworks by incorporating aligned vision information. As the core research challenge for MMT, how to fuse the image information and further align it with the bilingual data remains critical. Existing works have either focused on a methodological alignment in the space of bilingual text or emphasized the combination of the one-sided text and given image. In this work, we entertain the possibility of a triplet alignment, among the source and target text together with the image instance. In particular, we propose Multi-aspect AlignmenT (MAT) model that augments the MMT tasks to three sub-tasks --- namely cross-language translation alignment, cross-modal captioning alignment and multi-modal hybrid alignment tasks. Core to this model consists of a hybrid vocabulary which compiles the visually depictable entity (nouns) occurrence on both sides of the text as well as the detected object labels appearing in the images. Through this sub-task, we postulate that MAT manages to further align the modalities by casting three instances into a shared domain, as compared against previously proposed methods. Extensive experiments and analyses demonstrate the superiority of our approaches, which achieve several state-of-the-art results on two benchmark datasets of the MMT task.

SESSION: Session 4B: Captioning and Summarization

Improving Image Captioning via Enhancing Dual-Side Context Awareness

Yiqi Gao
Ning Wang
Wei Suo
Mengyang Sun
Peng Wang

Recent work on visual question answering demonstrate that grid features can work as well as region feature on vision language tasks. In the meantime, transformer-based model and its variants have shown remarkable performance on image captioning. However, the object-contextual information missing caused by the single granularity nature of grid feature on the encoder side, as well as the future contextual information missing due to the left2right decoding paradigm of transformer decoder, remains unexplored. In this work, we tackle these two problems by enhancing contextual information at dual-side:(i) at encoder side, we propose Context-Aware Self-Attention module, in which the key/value is expanded with adjacent rectangle region where each region contains two or more aggregated grid features; this enables grid feature with varying granularity, storing adequate contextual information for object with different scale. (ii) at decoder side, we incorporate a dual-way decoding strategy, in which left2right and right2left decoding are conducted simultaneously and interactively. It utilizes both past and future contextual information when generates current word. Combining these two modules with a vanilla transformer, our Context-Aware Transformer(CATNet) achieves a new state-of-the-art on MSCOCO benchmark.

Improve Image Captioning by Modeling Dynamic Scene Graph Extension

Minghao Geng
Qingjie Zhao

Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current methods attend to scene graph relying on ambiguous language information, neglecting the strong connections between scene graph nodes. In this paper, we propose a Scene Graph Extension (SGE) architecture to model the dynamic scene graph extension using the partly generated sentence. Our model first uses the generated words and previous attention results of scene graph nodes to make up a partial scene graph. Then we choose objects or relationships that has close connection with the generated graph to infer the next word. Our SGE is appealing in view that it is pluggable to any scene graph based image captioning method. We conduct the extensive experiments on MSCOCO dataset. The results shows that the proposed SGE significantly outperforms the baselines, resulting in a state-of-the-art performance under most metrics.

Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames

Evlampios Apostolidis
Georgios Balaouras
Vasileios Mezaris
Ioannis Patras

In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.

SESSION: Session 5A: Applications

Fashion Image Search via Anchor-Free Detector

Shanchuan Gao
Fankai Zeng
Lu Cheng
Jicong Fan
Mingbo Zhao

Clothes image search is the key technique to effectively search the clothes items that are most relevant to the query clothes given by the customer. In this work, we propose an Anchor-free framework for clothes image search by adopting an additional Re-ID branch for similarity learning and global mask branch for instance segmentation. The Re-ID branch is to extract richer feature of target clothes, where we develop a mask pooling layer to aggregate the feature by utilizing the mask of target clothes as the guidance. In this way, the extracted feature will involve more information covered by the mask area of targets instead of only the center point; the global mask branch is to be trained with detection and Re-ID branches simultaneously, where the estimated mask of target clothes can be utilized in reference procedure to guide the feature extraction. Finally, to further enhance the performance of retrieval, we have introduced a match loss to further fine-tune the Re-ID embedding branch in the framework, so that the clothes target can be closer to the same one, while be farther away from different clothes targets. Extensive simulations have been conducted and the results verify the effectiveness of the proposed work.

Unsupervised Contrastive Masking for Visual Haze Classification

Jingyu Li
Haokai Ma
Xiangxian Li
Zhuang Qi
Lei Meng
Xiangxu Meng

Haze classification has gained much attention recently as a cost-effective solution for air quality monitoring. Different from conventional image classification tasks, it requires the classifier to capture the haze patterns of different severity degrees. Existing efforts typically focus on the extraction of effective haze features, such as the dark channel and deep features. However, it is observed that the light-haze images are often mis-classified due to the presence of diverse background scenes. To address this issue, this paper presents an unsupervised contrastive masking (UCM) algorithm to segment the haze regions without any supervision, and develops a dual-channel model-agnostic framework, termed magnifier neural network (MagNet), to effectively use the segmented haze regions to enhance the learning of haze features by conventional deep learning models. Specifically, MagNet employs the haze regions to provide the pixel- and feature-level visual information via three strategies, including Input Augmentation, Network Constraint, and Feature Enhancement, which work as a soft-attention regularizer to alleviates the trade-off between capturing the global scene information and the local information in the haze regions. Experiments were conducted on two datasets in terms of performance comparison, parameter estimation, ablation studies, and case studies, and the results verified that UCM can accurately and rapidly segment the haze regions, and the proposed three strategies of MagNet consistently improve the performance of the state-of-the-art deep learning backbones.

MuLER: Multiplet-Loss for Emotion Recognition

Anwer Slimi
Mounir Zrigui
Henri Nicolas

With the rise of human-machine interactions, it has become necessary for machines to better understand humans in order to respond appropriately. Hence, in order to increase communication and interaction, it would be ideal for machines to automatically detect human emotions. Speech Emotion Recognition (SER) has been a focus of a lot of studies in the past few years. However, they can be considered poor in accuracy and must be improved. In our work, we propose a new loss function that aims to encode speeches instead of classifying them directly as the majority of the existing models do. The encoding will be done in a way that utterances with the same labels would have similar encodings. The encoded speeches were tested on two datasets and we managed to get 88.19% accuracy with the RAVDESS (Ryerson Audiovisual Database of Emotional Speech and Song) dataset and 91.66% accuracy with the RML (Ryerson Multimedia Research Lab) dataset.

STAFNet: Swin Transformer Based Anchor-Free Network for Detection of Forward-looking Sonar Imagery

Xingyu Zhu
Yingshuo Liang
Jianlei Zhang
Zengqiang Chen

Forward-looking sonar (FLS) is widely applied in underwater operations, among which the search of underwater crash objects and victims is an incredibly challenging task. An efficient detection method based on deep learning can intelligently detect objects in FLS images, which makes it a reliable tool to replace manual recognition. To achieve this aim, we propose a novel Swin Transformer based anchor-free network (STAFNet), which contains a strong backbone Swin Transformer and a lite head with deformable convolution network (DCN). We employ a ROV equipped with a FLS to acquire dataset including victim, boat and plane model objects. A series of experiments are carried out on this dataset to train and verify the performance of STAFNet. Compared with other state-of-the-art methods, STAFNet significantly overcomes complex noise interference, and achieves the best balance between detection accuracy and inference speed.

SESSION: Session 5B: Robust MM

Camouflaged Poisoning Attack on Graph Neural Networks

Chao Jiang
Yi He
Richard Chapman
Hongyi Wu

Graph neural networks (GNNs) have enabled the automation of many web applications that entail node classification on graphs, such as scam detection in social media and event prediction in service networks. Nevertheless, recent studies revealed that the GNNs are vulnerable to adversarial attacks, where feeding GNNs with poisoned data at training time can lead them to yield catastrophically devastative test accuracy. This finding heats up the frontier of attacks and defenses against GNNs. However, the prior studies mainly posit that the adversaries can enjoy free access to manipulate the original graph, while obtaining such access could be too costly in practice. To fill this gap, we propose a novel attacking paradigm, named Generative Adversarial Fake Node Camouflaging (GAFNC), with its crux lying in crafting a set of fake nodes in a generative-adversarial regime. These nodes carry camouflaged malicious features and can poison the victim GNN by passing their malicious messages to the original graph via learned topological structures, such that they 1) maximize the devastation of classification accuracy (i.e., global attack) or 2) enforce the victim GNN to misclassify a targeted node set into prescribed classes (i.e., target attack). We benchmark our experiments on four real-world graph datasets, and the results substantiate the viability, effectiveness, and stealthiness of our proposed poisoning attack approach. Code is released in github.com/chao92/GAFNC.

Accelerated Sign Hunter: A Sign-based Black-box Attack via Branch-Prune Strategy and Stabilized Hierarchical Search

Siyuan Li
Guangji Huang
Xing Xu
Yang Yang
Fumin Shen

We propose the Accelerated Sign Hunter (ASH), a sign-based black-box attack under l∞ constraint. The proposed method searches an approximate gradient sign of loss w.r.t. the input image with few queries to the target model and crafts the adversarial example by updating the input image in this direction. It applies a Branch-Prune Strategy that infers the unknown sign bits according to the checked ones to avoid unnecessary queries. It also adopts a Stabilized Hierarchical Search to achieve better performance within a limited query budget. We provide a theoretical proof showing that the Accelerated Sign Hunter halves the queries without dropping the attack success rate (SR) compared with the state-of-the-art sign-based black-box attack. Extensive experiments also demonstrate the superiority of our ASH method over other black-box attacks. In particular on Inception-v3 for ImageNet, our method achieves the SR of 0.989 with an average queries of 338.56, which is 1/4 fewer than that of the state-of-the-art sign-based attack to achieve the same SR. Moreover, our ASH method is out-of-the-box since there are no hyperparameters that need to be tuned.

DiGAN: Directional Generative Adversarial Network for Object Transfiguration

Zhen Luo
Yingfang Zhang
Peihao Zhong
Jingjing Chen
Donglong Chen

The concept of cycle consistency in couple mapping has helped CycleGAN illustrate remarkable performance in the context of image-to-image translation. However, its limitations in object transfiguration have not been ideally solved yet. In order to alleviate previous problems of wrong transformation position, degeneration, and artifacts, this work presents a new approach called Directional Generative Adversarial Network (DiGAN) in the field of object transfiguration. The major contribution of this work is threefold. First, paired directional generators are designed for both intra-domain and inter-domain generations. Second, a segmentation network based on Mask R-CNN is introduced to build conditional inputs for both generators and discriminators. Third, a feature loss and a segmentation loss are added to optimize the model. Experimental results indicate that DiGAN surpasses CycleGAN and AttentionGAN by 17.2% and 60.9% higher on Inception Score, 15.5% and 2.05% lower on Fréchet Inception Distance, and 14.2% and 15.6% lower on VGG distance, respectively, in horse-to-zebra mapping.

GIO: A Timbre-informed Approach for Pitch Tracking in Highly Noisy Environments

Xiaoheng Sun
Xia Liang
Qiqi He
Bilei Zhu
Zejun Ma

As one of the fundamental tasks in music and speech signal processing, pitch tracking has been attracting attention for decades. While a human can focus on the voiced pitch even in highly noisy environments, most existing automatic pitch tracking systems show unsatisfactory performance encountering noise. To mimic human auditory, a data-driven model named GIO is proposed in this paper, in which timbre information is introduced to guide pitch tracking. The proposed model takes two inputs: a short audio segment to extract pitch from and a timbre embedding derived from the speaker's or singer's voice. In experiments, we use a music artist classification model to extract timbre embedding vectors. A dual-branch structure and a two-step training method are designed to enable the model to predict voice presence. The experimental results show that the proposed model gains a significant improvement in noise robustness and outperforms existing state-of-the-art methods with fewer parameters.

SESSION: Session 5C: Action, Pose and Body

Source-free Temporal Attentive Domain Adaptation for Video Action Recognition

Peipeng Chen
Andy J. Ma

With the rapidly increasing video data, many video analysis techniques have been developed and achieved success in recent years. To mitigate the distribution bias of video data across domains, unsupervised video domain adaptation (UVDA) has been proposed and become an active research topic. Nevertheless, existing UVDA methods need to access source domain data during training, which may result in problems of privacy policy violation and transfer inefficiency. To address this issue, we propose a novel source-free temporal attentive domain adaptation (SFTADA) method for video action recognition under the more challenging UVDA setting, such that source domain data is not required for learning the target domain. In our method, an innovative Temporal Attentive aGgregation (TAG) module is designed to combine frame-level features with varying importance weights for video-level representation generation. Without source domain data and label information in the target domain and during testing, an MLP-based attention network is trained to approximate the attentive aggregation function based on class centroids. By minimizing frame-level and video-level loss functions, both the temporal and spatial domain shifts in cross-domain video data can be reduced. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our proposed method in solving the challenging source-free UVDA task.

Review of Deep Learning Models for Spine Segmentation

Neng Zhou
Hairu Wen
Yi Wang
Yang Liu
Longfei Zhou

Medical image segmentation has been a long-standing chal- lenge due to the limitation in labeled datasets and the exis- tence of noise and artifacts. In recent years, deep learning has shown its capability in achieving successive progress in this field, making its automatic segmentation performance gradually catch up with that of manual segmentation. In this paper, we select twelve state-of-the-art models and compare their performance in the spine MRI segmentation task. We divide them into two categories. One of them is the U-Net family, including U-Net, Attention U-Net, ResUNet++, TransUNet, and MiniSeg. The architectures of these models often ultimately include the encoder-decoder structure, and their innovation generally lies in the way of better fusing low-level and high-level information. Models in the other category, named Models Using Backbone often use ResNet, Res2Net, or other pre-trained models on ImageNet as the backbone to extract information. These models pay more attention capturing multi-scale and rich contextual information. All models are trained and tested on the open-source spine M- RI dataset with 20 labels and no pre-training. Through the comparison, the models using backbone exceed U-Net family, and DeepLabv3+ works best. We suppose it is also necessary to extract multi-scale information in a multi-label medical segmentation task.

3D-Augmented Contrastive Knowledge Distillation for Image-based Object Pose Estimation

Zhidan Liu
Zhen Xing
Xiangdong Zhou
Yijiang Chen
Guichun Zhou

Image-based object pose estimation sounds amazing because in real applications the shape of object is oftentimes not available or not easy to take like photos. Although it is an advantage to some extent, un-explored shape information in 3D vision learning problem looks like "flaws in jade''. In this paper, we deal with the problem in a reasonable new setting, namely 3D shape is exploited in the training process, and the testing is still purely image-based. We enhance the performance of image-based methods for category-agnostic object pose estimation by exploiting 3D knowledge learned by a multi-modal method. Specifically, we propose a novel contrastive knowledge distillation framework that effectively transfers 3D-augmented image representation from a multi-modal model to an image-based model. We integrate contrastive learning into the two-stage training procedure of knowledge distillation, which formulates an advanced solution to combine these two approaches for cross-modal tasks. We experimentally report state-of-the-art results compared with existing category-agnostic image-based methods by a large margin (up to +5% improvement on ObjectNet3D dataset), demonstrating the effectiveness of our method.

Selective Hypergraph Convolutional Networks for Skeleton-based Action Recognition

Yiran Zhu
Guangji Huang
Xing Xu
Yanli Ji
Fumin Shen

In skeleton-based action recognition, Graph Convolutional Networks (GCNs) have achieved remarkable performance since the skeleton representation of human action can be naturally modeled by the graph structure. Most of the existing GCN-based methods extract skeleton features by exploiting single-scale joint information, while neglecting the valuable multi-scale contextual information. Besides, the commonly used strided convolution in temporal dimension could evenly filters out the keyframes we expect to preserve and leads to the loss of keyframe information. To address these issues, we propose a novel Selective Hypergraph Convolution Network, dubbed Selective-HCN, which stacks two key modules: Selective-scale Hypergraph Convolution (SHC) and Selective-frame Temporal Convolution (STC). The SHC module represents the human skeleton as the graph and hypergraph to fully extract multi-scale information, and selectively fuse features at various scales. Instead of traditional strided temporal convolution, the STC module can adaptively select keyframes and filter redundant frames according to the importance of the frames. Extensive experiments on two challenging skeleton action benchmarks, i.e., NTU-RGB+D and Skeleton-Kinetics, demonstrate the superiority and effectiveness of our proposed method.

SESSION: Session 6: Multifarious Multimedia

Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning

Guangyu Chen
Deyuan Zhang
Tao Liu
Xiaoyong Du

Voice-face association learning (VFAL) aims to tap into the potential connections between voices and faces. Most studies currently address this problem in a supervised manner, which cannot exploit the wealth of unlabeled video data. To solve this problem, we propose an unsupervised learning framework: Self-Lifting (SL), which can use unlabeled video data for learning. This framework includes two iterative steps of "clustering" and "metric learning". In the first step, unlabeled video data is mapped into the feature space by a coarse model. Then unsupervised clustering is leveraged to allocate pseudo-label to each video. In the second step, the pseudo-label is used as supervisory information to guide the metric learning process, which produces the refined model. These two steps are performed alternately to lift the model's performance. Experiments show that our framework can effectively use unlabeled video data for learning. On the VoxCeleb dataset, our approach achieves SOTA results among the unsupervised methods and has competitive performance compared with the supervised competitors. Our code is released on Github.

Revisiting Performance Measures for Cross-Modal Hashing

Hongya Wang
Shunxin Dai
Ming Du
Bo Xu
Mingyong Li

Recently, cross-modal hashing has attracted much attention due to its low storage cost and fast query speed. Mean Average Precision (MAP) is the most widely used performance measure for cross-modal hashing. However, we found that the MAP scores do not fully reflect the quality of the top-K results for cross-modal retrieval because it neglects multi-label information and overlooks the label semantic hierarchy. In view of this, we propose a new performance measure named Normalized Weighted Discounted Cumulative Gains (NWDCG) by extending Normalized Discounted Cumulative Gains (NDCG) using co-occurrence probability matrix. To verify the effectiveness of NWDCG, we conduct extensive experiments using three popular cross-modal hashing schemes over two publically available datasets.

Local Slot Attention for Vision and Language Navigation

Yifeng Zhuang
Qiang Sun
Yanwei Fu
Lifeng Chen
Xiangyang Xue

Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments.

Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information.

To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.

Cross-Pixel Dependency with Boundary-Feature Transformation for Weakly Supervised Semantic Segmentation

Yuhui Guo
Xun Liang
Tang Hui
Bo Wu
Xiangping Zheng

Weakly supervised semantic segmentation with image-level labels is a challenging problem that typically relies on the initial responses generated by the classification network to locate object regions. However, such initial responses only cover the most discriminative parts of the object and may incorrectly activate in the background regions. To address this problem, we propose a Cross-pixel Dependency with Boundary-feature Transformation (CDBT) method for weakly supervised semantic segmentation. Specifically, we develop a boundary-feature transformation mechanism, to build strong connections among pixels belonging to the same object but weak connections among different objects. Moreover, we design a cross-pixel dependency module to enhance the initial responses, which exploits context appearance information and refines the prediction of current pixels by the relations of global channel pixels, thus generating pseudo labels of higher quality for training the semantic segmentation network. Extensive experiments on the PASCAL VOC 2012 segmentation benchmark demonstrate that our method outperforms state-of-the-art methods using image-level labels as weak supervision.

Mobile Emotion Recognition via Multiple Physiological Signals using Convolution-augmented Transformer

Kangning Yang
Benjamin Tag
Yue Gu
Chaofan Wang
Tilman Dingler
Greg Wadley
Jorge Goncalves

Recognising and monitoring emotional states play a crucial role in mental health and well-being management. Importantly, with the widespread adoption of smart mobile and wearable devices, it has become easier to collect long-term and granular emotion-related physiological data passively, continuously, and remotely. This creates new opportunities to help individuals manage their emotions and well-being in a less intrusive manner using off-the-shelf low-cost devices. Pervasive emotion recognition based on physiological signals is, however, still challenging due to the difficulty to efficiently extract high-order correlations between physiological signals and users' emotional states. In this paper, we propose a novel end-to-end emotion recognition system based on a convolution-augmented transformer architecture. Specifically, it can recognise users' emotions on the dimensions of arousal and valence by learning both the global and local fine-grained associations and dependencies within and across multimodal physiological data (including blood volume pulse, electrodermal activity, heart rate, and skin temperature). We extensively evaluated the performance of our model using the K-EmoCon dataset, which is acquired in naturalistic conversations using off-the-shelf devices and contains spontaneous emotion data. Our results demonstrate that our approach outperforms the baselines and achieves state-of-the-art or competitive performance. We also demonstrate the effectiveness and generalizability of our system on another affective dataset which used affect inducement and commercial physiological sensors.

SESSION: Special Session 1: Adversarial Learning for Multimedia Understanding and Retrieval

VAC-Net: Visual Attention Consistency Network for Person Re-identification

Weidong Shi
Yunzhou Zhang
Shangdong Zhu
Yixiu Liu
Sonya Coleman
Dermot Kerr

Person re-identification (ReID) is a crucial aspect of recognising pedestrians across multiple surveillance cameras. Even though significant progress has been made in recent years, the viewpoint change and scale variations still affect model performance. In this paper, we observe that it is beneficial for the model to handle the above issues when boost the consistent feature extraction capability among different transforms (e.g., flipping and scaling) of the same image. To this end, we propose a visual attention consistency network (VAC-Net). Specifically, we propose Embedding Spatial Consistency (ESC) architecture with flipping, scaling and original forms of the same image as inputs to learn a consistent embedding space. Furthermore, we design an Input-Wise visual attention consistent loss (IW-loss) so that the class activation maps(CAMs) from the three transforms are aligned with each other to enforce their advanced semantic information remains consistent. Finally, we propose a Layer-Wise visual attention consistent loss (LW-loss) to further enforce the semantic information among different stages to be consistent with the CAMs within each branch. These two losses can effectively improve the model to address the viewpoint and scale variations. Experiments on the challenging Market-1501, DukeMTMC-reID, and MSMT17 datasets demonstrate the effectiveness of the proposed VAC-Net.

MFGAN: A Lightweight Fast Multi-task Multi-scale Feature-fusion Model based on GAN

Lijia Deng
Yu-Dong Zhang

Cell segmentation and counting is a time-consuming task and an important experimental step in traditional biomedical research. Many current counting methods require exact cell locations. However, there are few such cell datasets with detailed object coordinates. Most existing cell datasets only have the total number of cells and a global segmentation labelling. To make more effective use of existing datasets, we divided the cell counting task into cell number prediction and cell segmentation respectively. This paper proposed a lightweight fast multi-task multi-scale feature fusion model based on generative adversarial networks (MFGAN). To coordinate the learning of these two tasks, we proposed a Combined Hybrid Loss function (CH Loss) and used conditional GAN to train our network. We proposed a Lightweight Fast Multitask Generator (LFMG) which reduced the number of parameters by 20% compared with U-Net but got better performance on cell segmentation. We used multi-scale feature fusion technology to improve the quality of reconstructed segmentation images. In addition, we also proposed a Structure Fusion Discrimination (SFD) to refine the accuracy of the details of the features. Our method achieved non-Point-based counting that no longer needs to annotate the exact position of each cell in the image during the training and successfully achieved excellent results on cell counting and cell segmentation.

Adaptive Temporal Grouping for Black-box Adversarial Attacks on Videos

Zhipeng Wei
Jingjing Chen
Hao Zhang
Linxi Jiang
Yu-Gang Jiang

Deep-learning based video models, which have remarkable performance on action recognition tasks, are recently proved to be vulnerable to adversarial samples, even those generated in the black-box setting. However, these black-box attack methods are insufficient to attack videos models in real-world applications due to the requirement of lots of queries. To this end, we propose to boost the efficiency of black-box attacks on video recognition models. Although videos carry rich temporal information, they include redundant spatial information from adjacent frames. This motivates us to introduce the adaptive temporal grouping (ATG) method, which groups video frames by the similarity of their features extracted from the ImageNet-pretrained image model. By selecting one key-frame from each group, ATG helps any black-box attack methods to optimize the adversarial perturbations over key-frames instead of all frames, where the estimated gradient of key-frame is shared with other frames in each group. To balance the efficiency and precision of estimated gradients, ATG adaptively adjusts the group number by the magnitude of the current perturbation and the current query number. Through extensive experiments on the HMDB-51 dataset and the UCF-101 dataset, we demonstrate that ATG can significantly reduce the number of queries by more than 10% for the targeted attack.

SESSION: Special Session 2A: Transformer-based Multimedia Understanding: Model Design, Learning, Distillation

Parallelism Network with Partial-aware and Cross-correlated Transformer for Vehicle Re-identification

Guangqi Jiang
Huibing Wang
Jinjia Peng
Xianping Fu

Vehicle re-identification (ReID) aims to identify a specific vehicle in the dataset captured by non-overlapping cameras, which plays a great significant role in the development of intelligent transportation systems. Even though CNN-based model achieves impressive performance for the ReID task, its Gaussian distribution of effective receptive fields has limitations in capturing the long-term dependence between features. Moreover, it is crucial to capture fine-grained features and the relationship between features as much as possible from vehicle images.

To address those problems, we propose a partial-aware and cross-correlated transformer model (PCTM), which adopts the parallelism network extracting discriminant features to optimize the feature representation for vehicle ReID. PCTM includes a cross-correlation transformer branch that fuses the features extracted based on the transformer module and feature guidance module, which guides the network to capture the long-term dependence of key features. In this way, the feature guidance module promotes the transformer-based features to focus on the vehicle itself and avoid the interference of excessive background for feature extraction. Moreover, PCTM introduced a partial-aware structure in the second branch to explore fine-grained information from vehicle images for capturing local differences from different vehicles. Furthermore, we conducted experiments on 2 vehicle datasets to verify the performance of PCTM.

Motor Learning based on Presentation of a Tentative Goal

Siqi Sun
Yongqing Sun
Mitsuhiro Goto
Shigekuni Kondo
Dan Mikami
Susumu Yamamoto

This paper presents a motor learning method based on the presenting of a personalized target motion, which we call a tentative goal. While many prior studies have focused on helping users correct their motor skill motions, most of them present the reference motion to users regardless of whether the motion is attainable or not. This makes it difficult for users to appropriately modify their motion to the reference motion when the difference between their motion and the reference motion is too significant. This study aims to provide a tentative goal that maximizes performance within a certain amount of motion change. To achieve this, predicting the performance of any motion is necessary. However, it is challenging to estimate the performance of a tentative goal by building a general model because of the large variety of human motion. Therefore, we built an individual model that predicts performance from a small training dataset and implemented it using our proposed data augmentation method. Experiments with basketball free-throw data demonstrate the effectiveness of the proposed method.

Extracting Precedence Relations between Video Lectures in MOOCs

Kui Xiao
Youheng Bai
Yan Zhang

Nowadays, the high dropout rate has become a widespread phenomenon in various MOOC platforms. When learning a MOOC, many learners are reluctant to spend time learning from the first video lecture to the last one. If we can recommend a learning path based on learners' individual needs and ignore irrelevant video lectures in the MOOC, it will help them learn more efficiently. The premise of learning path recommendation is to understand the precedence relations between learning resources. In this paper, we propose a novel approach for extracting precedence relations between video lectures in a MOOC. According to "knowledge depth" of concepts, we extract the core concepts from the video captions accurately. Transformer-based models are used to discover concept prerequisite relations, which help us identify the precedence relations between video lectures in MOOCs. Experiments show that the proposed method outperforms the state-of-the-art methods.

M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection

Junke Wang
Zuxuan Wu
Wenhao Ouyang
Xintong Han
Jingjing Chen
Yu-Gang Jiang
Ser-Nam Li

The widespread dissemination of Deepfakes demands effective approaches that can detect perceptually convincing forged images. In this paper, we aim to capture the subtle manipulation artifacts at different scales using transformer models. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which operates on patches of different sizes to detect local inconsistencies in images at different spatial levels. M2TR further learns to detect forgery artifacts in the frequency domain to complement RGB information through a carefully designed cross modality fusion block. In addition, to stimulate Deepfake detection research, we introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods. We conduct extensive experiments to verify the effectiveness of the proposed method, which outperforms state-of-the-art Deepfake detection methods by clear margins.

SESSION: Special Session 2B: Transformer-based Multimedia Understanding: Model Design, Learning, Distillation

Blindfold Attention: Novel Mask Strategy for Facial Expression Recognition

Bo Fu
Yuanxin Mao
Shilin Fu
Yonggong Ren
Zhongxuan Luo

Facial Expression Recognition (FER) is a basic and crucial computer vision task of classifying emotional expressions from human faces images into various emotion categories such as happy, sad, surprised, scared, angry, etc. Recently, facial expression recognition based on deep learning has made great progress. However, no matter the weight initialization technology or the attention mechanism, the face recognition method based on deep learning hard to capture those visually insignificant but semantically important features. To aid above question, in this paper we present a novel Facial Expression Recognition training strategy consisting of two components: Memo Affinity Loss (MAL) and Mask Attention Fine Tuning (MAFT). MAL is a variant of center loss, which uses memory bank strategy as well as discriminative center. MAL widens the distance between different clusters and narrows the distance within each cluster. Therefore, the features extracted by CNN were comprehensive and independent, which produced a more robust model. MAFT is a strategy that blindfolds attention parts temporarily and forces the model to learn from other important regions of the input image. It's not only an augmenting technique, but also a novel fine-tuning approach. As we know, we are the first to apply the mask strategy to the attention part and use this strategy to fine-tune the models. Finally, to implement our ideas, we constructed a new network named Architecture Attention ResNet based on ResNet-18. Our methods are conceptually and practically simple, but receives superior results on popular public facial expression recognition benchmarks with 88.75% on RAF-DB, 65.17% on AffectNet-7, 60.72% on AffectNet-8. The code will open source soon.

MSSPQ: Multiple Semantic Structure-Preserving Quantization for Cross-Modal Retrieval

Lei Zhu
Liewu Cai
Jiayu Song
Xinghui Zhu
Chengyuan Zhang
Shichao Zhang

Cross-modal hashing is a hot issue in the multimedia community, which is to generate compact hash code from multimedia content for efficient cross-modal search. Two challenges, i.e., (1) How to efficiently enhance cross-modal semantic mining is essential for cross-modal hash code learning, and (2) How to combine multiple semantic correlations learning to improve the semantic similarity preserving, cannot be ignored. To this end, this paper proposed a novel end-to-end cross-modal hashing approach, named Multiple Semantic Structure-Preserving Quantization (MSSPQ) that is to integrate deep hashing model with multiple semantic correlation learning to boost hash learning performance. The multiple semantic correlation learning consists of inter-modal and intra-modal pairwise correlation learning and Cosine correlation learning, which can comprehensively capture cross-modal consistent semantics and realize semantic similarity preserving. Extensive experiments are conducted on three multimedia datasets, which confirms that the proposed method outperforms the baselines.

SESSION: Special Session 3A: Weakly Supervised Learning for Medical Image Analysis

Lesion Localization in OCT by Semi-Supervised Object Detection

Yue Wu
Yang Zhou
Jianchun Zhao
Jingyuan Yang
Weihong Yu
Youxin Chen
Xirong Li

Over 300 million people worldwide are affected by various retinal diseases. By noninvasive Optical Coherence Tomography (OCT) scans, a number of abnormal structural changes in the retina, namely retinal lesions, can be identified. Automated lesion localization in OCT is thus important for detecting retinal diseases at their early stage. To conquer the lack of manual annotation for deep supervised learning, this paper presents a first study on utilizing semi-supervised object detection (SSOD) for lesion localization in OCT images. To that end, we develop a taxonomy to provide a unified and structured viewpoint of the current SSOD methods, and consequently identify key modules in these methods. To evaluate the influence of these modules in the new task, we build OCT-SS, a new dataset consisting of over 1k expert-labeled OCT B-scan images and over 13k unlabeled B-scans. Extensive experiments on OCT-SS identify Unbiased Teacher (UnT) as the best current SSOD method for lesion localization. Moreover, we improve over this strong baseline, with mAP increased from 49.34 to 50.86.

Weakly Supervised Pediatric Bone Age Assessment Using Ultrasonic Images via Automatic Anatomical RoI Detection

Yunyan Yan
Chuanbin Liu
Hongtao Xie
Sicheng Zhang
Zhendong Mao

Bone age assessment (BAA) is vital in pediatric clinical diagnosis. Existing deep learning methods predict bone age based on Regions of Interest (RoIs) detection or segmentation of hand radiograph, which requires expensive annotations. Limitations of radiographic technique on imaging and cost hinder their clinical application as well. Compared to X-ray images, ultrasonic images are rather clean, cheap and flexible, but the deep learning research on ultrasonic BAA is still a white space. For this purpose, we propose a weakly supervised interpretable framework entitled USB-Net, utilizing ultrasonic pelvis images and only image-level age annotations. USB-Net consists of automatic anatomical RoI detection stage and age assessment stage. In the detection stage, USB-Net locates the discriminative anatomical RoIs of pelvis through attention heatmap without any extra RoI supervision. In the assessment stage, the cropped anatomical RoI patch is fed as fine-grained input to estimate age. In addition, we provide the first ultrasonic BAA dataset composed of 1644 ultrasonic hip joint images with image-level labels of age and gender. The experimental results verify that our model keeps consistent attention with human knowledge and achieves 16.24 days mean absolute error (MAE) on USBAA dataset.

I2-Net: Intra- and Inter-scale Collaborative Learning Network for Abdominal Multi-organ Segmentation

Chao Suo
Xuanya Li
Donghui Tan
Yuan Zhang
Xieping Gao

Efficient and accurate abdominal multi-organ segmentation is the key to clinical applications such as computer-aided diagnosis and computer-aided surgery, but this task is extremely challenging due to blurred organ boundaries, complex backgrounds, and different organ sizes. Although existing segmentation methods have achieved good segmentation results, we found that the segmentation performance of abdominal small and medium organs is often unsatisfactory, but the accurate location and segmentation of abdominal small and medium organs plays an important role in the diagnosis and screening of clinical diseases. To address this problem, in this paper we propose an intra- and inter-scale collaborative learning network (I2-Net) for the abdominal multi-organ segmentation task. Firstly, we design a Feature Complementary Module (FCM) to adaptively complement the local and global features extracted by CNN and Transformer. Secondly, we propose a Feature Aggregation Module (FAM) to aggregate multi-scale semantic information. Finally, we employ a Focus Module (FM) for collaborative learning of intra- and inter-scale features. Extensive experiments on the Synapse dataset show that our method outperforms the state-of-the-art approaches and achieve accurate segmentation of abdominal multi-organs, especially for small and medium organs.

SA-NAS-BFNR: Spatiotemporal Attention Neural Architecture Search for Task-based Brain Functional Network Representation

Fenxia Duan
Chunhong Cao
Xieping Gao

The spatiotemporal representation of task-based brain functional networks is a key topic in functional magnetic resonance image (fMRI) research. At present, deep learning has been more powerful and flexible in brain functional network research than traditional methods. However, the dominant deep learning models failed in capturing the long-distance dependency (LDD) in task-based fMRI images (tfMRI) due to the time correlation among different task stimuli, the nature between temporal and spatial dimensions, which resulting in inaccurate brain pattern extraction. To address this issue, this paper proposes a spatiotemporal attention neural architecture search (NAS) model for task-based brain functional networks representation (SA-NAS-BFNR), where attention mechanism and gate recurrent unit (GRU) are integrated into a novel framework and GRU structure is searched by the differentiable neural architecture search. This model can not only achieve meaningful brain functional networks (BFNs) by addressing the LDD, but also simplify the existing recurrent structure models in tfMRI. Experiments show that the proposed model is capable of improving the fitting ability between time series and task stimulus sequence, and extracting the BFNs effectively as well.

SESSION: Special Session 3B: Weakly Supervised Learning for Medical Image Analysis

Weakly-supervised Cerebrovascular Segmentation Network with Shape Prior and Model Indicator

Qian Wu
Yufei Chen
Ning Huang
Xiaodong Yue

Labeling cerebral vessels requires domain knowledge in neurology and could be extremely laborious, and there is a scarcity of public annotated cerebrovascular datasets. Traditional machine learning or statistical models could yield decent results on thick vessels with high contrast while having poor performance on those regions of low contrast. In our work, we employ a statistic model as noisy labels and propose a Transformer-based architecture which utilizes Hessian shape prior as soft supervision. It enhances the learning ability of the network to tubular structures, so that the model can make more accurate predictions on refined cerebrovascular segmentation. Furthermore, to combat the overfitting towards noisy labels as model training, we introduce an effective label extension strategy that only calls for a few manual strokes on one sample. These supplementary labels are not used for supervision but only as an indicator to tell where the model keeps the most generalization capability, so as to further guide the model selection in validation. Our experiments are carried out on a public TOF-MRA dataset from MIDAS data platform, and the results demonstrate that our method shows superior performance on cerebrovascular segmentation which achieves Dice of 0.831±0.040 in the dataset.

SESSION: Doctoral Symposium

FreqCAM: Frequent Class Activation Map for Weakly Supervised Object Localization

Runsheng Zhang

Class Activation Map (CAM) is a commonly used solution for weakly supervised tasks. However, most of the existing CAM-based methods have one crucial problem, that is, only small object parts instead of full object regions can be located. In this paper, we find that the co-occurrence between the feature maps of different channels might provide more clues for object locations. Therefore, we propose a simple yet effective method, called Frequent Class Activation Map (FreqCAM), which exploits element-wise frequency information from the last convolutional layers as an attention filter to generate object regions. Our FreqCAM can filter the background noise and obtain more accurate fine-grained object localization information robustly. Furthermore, our approach is a post-hoc method of a trained classification model, and thus can be used to improve the performance of existing methods without modification. Experiments on the standard dataset CUB-200-2011 show that our proposed method achieves a significant increase in localization performance compared to the original existing state-of-the-art methods without any architectural changes or re-training.

SESSION: Reproducibility Paper

Reproducibility Companion Paper: Human Object Interaction Detection via Multi-level Conditioned Network

Yunqing He
Xu Sun
Hui Jiang
Tongwei Ren
Gangshan Wu
Maria Sinziiana Astefanoaei
Andreas Leibetseder

To support the replication of ?Human Object Interaction Detection via Multi-level Conditioned Network", which was presented at ICMR'20, this companion paper provides the details of the artifacts. Human Object Interaction Detection (HOID) aims to recognize fine-grained object-specific human actions, which demands the capabilities of both visual perception and reasoning. In this paper, we explain the file structure of the source code and publish the details of our experiments settings. We also provide a program for component analysis to assist other researchers with experiments on alternative models that are not included in our experiments. Moreover, we provide a demo program for facilitating the use of our model.

SESSION: Workshop Summaries

Introduction to the Fifth Annual Lifelog Search Challenge, LSC'22

Cathal Gurrin
Liting Zhou
Graham Healy
Björn Þór Jónsson
Duc-Tien Dang-Nguyen
Jakub Lokoć
Minh-Triet Tran
Wolfgang Hürst
Luca Rossetto
Klaus Schöffmann

For the fifth time since 2018, the Lifelog Search Challenge (LSC) facilitated a benchmarking exercise to compare interactive search systems designed for multimodal lifelogs. LSC'22 attracted nine participating research groups who developed interactive lifelog retrieval systems enabling fast and effective access to lifelogs. The systems competed in front of a hybrid audience at the LSC workshop at ACM ICMR'22. This paper presents an introduction to the LSC workshop, the new (larger) dataset used in the competition, and introduces the participating lifelog search systems.

MAD '22 Workshop: Multimedia AI against Disinformation

Bogdan Ionescu
Giorgos Kordopatis-Zilos
Adrian Popescu
Luca Cuccovillo
Symeon Papadopoulos

The verification of multimedia content posted online becomes increasingly challenging due to recent advancements in synthetic media manipulation and generation. Moreover, malicious actors can easily exploit AI technologies to spread disinformation across social media at a rapid pace, which poses very high risks for society and democracy. There is, therefore, an urgent need for AI-powered tools that facilitate the media verification process. The objective of the MAD '22 workshop is to bring together those who work on the broader topic of disinformation detection in multimedia in order to share their experiences and discuss their novel ideas, reaching out to people with different backgrounds and expertise. The research domains of interest vary from the detection of manipulated and synthetic content in multimedia to the analysis of the spread of disinformation and its impact on society. The MAD '22 workshop proceedings are available at: https://dl.acm.org/citation.cfm?id=3512732.

ICDAR'22: Intelligent Cross-Data Analysis and Retrieval

Minh-Son Dao
Michael Alexander Riegler
Duc-Tien Dang-Nguyen
Cathal Gurrin
Yuta Nakashima
Mianxiong Dong

We have witnessed the rise of cross-data against multimodal data problems recently. The cross-modal retrieval system uses a textual query to look for images; the air quality index can be predicted using lifelogging images; the congestion can be predicted using weather and tweets data; daily exercises and meals can help to predict the sleeping quality are some examples of this research direction. Although vast investigations focusing on multimodal data analytics have been developed, few cross-data (e.g., cross-modal data, cross-domain, cross-platform) research has been carried on. In order to promote intelligent cross-data analytics and retrieval research and to bring a smart, sustainable society to human beings, the specific article collection on "Intelligent Cross-Data Analysis and Retrieval" is introduced. This Research Topic welcomes those who come from diverse research domains and disciplines such as well-being, disaster prevention and mitigation, mobility, climate change, tourism, healthcare, and food computing

MMArt-ACM 2022: 5th Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia

Naoko Nitta
Anita Hu
Kensuke Tobitani

In addition to classical art types like paintings and sculptures, new types of artworks emerge following the advancement of deep learning, social platforms, media capturing devices, and media processing tools. Large volumes of machine-/user-generated content or professionally-edited content are shared and disseminated on the Web. Novel multimedia artworks, therefore, emerge rapidly in the era of social media and big data. The ever-increasing amount of illustrations/comics/animations on this platform gives rise to challenges of automatic classification, indexing, and retrieval that have been studied widely in other areas but not necessarily for this emerging type of artwork. In addition to objective entities like objects, events, and scenes, studies of cognitive properties emerge. Among various kinds of computational cognitive analyses, we focus on attractiveness analysis in this workshop. The topics of the accepted papers cover the affective analysis of texts, images, and music. The actual MMArt-ACM 2022 Proceedings are available at: https://dl.acm.org/citation.cfm?id=3512730.

ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

SESSION: Short Papers

SESSION: Session 1A: Reidentification

SESSION: Session 1B: Recommendations

SESSION: Session 2A: Visual+Text Retrieval

SESSION: Session 2B: Deep Learning - Methodological Advancements

SESSION: Demos

SESSION: Best Paper Candidates

SESSION: Session 3A: Visual+Text Retrieval

SESSION: Session 3B: Applications

SESSION: Session 3C: Synchronized MM

SESSION: Session 4A: Alignment and Localization

SESSION: Session 4B: Captioning and Summarization

SESSION: Session 5A: Applications

SESSION: Session 5B: Robust MM

SESSION: Session 5C: Action, Pose and Body

SESSION: Session 6: Multifarious Multimedia

SESSION: Special Session 1: Adversarial Learning for Multimedia Understanding and Retrieval

SESSION: Special Session 2A: Transformer-based Multimedia Understanding: Model Design, Learning, Distillation

SESSION: Special Session 2B: Transformer-based Multimedia Understanding: Model Design, Learning, Distillation

SESSION: Special Session 3A: Weakly Supervised Learning for Medical Image Analysis

SESSION: Special Session 3B: Weakly Supervised Learning for Medical Image Analysis

SESSION: Doctoral Symposium

SESSION: Reproducibility Paper

SESSION: Workshop Summaries

Sections

User login