ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

Full Citation in the ACM Digital Library

SESSION: Regular Long Papers

Integrative Multi-Modal Computing for Personal Health Navigation

  • Nitish Nag
  • Hyungik Oh
  • Mengfan Tang
  • Mingshu Shi
  • Ramesh Jain

An individual’s health trajectory is most influenced by personal lifestyle choices made regularly and frequently. It is now possible to measure, store, and analyze both multimodal lifestyle signals as well as multimodal physiological and behavioral health signals perpetually. Moreover, it is possible to model the effects of lifestyle precisely on health signals by collecting a variety of longitudinal data. These actions provide the inputs that affect changes in the health state of an individual based on their lifestyle decisions. With the advent of modern relatively inexpensive and common multi-modal data streams and sensing technologies, individuals now have large amounts of data and information about themselves that have the potential to transform decision-making in day-to-day life for health improvement. Multimodal analytics and contextual prediction and retrieval allow this data to make the best decisions to keep the health state optimal while making good lifestyle choices. This critical problem requires making a large variety of data constantly useful, contextually relevant, and most importantly useful in personalized decision-making.

In order to address this challenge, we implement a generalized Personal Health Navigation (PHN) framework. PHN takes individuals toward their personal health goals through a system that perpetually digests multi-modal data streams from diverse sources, estimates current health status, computes the best route through intermediate states utilizing personal models, and guides the best inputs that carry a user towards their goal. We show the effectiveness of this approach using two examples in cardiac health. First, we prospectively test a knowledge-infused cardiovascular PHN system with a pilot prospective experiment of 41 users. Second, we create a data-driven personalized model on cardiovascular exercise response variability on a smartwatch data set of 33,269 real-world users. We conclude with critical challenges in multi-modal computing for PHN systems that require deep future investigation.

Raising User Awareness about the Consequences of Online Photo Sharing

  • Hugo Schindler
  • Adrian Popescu
  • Khoa Nguyen
  • Jerome Deshayes-Chossart

Online social networks use AI techniques to automatically infer profiles from users’ shared data. However, these inferences and their effects remain, to a large extent, opaque to the users themselves. We propose a method which raises user awareness about the potential use of their profiles in impactful situations, such as searching for a job or an accommodation. These situations illustrate usage contexts that users might not have anticipated when deciding to share their data. User photographic profiles are described by automatic object detections in profile photos, and associated object ratings in situations. Human ratings of the profiles per situation are also available for training. These data are represented as graph structures which are fed into graph neural networks in order to learn how to automatically rate them. An adaptation of the learning procedure per situation is proposed since the same profile is likely to be interpreted differently, depending on the context. Automatic profile ratings are compared to one another in order to inform individual users of their standing with respect to others. Our method is evaluated on a public dataset, and consistently outperforms competitive baselines. An ablation study gives insights about the role of its main components.

Explaining Image Aesthetics Assessment: An Interactive Approach

  • Sven Schultze
  • Ani Withöft
  • Larbi Abdenebaoui
  • Susanne Boll

Assessing visual aesthetics is important for organizing and retrieving photos. That is one reason why several works aim to automate such an assessment using deep neural networks. The underlying models, however, lack explainability. Due to the subjective nature of aesthetics, it is challenging to find objective ground truths and explanations for aesthetics based on them. Hence, such models are prone to socio-cultural biases that come with the data, which raises questions on a wide range of ethical and technical issues. This paper presents an explainable artificial intelligence framework that adapts and combines three types of explanations for the concept of aesthetic assessment: 1) model constraints for built-in interpretability, 2) analysis of perturbation impacts on decisions, and 3) generation of artificial images that represent maxima or minima of values in the latent feature space. The objective is to improve human understanding through the explanations by creating an intuition for the model’s decision making. We identify issues that arise when humans interact with the explanations and derive requirements from human feedback to address the needs of different user groups. We evaluate our novel interactive explainable artificial intelligence technology in a study with end users (N=20). Our participants have different levels of experience in deep learning, allowing us to include experts, intermediate users, and laypersons. Our results show the benefits of the interactivity of our approach. All users found our system helpful in understanding how the aesthetic assessment was executed, reporting varying needs for explanatory details.

Explicit Knowledge Integration for Knowledge-Aware Visual Question Answering about Named Entities

  • Omar Adjali
  • Paul Grimal
  • Olivier Ferret
  • Sahar Ghannay
  • Hervé Le Borgne

Recent years have shown unprecedented growth of interest in Vision-Language related tasks, with the need to address the inherent challenges of integrating linguistic and visual information to solve real-world applications. Such a typical task is Visual Question Answering (VQA), which aims to answer questions about visual content. The limitations of the VQA task in terms of question redundancy and poor linguistic variability encouraged researchers to propose Knowledge-aware Visual Question Answering tasks as a natural extension of VQA. In this paper, we tackle the KVQAE (Knowledge-based Visual Question Answering about named Entities) task, which proposes to answer questions about named entities defined in a knowledge base and grounded in visual content. In particular, besides the textual and visual information, we propose to leverage the structural information extracted from syntactic dependency trees and external knowledge graphs to help answer questions about a large spectrum of entities of various types. Thus, by combining contextual and graph-based representations using Graph Convolutional Networks (GCNs), we are able to learn meaningful embeddings for Information Retrieval tasks. Experiments on the ViQuAE public dataset show how our approach improves the state-of-the-art baselines while demonstrating the interest of injecting external knowledge to enhance multimodal information retrieval.

Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph Generation

  • Shuo Chen
  • Yingjun Du
  • Pascal Mettes
  • Cees G.M. Snoek

This paper investigates the problem of scene graph generation in videos with the aim of capturing semantic relations between subjects and objects in the form of ⟨ subject, predicate, object⟩ triplets. Recognizing the predicate between subject and object pairs is imbalanced and multi-label in nature, ranging from ubiquitous interactions such as spatial relationships (e.g. in front of) to rare interactions such as twisting. In widely-used benchmarks such as Action Genome and VidOR, the imbalance ratio between the most and least frequent predicates reaches 3,218 and 3,408, respectively, surpassing even benchmarks specifically designed for long-tailed recognition. Due to the long-tailed distributions and label co-occurrences, recent state-of-the-art methods predominantly focus on the most frequently occurring predicate classes, ignoring those in the long tail. In this paper, we analyze the limitations of current approaches for scene graph generation in videos and identify a one-to-one correspondence between predicate frequency and recall performance. To make the step towards unbiased scene graph generation in videos, we introduce a multi-label meta-learning framework to deal with the biased predicate distribution. Our meta-learning framework learns a meta-weight network for each training sample over all possible label losses. We evaluate our approach on the Action Genome and VidOR benchmarks by building upon two current state-of-the-art methods for each benchmark. The experiments demonstrate that the multi-label meta-weight network improves the performance for predicates in the long tail without compromising performance for head classes, resulting in better overall performance and favorable generalizability. Code: https://github.com/shanshuo/ML-MWN.

Cross-View Sample-Enriched Graph Contrastive Learning Network for Personalized Micro-video Recommendation

  • Ying He
  • Gongqing Wu
  • Desheng Cai
  • Xuegang Hu

Micro-video recommendation has attracted extensive research attention with the increasing popularity of micro-video sharing platforms. Recently, graph contrastive learning (GCL) is adopted for enhancing the performance of graph neural network based micro-video recommendation. However, these GCL methods may suffer from the following problems: (1) they fail to fully exploit the potential of contrastive learning for ignoring or misjudging highly similar samples, and (2) the complementary recommendation effects between graph structure information and multi-modal feature information are not effectively utilized. In this paper, we propose a novel Cross-View Sample-Enriched Graph Contrastive Learning Network (CSGCL) for micro-video recommendation. Specifically, we build a collaborative learning view and a semantic learning view to learn node representations. For the collaborative learning view, we leverage similar nodes at the structure level to construct an effective collaborative contrastive objective. For the semantic learning view, we derive the k-nearest neighbor graph generated from multi-modal features as the semantic graphs and build a semantic contrastive objective for learning high-quality micro-video representations. Finally, a cross-view contrastive objective is designed to consider the mutually complementary recommendation effects by maximizing the agreement between the two above views. Extensive experiments on three real-world datasets demonstrate that the proposed model outperforms the baselines.

Improving Image Encoders for General-Purpose Nearest Neighbor Search and Classification

  • Konstantin Schall
  • Kai Uwe Barthel
  • Nico Hezel
  • Klaus Jung

Recent advances in computer vision research led to large vision foundation models that generalize to a broad range of image domains and perform exceptionally well in various image based tasks. However, content-based image-to-image retrieval is often overlooked in this context. This paper investigates the effectiveness of different vision foundation models on two challenging nearest neighbor search-based tasks: zero-shot retrieval and k-NN classification. A benchmark for evaluating the performance of various vision encoders and their pre-training methods is established, where significant differences in the performance of these models are observed. Additionally, we propose a fine-tuning regime that improves zero-shot retrieval and k-NN classification through training with a combination of large publicly available datasets without specializing in any data domain. Our results show that the retrained vision encoders have a higher degree of generalization across different search-based tasks and can be used as general-purpose embedding models for image retrieval.

Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining

  • Giacomo Nebbia
  • Adriana Kovashka

Named entities are ubiquitous in text that naturally accompanies images, especially in domains such as news or Wikipedia articles. In previous work, named entities have been identified as a likely reason for low performance of image-text retrieval models pretrained on Wikipedia and evaluated on named entities-free benchmark datasets. Because they are rarely mentioned, named entities could be challenging to model. They also represent missed learning opportunities for self-supervised models: the link between named entity and object in the image may be missed by the model, but it would not be if the object were mentioned using a more common term. In this work, we investigate hypernymization as a way to deal with named entities for pretraining grounding-based multi-modal models and for fine-tuning on open-vocabulary detection. We propose two ways to perform hypernymization: (1) a “manual” pipeline relying on a comprehensive ontology of concepts, and (2) a “learned” approach where we train a language model to learn to perform hypernymization. We run experiments on data from Wikipedia and from The New York Times. We report improved pretraining performance on objects of interest following hypernymization, and we show the promise of hypernymization on open-vocabulary detection, specifically on classes not seen during training.

CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval

  • Yizhao Gao
  • Zhiwu Lu

Video-text retrieval has drawn great attention due to the prosperity of online video contents. Most existing methods extract the video embeddings by densely sampling abundant (generally dozens of) video clips, which acquires tremendous computational cost. To reduce the resource consumption, recent works propose to sparsely sample fewer clips from each raw video with a narrow time span. However, they still struggle to learn a reliable video representation with such locally sampled video clips, especially when testing on cross-dataset setting. In this work, to overcome this problem, we sparsely and globally (with wide time span) sample a handful of video clips from each raw video, which can be regarded as different samples of a pseudo video class (i.e., each raw video denotes a pseudo video class). From such viewpoint, we propose a novel Cross-Modal Meta-Transformer (CMMT) model that can be trained in a meta-learning paradigm. Concretely, in each training step, we conduct a cross-modal fine-grained classification task where the text queries are classified with pseudo video class prototypes (each has aggregated all sampled video clips per pseudo video class). Since each classification task is defined with different/new videos (by simulating the evaluation setting), this task-based meta-learning process enables our model to generalize well on new tasks and thus learn generalizable video/text representations. To further enhance the generalizability of our model, we induce a token-aware adaptive Transformer module to dynamically update our model (prototypes) for each individual text query. Extensive experiments on three benchmarks show that our model achieves new state-of-the-art results in cross-dataset video-text retrieval, demonstrating that it has more generalizability in video-text retrieval. Importantly, we find that our new meta-learning paradigm indeed brings improvements under both cross-dataset and in-dataset retrieval settings.

Dual-Modality Co-Learning for Unveiling Deepfake in Spatio-Temporal Space

  • Jiazhi Guan
  • Hang Zhou
  • Zhizhi Guo
  • Tianshu Hu
  • Lirui Deng
  • Chengbin Quan
  • Meng Fang
  • Youjian Zhao

The emergence of photo-realistic deepfakes on a large scale has become a significant societal concern, which has garnered considerable attention from the research community. Several recent studies have identified the critical issue of “temporal inconsistency” resulting from the frame reassembling process of deepfake generation techniques. However, due to the lack of task-specific design, the spatio-temporal modeling of current methods remains insufficient in three critical aspects: 1) inapparent temporal changes are prone to be undermined compared to abundant spatial cues; 2) minor inconsistent regions are often concealed by motions with greater amplitude during downsampling; 3) capturing both transient inconsistencies and persistent motions simultaneously remains a significant challenge. In this paper, we propose a novel Dual-Modality Co-Learning framework tailored for these characteristics, which achieves more effectual deepfake detection with complementary information from RGB and optical flow modalities. In particular, we designed a Multi-Scale Motion Regularization module to encourage the network to equally prioritize both the significant spatial cues and the subtle temporal facial motion cues. Additionally, we developed a Multi-Span Cross-Attention module to effectively integrate the information from both RGB and optical flow modalities and improve the detection accuracy with multi-span predictions. Extensive experiments validate the effectiveness our ideas and demonstrate the superior performance of our approach.

A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset

  • Jiaxin Deng
  • Dong Shen
  • Haojie Pan
  • Xiangyu Wu
  • Ximan Liu
  • Gaofeng Meng
  • Fan Yang
  • Tingting Gao
  • Ruiji Fu
  • Zhongyuan Wang

Video understanding is an important task in short video business platforms and it has a wide application in video recommendation and classification. Most of the existing video understanding works only focus on the information that appeared within the video content, including the video frames, audio and text. However, introducing common sense knowledge from the external Knowledge Graph (KG) dataset is essential for video understanding when referring to the content which is less relevant to the video. Owing to the lack of video knowledge graph dataset, the work which integrates video understanding and KG is rare. In this paper, we propose a heterogeneous dataset that contains the multi-modal video entity and fruitful common sense relations. This dataset also provides multiple novel video inference tasks like the Video-Relation-Tag (VRT) and Video-Relation-Video (VRV) tasks. Furthermore, based on this dataset, we propose an end-to-end model that jointly optimizes the video understanding objective with knowledge graph embedding, which can not only better inject factual knowledge into video understanding but also generate effective multi-modal entity embedding for KG. Comprehensive experiments indicate that combining video understanding embedding with factual knowledge benefits the content-based video retrieval performance. Moreover, it also helps the model generate better knowledge graph embedding which outperforms traditional KGE-based methods on VRT and VRV tasks with at least 42.36% and 17.73% improvement in HITS@10.

Edge Enhanced Image Style Transfer via Transformers

  • Chiyu Zhang
  • Zaiyan Dai
  • Peng Cao
  • Jun Yang

In recent years, arbitrary image style transfer has attracted more and more attention. Given a pair of content and style images, a stylized one is hoped that retains the content from the former while catching style patterns from the latter. However, it is difficult to simultaneously keep well the trade-off between the content details and the style features. To stylize the image with sufficient style patterns, the content details may be damaged, and sometimes the objects of images can not be distinguished clearly. For this reason, we present a new transformer-based method named STT (Style Transfer via Transformers) for image style transfer, and an edge loss function that can enhance the content details and avoid generating blurred results due to the excessive rendering of style features. Extensive qualitative and quantitative experiments demonstrate that STT achieves comparable performance to state-of-the-art image style transfer approaches while alleviating the content leak problem.

Unlocking Potential of 3D-aware GAN for More Expressive Face Generation

  • Juheon Hwang
  • Jiwoo Kang
  • Kyoungoh Lee
  • Sanghoon Lee

As style-based image generators have achieved disentanglement in features by converting latent vector space to style vector space, numerous efforts have been made to enhance the controllability of the latent. However, existing methods for controllable models have limitations in precisely creating high-resolution faces with large expressions. The degradation is due to the dependence on the training dataset, as the high-resolution face datasets do not have sufficient expressive images. To tackle this challenge, we propose a robust training framework for 3D-aware generative adversarial networks to learn the high-quality generation of more expressive faces through a signed distance field. First, we propose a novel 3D enforcement loss to generate more expressive images in an unsupervised manner. Second, we introduce a partial training method to fine-tune the network on multiple datasets without loss of image resolution. Finally, we propose a ray-scaling scheme for the volume renderer to represent a face at arbitrary scales. Through the proposed framework, the network learns 3D face priors, such as expressional shapes of the parametric facial model, to generate detailed faces. The experimental results outperform the methods of the state of the art, showing strong benefits in the generation of high-resolution facial expressions.

RIP-NeRF: Learning Rotation-Invariant Point-based Neural Radiance Field for Fine-grained Editing and Compositing

  • Yuze Wang
  • Junyi Wang
  • Yansong Qu
  • Yue Qi

Neural Radiance Field (NeRF) shows dramatic results in synthesising novel views. However, existing controllable and editable NeRF methods are still incapable of both fine-grained editing and cross-scene compositing, greatly limiting their creative editing as well as potential applications. When the radiance field is fine-grained edited and composited, a severe drawback is that varying the orientation of the corresponding explicit scaffold, such as point, mesh, volume, etc., may lead to the degradation of rendering quality. In this work, by taking the respective strengths of the implicit NeRF-based representation and the explicit point-based representation, we present a novel Rotation-Invariant Point-based NeRF (RIP-NeRF) for both fine-grained editing and cross-scene compositing of the radiance field. Specifically, we introduce a novel point-based radiance field representation to replace the Cartesian coordinate as the network input. This rotation-invariant representation is met by carefully designing a Neural Inverse Distance Weighting Interpolation (NIDWI) module to aggregate neural points, significantly improving the rendering quality for fine-grained editing. To achieve cross-scene compositing, we disentangle the rendering module and the neural point-based representation in NeRF. After simply manipulating the corresponding neural points, a cross-scene neural rendering module is applied to achieve controllable cross-scene compositing without retraining. The advantages of our RIP-NeRF on editing quality and capability are demonstrated by extensive editing and compositing experiments on room-scale real scenes and synthetic objects with complex geometry.

A Multi-Teacher Assisted Knowledge Distillation Approach for Enhanced Face Image Authentication

  • Tiancong Cheng
  • Ying Zhang
  • Yifang Yin
  • Roger Zimmermann
  • Zhiwen Yu
  • Bin Guo

Recent deep-learning-based face recognition systems have achieved significant success. However, most existing face recognition systems are vulnerable to spoofing attacks where a copy of the face image is used to deceive the authentication. A number of solutions are developed to overcome this problem by building a separate face anti-spoofing model, which however brings in additional storage and computation requirements. Since both recognition and face anti-spoofing tasks stem from the analysis of the same face image, this paper explores a unified approach to reduce the original dual-model redundancy. To this end, we introduce a compressed multi-task model to simultaneously perform both tasks in a lightweight manner, which has the potential to benefit lightweight IoT applications. Concretely, we regard the original two single-task deep models as teacher networks and propose a novel multi-teacher-assisted knowledge distillation method to guide our lightweight multi-task model to achieve satisfying performance on both tasks. Additionally, to reduce the large gap between the deep teachers and the light student, a comprehensive feature alignment is further integrated by distilling multi-layer features. Extensive experiments are carried out on two benchmark datasets, where we achieve the task accuracy of 93% meanwhile reducing the model size by 97% and reducing the inference time by 56% compared to the original dual-model.

FaceLivePlus: A Unified System for Face Liveness Detection and Face Verification

  • Ying Zhang
  • Lilei Zheng
  • Vrizlynn L.L. Thing
  • Roger Zimmermann
  • Bin Guo
  • Zhiwen Yu

Face verification is a trending way to verify someone’s identity in broad applications. But such systems are vulnerable to face spoofing attacks via, for example, a fraudulent copy of a photo, making it necessary to include face liveness detection as an additional safeguard. Among most existing studies, the face liveness detection is realized in a separate machine learning model in addition to the model for face verification. Such a two-model configuration may face challenges when deployed onto platforms with limited computation power and storage (e.g. mobile phone, IoT devices), especially considering each model may have millions of parameters. Inspired by the fact that humans can verify a person’s identity and liveness at a single glance from a face, we develop a novel system, named FaceLivePlus, to learn a single and universal face descriptor for the two tasks (face verification and liveness detection) so that the computational workload and storage space can be halved. To achieve this, we formulate the underlying relationship between the two tasks, and seamlessly embed this relationship in a distance ranking deep model. The model directly works on features rather than classification labels, which makes the system well generalized on unseen data. Extensive experiments show that our average half total error rate (HTER) has at least 15% and 8% improvement from the state-of-the-arts on two benchmark datasets. We anticipate this approach could become a new direction for face authentication.

SIGMA-DF: Single-Side Guided Meta-Learning for Deepfake Detection

  • Bing Han
  • Jianshu Li
  • Wenqi Ren
  • Man Luo
  • Jian Liu
  • Xiaochun Cao

The current challenge of Deepfake detection is the cross-domain performance on unseen Deepfake data. Instead of extracting forgery artifacts that are robust to the cross-domain scenarios as most previous works, we propose a novel method named Single-sIde Guided Meta-leArning framework for DeepFake detection (SIGMA-DF) which simulates the cross-domain scenarios during training by synthesizing virtual testing domain through meta-learning. In addition, SIGMA-DF integrates the meta-learning algorithm with a new ensemble meta-learning framework, which separately trains multiple meta-learners in the meta-train phase to aggregate multiple domain shifts in each iteration. Hence multiple cross-domain scenarios are simulated, better leveraging the domain knowledge. In addition, considering the contribution of hard samples in single-side distribution optimization, a novel weighted single-side loss function is proposed to only narrow the intra-class distance between real faces and enlarge the inter-class distance for both real and fake faces in embedding space with the awareness of sample weights. Extensive experiments are conducted on several standard Deepfake detection datasets to demonstrate that the proposed SIGMA-DF achieves state-of-the-art performance. In particular, in the cross-domain evaluation from FF++ to Celeb-DF and DFDC, our SIGMA-DF outperforms the baselines by 4.4% and 4.5% in terms of AUC, respectively.

AVForensics: Audio-driven Deepfake Video Detection with Masking Strategy in Self-supervision

  • Yizhe Zhu
  • Jialin Gao
  • Xi Zhou

Existing cross-dataset deepfake detection approaches exploit mouth-related mismatches between the auditory and visual modalities in fake videos to enhance generalisation to unseen forgeries. However, such methods inevitably suffer performance degradation with limited or unaltered mouth motions, we argue that face forgery detection consistently benefits from using high-level cues across the whole face region. In this paper, we propose a two-phase audio-driven multi-modal transformer-based framework, termed AVForensics, to perform deepfake video content detection from an audio-visual matching view related to full face. In the first pre-training phase, we apply the novel uniform masking strategy to model global facial features and learn temporally dense video representations in a self-supervised cross-modal manner, by capturing the natural correspondence between the visual and auditory modalities regardless of large-scaled labelled data and heavy memory usage. Then we use these learned representations to fine-tune for the down-stream deepfake detection task in the second phase, which encourages the model to offer accurate predictions based on captured global facial movement features. Extensive experiments and visualizations on various public datasets demonstrate the superiority of our self-supervised pre-trained method for achieving generalisable and robust deepfake video detection.

Predicting Tweet Engagement with Graph Neural Networks

  • Marco Arazzi
  • Marco Cotogni
  • Antonino Nocera
  • Luca Virgili

Social Networks represent one of the most important online sources to share content across a world-scale audience. In this context, predicting whether a post will have any impact in terms of engagement is of crucial importance to drive the profitable exploitation of these media. In the literature, several studies address this issue by leveraging direct features of the posts, typically related to the textual content and the user publishing it. In this paper, we argue that the rise of engagement is also related to another key component, which is the semantic connection among posts published by users in social media. Hence, we propose TweetGage, a Graph Neural Network solution to predict the user engagement based on a novel graph-based model that represents the relationships among posts. To validate our proposal, we focus on the Twitter platform and perform a thorough experimental campaign providing evidence of its quality.

A Recurrent Neural Network based Generative Adversarial Network for Long Multivariate Time Series Forecasting

  • Peiwang Tang
  • Qinghua Zhang
  • Xianchao Zhang

Some multimedia data from real life can be collected as multivariate time series data, such as community-contributed social data or sensor data. Many methods have been proposed for multivariate time series forecasting. In light of its importance in wide applications including traffic or electric power forecasting, appearance of the Transformer model has rapidly revolutionized various architectural design efforts. In Transformer, self-attention is used to achieve state-of-the-art prediction, and further studied for time series modeling in the frequency recently. These related works prove that self-attention mechanisms can reach a satisfied performance whether in time or frequency domain, but we used recurrent neural network (RNN) to verify that these are not critical and necessary. The correlation structure of RNN has time series specific inductive bias, but there are still some shortcomings in long multivariate time series forecasting. To break the forecasting bottleneck of traditional RNN architectures, we introduced RNNGAN, a novel and competitive RNN-based architecture combining the generation capability of Generative Adversarial Network (GAN) with the forecasting power of RNN. Differentiated from the Transformer, RNNGAN uses long short-term memory (LSTM) instead of the self-attention layers to model long-range dependencies. The experiment shows that, compared with the state-of-the-art models, RNNGAN can obtain competitive scores in many benchmark tests when training on multivariate time series datasets in many different fields.

Multi-channel Convolutional Neural Network for Precise Meme Classification

  • Victoria Sherratt
  • Kevin Pimbblet
  • Nina Dethlefs

This paper proposes a multi-channel convolutional neural network (MC-CNN) for classifying memes and non-memes. Our architecture is trained and validated on a challenging dataset that includes non-meme formats with textual attributes, which are also circulated online but rarely accounted for in meme classification tasks. Alongside a transfer learning base, two additional channels capture low-level and fundamental features of memes that make them unique from other images with text. We contribute an approach which outperforms previous meme classifiers specifically in live data evaluation, and one that is better able to generalise ‘in the wild’. Our research aims to improve accurate collation of meme content to support continued research in meme content analysis, and meme-related sub-tasks such as harmful content detection.

Not Only Generative Art: Stable Diffusion for Content-Style Disentanglement in Art Analysis

  • Yankun Wu
  • Yuta Nakashima
  • Noa Garcia

The duality of content and style is inherent to the nature of art. For humans, these two elements are clearly different: content refers to the objects and concepts in the piece of art, and style to the way it is expressed. This duality poses an important challenge for computer vision. The visual appearance of objects and concepts is modulated by the style that may reflect the author’s emotions, social trends, artistic movement, etc., and their deep comprehension undoubtfully requires to handle both. A promising step towards a general paradigm for art analysis is to disentangle content and style, whereas relying on human annotations to cull a single aspect of artworks has limitations in learning semantic concepts and the visual appearance of paintings. We thus present GOYA, a method that distills the artistic knowledge captured in a recent generative model to disentangle content and style. Experiments show that synthetically generated images sufficiently serve as a proxy of the real distribution of artworks, allowing GOYA to separately represent the two elements of art while keeping more information than existing methods.

Attention-based Video Virtual Try-On

  • Wen-Jiin Tsai
  • Yi-Cheng Tien

This paper presents a video virtual try-on model which is based on appearance flow warping and is parsing-free. In this model, we utilized attention methods from Transformer  [15] and proposed three attention-based modules: a Person-Cloth Transformer, a Self-Attention Generator, and a Cloth Refinement Transformer. The Person-Cloth Transformer enables clothing features to refer to person information, which is beneficial for style vector calculation and also improves the style warping process to estimate better appearance flows. The Self-Attention Generator utilizes a self-attention mechanism at the deepest feature layer, which enables the feature map to learn global context from all the other pixels, helping it synthesize more realistic results. The Cloth Refinement Transformer utilizes two cross-attention modules: one enables the current warped clothes to refer to previously warped clothes to ensure it is temporally consistent, and the other enables the current warped clothes to refer to person information to ensure it is spatially aligned. Our ablation study shows that each proposed module contributes to the improvement of the results. Experiment results show that our model can generate realistic try-on videos with high quality and perform better than existing methods.

Intra-inter Modal Attention Blocks for RGB-D Semantic Segmentation

  • Soyun Choi
  • Youjia Zhang
  • Sungeun Hong

In this paper, we introduce a novel approach to address the challenge of effectively utilizing both RGB and depth information for semantic segmentation. Our approach, Intra-inter Modal Attention (IMA) blocks, considers both intra-modal and inter-modal aspects of the information to produce better results than prior methods which primarily focused on inter-modal relationships. The IMA blocks consist of a cross-modal non-local module and an adaptive channel-wise fusion module. The cross-modal non-local module captures both intra-modal and inter-modal variations at the spatial level through inter-modality parameter sharing, while the adaptive channel-wise fusion module refines the spatially-correlated features. Experimental results on RGB-D benchmark datasets demonstrate consistent performance improvements over various baseline segmentation networks when using the IMA blocks. Our in-depth analysis provides comprehensive results on the impact of intra-, inter-, and intra-inter modal attention on RGB-D segmentation.

Joint Geometric-Semantic Driven Character Line Drawing Generation

  • Cheng-Yu Fang
  • Xian-Feng Han

Character line drawing synthesis can be formulated as a special case of image-to-image translation problem that automatically manipulates the photo-to-line drawing style transformation. In this paper, we present the first generative adversarial network based end-to-end trainable translation architecture, dubbed P2LDGAN, for automatic generation of high-quality character drawings from input photos/images. The core component of our approach is the joint geometric-semantic driven generator, which uses our well-designed cross-scale dense skip connections framework to embed learned geometric and semantic information for generating delicate line drawings. In order to support the evaluation of our model, we release a new dataset including 1,532 well-matched pairs of freehand character line drawings as well as corresponding character images/photos, where these line drawings with diverse styles are manually drawn by skilled artists. Extensive experiments on our introduced dataset demonstrate the superior performance of our proposed models against the state-of-the-art approaches in terms of quantitative, qualitative and human evaluations. Our code, models and dataset will be available at Github.

CurveSDF: Binary Image Vectorization Using Signed Distance Fields

  • Zeqing Xia
  • Zhouhui Lian

Binary image vectorization is a classical and fundamental problem in the areas of Computer Graphics and Computer Vision. Existing image vectorization methods are mainly based on global optimization, typically failing to preserve important details on outlines due to the incapability of learning high-level knowledge from training data. To address this problem, we propose CurveSDF to facilitate the learning of vectorization for 2D outlines. Our method consists of the following three modules: convex separation, signed distance field (SDF) generation, and curve intersection calculation. Specifically, we first divide an input binary image into convex elements. Then, we use restrained curve-hyperplane divisions to generate their SDFs and precisely reconstruct the original image. Finally, we convert the generated SDFs to vector outlines composed of both Bezier curves and line segments. Moreover, our method is self-constrained, and thus there is no need to use any vector data for training. Experimental results demonstrate the effectiveness of our method and its superiority against other existing approaches for binary image vectorization.

EMP: Emotion-guided Multi-modal Fusion and Contrastive Learning for Personality Traits Recognition

  • Yusong Wang
  • Dongyuan Li
  • Kotaro Funakoshi
  • Manabu Okumura

Multi-modal personality traits recognition aims to recognize personality traits precisely by utilizing different modality information, which has received increasing attention for its potential applications in human-computer interaction. Current methods almost fail to extract distinguishable features, remove noise, and align features from different modalities, which dramatically affects the accuracy of personality traits recognition. To deal with these issues, we propose an emotion-guided multi-modal fusion and contrastive learning framework for personality traits recognition. Specifically, we first use supervised contrastive learning to extract deeper and more distinguishable features from different modalities. After that, considering the close correlation between emotions and personalities, we use an emotion-guided multi-modal fusion mechanism to guide the feature fusion, which eliminates the noise and aligns the features from different modalities. Finally, we use an auto-fusion structure to enhance the interaction between different modalities to further extract essential features for final personality traits recognition. Extensive experiments on two benchmark datasets indicate that our method achieves state-of-the-art performance and robustness.

Knowledge-Aware Causal Inference Network for Visual Dialog

  • Zefan Zhang
  • Yi Ji
  • Chunping Liu

The effective knowledge and interaction within multi-modalities are key to Visual Dialog. Classic graph-based framework with the direct connection between history dialog and answer fails to give the right answer for the spurious guidance and strong bias induced from history dialog. Recent causal inference framework without this direct connection improves the generalization while worse accuracy. In this work, we propose a novel Knowledge-Aware Causal Inference framework(KACI-Net) in which the commonsense knowledge is introduced into the causal inference framework to achieve both high accuracy and generalization. Specifically, the commonsense knowledge is first generated according to the entities extracted from the question and fused with language and visual features with the co-attention to get the final answer. Comparisons with knowledge-unaware framework and graph-based knowledge-aware framework on VisDial v1.0 dataset show the superiority of our proposed framework and verify the effectiveness the usage of the commonsense knowledge for a good reasoning in Visual Dialog. Both high NDCG and MRR metrics indicate a good trade-off between accuracy and generalization.

Less is More: Decoupled High-Semantic Encoding for Action Recognition

  • Chun Zhang
  • Keyan Ren
  • Qingyun Bian
  • Yu Shi

This paper focuses on how to improve the efficiency of the action recognition framework by optimizing its complicated feature extraction pipelines and enhancing explainability, benefiting future adaptation to more complex visual understanding tasks (e.g. video captioning). To achieve this task, we propose a novel decoupled two-stream framework for action recognition - HSAR, which utilizes high-semantic features for increased efficiency and provides well-founded explanations in terms of spatial-temporal perceptions that will benefit further expansions on visual understanding tasks. The inputs are decoupled into spatial and temporal streams with designated encoders aiming to extract only the pinnacle of representations, gaining high-semantic features while reducing computation costs greatly. A lightweight Temporal Motion Transformer (TMT) module is proposed for globally modeling temporal features through self-attention, omitting redundant spatial features. Decoupled spatial-temporal embeddings are further merged dynamically by an attention fusion model to form a joint high-semantic representation. The visualization of the attention in each module offers intuitive interpretations of HSAR’s explainability. Extensive experiments on three widely-used benchmarks (Kinetics400, 600, and Sthv2) show that our framework achieves high prediction accuracy with significantly reduced computation (only 64.07 GFLOPs per clip), offering a great trade-off between accuracy and computational costs.

Dual-Stream Multimodal Learning for Topic-Adaptive Video Highlight Detection

  • Ziwei Xiong
  • Han Wang

This paper targets at topic-adaptive video highlight detection, aiming to identify the moments in a video described by arbitrary text inputs. The fundamental challenge is the availability of annotated training data. It is costly to further scale up the number of topic-level categories which requires manually identifying and labeling corresponding highlights. To overcome this challenge, our method provides a new perspective on highlight detection by attaching importance to the semantic information of topic text rather than simply classifying whether a snippet is a highlight.Specifically, we decompose a topic into a set of key concepts and utilize the remarkable ability of visual-language pre-trained models to learn knowledge from both videos and semantic language. With the merits of this reformulation, the highlight detection task can be modeled as a snippet-text matching problem within a dual-stream multimodal learning framework, which strengthens the video representation with semantic language supervision and enables our model to accomplish open-set topic-adaptive highlight detection without any further labeled data. Our empirical evaluation shows the effectiveness of our method on several publicly available datasets, where the proposed method outperforms competitive baselines and achieves a novel state-of-the-art for topic-adaptive highlight detection. Further, when transferring our pre-trained model to the open-set video highlight detection task, our method outperforms prior supervised work by a substantial margin.

TDEC: Deep Embedded Image Clustering with Transformer and Distribution Information

  • Ruilin Zhang
  • Haiyang Zheng
  • Hongpeng Wang

Image clustering is a crucial but challenging task in multimedia machine learning. Recently the combination of clustering with deep learning has achieved promising performance against conventional methods on high-dimensional image data. Unfortunately, existing deep clustering methods (DC) often ignore the importance of information fusion with a global perception field among different image regions for clustering images, especially complex ones. Additionally, the learned features are usually not clustering-friendly in terms of dimensionality and are based only on simple distance information for the clustering. In this regard, we propose a deep embedded image clustering TDEC, which for the first time to our knowledge, jointly considers feature representation, dimensional preference, and robust assignment for image clustering. Specifically, we introduce the Transformer to form a novel module T-Encoder to learn discriminative features with global dependency while using the Dim-Reduction block to build a friendly low-dimensional space favoring clustering. Moreover, the distribution information of embedded features is considered in the clustering process to provide reliable supervised signals for joint training. Our method is robust and allows for more flexibility in data size, the number of clusters, and the context complexity. More importantly, the clustering performance of TDEC is much higher than that of recent competitors. Extensive experiments with state-of-the-art approaches on complex datasets demonstrate the superiority of TDEC.

MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style

  • Beibei Zhang
  • Yaqun Fang
  • Fan Yu
  • Jia Bei
  • Tongwei Ren

As talking takes a large proportion of human lives, it is necessary to perform deeper understanding of human conversations. Speaking style recognition is aimed at recognizing the styles of conversations, which provides a fine-grained description about talking. Current works focus on adopting only visual clues to recognize speaking styles, which cannot accurately distinguish different speaking styles when they are visually similar. To recognize speaking styles more effectively, we propose a novel multimodal sentiment-fused method, MMSF, which extracts and integrates visual, audio and textual features of videos. In addition, as sentiment is one of the motivations of human behavior, we first introduce sentiment into our multimodal method with cross-attention mechanism, which enhance the video feature to recognize speaking styles. The proposed MMSF is evaluated on long-form video understanding benchmark, and the experiment results show that it is superior to the state-of-the-arts.

Shot Retrieval and Assembly with Text Script for Video Montage Generation

  • Guoxing Yang
  • Haoyu Lu
  • Zelong Sun
  • Zhiwu Lu

With the development of video sharing websites, numerous users desire to create their own attractive video montages. However, it is difficult for inexperienced users to create well-edited video montages due to the lack of professional expertise. In the meantime, it is time-consuming even for experts to create video montages of high quality, which requires effectively selecting shots from abundant candidates and assembling them together. Instead of manual creation, various automatic methods have been proposed for video montage generation, which typically take a single sentence as input for text-to-shot retrieval, and ignore the semantic cross-sentence coherence given complicated text script of multiple sentences. To overcome this drawback, we propose a novel model for video montage generation by retrieving and assembling shots with arbitrary text scripts. To this end, a sequence consistency transformer is devised for cross-sentence coherence modeling. More importantly, with this transformer, two novel sequence-level tasks are defined for sentence-shot alignment in sequence-level: Cross-Modal Sequence Matching (CMSM) task, and Chaotic Sequence Recovering (CSR) task. To facilitate the research on video montage generation, we construct a new, highly-varied dataset which collects thousands of video-script pairs in documentary. Extensive experiments on the constructed dataset demonstrate the superior performance of the proposed model. The dataset and generated video demos are available at https://github.com/RATVDemo/RATV.

Multi-granularity Separation Network for Text-Based Person Retrieval with Bidirectional Refinement Regularization

  • Shenshen Li
  • Xing Xu
  • Fumin Shen
  • Yang Yang

Text-based person retrieval is one of the fundamental tasks in the field of computer vision, which aims to retrieve the most relevant pedestrian image from all the candidates according to textual descriptions. Such a cross-modal retrieval task could be challenging since it requires one to properly select distinguishing clues and perform cross-modal alignments. To achieve cross-modal alignments, most previous works focus on different inter-modal constraints while overlooking the influence of intra-modal noise, yielding sub-optimal retrieved results in certain cases. To this end, we propose a novel framework termed Multi-granularity Separation Network with Bidirectional Refinement Regularization (MSN-BRR) to tackle the problem. The framework consists of two components: (1) Multi-granularity Separation Network, which extracts the multi-grained discriminative textual and visual representations at local and global semantic levels. (2) Bidirectional Refinement Regularization, which alleviates the influence of intra-modal noise and facilitates the proper alignments between the visual and textual representations. Extensive experiments on two widely used benchmarks, i.e., CUHK-PEDES and ICFG-PEDES show that our MSN-BRR method outperforms current state-of-the-art methods.

Graph Interactive Network with Adaptive Gradient for Multi-Modal Rumor Detection

  • Tiening Sun
  • Zhong Qian
  • Peifeng Li
  • Qiaoming Zhu

With more and more messages in the form of text and image being spread on the Internet, multi-modal rumor detection has become the focus of recent research. However, most of the existing methods simply concatenate or fuse image features with text features, which can not fully explore the interaction between modalities. Meanwhile, they ignore the convergence inconsistency problem between strong and weak modalities, that is, the dominant rumor text modality may inhibit the optimization of image modality. In this paper, we investigate multi-modal rumor detection from a novel perspective, and propose a Multi-modal Graph Interactive Network with Adaptive Gradient (MGIN-AG) to solve the problem of insufficient information mining within and between modalities, and alleviate the optimization imbalance. Specifically, we first construct fine-grained graph for each rumor text or image to explicitly capture the relation between text tokens or image patches in uni-modal. Then, the cross modal interaction graph between text and image is designed to implicitly mine the text-image interaction, especially focusing on the consistency and mutual enhancement between image patches and text tokens. Furthermore, we extract the embedded text in images as an important supplement to improve the performance of the model. Finally, a strategy of dynamically adjusting the model gradient is introduced to alleviate the under optimization problem of weak modalities in the multi-modal rumor detection task. Extensive experiments demonstrate the superiority of our model in comparison with the state-of-the-art baselines.

Towards Shape-regularized Learning for Mitigating Texture Bias in CNNs

  • Harsh Sinha
  • Adriana Kovashka

CNNs have emerged as powerful techniques for object recognition. However, the test performance of CNNs is contingent on the similarity to training distribution. Existing methods focus on data augmentation to address out-of-domain generalization. In contrast, we enforce a shape bias by encouraging our model to learn features that correlate with those learned from the shape of the object. We show that explicit shape cues enable CNNs to learn features that are robust to unseen image manipulations i.e. novel textures with the same semantic content. Our models are validated on Toys4K dataset which consists of 4179 3D objects and image pairs. To quantify texture bias, we synthesize dataset variants called Style (style-transfer with GANs), CueConflict (conflicting texture & semantics), and Scrambled datasets (obfuscating semantics by scrambling pixel blocks). Our experiments show that the benefits of using shape is not subject to specific shape representations like point clouds, rather the same benefits can be obtained from a simpler representation such as the distance transform.

ASCS-Reinforcement Learning: A Cascaded Framework for Accurate 3D Hand Pose Estimation

  • Mingqi Chen
  • Feng Shuang
  • Shaodong Li
  • Xi Liu

3D hand pose estimation can be achieved by cascading a feature extraction module and a feature exploitation module, where reinforcement learning (RL) is proved to be an effective way to perform feature exploitation. This paper points out the prospects of improving accuracy using better exploitation strategy, and proposes an Adaptive Step-Critic Shared RL (ASCS-RL) strategy for accurate feature exploitation in 3D hand pose estimation. Hand joint features are exploited in a multi-task manner, and divided into two groups according to the distributions of estimation error. An RL-based adaptive-step (AS-RL) strategy is then used to obtain the optimal step size for better exploitation. The exploitation process are finally performed using a critic-shared RL (CS-RL) strategy, where both groups share a universal critic mechanism. Ablation studies and extensive experiments are carried out to evaluate the performance of ASCS-RL on ICVL and NYU datasets. The results show the strategy achieves the state-of-the-art accuracy in monocular depth-based 3D hand pose estimation, especially the best on ICVL. Experiments also validates that ASCS-RL realizes better tradeoff between accuracy and running rapidity.

Multi-modal Fake News Detection on Social Media via Multi-grained Information Fusion

  • Yangming Zhou
  • Yuzhou Yang
  • Qichao Ying
  • Zhenxing Qian
  • Xinpeng Zhang

The easy sharing of multimedia content on social media has caused a rapid dissemination of fake news, which threatens society’s stability and security. Therefore, fake news detection has garnered extensive research interest in the field of social forensics. Current methods primarily concentrate on the integration of textual and visual features but fail to effectively exploit multi-modal information at both fine-grained and coarse-grained levels. Furthermore, they suffer from an ambiguity problem due to a lack of correlation between modalities or a contradiction between the decisions made by each modality. To overcome these challenges, we present a Multi-grained Multi-modal Fusion Network (MMFN) for fake news detection. Inspired by the multi-grained process of human assessment of news authenticity, we respectively employ two Transformer-based pre-trained models to encode token-level features from text and images. The multi-modal module fuses fine-grained features, taking into account coarse-grained features encoded by the CLIP encoder. To address the ambiguity problem, we design uni-modal branches with similarity-based weighting to adaptively adjust the use of multi-modal features. Experimental results demonstrate that the proposed framework outperforms state-of-the-art methods on three prevalent datasets.

Learning and Fusing Multi-Scale Representations for Accurate Arbitrary-Shaped Scene Text Recognition

  • Mingjun Li
  • Shuo Xu
  • Feng Su

Scene text in natural images carries a wealth of valuable semantic information, while due to the largely varied appearance of the text, accurately recognizing scene text is a challenging task. In this work, we propose an arbitrary-shaped scene text recognition method based on learning and fusing multiple representations of text in the scale space with attention mechanisms. Specifically, as distinctive visual features of text often appear at different scales, given an input text image, we generate a family of multi-scale representations that capture complementary appearance characteristics of the text through multiple encoder branches with progressively increasing scale parameters. We further introduce edge map features as a supplementary high-frequency representation with useful text cues. We then refine the multi-scale representations with in-scale and cross-scale attention mechanisms and adaptively aggregate them into an enhanced representation of the text, which effectively improves the text recognition accuracy. The proposed text recognition method achieves competitive results on several scene text benchmarks, demonstrating its effectiveness in recognizing text of various shapes.

Modeling Functional Brain Networks with Multi-Head Attention-based Region-Enhancement for ADHD Classification

  • Chunhong Cao
  • Huawei Fu
  • Gai Li
  • Mengyang Wang
  • Xieping Gao

Increasing attention has been paid to attention-deficit hyperactivity disorder (ADHD)-assisted diagnosis using functional brain networks (FBNs) since FBNs-based ADHD diagnosis can not only extract the functional connectivities from FBNs as potential biomarkers for brain disease classification, but also identify the focal regions of disease. Therefore, modeling FBNs has become a key topic for ADHD diagnosis via resting state functional magnetic resonance imaging (rfMRI). However, the dominant models either ignore the strong regional correlation between adjacent time series or fail to capture the long-distance dependency (LDD) in imaging series. To address the issues, we propose a multi-head attention-based region-enhancement model (MAREM) for ADHD classification. Firstly, a multi-head attention mechanism with region-enhancement is designed to represent the FBNs, where region-enhancement module are designed to process strong regional correlation between adjacent time series. Secondly, multi-head attention is used to map the region information of each time point into different subspaces for establishing global dependencies in imaging series. Thirdly, the proposed model is applied to the ADHD-200 dataset for classification. The results show the proposed model’s out-performance of the state-of-the-art in both classification accuracy and generalization ability. Furthermore, we identify several brain networks that have been considered to be associated with ADHD in clinical studies.

SPAE: Spatial Preservation-based Autoencoder for ADHD functional brain networks modelling

  • Chunhong Cao
  • Gai Li
  • Huawei Fu
  • Xingxing Li
  • Xieping Gao

Spatio-temporal modelling based on resting-state functional magnetic resonance imaging (rsfMRI) of ADHD has been a major concern in the neuroimaging community, given the differences in the role of brain regions between attention deficit hyperactivity disorder (ADHD) patients versus typical developmental control group (TC). Several spatio-temporal deep learning models are proposed for rsfMRI, however, due to the high dimensionality and few samples of brain data, most models use dimension-reduced data as input for modelling, which suffer from the loss of original spatial relationships in the brain data. Although Recurrent Neural Network (RNN) and Attention mechanism (Attention) proposed in recent years can extract local correlations and long-distance dependency (LDD), the spatio-temporal relationships they rely on have lost their original high-dimensional spatial relevance. Therefore, a spatial preservation-based autoencoder for modelling ADHD functional brain networks (FBNs) is proposed by embedding the spatial information and combining both RNN and Transformer to address the issue that the dimension-reduced data cannot preserve the original high-dimensional spatial correlations. Firstly, a spatial preservation module is designed to fill the gap between the original data and the dimension-reduced data. Secondly, the dimension reduction module and feature extraction module are designed to improve the representation of spatio-temporal correlations. Thirdly, the extracted FBNs are applied to the disease classification on the ADHD-200 dataset, which show the model’s effectiveness in classifying ADHD compared with the state-of-the-art methods. Finally, we investigate the differences in regional correlations between ADHD and TC.

We Are Not So Similar: Alleviating User Representation Collapse in Social Recommendation

  • Bingchao Wu
  • Yangyuxuan Kang
  • Bei Guan
  • Yongji Wang

Integrating social relations into recommendation is an effective way to mitigate data sparsity. Most social recommendation methods encode user representations from a unified graph that includes user-user and user-item relations. Due to the enriched relations on this graph, a large fraction of users are aware of each other within only a few hops, and the user representations generated by existing methods may encode the information received from a large number of neighbors. Thus, many user representations are enforced to be too similar, which hinders modeling fine-grained user interest. Here, we name this phenomenon as user representation collapse. To address this problem, in this paper we propose a robust user representation learning method named RobustSR with social regularization and multi-view contrastive learning, which aim to enhance the model’s awareness of relation informativeness and the discriminativeness of user representations, respectively. Concretely, the social regularization mechanism encourages the model to learn from the relation importance weights derived from graph topologies, which helps recognize important observed relations meanwhile mining potential useful relations. To enhance the discriminativeness of user representations, we further perform multi-view contrastive learning between collaborative and social-enhanced user representations. Extensive experiments on four benchmark datasets show that RobustSR effectively alleviates user representation collapse and improves recommendation performance. Our code is deposited at https://github.com/paulpig/RobustSR.

Towards Practical Consistent Video Depth Estimation

  • Pengzhi Li
  • Yikang Ding
  • Linge Li
  • Jingwei Guan
  • Zhiheng Li

Monocular depth estimation algorithms aim to explore the possible links between 2D and 3D data, but challenges remain for existing methods to predict consistent depth from a casual video. Relying on camera poses and the optical flow in the time-consuming test-time training phases makes these methods fail in many scenarios and cannot be used for practical applications. In this work, we present a data-driven post-processing method to overcome these challenges and achieve online processing. Based on a deep recurrent network, our method takes the adjacent original and optimized depth map as inputs to learn temporal consistency from the dataset and achieves higher depth accuracy. Our approach can be applied to multiple single-frame depth estimation models and used for various real-world scenes in real-time. In addition, to tackle the lack of a temporally consistent video depth training dataset of dynamic scenes, we propose an approach to generate the training video sequences dataset from a single image based on inferring motion field. To the best of our knowledge, this is the first data-driven plug-and-play method to improve the temporal consistency of depth estimation for casual videos. Extensive experiments on three datasets and three depth estimation models show that our method outperforms the state-of-the-art methods.

Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval

  • Jiancheng Pan
  • Qing Ma
  • Cong Bai

Recently, remote sensing cross-modal retrieval has received incredible attention from researchers. However, the unique nature of remote-sensing images leads to many semantic confusion zones in the semantic space, which greatly affects retrieval performance. We propose a novel scene-aware aggregation network (SWAN) to reduce semantic confusion by improving scene perception capability. In visual representation, a visual multiscale fusion module (VMSF) is presented to fuse visual features with different scales as a visual representation backbone. Meanwhile, a scene fine-grained sensing module (SFGS) is proposed to establish the associations of salient features at different granularity. A scene-aware visual aggregation representation is formed by the visual information generated by these two modules. In textual representation, a textual coarse-grained enhancement module (TCGE) is designed to enhance the semantics of text and to align visual information. Furthermore, as the diversity and differentiation of remote sensing scenes weaken the understanding of scenes, a new metric, namely, scene recall is proposed to measure the perception of scenes by evaluating scene-level retrieval performance, which can also verify the effectiveness of our approach in reducing semantic confusion. By performance comparisons, ablation studies and visualization analysis, we validated the effectiveness and superiority of our approach on two datasets, RSICD and RSITMD. The source code is available at https://github.com/kinshingpoon/SWAN-pytorch.

Zero-shot Sketch-based Image Retrieval with Adaptive Balanced Discriminability and Generalizability

  • Jialin Tian
  • Xing Xu
  • Zuo Cao
  • Gong Zhang
  • Fumin Shen
  • Yang Yang

Zero-shot sketch-based image retrieval (ZS-SBIR) is a task that learns semantic knowledge and embedding extraction to retrieve similar images using a sketch without any training examples of unseen classes. Existing methods have attempted to address the modal and semantic gaps in ZS-SBIR by using various strategies such as leveraging category linguistic information for improved discriminability and utilizing knowledge distillation to increase the model’s generalizability towards unseen classes. However, these methods fail to consider the importance of discriminability and generalizability in a unified manner. To address this, we propose a novel method called Adaptive Balanced Discriminability and Generalizability (ABDG) for ZS-SBIR. Specifically, our ABDG method utilizes an advanced two-stage knowledge distillation scheme to balance the learning of discriminability and generalizability for each instance. In addition to task-agnostic teacher models to preserve structural information used in existing work, we introduce a task-specific teacher model pre-trained with a classification objective function to emphasize the discriminability property during knowledge distillation. We also employ a novel entropy-based weighting strategy to balance the effects of structural information preservation and classification losses specific to the classification progress of each instance. Furthermore, we use fine-grained semantic relevance to refine the ego predictions of the student model, with the aim of improving its performance as the training objective continues to converge. Experimental results on three benchmark datasets of ZS-SBIR demonstrate that our ABDG method establishes a state-of-the-art performance by balancing the learning of discriminative and generalizable properties.

Label-wise Deep Semantic-Alignment Hashing for Cross-Modal Retrieval

  • Liang Li
  • Weiwei Sun

Hashing plays an important role in the content-based retrieval of multimedia data. Existing methods focus on designing various joint-optimization strategies to preserve the similarity relationships between different modalities and improve the performance of cross-modal retrieval. However, when the intrinsic information of any modality is significantly insufficient compared with others, the final hash space will collapse to the trivial space due to the joint training. Compared with inter-modal semantic alignment, directly aligning the semantics of each modality with the label-wise semantics can obtain higher-quality common semantic spaces, and only a pair-wise alignment between each space can obtain a unified representation. Following this idea, we design a general hash generation framework for uni-modal embedding and directly align the hashing code from different modalities using a pair-wise loss, further improving the retrieval performance within the less-semantic modality. Within this framework, we introduce two optimizations to penalize dissimilar items. First, with a Gaussian distribution to describe the hashing semantic distribution, JS-divergence is introduced to keep the consistency of label-wise semantics and hashing similarity. Then the attention mechanism is used for hard-sample re-weighting to learn fine-grained distribution alignment further. We conduct extensive experiments on three public datasets to validate the enhancements of our work.

TsP-Tran: Two-Stage Pure Transformer for Multi-Label Image Retrieval

  • Ying Li
  • Chunming Guan
  • Jiaquan Gao

Image retrieval aims to find similar images given the query. Most of existing retrieval works are based on the pre-trained model of single-label image classification. In practice, the query usually contains more than one instance, and the single label is far from enough for fully depicting the attributes of an open-world image. Due to the complicated similarity relationships between multiple semantics, the multi-label image retrieval task is not so well solved as the single-label task. In this work, we propose a two-stage pure Transformer model for multi-label image retrieval, which leverages a Transformer encoder to exploit the complex dependencies among visual features and labels. Except for the Transformer encoder, the image feature embedding module is also based on Transformer, so that the optimal model weights could be learned in an end-to-end manner. To be specific, inputs of the Transformer encoder mainly consist of an Vision Transformer branch and a label embedding branch, which generates suitable image features and label descriptions, respectively. Given an input set of visual features and text labels, the developed Transformer encoder could be accordingly optimized in the training stage with compressed multi-label output layer. In order to obtain sufficient outputs to accurately find images containing similar semantics with the query from the database, we adjust the network by removing the last fully connected layer in the retrieval stage. Specially, images and labels are used for the training stage in a randomly masked manner to enhance the model performance, and no labels are visible in the content-based image retrieval stage. Comprehensive experiments are performed on three multi-label datasets including MS-COCO, NUS-WIDE and VOC2007, demonstrating promising results of our proposed method against the state-of-the-arts for multi-label image retrieval.

MuseHash: Supervised Bayesian Hashing for Multimodal Image Representation

  • Maria Pegia
  • Björn Þór Jónsson
  • Anastasia Moumtzidou
  • Ilias Gialampoukidis
  • Stefanos Vrochidis
  • Ioannis Kompatsiaris

This paper presents a novel method for supporting multiple modalities in the field of image retrieval, called Multimodal Bayesian Supervised Hashing (MuseHash). The method takes into consideration the semantic information of the training data through the use of Bayesian regression to estimate the semantic probabilities and statistical properties in the retrieval process. MuseHash is an extension of the previously proposed Bayesian ridge-based Semantic Preserving Hashing (BiasHash) method. Experimentation on various domain-specific and benchmark datasets demonstrates that MuseHash outperforms seven existing state-of-the-art methods in image retrieval performance, regardless of the feature extractor type, code length, and visual or textual descriptors used. This highlights the robustness and adaptability of MuseHash, making it a promising solution for multimodal image retrieval.

Reference-Limited Compositional Zero-Shot Learning

  • Siteng Huang
  • Qiyao Wei
  • Donglin Wang

Compositional zero-shot learning (CZSL) refers to recognizing unseen compositions of known visual primitives, which is an essential ability for artificial intelligence systems to learn and understand the world. While considerable progress has been made on existing benchmarks, we suspect whether popular CZSL methods can address the challenges of few-shot and few referential compositions, which is common when learning in real-world unseen environments. To this end, we study the challenging reference-limited compositional zero-shot learning (RL-CZSL) problem in this paper, i.e., given limited seen compositions that contain only a few samples as reference, unseen compositions of observed primitives should be identified. We propose a novel Meta Compositional Graph Learner (MetaCGL) that can efficiently learn the compositionality from insufficient referential information and generalize to unseen compositions. Besides, we build a benchmark with two new large-scale datasets that consist of natural images with diverse compositional labels, providing more realistic environments for RL-CZSL. Extensive experiments in the benchmarks show that our method achieves state-of-the-art performance in recognizing unseen compositions when reference is limited for compositional learning.

Exploration of Lightweight Single Image Denoising with Transformers and Truly Fair Training

  • Haram Choi
  • Cheolwoong Na
  • Jinseop Kim
  • Jihoon Yang

As multimedia content often contains noise from intrinsic defects of digital devices, image denoising is an important step for high-level vision recognition tasks. Although several studies have developed the denoising field employing advanced Transformers, these networks are too momory-intensive for real-world applications. Additionally, there is a lack of research on lightweight denosing (LWDN) with Transformers. To handle this, we provide seven comparative baseline Transformers for LWDN, serving as a foundation for future research. We also demonstrate the parts of randomly cropped patches significantly affect the denoising performances during training. While previous studies have overlooked this aspect, we aim to train our baseline Transformers in a truly fair manner. Furthermore, we conduct empirical analyses of various components to determine the key considerations for constructing LWDN Transformers. Codes are available at https://github.com/rami0205/LWDN.

TAGM: Task-Aware Graph Model for Few-shot Node Classification

  • Feng Zhao
  • Min Zhang
  • Tiancheng Huang
  • Donglin Wang

Graph representation learning has attracted tremendous attention due to its remarkable performance in variety of real-world applications. However, because data labeling is always time and resource intensive, current supervised graph representation learning models for particular tasks frequently suffer from label sparsity issues. In light of this, graph few-shot learning has been proposed to tackle the performance degradation in face of limited annotated data challenge. While recent advances in graph few shot learning achieve promising performance, they typically force to use a generic feature embedding across various tasks. Ideally, we want to construct feature embeddings that are tuned for the given task because of the differences in distribution between tasks. In this work, we propose a novel Task-Aware Graph Model (TAGM) to learn task-aware node embedding. Specifically, we provide a new graph cell design that includes a graph convolution layer for aggregating and updating graph information as well as a two-layer linear transformation for node feature transformation. On this basis, we encode task information to learn the binary weight mask set and gradient mask set, where the weight mask set selects different network parameters for different tasks and the gradient mask set can dynamically update the selected network parameters in a different manner during the optimization process. Our model is more sensitive to task identity and performs better for a task graph input. Our extensive experiments on three graph-structured datasets demonstrate that our proposed method generally outperforms the state-of-the-art baselines in few-shot learning.

Learning with Adaptive Knowledge for Continual Image-Text Modeling

  • Yutian Luo
  • Yizhao Gao
  • Zhiwu Lu

In realistic application scenarios, existing methods for image-text modeling have limitations in dealing with data stream: training on all data needs too much computation/storage resources, and even the full access to previous data is invalid. In this work, we thus propose a new continual image-text modeling (CITM) setting that requires a model to be trained sequentially on a number of diverse image-text datasets. Although recent continual learning methods can be directly applied to the CITM setting, most of them only consider reusing part of previous data or aligning the output distributions of previous and new models, which is a partial or indirect way to acquire the old knowledge. In contrast, we propose a novel dynamic historical adaptation (DHA) method which can holistically and directly review the old knowledge from a historical model. Concretely, the historical model transfers its total parameters to the main/current model to utilize the holistic old knowledge. In turn, the main model dynamically transfers its parameters to the historical model at every five training steps to ensure that the knowledge gap between them is not too large. Extensive experiments show that our proposed DHA outperforms other representative/latest continual learning methods under the CITM setting.

A Dual-branch Enhanced Multi-task Learning Network for Multimodal Sentiment Analysis

  • Wenxiu Geng
  • Xiangxian Li
  • Yulong Bian

Multimodal sentiment analysis is a complex research problem. Firstly, current multimodal approaches fail to adequately consider the intricate multi-level correspondence between modalities and the unique contextual information within each modality; secondly, cross-modal fusion methods for inter-modal fusion somewhat weaken the mode-specific internal features, which is a limitation of the traditional single-branch model. To this end, we proposes a dual-branch enhanced multi-task learning network (DBEM), a new architecture that considers both the multiple dependencies of sequences and the heterogeneity of multimodal data, for better multimodal sentiment analysis. The global-local branch takes into account the intra-modal dependencies of different length time subsequences and aggregates global and local features to enrich the feature diversity. The cross-refine branch considers the difference in information density of different modalities and adopts coarse-to-fine fusion learning to model the inter-modal dependencies. Coarse-grained fusion achieves low-level feature reinforcement of audio and visual modalities, and fine-grained fusion improves the ability to integrate information complementarity between different levels of modalities. Finally, multi-task learning is carried out to improve the generalization and performance of the model based on the enhanced fusion features obtained from the dual-branch network. Compared with the single branch network (SBEM, variant of DBEM model) and SOTA methods, the experimental results on the two datasets CH-SIMS and CMU-MOSEI validate the effectiveness of the DBEM model.

FedPcf : An Integrated Federated Learning Framework with Multi-Level Prospective Correction Factor

  • Yu Zang
  • Zhe Xue
  • Shilong Ou
  • Yunfei Long
  • Hai Zhou
  • Junping Du

In recent years, the issue of data privacy has attracted more and more attention. Federated learning is a practical solution to train the model while guaranteeing data privacy. It has two main characteristics: the first is that the data in the clients is usually non-IID, and the second is that the data of each client cannot be shared. However, due to the non-IID data of each client, the optimal solution of the client is often inconsistent with the global optimal solution. The non-IID data often causes the client to optimize along the local optimal direction and drift out of the global optimal solution during training. Due to the client drift problem, the server tends to converge slowly so that the overall communication efficiency of federated learning is usually limited. To improve the communication efficiency of federated learning, in this paper, we propose a new federated learning framework which integrates multi-level prospective correction factor in the training procedure of server and clients. We propose the global prospective correction factor in server aggregation to reduce model communication rounds and accelerate convergence. In client training, we introduce the local prospective correction factor to alleviate client drift. Both global and local prospective correction factors are integrated into a unified federated learning framework to further improve the communication efficiency. Extensive experiments conducted on several datasets demonstrate that our method can effectively improve the communication efficiency and is robust to different federated learning environments.

Learning From Expert: Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval

  • Lina Sun
  • Yewen Li
  • Yumin Dong

Unsupervised cross-modal hashing (UCMH) has attracted increasing research due to its efficient retrieval performance and label irrelevance. However, existing methods have some bottlenecks: Firstly, the existing unsupervised methods suffer from inaccurate similarity measures due to the lack of correlation between features of different modalities and simple features cannot fully describe the fine-grained relationships of multi-modal data. Secondly, existing methods have rarely explored vision-language knowledge distillation schemes to distil multi-modal knowledge of these vision-language models to guide the learning of student networks. To address these bottlenecks, this paper proposes an effective unsupervised cross-modal hashing retrieval method, called Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval (VLKD). VLKD uses the vision-language pre-training (VLP) model to encode features on multi-modal data, and then constructs a similarity matrix to provide soft similarity supervision for the student model. It distils the knowledge of the VLP model to the student model to gain an understanding of multi-modal knowledge. In addition, we designed an end-to-end unsupervised hashing learning model that incorporates a graph convolutional auxiliary network. The auxiliary network aggregates information from similar data nodes based on the similarity matrix distilled by the teacher model to generate more consistent hash codes. Finally, the teacher network does not require additional training, it only needs to guide the student network to learn high-quality hash representation, and VLKD is quite efficient in training and retrieval. Sufficient experiments on three multimedia retrieval benchmark datasets show that the proposed method achieves better retrieval performance compared to existing unsupervised cross-modal hashing methods, demonstrating the effectiveness of the proposed method.

A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic Environments

  • Yaoqing Li
  • Sheng-Hua Zhong
  • Shuai Li
  • Yan Liu

Simultaneous Localization and Mapping (SLAM) has developed as a fundamental method for intelligent robot perception over the past decades. Most of the existing feature-based SLAM systems relied on traditional hand-crafted visual features and a strong static world assumption, which makes these systems vulnerable in complex dynamic environments. In this paper, we propose a robust monocular SLAM system by combining geometry-based methods with two convolutional neural networks. Specifically, a lightweight deep local feature detection network is proposed as the system front-end, which can efficiently generate keypoints and binary descriptors robust against variations in illumination and viewpoint. Besides, we propose a motion segmentation and depth estimation network for simultaneously predicting pixel-wise motion object segmentation and depth map, so that our system can easily discard dynamic features and reconstruct 3D maps without dynamic objects. The comparison against state-of-the-art methods on publicly available datasets shows the effectiveness of our system in highly dynamic environments.

Symbol Location-Aware Network for Improving Handwritten Mathematical Expression Recognition

  • Yingnan Fu
  • Wenyuan Cai
  • Ming Gao
  • Aoying Zhou

Recently most handwritten mathematical expression recognition methods adopt the attention-based encoder-decoder framework, which generates LaTeX sequences from given images. However, the accuracy of the attention mechanism limits the performance of HMER models. Lacking global context information in the decoding process is also a challenge for HMER. Some methods adopt symbol-level counting to localize symbols for improving the model performance, while these methods cannot work well. In this paper, we propose a method named SLAN, shorted for a Symbol Location-Aware Network, to solve the HMER problem. Specifically, we propose an advanced relation-level counting method to detect symbols in the image. We solve the lacking global context problem with a new global context-aware decoder. For improving the accuracy of attention, we design a novel attention alignment loss function by the dynamic programming algorithm, which can learn attention alignment directly without pixel-level labels. We conducted extensive experiments on the CROHME dataset to demonstrate the effectiveness of each part of SLAN and achieved state-of-the-art performance.

SESSION: Regular Short Papers

Text-to-Image Fashion Retrieval with Fabric Textures

  • Daichi Suzuki
  • Go Irie
  • Kiyoharu Aizawa

In this study, we proposed text-to-image fashion image retrieval that captures the texture of clothing fabrics. A fabric’s texture is a major factor governing the comfort and appearance of clothes and significantly influences user preferences. However, unlike patterns and shapes that can readily be captured from a global image of the entire piece of clothing, extracting the fine and ambiguous characteristics of textures is considerably more challenging. The key concept is that by focusing on the "local" regions of clothing, detailed fabric textures can be more accurately captured. To this end, we propose a framework for learning cross-modal features from both global (the entire garment) and local (a close-up detail) image-text pairs. To verify the idea, we constructed a new dataset named Global and Local FACAD (G&L FACAD) by modifying the existing large-scale public FACAD dataset used for fashion retrieval. The experimental results confirm that the retrieval accuracy is significantly improved compared to the baselines. The code is available at https://github.com/SuzukiDaichi-git/texture_aware_fashion_retrieval.git.

Escaping local minima in deep reinforcement learning for video summarization

  • Panagiota Alexoudi
  • Ioannis Mademlis
  • Ioannis Pitas

State-of-the-art deep neural unsupervised video summarization methods mostly fall under the adversarial reconstruction framework. This employs a Generative Adversarial Network (GAN) structure and Long Short-Term Memory (LSTM) autoencoders during its training stage. The typical result is a selector LSTM that sequentially receives video frame representations and outputs corresponding scalar importance factors, which are then used to select key-frames. This basic approach has been augmented with an additional Deep Reinforcement Learning (DRL) agent, trained using the Discriminator’s output as a reward, which learns to optimize the selector’s outputs. However, local minima are a well-known problem in DRL. Thus, this paper presents a novel regularizer for escaping local loss minima, in order to improve unsupervised key-frame extraction. It is an additive loss term employed during a second training phase, that rewards the difference of the neural agent’s parameters from those of a previously found good solution. Thus, it encourages the training process to explore more aggressively the parameter space in order to discover a better local loss minimum. Evaluation performed on two public datasets shows considerable increases over the baseline and against the state-of-the-art.

A Comparison of Video Browsing Performance between Desktop and Virtual Reality Interfaces

  • Florian Spiess
  • Ralph Gasser
  • Silvan Heller
  • Heiko Schuldt
  • Luca Rossetto

Interactive retrieval with user-friendly and performant interfaces remains a necessity for video retrieval, even in light of significant gains in retrieval performance through multi-modal encoders. In recent years, novel interaction modalities such as virtual reality (VR) and augmented reality (AR) have gained popularity, but the best way to adapt paradigms from traditional retrieval interfaces, especially for result browsing and interaction, remains an open research question. In this paper, we compare two video retrieval interfaces in a controlled setting to gain insight into the differences in video browsing between VR and desktop interfaces. We formulate hypotheses explaining why there might be performance differences between the two interfaces, define metrics to test the hypotheses, and show results based on data gathered at an evaluation campaign. Our results show that VR interfaces can be competitive in browsing performance and indicate that there can even be an advantage when browsing larger result sets in VR.

More Than Simply Masking: Exploring Pre-training Strategies for Symbolic Music Understanding

  • Zhexu Shen
  • Liang Yang
  • Zhihan Yang
  • Hongfei Lin

Pre-trained language models have become the prevailing approach for handling natural language processing tasks in recent years. Given the similarities in sequential features between symbolic music and natural language text, it is fairly logical to adopt pre-training methods to symbolic music data. However, the disparity between music and natural language text makes it difficult to comprehensively model the unique features of music through traditional text-based pre-training strategies alone. To address this challenge, in this paper, we design the quad-attribute masking (QM) strategy and propose the key prediction (KP) task to improve the extraction of generic knowledge from symbolic music. We evaluate the impact of various pre-training strategies on several public symbolic music datasets, and the results of our experiments reveal that the proposed multi-task pre-training model can effectively capture music domain knowledge from symbolic music data and significantly improve performance on downstream tasks.

SOFA: Style-based One-shot 3D Facial Animation Driven by 2D landmarks

  • Pu Ching
  • Hung-Kuo Chu
  • Min-Chun Hu

We propose a 2D landmark-driven 3D facial animation framework trained without the need of 3D facial dataset. Our method decomposes the 3D facial avatar into geometry and texture. Given 2D landmarks as input, our models learn to estimate the parameters of FLAME and transfer the target texture into different facial expressions. The experiments show that our method achieves remarkable results. Using 2D landmarks as input data, our method has the potential to be deployed in a scenario that suffered from obtaining full RGB facial images (e.g., occluded by VR Head-mounted Display).

Strong-Weak Cross-View Interaction Network for Stereo Image Super-Resolution

  • Kun He
  • Changyu Li
  • Jie Shao

Recently, super-resolution (SR) performance has been improved by the stereo images since the beneficial information could be provided by another view. Transformer has shown significant performance gains for computer vision tasks, while it needs huge computing resources and training time. To alleviate this problem, we introduce an efficient Transformer feature extraction block, which can efficiently capture long-range pixel interactions with lower resource consumption. There are many kinds of cross-view interaction modules for stereo image SR, and they all have limitations of SR performance in their own models. To address the aforementioned challenge, we first propose the strong-weak cross-view interaction mechanism, which consists of strong cross-view interaction module and weak cross-view interaction module. Benefiting from the proposed mechanism, the SR performance can be improved significantly with a negligible increment of computing cost. We integrate the efficient Transformer feature extraction module and the strong-weak cross-view interaction mechanism into a unified framework named strong-weak cross-view interaction network (SWCVIN), and extensive experiments on three benchmark datasets show the proposed model achieves state-of-the-art results.

Multi-view Contrastive Learning with Additive Margin for Adaptive Nasopharyngeal Carcinoma Radiotherapy Prediction

  • Jiabao Sheng
  • Sai-Kit Lam
  • Zhe Li
  • Jiang Zhang
  • Xinzhi Teng
  • Yuanpeng Zhang
  • Jing Cai

The accurate prediction of adaptive radiation therapy (ART) for nasopharyngeal carcinoma (NPC) patients before radiation therapy (RT) is crucial for minimizing toxicity and enhancing patient survival rates. Owing to the complexity of the tumor micro-environment, a single high-resolution image offers only limited insight. Furthermore, the traditional softmax-based loss falls short in quantifying a model’s discriminative power. To address these challenges, we introduce a supervised multi-view contrastive learning approach with an additive margin (MMCon). For each patient, we consider four medical images to form multi-view positive pairs, which supply supplementary information and bolster the representation of medical images. We employ supervised contrastive learning to determine the embedding space, ensuring that NPC samples from the same patient or with the same labels stay in close proximity while NPC samples with different labels are distant. To enhance the discriminative ability of the loss function, we incorporate a margin into the contrastive learning process. Experimental results show that this novel learning objective effectively identifies an embedding space with superior discriminative abilities for NPC images.

Recommendation of Mix-and-Match Clothing by Modeling Indirect Personal Compatibility

  • Shuiying Liao
  • Yujuan Ding
  • P.Y. Mok

Fashion recommendation considers both product similarity and compatibility, and has drawn increasing research interest. It is a challenging task because it often needs to use information from different sources, such as visual content or textual descriptions for the prediction of user preferences. In terms of complementary recommendation, existing approaches were dedicated to modeling either product compatibility or users’ personalization in a direct and decoupled manner, yet overlooked additional relations hidden within historical user-product interactions. In this paper, we propose a Normalized indirect Personal Compatibility modeling scheme based on Bayesian Personalized Ranking (NiPC-BPR) for mix-and-match clothing recommendations. We exploit direct and indirect personalization and compatibility relations from the user and product interactions, and effectively integrate various multi-modal data. Extensive experimental results on two benchmark datasets show that our method outperforms other methods by large margins.

Video Retrieval for Everyday Scenes With Common Objects

  • Arun Zachariah
  • Praveen Rao

We propose a video retrieval system for everyday scenes with common objects. Our system exploits the predictions made by deep neural networks for image understanding tasks using natural language processing (NLP). It aims to capture the relationships between objects in a video scene as well as the ordering of the matching scenes. For each video in the database, it identifies and generates a sequence of key scene images. For each such scene, it generates most probable captions using state-of-the-art models for image captioning. The captions are parsed and represented by tree structures using NLP techniques. These are then stored and indexed in a database system. When a user poses a query video, a sequence of key scenes are generated. For each scene, its caption is generated using deep learning and parsed into its corresponding tree structure. After that, optimized tree-pattern queries are constructed and executed on the database to retrieve a set of candidate videos. Finally, these candidate videos are ranked using a combination of longest common subsequence of scene matches and tree-edit distance between parse trees. We evaluated the performance of our system using the MSR-VTT dataset, which contained everyday scenes. We observed that our system achieved higher mean average precision (mAP) compared to two recent techniques, namely, CSQ and DnS.

Offensive Tactics Recognition in Broadcast Basketball Videos Based on 2D Camera View Player Heatmaps

  • subst Nico
  • Tse-Yu Pan
  • Herman Prawiro
  • Jian-Wei Peng
  • Wen-Cheng Chen
  • Hung-Kuo Chu
  • Min-Chun Hu

It is essential for sports teams to review their offensive and defensive tactical execution performance as well as understand their opponents’ tactics in order to identify effective counterattack strategies. This study focuses on basketball offensive tactics recognition based on 2D camera view heatmaps. Most of the current tactics recognition methods learn the spatiotemporal correlation of players based on top-view trajectory information. To obtain correct top-view player trajectories, robust camera calibration and player tracking techniques are indispensable. However, for broadcast videos having large camera movement, serious player occlusions, and similar players’ jerseys, it is quite challenging to obtain accurate camera parameters and player tracking results, resulting in poor tactical analysis performance. Instead of applying camera calibration and player tracking, this study attempts to design a tactics recognition method that directly predicts the tactics class from 2D camera-view player heatmaps in the inference phase. Our proposed method uses a recurrent convolutional neural network with coordinate embedding to directly identify the tactics. Moreover, an auxiliary top-view player trajectory reconstruction module is added in the training phase to acquire better latent codes to represent the tactics. The experimental results show that for both supervised and unsupervised settings, our proposed method achieves comparable accuracy to the current tactics classification methods that rely on perfect top-view trajectory input.

Graph Contrastive Learning on Complementary Embedding for Recommendation

  • Meishan Liu
  • Meng Jian
  • Ge Shi
  • Ye Xiang
  • Lifang Wu

Previous works build interest learning via mining deeply on interactions. However, the interactions come incomplete and insufficient to support interest modeling, even bringing severe bias into recommendations. To address the interaction sparsity and the consequent bias challenges, we propose a graph contrastive learning on complementary embedding (GCCE), which introduces negative interests to assist positive interests of interactions for interest modeling. To embed interest, we design a perturbed graph convolution by preventing embedding distribution from bias. Since negative samples are not available in the general scenario of implicit feedback, we elaborate a complementary embedding generation to depict users’ negative interests. Finally, we develop a new contrastive task to contrastively learn from the positive and negative interests to promote recommendation. We validate the effectiveness of GCCE on two real datasets, where it outperforms the state-of-the-art models for recommendation.

Improving Generalization for Multimodal Fake News Detection

  • Sahar Tahmasebi
  • Sherzod Hakimov
  • Ralph Ewerth
  • Eric Müller-Budack

The increasing proliferation of misinformation and its alarming impact have motivated both industry and academia to develop approaches for fake news detection. However, state-of-the-art approaches are usually trained on datasets of smaller size or with a limited set of specific topics. As a consequence, these models lack generalization capabilities and are not applicable to real-world data. In this paper, we propose three models that adopt and fine-tune state-of-the-art multimodal transformers for multimodal fake news detection. We conduct an in-depth analysis by manipulating the input data aimed to explore models performance in realistic use cases on social media. Our study across multiple models demonstrates that these systems suffer significant performance drops against manipulated data. To reduce the bias and improve model generalization, we suggest training data augmentation to conduct more meaningful experiments for fake news detection on social media. The proposed data augmentation techniques enable models to generalize better and yield improved state-of-the-art results.

MemeFier: Dual-stage Modality Fusion for Image Meme Classification

  • Christos Koutlis
  • Manos Schinas
  • Symeon Papadopoulos

Hate speech is a societal problem that has significantly grown through the Internet. New forms of digital content such as image memes have given rise to spread of hate using multimodal means, being far more difficult to analyse and detect compared to the unimodal case. Accurate automatic processing, analysis and understanding of this kind of content will facilitate the endeavor of hindering hate speech proliferation through the digital world. To this end, we propose MemeFier, a deep learning-based architecture for fine-grained classification of Internet image memes, utilizing a dual-stage modality fusion module. The first fusion stage produces feature vectors containing modality alignment information that captures non-trivial connections between the text and image of a meme. The second fusion stage leverages the power of a Transformer encoder to learn inter-modality correlations at the token level and yield an informative representation. Additionally, we consider external knowledge as an additional input, and background image caption supervision as a regularizing component. Extensive experiments on three widely adopted benchmarks, i.e., Facebook Hateful Memes, Memotion7k and MultiOFF, indicate that our approach competes and in some cases surpasses state-of-the-art. Our code is available on GitHub1.

CNNs with Multi-Level Attention for Domain Generalization

  • Aristotelis Ballas
  • Christos Diou

In the past decade, deep convolutional neural networks have achieved significant success in image classification and ranking, finding therefore numerous applications in multimedia content retrieval. Still, these models suffer from performance degradation when neural networks are tested on out-of-distribution scenarios or on data originating from previously unseen data Domains. In the present work, we focus on this problem of Domain Generalization and propose an alternative neural network architecture for robust, out-of-distribution image classification. We attempt to produce a model that focuses on the causal features of the depicted class for robust image classification in the Domain Generalization setting. To achieve this, we propose attending to multiple-levels of information throughout a Convolutional Neural Network and leveraging the most important attributes of an image, by employing trainable attention mechanisms. To validate our method we evaluate our model on four widely accepted Domain Generalization benchmarks, where our model is able to surpass previously reported baselines in three out of four datasets and achieve the second best score in the fourth one.

Improving Query and Assessment Quality in Text-Based Interactive Video Retrieval Evaluation

  • Werner Bailer
  • Rahel Arnold
  • Vera Benz
  • Davide Coccomini
  • Anastasios Gkagkas
  • Gylfi Þór Guðmundsson
  • Silvan Heller
  • Björn Þór Jónsson
  • Jakub Lokoc
  • Nicola Messina
  • Nick Pantelidis
  • Jiaxin Wu

Different task interpretations are a highly undesired element in interactive video retrieval evaluations. When a participating team focuses partially on a wrong goal, the evaluation results might become partially misleading. In this paper, we propose a process for refining known-item and open-set type queries, and preparing the assessors that judge the correctness of submissions to open-set queries. Our findings from recent years reveal that a proper methodology can lead to objective query quality improvements and subjective participant satisfaction with query clarity.

Multimodal Topic Segmentation of Podcast Shows with Pre-trained Neural Encoders

  • Iacopo Ghinassi
  • Lin Wang
  • Chris Newell
  • Matthew Purver

We present two multimodal models for topic segmentation of podcasts built on pre-trained neural text and audio embeddings. We show that results can be improved by combining different modalities; but also by combining different encoders from the same modality, especially general-purpose sentence embeddings with specifically fine-tuned ones. We also show that audio embeddings can be substituted with two simple features related to sentence duration and inter-sentential pauses with comparable results. Finally, we publicly release our two datasets, the first in our knowledge publicly and freely available multimodal datasets for topic segmentation.

Tweaking EfficientDet for frugal training

  • Georgios Orfanidis
  • Konstantinos Ioannidis
  • Anastasios Tefas
  • Stefanos Vrochidis
  • Ioannis Kompatsiaris

Object detection appears to be omnipresent nowadays with detectors being available for every problem available, covering solutions from extra-light to ultra resource demanding models. Yet, the vast majority of these approaches are based on large datasets to provide the required feature diversity. This work focuses on object detection solutions which do not rely heavily on abundant training datasets but rather on medium-sized data collections. It uses Efficientdet object detector as base for the application of novel modifications which achieve better performance both in efficiency as well in effectiveness. The focus on medium-sized datasets aim at representing more commonplace datasets which can be accumulated and compiled with relative ease.

Deep Enhanced-Similarity Attention Cross-modal Hashing Learning

  • Mingyuan Ge
  • Yewen Li
  • Longfei Ma
  • Mingyong Li

Despite the great success of existing cross-modal retrieval methods, existing unsupervised cross-modal hashing methods still suffer from common problems. First, the features extracted from the text are too sparse. Second, the similarity matrices of each different modality cannot be fused adaptively. In this paper, we propose Deep Enhanced-Similarity Attention Hashing (DESAH) to alleviate the above problems. Firstly, we construct a text encoder expanding graph convolutional neural network to simultaneously extract features of samples and their semantic neighbors to enrich text features. Secondly, we propose an enhanced attention fusion mechanism. The mechanism is used to adaptively fuse the similarity matrices within different modalities to form a unified inter-modal similarity matrix to guide the learning of hash functions. Extensive experiments have demonstrated that DESAH provides significant improvements in cross-modal retrieval tasks compared to baseline methods.

TNOD: Transformer Network with Object Detection for Tag Recommendation

  • Kai Feng
  • Tao Liu
  • Heng Zhang
  • Zihao Meng
  • Zemin Miao

The hashtag is an effective tool to manage and distribute social media content in recent years. Most existing tag recommendation methods rely on user profiles to improve the F1 score by roughly 20%. In the final results, multimodal information accounts for 58% of the total, while user information accounts for 42%. However, these methods neither provide a sufficient fusion method of information across modalities nor ignore the visual information from the main sources of tags. In this paper, we propose a novel model entitled Transformer Network with Object Detection (TNOD), which utilizes the contextual semantics of entities combined with text and image information, forming a multi-layer attention mechanism. In particular, we use object detection to extract entities in images and fuse the relationship between entities, text, and images with a multi-layer attention mechanism. Experiment results validate the superiorities of our proposed scheme from the perspective of recall rate and precision rate.

CLAP: Contrastive Language-Audio Pre-training Model for Multi-modal Sentiment Analysis

  • Tianqi Zhao
  • Ming Kong
  • Tian Liang
  • Qiang Zhu
  • Kun Kuang
  • Fei Wu

Multi-modal Sentiment Analysis (MSA) is a hotspot of multi-modal fusion. To make full use of the correlation and complementarity between modalities in the process of fusing multi-modal data, we propose a two-stage framework of Contrastive Language-Audio Pre-training (CLAP) for the MSA task: 1) Making contrastive pre-training on an unlabeled large-scaled external data to yield better single-modal representations; 2) Adopting a Transformer-based multi-modal fusion module, to achieve further single-modal feature optimization and sentiment prediction via the task-driven training process. Our work fully demonstrates the importance and necessity of core elements such as pre-training, contrastive learning, and representation learning for the MSA task and significantly outperforms existing methods on two well-recognized MSA benchmarks.

SESSION: Brave New Ideas Paper

Framing the News: From Human Perception to Large Language Model Inferences

  • David Alonso del Barrio
  • Daniel Gatica-Perez

Identifying the frames of news is important to understand the articles’ vision, intention, message to be conveyed, and which aspects of the news are emphasized. Framing is a widely studied concept in journalism, and has emerged as a new topic in computing, with the potential to automate processes and facilitate the work of journalism professionals. In this paper, we study this issue with articles related to the Covid-19 anti-vaccine movement. First, to understand the perspectives used to treat this theme, we developed a protocol for human labeling of frames for 1786 headlines of No-Vax movement articles of European newspapers from 5 countries. Headlines are key units in the written press, and worth of analysis as many people only read headlines (or use them to guide their decision for further reading.) Second, considering advances in Natural Language Processing (NLP) with large language models, we investigated two approaches for frame inference of news headlines: first with a GPT-3.5 fine-tuning approach, and second with GPT-3.5 prompt-engineering. Our work contributes to the study and analysis of the performance that these models have to facilitate journalistic tasks like classification of frames, while understanding whether the models are able to replicate human perception in the identification of these frames.

SESSION: Doctoral Symposium Paper

Dual-Path Semantic Construction Network for Composed Query-Based Image Retrieval

  • Shenshen Li

Composed Query-Based Image Retrieval (CQBIR) aims to retrieve the most relevant image from all the candidates according to the composed query. However, the multi-model query brings more challenges to learning the proper semantics, which include the traits mentioned in the text and resemblance with reference images. The improper learned semantics reduced the performance of existing CQBIR methods. To this end, we propose a novel framework termed Dual-Path Semantic Construction Network for Composed Query-Based Image Retrieval (DSCN). It consists of three components: (1) Multi-level Feature Extraction obtains the textual and visual features of various hierarchies for learning multi-level semantics. (2) Visual-to-Textual Semantic Construction module refines the learned semantics at the textual level. (3) Textual-to-Visual Semantic Construction module performs semantic guidance in the visual semantic space. Extensive experiments on three benchmarks, i.e., FashionIQ, Shoes, and Fashion200k show that our DSCN method outperforms recent state-of-the-art methods.

SESSION: Reproducibility Track Paper

Reproducibility Companion Paper: MeTILDA - Platform for Melodic Transcription in Language Documentation and Application

  • Mitchell Lee
  • Chris Lee
  • Sanjay Penmetsa
  • Min Chen
  • Mizuki Miyashita
  • Naatosi Fish
  • Bo Wu
  • Omar Khan

This companion paper supports the replication of the development and evaluation of “MeTILDA - Platform for Melodic Transcription in Language Documentation and Application” that we presented in the ICMR 2021. MeTILDA aims to help document and analyze pitch patterns of endangered languages including Blackfoot, whose prosodic system is characterized by pitch movements. It develops a new form of audio analysis (termed MeT scale which is a perceptual scale) and automates the process of creating visual aids (Pitch Art) to provide more effective visuals of perceived changes in pitch movement. In this paper, we explain the file structure of the source code and publish the details of our data as well as system operations. Moreover, we provide a link to the demo video for facilitating the use of our platform.

SESSION: Technical Demonstrations

CalorieCam360: Simultaneous Eating Action Recognition of Multiple People Using an Omnidirectional Camera

  • Kento Terauchi
  • Keiji Yanai

In recent years, as people become more health-conscious, dietary management has become increasingly important. Existing methods record only one person’s meals or eating movements, but cannot record the meals of multiple people at the same time. Therefore, we aim to record the meals of all people around a dining table using an omnidirectional camera simultaneously.

In this study, we propose CalorieCam360, a system that records the entire dining table using only an omnidirectional camera and a smartphone. Note that all the processing is done inside the smartphone application without using any external servers. Since the images from the omnidirectional camera are distorted and cannot be used for detection as they are, the distortion is corrected using plane projection. The corrected images are used to detect rectangular objects that serve as references for object size, and the area is calculated by combining object detection and region segmentation to estimate the amount of calories from the area. The system then uses person detection and region segmentation to track the person and the food and records the amount of food consumed and its calorie content for each person. We demonstrate that CalorieCam360 can record an entire meal at once for multiple people around the table.

VISIONE: A Large-Scale Video Retrieval System with Advanced Search Functionalities

  • Giuseppe Amato
  • Paolo Bolettieri
  • Fabio Carrara
  • Fabrizio Falchi
  • Claudio Gennaro
  • Nicola Messina
  • Lucia Vadicamo
  • Claudio Vairo

VISIONE is a large-scale video retrieval system that integrates multiple search functionalities, including free text search, spatial color and object search, visual and semantic similarity search, and temporal search. The system leverages cutting-edge AI technology for visual analysis and advanced indexing techniques to ensure scalability. As demonstrated by its runner-up position in the 2023 Video Browser Showdown competition, VISIONE effectively integrates these capabilities to provide a comprehensive video retrieval solution. A system demo is available online, showcasing its capabilities on over 2300 hours of diverse video content (V3C1+V3C2 dataset) and 12 hours of highly redundant content (Marine dataset). The demo can be accessed at https://visione.isti.cnr.it/.

navigu.net: NAvigation in Visual Image Graphs gets User-friendly

  • Kai Uwe Barthel
  • Nico Hezel
  • Konstantin Schall
  • Klaus Jung

Due to the size of today’s image collections it can be challenging to fully understand their content. Recent technological advances have enabled efficient visual search. These systems use joint visual and textual feature vectors to identify similar images based on image queries or text descriptions. Despite their effectiveness, high-dimensional feature vectors can lead to long search times for large collections. In this demonstration, we propose a solution that significantly reduces search times and increases the efficiency of the search system. By combining two separate image graphs, our method provides fast approximate nearest neighbor search and allows seamless visual exploration of the entire collection in real time through a standard web browser, using familiar navigation techniques such as zooming and dragging, common in systems like Google Maps.

MAAM: Media Asset Annotation and Management

  • Manos Schinas
  • Panagiotis Galopoulos
  • Symeon Papadopoulos

Artificial intelligence can facilitate the management of large amounts of media content and enable media organisations to extract valuable insights from their data. Although AI for media understanding has made rapid progress over the recent years, its deployment in applications and professional sectors poses challenges, especially to organizations with no AI expertise. This motivated the creation of the Media Asset Annotation and Management platform (MAAM) that employs state-of-the-art deep learning models to annotate and facilitate the management of image and video assets. Annotation models provided by MAAM include automatic captioning, object detection, action recognition and moderation models, such as NSFW and disturbing content classifiers. By annotating media assets with these models, MAAM can support easy navigation, filtering and retrieval of media assets. In addition, our platform leverages the power of deep learning to support advanced visual and multi-modal retrieval capabilities. That allows accurately identifying assets that convey a similar idea, or concept even if they are not visually identical, and support a state-of-the-art reverse search facility for images and videos.

Cross-Language Music Recommendation Exploration

  • Stefanos Stoikos
  • David Kauchak
  • Douglas Turnbull
  • Alexandra Papoutsaki

Recommendation systems are essential for music platforms to drive exploration and discovery for users. Little work has been done in exploring cross-language music recommendation systems, which represent another avenue for music exploration. In this paper, we collected and created a database of over 200,000 artists, which includes subsets of artists that sing in 8 different languages other than English. Our goal was to recommend artists in those 8 other language subsets for a given English-speaking artist. Using Spotify’s API-related artists feature, we implemented two approaches: a matrix factorization model using alternating least squares and a breadth-first search system. Both systems perform significantly better than a random baseline based on accuracy of the base artist’s genre with the breadth-first search model outperforming the matrix factorization technique. We conclude with suggestions for improving the performance and reach of cross-language music recommendation systems.

SESSION: Keynote Talk Abstracts

How Responsible LLMs are beneficial to search and exploration in Retail industry

  • Nozha BOUJEMAA
  • Abdelrahman HASSAN
  • Giorgi Kokaia
  • Pratyush Kumar Sinha

Efficient CNNs and Transformers for Video Understanding and Image Synthesis

  • Jürgen Gall

In this talk, I will first discuss approaches that reduce the GFLOPs during inference for 3D convolutional neural networks (CNN) and vision transformers. While state-of-the-art 3D CNNs and vision transformers achieve very good results on action recognition datasets, they are computationally very expensive and require many GFLOPs. While the GFLOPs of a 3D CNN or vision transformer can be decreased by reducing the temporal feature resolution or the number of tokens, there is no setting that is optimal for all input clips. I will therefore discuss two differentiable sampling approaches that can be plugged into any existing 3D CNN or vision transformer architecture. The sampling approaches adapt the computational resources to the input video such that as much resources as needed but not more than necessary are used to classify a video. The approaches substantially reduce the computational cost (GFLOPs) of state-of-the-art networks while preserving the accuracy. In the second part, I will discuss an approach that generates annotated training samples of very rare classes. It is based on a generative adversarial network (GAN) that jointly synthesizes images and the corresponding segmentation mask for each image. The generated data can then be used for one-shot video object segmentation.

Recognizing Actions in Videos under Domain Shift

  • Elisa Ricci

Action recognition, which consists in automatically recognizing the action being performed in a video sequence, is a fundamental task in computer vision and multimedia. Supervised action recognition has been widely studied because of the growing need for automatically categorizing video content that are being generated everyday. However, it is nearly impossible for human annotators to keep pace with the enormous volumes of online videos, and thus supervised training becomes infeasible. A cheaper way of leveraging the massive pool of unlabelled data is by exploiting an already trained model to infer the labels on such data and then re-using them to build an improved model. Such an approach is also prone to failure because the unlabelled data may belong to a data distribution that is different from the annotated one. This is often referred to as the domain-shift problem. To address the domain-shift, recently Unsupervised Video Domain Adaptation (UVDA) methods have been proposed. However, these methods typically make strong and unrealistic assumptions. In this talk I will present some recent works of my research group on UVDA, showing that, thanks to recent advances in deep architectures and to the advent of foundation models, it is possible to deal with more challenging and realistic settings and recognize out-of-distribution classes.

SESSION: Tutorial Abstract

Algorithms for Generating and Evaluating Visually Sorted Grid Layouts

  • Kai Uwe Barthel

The increasing amount of visual data shared online highlights the importance of organizing and finding related content. However, current efforts to improve visual search and image classification lack support for exploratory image search. Sorting images by similarity offers a solution, allowing users to view several hundred images at once.

This tutorial covers the main principles of image sorting techniques, including visual feature vectors to be used, various sorting algorithms, and metrics used to evaluate sorting results. A new sorting algorithm (Linear Assignment Sorting), efficient optimization, and coding examples will also be presented. By the end of the workshop, participants will be able to implement image sorting schemes and address any special requirements related to layout and positioning constraints.

Despite efforts to improve visual search and image classification, there is little research in exploratory visual image search. Humans can easily understand complex images, but have difficulties with a large number of unordered individual images. When searching photo archives or trying to find products online, users are often presented with vast collections of images. However, as human perception is limited, overview is quickly lost when too many images are displayed at once. Typically, only about 10-20 images can be perceived on a single screen, which is a small fraction of the number of available images. Because image archives and e-commerce websites do not offer visual browsing or exploration of their collection, users are left with unstructured lists of images from keyword or similarity searches.

It has been shown that a sorted arrangement helps users to identify regions of interest more easily and thus find the images they are looking for more quickly [4, 9, 10, 13]. Figure 1 shows an example of 256 kitchenware images in random order on the left and visually sorted on the right. Although the sorted images may not be recognized perfectly, users can quickly identify where images of interest are located.

If the images are represented as high-dimensional feature vectors, their similarities can be expressed by appropriate visualization techniques. A variety of dimensionality reduction algorithms have been proposed to visualize high-dimensional data relationships in two dimensions. Conventional dimensionality reduction schemes like Principal Component Analysis (PCA) [8], Multidimensional Scaling (MDS) [12], Locally Linear Embedding (LLE) [11], Isomap [16], t-Distributed Stochastic Neighborhood Embedding (t-SNE) [17], Uniform Manifold Approximation and Projection (UMAP) [7] cannot be used for image sorting because they result in unequally distributed and overlapping images. Furthermore, only a fraction of the display area is used.

To arrange or sort a set of images based on their similarity and utilize the maximum display area effectively, three requirements must be satisfied: 1. The images should not overlap. 2. The image arrangement should cover the entire display area. 3. The similarity relationships of the high-dimensional image feature vectors should be preserved by the 2D image positions.

To overcome this problem only grid-based arrangements can be used for sorting/arranging images. As the number of ways to arrange images in a dense regular grid increases factorially with the grid size, finding the optimal arrangement becomes impractical. However, approximate solutions can be obtained through the use of Self-Organizing Maps (SOMs) [5, 6], Self-Sorting Maps (SSMs) [14, 15], or discrete optimization algorithms like IsoMatch [3]. The main concepts of these image sorting techniques will be explained, and the newly proposed Linear Assignment Sorting (LAS) [1], as well as an approach based on neural networks for learning permutations will be presented.

An overview is given of which visual feature vectors are best suited for visual image sorting. For images, one might expect that feature vectors from neural networks would be best suited to describe them, which is definitely true for retrieval tasks. However, when neural feature vectors are used to visually sort larger sets of images, the arrangements often look somewhat confusing because images can have very different appearances even though they represent a similar concept. Because people pay strong attention to color when viewing larger sets of images and visually group similar-looking images, feature vectors that describe visual appearance are generally better suited for arrangements that are perceived as "well-organized."

There are several metrics for evaluating 2D grid layouts, but there is little experimental evidence of the correlation between human-perceived quality and the value of the metric. We proposed Distance Preservation Quality (DPQ) [1] as a new metric to evaluate the quality of an arrangement and present the results of extensive user testing which revealed a stronger correlation of DPQ with user-perceived quality and performance in image retrieval tasks compared to other metrics.

Optimization techniques for fast image sorting of large image sets are presented, including filtering using integral images, fast matching/swapping using a solver for linear assignment problems, and other techniques. In addition, tips and tricks for sorting images are provided, especially when there are constraints on the layout shape and fixed positioning of certain images.

We will show how the presented techniques can be extended to the visualization of image graphs for continuously changing image sets to enable visual exploration/recommendation of image collections. Navigu.net is an online demo showing how large image collections can be visually explored. [2]. Jupyter notebooks used in the tutorial will be available at


SESSION: Workshop Abstracts

ICDAR’23: Intelligent Cross-Data Analysis and Retrieval

  • Guillaume Habault
  • Minh-Son Dao
  • Michael Alexander Riegler
  • Duc Tien Dang Nguyen
  • Yuta Nakashima
  • Cathal Gurrin

Recently, there has been an increased interest in cross-data research problems, such as predicting air quality using life logging images, predicting congestion using weather and tweets data, and predicting sleep quality using daily exercises and meals. Although several research focusing on multimodal data analytics have been performed, few studies have been conducted on cross-data research (e.g., cross-modal data, cross-domain, cross-platform). The article collection “Intelligent Cross-Data Analysis and Retrieval” aims to encourage research in intelligent cross-data analytics and retrieval and contribute to the creation of a sustainable society. Researchers from diverse domains such as well-being, disaster prevention and mitigation, mobility, climate, tourism and healthcare are welcome to contribute to this Research Topic.

MAD ’23 Workshop: Multimedia AI against Disinformation

  • Luca Cuccovillo
  • Bogdan Ionescu
  • Giorgos Kordopatis-Zilos
  • Symeon Papadopoulos
  • Adrian Popescu

With recent advancements in synthetic media manipulation and generation, verifying multimedia content posted online has become increasingly difficult. Additionally, the malicious exploitation of AI technologies by actors to disseminate disinformation on social media, and more generally the Web, at an alarming pace poses significant threats to society and democracy. Therefore, the development of AI-powered tools that facilitate media verification is urgently needed. The MAD ’23 workshop aims to bring together individuals working on the wider topic of detecting disinformation in multimedia to exchange their experiences and discuss innovative ideas, attracting people with varying backgrounds and expertise. The research areas of interest include identifying manipulated and synthetic content in multimedia, as well as examining the dissemination of disinformation and its impact on society. The multimedia aspect is very important since content most often contains a mix of modalities and their joint analysis can boost the performance of verification methods.

Introduction to the Sixth Annual Lifelog Search Challenge, LSC’23

  • Cathal Gurrin
  • Björn Þór Jónsson
  • Duc Tien Dang Nguyen
  • Graham Healy
  • Jakub Lokoc
  • Liting Zhou
  • Luca Rossetto
  • Minh-Triet Tran
  • Wolfgang Hürst
  • Werner Bailer
  • Klaus Schoeffmann

For the sixth time since 2018, the Lifelog Search Challenge (LSC) was organized as a comparative benchmarking exercise for various interactive lifelog search systems. The goal of this international competition is to test system capabilities to access large multimodal lifelogs. LSC’23 attracted twelve participanting teams, each of whom had developed a competitive interactive lifelog retrieval system. The benchmark was organized in front of live audience at the LSC workshop at ACM ICMR’23. As in previous editions, this introductory paper presents the LSC workshop and introduces the participating lifelog search systems.