MMIR '23: Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval

MMIR '23: Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval

MMIR '23: Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval


Full Citation in the ACM Digital Library

SESSION: Workshop Presentations

Metaverse Retrieval: Finding the Best Metaverse Environment via Language

  • Ali Abdari
  • Alex Falcon
  • Giuseppe Serra

In recent years, the metaverse has sparked an increasing interest across the globe and is projected to reach a market size of more than \1000B by 2030. This is due to its many potential applications in highly heterogeneous fields, such as entertainment and multimedia consumption, training, and industry. This new technology raises many research challenges since, as opposed to the more traditional scene understanding, metaverse scenarios contain additional multimedia content, such as movies in virtual cinemas and operas in digital theaters, which greatly influence the relevance of the metaverse to a user query. For instance, if a user is looking for Impressionist exhibitions in a virtual museum, only the museums that showcase exhibitions featuring various Impressionist painters should be considered relevant. In this paper, we introduce the novel problem of text-to-metaverse retrieval, which proposes the challenging objective of ranking a list of metaverse scenarios based on a given textual query. To the best of our knowledge, this represents the first step towards understanding and automating cross-modal tasks dealing with metaverses. Since no public datasets contain these important multimedia contents inside the scenes, we also collect and annotate a dataset which serves as a proof-of-concept for the problem. To establish the foundation for it, we implement and analyze several solutions based on deep learning, whereas to promote transparency and reproducibility, we will publicly release their source code and the collected data.

TC-OCR: TableCraft OCR for Efficient Detection & Recognition of Table Structure & Content

  • Avinash Anand
  • Raj Jaiswal
  • Pijush Bhuyan
  • Mohit Gupta
  • Siddhesh Bangar
  • Md. Modassir Imam
  • Rajiv Ratn Shah
  • Shin'ichi Satoh

The automatic recognition of tabular data in document images presents a significant challenge due to the diverse range of table styles and complex structures. Tables offer valuable content representation, enhancing the predictive capabilities of various systems such as search engines and Knowledge Graphs. Addressing the two main problems, namely table detection (TD) and table structure recognition (TSR), has traditionally been approached independently. In this research, we propose an end-to-end pipeline that integrates deep learning models, including DETR, Cascade TabNet, and PP OCR v2, to achieve comprehensive image-based table recognition. This integrated approach effectively handles diverse table styles, complex structures, and image distortions, resulting in improved accuracy and efficiency compared to existing methods like Table Transformer. Our system achieves simultaneous table detection, table structure recognition, and table content recognition (TCR), preserving table structures and accurately extracting tabular data from document images. The integration of multiple models addresses the intricacies of table recognition, making our approach a promising solution for image-based table understanding, data extraction, and information retrieval applications. Our proposed approach achieves an IOU of 0.96 and an OCR Accuracy of 78%, showcasing a remarkable improvement of approximately 25% in the OCR Accuracy compared to the previous Table Transformer approach.

Prescription Recommendation based on Intention Retrieval Network and Multimodal Medical Indicator

  • Feng Gao
  • Yao Chen
  • Maofu Liu

Knowledge based Clinical Decision Support Systems can provide precise and interpretable results for prescription recommendation. Many existing knowledge based prescription recommendation systems take into account multi-modal historical medical events to learn from past experiences. However, such approaches treat those events as independent, static information and neglect the fact that patient history is a set of chronical events. Hence, they lack the ability to extract the dynamics of prescription intentions and cannot provide precise and interpretable results for chronical disease patients with long-term or repeating visits. To address these limitations, we propose a novel Intention Aware Conditional Generation Net (IACoGNet), which introduces an optimized copy-or-predict mechanism to learn precription intentions from multi-modal health datasets and generate drug recommendations. IACoGNet first designs a knowledge representation model that captures multi-modal patient features. Then, it proposes a novel prescription intention representation model in the multi-visit scenario and predicts the diagnostic intention. Finally, it constructs a prescription recommendation framework utilizing the above two knowledge representations. We validate IACoGNet on the public MIMIC data set, and the experimental results show that IACoGNet can achieve optimum in F1 score and average precision.

Boon: A Neural Search Engine for Cross-Modal Information Retrieval

  • Yan Gong
  • Georgina Cosma

Visual-Semantic Embedding (VSE) networks can help search engines understand the meaning behind visual content and associate it with relevant textual information, leading to accurate search results. VSE networks can be used in cross-modal search engines to embed image and textual descriptions in a shared space, enabling image-to-text and text-to-image retrieval tasks. However, the full potential of VSE networks for search engines has yet to be fully explored. This paper presents Boon, a novel cross-modal search engine that combines two state-of-the-art networks: the GPT-3.5-turbo large language model, and the VSE network VITR (VIsion Transformers with Relation-focused learning) to enhance the engine's capabilities in extracting and reasoning with regional relationships in images. VITR employs encoders from CLIP that were trained with 400 million image-description pairs and it was fine-turned on the RefCOCOg dataset. Boon's neural-based components serve as its main functionalities: 1) a 'cross-modal search engine' that enables end-users to perform image-to-text and text-to-image retrieval. 2) a 'multi-lingual conversational AI' component that enables the end-user to converse about one or more images selected by the end-user. Such a feature makes the search engine accessible to a wide audience, including those with visual impairments. 3) Boon is multi-lingual and can take queries and handle conversations about images in multiple languages. Boon was implemented using the Django and PyTorch frameworks. The interface and capabilities of the Boon search engine are demonstrated using the RefCOCOg dataset, and the engine's ability to search for multimedia through the web is facilitated by Google's API.

Video Referring Expression Comprehension via Transformer with Content-conditioned Query

  • Jiang Ji
  • Meng Cao
  • Tengtao Song
  • Long Chen
  • Yi Wang
  • Yuexian Zou

Video Referring Expression Comprehension (REC) aims to localize a target object in videos based on the queried natural language. Recent improvements in video REC have been made using Transformer-based methods with learnable queries. However, we contend that this naive query design is not ideal given the open-world nature of video REC brought by text supervision. With numerous potential semantic categories, relying on only a few slowupdated queries is insufficient to characterize them. Our solution to this problem is to create dynamic queries that are conditioned on both the input video and language to model the diverse objects referred to. Specifically, we place a fixed number of learnable bounding boxes throughout the frame and use corresponding region features to provide prior information. Also, we noticed that current query features overlook the importance of cross-modal alignment. To address this, we align specific phrases in the sentence with semantically relevant visual areas, annotating them in existing video datasets (VID-Sentence and VidSTG). By incorporating these two designs, our proposed model (called ConFormer) outperforms other models on widely benchmarked datasets. For example, in the testing split of VID-Sentence dataset, ConFormer achieves 8.75% absolute improvement on Accu.@0.6 compared to the previous state-of-the-art model.

Dynamic Network for Language-based Fashion Retrieval

  • Hangfei Li
  • Yiming Wu
  • Fangfang Wang

Language-based fashion image retrieval, as a kind of composed image retrieval, presents a substantial challenge in the domain of multi-modal retrieval. This task aims to retrieve the target fashion item in the gallery given a reference image and a modification text. Existing approaches primarily concentrate on developing a static multi-modal fusion module to learn the combined semantics of the reference image and modification text. Despite their commendable advancements, these approaches are still limited by a deficiency in flexibility, which is attributed to the application of a singluar fusion module across diverse input queries. In contrast to static fusion methods, we propose a novel method termed Dynamic Fusion Network (DFN) to compose the multi-granularity features dynamically by considering the consistency of routing path and modality-specific information simultaneously. In specific, our proposed method is consisted of two modules: (1) Dynamic Network. The dynamic network enables a flexible combination of the different operation modules, providing multi-granularity modality interaction for each reference image and modifier text. (2) Modality Specific Routers (MSR). The MSR generates precise routing decisions based on the distinct semantics and distributions of each reference image and modifier text. Extensive experiments on three benchmarks,\textiti.e., FashionIQ, Shoes and Fashion200K, demonstrate the effectiveness of our proposed model compared with the existing methods.

On Popularity Bias of Multimodal-aware Recommender Systems: A Modalities-driven Analysis

  • Daniele Malitesta
  • Giandomenico Cornacchia
  • Claudio Pomo
  • Tommaso Di Noia

Multimodal-aware recommender systems (MRSs) exploit multimodal content (e.g., product images or descriptions) as items' side information to improve recommendation accuracy. While most of such methods rely on factorization models (e.g., MFBPR) as base architecture, it has been shown that MFBPR may be affected by popularity bias, meaning that it inherently tends to boost the recommendation of popular (i.e., short-head) items at the detriment of niche (i.e., long-tail) items from the catalog. Motivated by this assumption, in this work, we provide one of the first analyses on how multimodality in recommendation could further amplify popularity bias. Concretely, we evaluate the performance of four state-of-the-art MRSs algorithms (i.e., VBPR, MMGCN, GRCN, LATTICE) on three datasets from Amazon by assessing, along with recommendation accuracy metrics, performance measures accounting for the diversity of recommended items and the portion of retrieved niche items. To better investigate this aspect, we decide to study the separate influence of each modality (i.e., visual and textual) on popularity bias in different evaluation dimensions. Results, which demonstrate how the single modality may augment the negative effect of popularity bias, shed light on the importance to provide a more rigorous analysis of the performance of such models.