LGM3A '23: Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications

LGM3A '23: Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications

LGM3A '23: Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications


Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Large Generative Models Meet Multimodal Video Intelligence

  • Mike Zheng Shou

In this talk, I would like to share my recent research around multimodal video intelligence in the era of large generative models. I will first talk about video-language pretraining techniques (All-in-one, EgoVLP) that use one single model to power various understanding tasks ranging from retrieval to QA. Then I will introduce challenges and our efforts of adapting these large pretrained models to AI Assistant, such a real-world application (AssistQ, AssistGPT). Finally I will delve into the reverse problem i.e. given open-world textual description, how to generate videos with diffusion models (Tune-A-Video, Show-1).

Unlocking Multimedia Capabilities of Gigantic Pretrained Language Models

  • Boyang Li

Benefitting from unprecedented computational power, massive data throughput, and superhuman memory, large language models (LLMs) are fundamentally transforming multimodal machine learning. An LLM can be analogized to an enormous treasure box guarded by a lock. It contains extensive knowledge, but it can be non-trivial to access and apply appropriate knowledge to solve the problem at hand. Researchers have developed many techniques to unlock the capabilities of LLMs. Some well-known examples include chain-of-thought prompting, "let's think step by step'', and instruction tuning. In this talk, I will discuss techniques to unlock the capability of LLMs to process both visual and linguistic information. VisualGPT is one of the earliest works that finetunes an LLM for a vision-language task. InstructBLIP is an instruction-tuned large vision-language model, which set new states of the art on several vision-language tasks and snatched top positions on several comprehensive evaluation suites. In addition, I will talk about how to unlock zero-shot capabilities without end-to-end finetuning, or any form of finetuning at all. In Plug-and-Play VQA and Img2LLM, we achieve excellent results on visual question-answering datasets by connecting existing pretrained models using natural language and model interpretations, demonstrating a feasible alternative to the mainstream finetuning approach. Finally, I will describe a new multimodal dataset, Synopses of Movie Narratives, or SyMoN, for story understanding, which constitutes a new challenge for large vision-language models. I will argue that story understanding is an important objective in the pursuit of artificial general intelligence (AGI) because stories are a preeminent form of human communication and story understanding requires many AGI capabilities such as cause-effect reasoning and theory of mind. Compared to other multimodal story datasets, the special advantages of SyMoN include (1) event descriptions at the right level of granularity, (2) abundant mental state descriptions, (3) the use of diverse storytelling techniques, and (4) the provision of easy-to-use automatic performance evaluation.

Multi-Modal Generative AI with Foundation Models

  • Ziwei Liu

Generating photorealistic and controllable visual contents has been a long-pursuing goal of artificial intelligence (AI), with extensive real-world applications. It is also at the core of embodied intelligence. In this talk, I will discuss our work in AI-driven visual context generation of humans [1, 2], objects [3] and scenes [4], with an emphasis on combing the power of neural rendering with large multimodal foundation models [5]. Our generative AI framework has shown its effectiveness and generalizability on a wide range of tasks.

SESSION: Session 1: Paper Presentation

NeurSEG: A Segment Driven Deep Neural Model for Nested Named Entity Recognition

  • Zheng Wang
  • Fei Li
  • Cheng Long

Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP). Apart from flat entities, nested entities are also commonly existed in real-life textual data. However, the current methods are not capable of handling nested structures for NER effectively. In this paper, we propose a novel segment driven modeling method (NeurSEG) for the nested NER problem, which can effectively extract entities from the nested structures in complex nesting scenarios. The proposed NeurSEG model first finds the nested label of each word in a sentence and determines the positional relationships between neighbouring words, and then extracts the entities and predicts the corresponding entity types. In addition, we also propose an augmented training method for further improving the performance. We have conducted extensive experiments based on both flat and nested NER benchmark datasets. The performance results have shown that our proposed NeurSEG model has achieved promising performance while retaining its runtime efficiency for the nested NER task. Moreover, the proposed model has also achieved very competitive results when compared with the existing models for the flat NER task, demonstrating its capability for tackling both nested and flat NER tasks.

SAT: Self-Attention Control for Diffusion Models Training

  • Jing Huang
  • Tianyi Zhang
  • Wei Shi

Recent text-to-image diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, a persistent challenge lies in the generation of detailed images, especially human-related images, which often exhibit distorted faces and eyes. Existing approaches to address this issue either involve the utilization of more specific yet lengthy prompts or the direct application of restoration tools to the generated image. Besides, a few studies have shown that attention maps can enhance diffusion models' stability by guiding intermediate samples during the inference process. In this paper, we propose a novel training strategy (SAT) to improve the sample quality during the training process. To address this issue in a straightforward manner, we introduce blur guidance as a solution to refine intermediate samples, enabling diffusion models to produce higher-quality outputs with a moderate ratio of control. Improving upon this, SAT leverages the intermediate attention maps of diffusion models to further improve training sample quality. Specifically, SAT adversarially blurs only the regions that diffusion models attend to and guide them during the training process. We examine and compare both cross-attention mask control (CAC) and self-attention mask control (SAC) based on stable diffusion (SD) V-1.5, and our results show that our method under SAC (i.e SAT) improves the performance of stable diffusion.

Multimodal Data Augmentation for Image Captioning using Diffusion Models

  • Changrong Xiao
  • Sean Xin Xu
  • Kunpeng Zhang

Image captioning, an important vision-language task, often requires a tremendous number of finely labeled image-caption pairs for learning the underlying alignment between images and texts. In this paper, we proposed a multimodal data augmentation method, leveraging a recent text-to-image model called Stable Diffusion, to expand the training set via high-quality generation of image-caption pairs. Extensive experiments on the MS COCO dataset demonstrate the advantages of our approach over several benchmark methods, and particularly a significant boost when having fewer training instances. In addition, models trained on our augmented datasets also outperform prior unpaired image captioning methods by a large margin. Finally, further improvement regarding the training efficiency and effectiveness can be obtained after intentionally filtering the generated data based on quality assessment.

ImEW: A Framework for Editing Image in the Wild

  • Tasnim Mohiuddin
  • Tianyi Zhang
  • Maowen Nie
  • Jing Huang
  • Qianqian Chen
  • Wei Shi

The ability to edit images in a realistic and visually appealing manner is a fundamental requirement in various computer vision applications. In this paper, we present ImEW, a unified framework designed for solving image editing tasks. ImEW utilizes off-the-shelf foundation models to address four essential editing tasks: object removal, object translation, object replacement, and generative fill beyond the image frame. These tasks are accomplished by leveraging the capabilities of state-of-the-art foundation models, namely the Segment Anything Model, Grounding DINO, LaMa, and Stable Diffusion. These models have undergone extensive training on large-scale datasets and have exhibited exceptional performance in understanding image context, object manipulation, and texture synthesis. Through extensive experimentation, we demonstrate the effectiveness and versatility of ImEW in accomplishing image editing tasks across a wide range of real-world scenarios. The proposed framework opens up new possibilities for realistic and visually appealing image editing and enables diverse applications requiring sophisticated image modifications. Additionally, we discuss the limitations and outline potential directions for future research in the field of image editing using off-the-shelf foundation models, enabling continued advancements in this domain.

CGSMP: Controllable Generative Summarization via Multimodal Prompt

  • Qian Yong
  • Jueqi Wei
  • YiRen Zhang
  • XiLun Zhang
  • Chao Wei
  • Simiao Chen
  • Yunhe Li
  • Cheng Ye
  • Bing Huang
  • Hao Wang

Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of a large language model (LLM), this advancement has resulted in more fluent and coherent Natural Language Generation, which has contributed to improved development in downstream tasks such as abstractive summarization. Despite the recent progress in LLM, hallucination has become a serious problem in NLG. Hallucination happens when language models generate nonsensical or unfaithful text, which will lead to severe problems with reliability and effectiveness. In this paper, we propose a novel approach called Controllable Generative Summarization via Multimodal Prompt (CGSMP), which uses entities extracted from content and images as multimodal prompt control signals, thereby reducing hallucination issues. Specifically, the proposed CGSMP consists of three main modules: (1) an image prefix module that obtains image representations; (2) a prompt encoder module that fusion entities and images as multimodal prompts; and (3) a pre-trained causal language model that fuses input and controllable prompt and serves as the backbone of the language model. Experimental results demonstrate that the proposed method significantly improves the quality of generated summaries compared to the state of the arts.

Generating Multimodal Augmentations with LLMs from Song Metadata for Music Information Retrieval

  • Federico Rossetto
  • Jeffrey Dalton
  • Roderick Murray-Smith

In this work we propose a set of new automatic text augmentations that leverage Large Language Models from song metadata to improve on music information retrieval tasks. Compared to recent works, our proposed methods leverage large language models and copyright-free corpora from web sources, enabling us to release the knowledge sources collected. We show how combining these representations with the audio signal provides a 21% relative improvement on five of six datasets on genre classification, emotion recognition and music tagging, achieving state-of-the-art in three (GTZAN, FMA-Small and Deezer). We demonstrate the benefit of injecting external knowledge sources by comparing them withintrinsic text representation methods that rely only on the sample's information.

Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model

  • Mingliang Liang
  • Martha Larson

In this paper, we introduce Subsampling of frequent Words for Contrastive Language-Image Pre-training (SW-CLIP), a novel approach for the training Vision-Language Models (VLMs). SW-CLIP uses frequency-based subsampling of words that has been previously proposed to train skip-gram models in natural language processing and applies it to the textual training data of VLMs. We report on experiments that demonstrate the ability of frequency-based subsampling to speed up training and also to deliver a substantial improvement in accuracy in a number of downstream zero-shot (i.e., transfer) classification tasks. We notice that the classification test sets on which SW-CLIP seems to be particularly effective are those in which the labels of the classes occur infrequently as words in the training data, and thus have a high probability of being retained during frequency-based subsampling of the model training data. Overall, the advantages of SW-CLIP demonstrated in this paper serves to motivated further future work in text subsampling for the training of VLMs. Our code and pre-trained weights are available at https://github.com/Anastasiais-ml/sw_clip.git

Fashion-GPT: Integrating LLMs with Fashion Retrieval System

  • Qianqian Chen
  • Tianyi Zhang
  • Maowen Nie
  • Zheng Wang
  • Shihao Xu
  • Wei Shi
  • Zhao Cao

Customers on a fashion e-commerce platform although expressing their clothing preferences through combined imagery and textual information, they are limited to retrieve with single-round fixed inputs. At the same time, large language models (LLMs) have been gaining attention across various fields. ChatGPT is a remarkable example of an LLM, known for its user-friendly language interface, impressive conversational proficiency, and reasoning abilities. To this end, we propose Fashion-GPT, a system paradigm that integrates ChatGPT with a pool of AI models in the fashion domain to achieve a multi-round multi-modal search. Specifically, it enables the system to utilize the LLMs for understanding user queries, select retrieval models based on their function descriptions, execute each subtask with the selected fashion models, and leverage LLMs to summarize the response corresponding to the execution results.

In order to boost the performance dominated by AI experts, we also introduce a novel pre-trained framework called 3M (short for Multi-view Multi-modal Matching). In particular, unlike prior studies that rely solely on one-to-one matching on image-text pair, 3M incorporates multiple texts describing the same image to achieve one-to-many alignment. Maximizing mutual information between features extracted from these views boosts capturing information about high-level factors that influence multiple views, such as the occurrence of specific objects. In addition, with the advantage of the characteristics of fashion data, multi-view images from the same product, like front-view and side-view, are naturally suitable for intra-modal self-alignment. Therefore, 3M also introduces an intra-modal contrastive objective to provide additional benefits in representation learning from the image perspective. To the best of our knowledge, our framework is the first to consider one-to-many mapping for multi-modality representation learning. Experimental evaluations demonstrate that our fashion experts are competitive and achieve state-of-the-art performance, bringing a +3.47% R@10 boost on Fashion-200K and +1.98% R@10 boost on the Fashion-IQ dress dataset compared to the previous SOTA results.