MAD '23: Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation

MAD '23: Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation

MAD '23: Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation


Full Citation in the ACM Digital Library

SESSION: Session 0: Keynote Talks

Controllable image generation and manipulation

  • Ioannis Patras

Recent years have witnessed an unprecedented interest in developing Deep Learning methodologies for the generation of images and image sequences that are hardly distinguishable to the human eye from real ones. A major issue in this field is how the generation can be easily controlled. In this talk we will focus on some of our recent works in generative models that are primarily aimed at controllable generation. We will first present unsupervised methods for learning non-linear paths in the latent spaces of Generative Adversarial Networks such that following different paths lead to different types of changes (e.g., removing the background, changing head poses, or facial expressions) in the resulting images [4]. Subsequently, we will present a method that allows local editing by finding a Parts and Appearances decomposition in the GAN latent space [2]. Then, we will present recent works on reenactment [1], where the goal is to transfer the facial activity (pose, expressions, speech) of a certain person to another one, and recent works in which supervision for generation comes from language models [3]. Finally, we will touch on the technical challenges ahead, as well on the challenges that this creates in spreading misinformation.

Multimedia Forensics versus disinformation in images and videos: lesson learnt and new challenges

  • Roberto Caldelli

From the very beginning when photography appeared and then much more in the digital era, images and videos have been edited not only to improve the visual quality of what they represented but also to change what had been acquired to mystify the reality. This is often done in order to transfer a different meaning to the watcher and basically mislead his/her opinion. On the other side, during the years, the need for even more effective defense instruments able to detect such alterations has increased. This has become particularly crucial with the advent of deep-learning based techniques that have allowed to rather easily achieve realistic results in content manipulation (deepfakes) but also in multimedia synthetic generation. This talk will provide a look throughout the evolution of the various kinds of manipulations with a parallel focus on the diverse multimedia forensic techniques and approaches [1, 2]. An analysis will be carried out to try to understand how needs and solutions have been evolved [3] in order to fix the lesson learnt and to individuate future research challenges.

SESSION: Session 1: AI for Audio Analysis

Synthetic Speech Detection through Audio Folding

  • Davide Salvi
  • Paolo Bestagini
  • Stefano Tubaro

In the field of synthetic speech generation, recent advancements in deep learning and speech synthesis methods have enabled the possibility of creating highly realistic fake speech tracks that are difficult to distinguish from real ones. Since the malicious use of these data can lead to dangerous consequences, the audio forensics community has focused on developing synthetic speech detectors to determine the authenticity of speech tracks. In this work we focus on the wide class of detectors that analyze audio streams on a frame-by-frame basis. We propose a technique to reduce the inference time of these detectors by relying on the fact that it is possible to mix multiple audio frames in a single one (i.e., in the same way a mono track is obtained from a stereo one). We test the proposed audio folding technique on speech tracks obtained from the ASVspoof 2019 dataset. The technique proves effective with both entirely and partially fake speech tracks and shows remarkable results, reducing processing time down to 25%.

SpoTNet: A spoofing-aware Transformer Network for Effective Synthetic Speech Detection

  • Awais Khan
  • Khalid Mahmood Malik

The prevalence of voice spoofing attacks in today’s digital world has become a critical security concern. Attackers employ various techniques, such as voice conversion (VC) and text-to-speech (TTS), to generate synthetic speech that imitates the victim’s voice and gain access to sensitive information. The recent advances in synthetic speech generation pose a significant threat to modern security systems, while traditional voice authentication methods are incapable of detecting them effectively. To address this issue, a novel solution for logical access (LA)-based synthetic speech detection is proposed in this paper. SpoTNet is an attention-based spoofing transformer network that includes crafted front-end spoofing features and deep attentive features retrieved using the developed logical spoofing transformer encoder (LSTE). The derived attentive features were then processed by the proposed multi-layer spoofing classifier to classify speech samples as bona fide or synthetic. In synthetic speeches produced by the TTS algorithm, the spectral characteristics of the synthetic speech are altered to match the target speaker’s formant frequencies, while in VC attacks, the temporal alignment of the speech segments is manipulated to preserve the target speaker’s prosodic features. By highlighting these observations, this paper targets the prosodic and phonetic-based crafted features, i.e., the Mel-spectrogram, spectral contrast, and spectral envelope, presenting an effective preprocessing pipeline proven to be effective in synthetic speech detection. The proposed solution achieved state-of-the-art performance against eight recent feature fusion methods with lower EER of 0.95% on the ASVspoof-LA dataset, demonstrating its potential to advance the field of speaker identification and improve speaker recognition systems.

SESSION: Session 2: Improving AI Generalization

Autoencoder-based Data Augmentation for Deepfake Detection

  • Dan-Cristian Stanciu
  • Bogdan Ionescu

Image generation has seen huge leaps in the last few years. Less than 10 years ago we could not generate accurate images using deep learning at all, and now it is almost impossible for the average person to distinguish a real image from a generated one. In spite of the fact that image generation has some amazing use cases, it can also be used with ill intent. As an example, deepfakes have become more and more indistinguishable from real pictures and that poses a real threat to society. It is important for us to be vigilant and active against deepfakes, to ensure that the false information spread is kept under control. In this context, the need for good deepfake detectors feels more and more urgent. There is a constant battle between deepfake generators and deepfake detection algorithms, each one evolving at a rapid pace. But, there is a big problem with deepfake detectors: they can only be trained on so many data points and images generated by specific architectures. Therefore, while we can detect deepfakes on certain datasets with near 100% accuracy, it is sometimes very hard to generalize and catch all real-world instances. Our proposed solution is a way to augment deepfake detection datasets using deep learning architectures, such as Autoencoders or U-Net. We show that augmenting deepfake detection datasets using deep learning improves generalization to other datasets. We test our algorithm using multiple architectures, with experimental validation being carried out on state-of-the-art datasets like CelebDF and DFDC Preview. The framework we propose can give flexibility to any model, helping to generalize to unseen datasets and manipulations.

Improving Synthetically Generated Image Detection in Cross-Concept Settings

  • Pantelis Dogoulis
  • Giorgos Kordopatis-Zilos
  • Ioannis Kompatsiaris
  • Symeon Papadopoulos

New advancements for the detection of synthetic images are critical for fighting disinformation, as the capabilities of generative AI models continuously evolve and can lead to hyper-realistic synthetic imagery at unprecedented scale and speed. In this paper, we focus on the challenge of generalizing across different concept classes, e.g., when training a detector on human faces and testing on synthetic animal images – highlighting the ineffectiveness of existing approaches that randomly sample generated images to train their models. By contrast, we propose an approach based on the premise that the robustness of the detector can be enhanced by training it on realistic synthetic images that are selected based on their quality scores according to a probabilistic quality estimation model. We demonstrate the effectiveness of the proposed approach by conducting experiments with generated images from two seminal architectures, StyleGAN2 and Latent Diffusion, and using three different concepts for each, so as to measure the cross-concept generalization ability. Our results show that our quality-based sampling method leads to higher detection performance for nearly all concepts, improving the overall effectiveness of the synthetic image detectors.

Synthetic Misinformers: Generating and Combating Multimodal Misinformation

  • Stefanos-Iordanis Papadopoulos
  • Christos Koutlis
  • Symeon Papadopoulos
  • Panagiotis Petrantonakis

With the expansion of social media and the increasing dissemination of multimedia content, the spread of misinformation has become a major concern. This necessitates effective strategies for multimodal misinformation detection (MMD) that detect whether the combination of an image and its accompanying text could mislead or misinform. Due to the data-intensive nature of deep neural networks and the labor-intensive process of manual annotation, researchers have been exploring various methods for automatically generating synthetic multimodal misinformation - which we refer to as Synthetic Misinformers - in order to train MMD models. However, limited evaluation on real-world misinformation and a lack of comparisons with other Synthetic Misinformers makes difficult to assess progress in the field. To address this, we perform a comparative study on existing and new Synthetic Misinformers that involves (1) out-of-context (OOC) image-caption pairs, (2) cross-modal named entity inconsistency (NEI) as well as (3) hybrid approaches and we evaluate them against real-world misinformation; using the COSMOS benchmark. The comparative study showed that our proposed CLIP-based Named Entity Swapping can lead to MMD models that surpass other OOC and NEI Misinformers in terms of multimodal accuracy and that hybrid approaches can lead to even higher detection accuracy. Nevertheless, after alleviating information leakage from the COSMOS evaluation protocol, low Sensitivity scores indicate that the task is significantly more challenging than previous studies suggested. Finally, our findings showed that NEI-based Synthetic Misinformers tend to suffer from a unimodal bias, where text-only models can outperform multimodal ones.

SESSION: Session 3: AI for (Dis-)Information Analysis

In the Spotlight: The Russian Government's Use of Official Twitter Accounts to Influence Discussions About its War in Ukraine

  • Benjamin Shultz

Russia's war in Ukraine has marked an inflection point for the future of the global order and democracy itself. Widely condemned for waging a war of aggression, the Russian government has used its official social media channels to spread disinformation as justification for the war. This study examines how the Russian government has used its official Twitter accounts to shape English-language conversations about the war in Ukraine. 2,685 English-language tweets posted by 70 Russian government accounts between 1 September 2022 and 31 January 2023 were analyzed using BERTopic. Initial topic analysis shows the Russian government portrayed itself as a noble world leader interested in peace and cooperation, while deflecting blame onto the “Kiev Regime” for starting the war. A semantic similarity analysis was then conducted to compare the narratives originating from Russian government Twitter accounts to 149,732 English-language tweets about the war in Ukraine to estimate these narratives’ spread. Results show a segment of general discussion tweets to exhibit strongly similar language to Russian government tweets, but also highlight differences between the frequency and saliency of Russian government narratives. This work contributes one of the first analyses of disinformation originating from official Russian government social media channels about the war in Ukraine.

Examining European Press Coverage of the Covid-19 No-Vax Movement: An NLP Framework

  • David Alonso del Barrio
  • Daniel Gatica-Perez

This paper examines how the European press dealt with the no-vax reactions against the Covid-19 vaccine and the dis- and misinformation associated with this movement. Using a curated dataset of 1786 articles from 19 European newspapers on the anti-vaccine movement over a period of 22 months in 2020-2021, we used Natural Language Processing techniques including topic modeling, sentiment analysis, semantic relationship with word embeddings, political analysis, named entity recognition, and semantic networks, to understand the specific role of the European traditional press in the disinformation ecosystem. The results of this multi-angle analysis demonstrate that the European well-established press actively opposed a variety of hoaxes mainly spread on social media, and was critical of the anti-vax trend, regardless of the political orientation of the newspaper. This confirms the relevance of studying the role of high-quality press in the disinformation ecosystem.