ADGD '21: Proceedings of the 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation and Detection

ADGD '21: Proceedings of the 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation
and Detection

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Keynote Talk

Session details: Keynote Talk

Abhinav Dhall

Fighting AI-synthesized Fake Media

Siwei Lyu

Recent years have witnessed an unexpected and astonishing rise of AI-synthesized fake
media (GAN synthesized faces, face-swap videos, and style transferred audios, commonly
known as the DeepFakes), thanks to the rapid advancement of technology and the omnipresence
of social media. Together with other forms of online disinformation, the AI-synthesized
fake media are eroding our trust in online information and have already caused real
damage. It is thus important to develop countermeasures to limit the negative impacts
of AI-synthesized fake media. In this presentation, I will first provide a high level
overview of the AI-synthesized fake media and their potential negative social impacts.
I will then highlight recent technical developments to fight AI-synthesized fake media,
including the signal level methods [1], and physical/physiological based methods [2,3],
as well as our effort of making large datasets [4] and open-platforms for DeepFake
detection [5]. I will also point out some current challenges, and discuss the future
of AI-synthesized fake media and their counter technology.

Representations for Content Creation, Manipulation and Animation

Sergey Tulyakov

What I cannot create, I do not understand" said the famous writing on Dr. Feynman's
blackboard. The ability to create or to change objects requires us to understand their
structure and factors of variation. For example, to draw a face an artist is required
to know its composition and have a good command of drawing skills (the latter is particularly
challenging for the presenter). Animation additionally requires the knowledge of rigid
and non-rigid motion patterns of the object. This talk shows that generation, manipulation
and animation skills of deep generative models substantially benefit from such understanding.
Moreover we see, the better the models can explain the data they see during training,
the higher quality content they are able to generate. Understanding and generation
form a loop in which improved understanding improves generation, improving understanding
even more. To show this, I detail our works in three areas: video synthesis and prediction,
image animation by motion retargeting. I will further introduce a new direction in
video generation which allows the user to play videos as they're generated. In each
of these works, the internal representation was designed to facilitate better understanding
of the task, resulting in improved generation abilities. Without a single labeled
example, our models are able to understand factors of variation, object parts, their
shapes, their motion patterns and perform creative manipulations previously only available
to trained professionals equipped with specialized software and hardware.

"Deepfake" Portrait Image Generation

Jianfei Cai

With the prevailing of deep learning technology, especially generative adversarial
networks (GAN), generating photo-realistic facial images has made huge progress. Image
generation techniques have many good applications such as data augmentation, entertainment,
augmented/virtual reality as well as bad applications like Deepfake, which has caused
huge concerns in the society. In this talk, we mainly review general image generation
techniques, particularly describing a few of our recent work on high-quality portrait/facial
image generation. Our work can be divided into 3D based approaches and GAN based 2D
approaches. For 3D based approaches, we will explain how to model facial geometry,
reflectance and lighting explicitly, and then show how such 3D modelling knowledge
can be used in portrait manipulation. For 2D GAN based approaches, we will present
a framework for pluralistic facial image generation from a masked facial input, where
all the previous approaches only aim to produce one output. At the end, some suggestions
will be provided for detecting deepfake from a generation point of view.

SESSION: Session 1: Deepfake Detection

Session details: Session 1: Deepfake Detection

Pavel Korshunov

Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal
Detectors

Hasam Khalid
Minha Kim
Shahroz Tariq
Simon S. Woo

Significant advancements made in the generation of deepfakes have caused security
and privacy issues. Attackers can easily impersonate a person's identity in an image
by replacing his face with the target person's face. Moreover, a new domain of cloning
human voices using deep-learning technologies is also emerging. Now, an attacker can
generate realistic cloned voices of humans using only a few seconds of audio of the
target person. With the emerging threat of potential harm deepfakes can cause, researchers
have proposed deepfake detection methods. However, they only focus on detecting a
single modality, i.e., either video or audio. On the other hand, to develop a good
deepfake detector that can cope with the recent advancements in deepfake generation,
we need to have a detector that can detect deepfakes of multiple modalities, i.e.,
videos and audios. To build such a detector, we need a dataset that contains video
and respective audio deepfakes. We were able to find a most recent deepfake dataset,
Audio-Video Multimodal Deepfake Detection Dataset (FakeAVCeleb), that contains not
only deepfake videos but synthesized fake audios as well. We used this multimodal
deepfake dataset and performed detailed baseline experiments using state-of-the-art
unimodal, ensemble-based, and multimodal detection methods to evaluate it. We conclude
through detailed experimentation that unimodals, addressing only a single modality,
video or audio, do not perform well compared to ensemble-based methods. Whereas purely
multimodal-based baselines provide the worst performance.

DmyT: Dummy Triplet Loss for Deepfake Detection

Nicolas Beuve
Wassim Hamidouche
Olivier Deforges

Recent progress in deep learning-based image generation has madeit easier to create
convincing fake videos called deepfakes. Whilethe benefits of such technology are
undeniable, it can also be usedas realistic fake news support for mass disinformation.
In this con-test, different detectors were proposed, many of them use a CNN asa backbone
model and the binary cross-entropy as a loss function.Some more recent approaches
applied a triplet loss with semi-hardtriplets. In this paper, we investigate the use
of triplet loss with fixedpositive and negative vectors as a replacement for semi-hard
triplets.This loss, called dummy triplet loss (DmyT), follows the concept ofthe triplet
loss but requires less computation, as the triplets are fixed.It also doesn't rely
on a linear classifier for prediction. We haveassessed the performance of the proposed
loss with four backbonenetworks, including two of the most popular CNNs in the Deepfakerealm,
Xception and EfficientNet, alongside two visual transformernetworks, ViT and CaiT.
Our loss function shows competitive re-sults on FaceForensics++ dataset compared to
triplet loss with semi-hard triplets while being less computationally intensive. The
sourcecode of DmyT is available at https://github.com/beuve/DmyT.

SESSION: Session 2: Deepfake Generation

Session details: Session 2: Deepfake Generation

Weiling Chen

Invertable Frowns: Video-to-Video Facial Emotion Translation

Ian Magnusson
Aruna Sankaranarayanan
Andrew Lippman

We present Wav2Lip-Emotion, a video-to-video translation architecture that modifies
facial expressions of emotion in videos of speakers. Previous work modifies emotion
in images, uses a single image to produce a video with animated emotion, or puppets
facial expressions in videos with landmarks from a reference video. However, many
use cases such as modifying an actor's performance in post-production, coaching individuals
to be more animated speakers, or touching up emotion in a teleconference require a
video-to-video translation approach. We explore a method to maintain speakers' identity
and pose while translating their expressed emotion. Our approach extends an existing
multi-modal lip synchronization architecture to modify the speaker's emotion using
L1 reconstruction and pre-trained emotion objectives. We also propose a novel automated
emotion evaluation approach and corroborate it with a user study. These find that
we succeed in modifying emotion while maintaining lip synchronization. Visual quality
is somewhat diminished, with a trade off between greater emotion modification and
visual quality between model variants. Nevertheless, we demonstrate (1) that facial
expressions of emotion can be modified with nothing other than L1 reconstruction and
pre-trained emotion objectives and (2) that our automated emotion evaluation approach
aligns with human judgements.