M4MM '22: Proceedings of the 1st International Workshop on Methodologies for Multimedia

M4MM '22: Proceedings of the 1st International Workshop on Methodologies for Multimedia

M4MM '22: Proceedings of the 1st International Workshop on Methodologies for Multimedia


Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Tools for Collecting, Synchronizing, and Annotating Ecologically Valid Social Behavior In-the-Wild

  • Hayley Hung

So many Multimedia Systems are designed for the online world. We may be led to believe that the real world we live in now already exists online. When I talk about the online world, I am referring to the countless images and videos, which are commonly scraped from the internet for the generation of many benchmark datasets. This 'finders keepers' attitude towards online found data has hailed a revolution in the development of multimodal computing systems but could be leading towards a dead end in the development of socially intelligent systems. The reality is that data containing social behavior is private; this growing realization has led to the withdrawal of major benchmarks. To develop truly socially intelligent systems, we desperately need ethically sourced ecologically valid data. I argue that we need to return to the drawing broad and reconsider our data gathering procedures to make meaningful progress. In this talk, I offer a different perspective on the first and perhaps fundamental step in developing socially intelligent multimodal systems. Using the ConfLab dataset and data collection concept as a case study, I will discuss the challenges of collecting such data. I will also present solutions for wireless multi-sensor synchronization, sensing (https://github.com/TUDelft-SPC-Lab/spcl_midge_hardware), and continuous annotation.

A Multimodal Dynamical Variational Autoencoder for Audiovisual Speech Representation Learning

  • Simon Leglaive

High-dimensional data such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently. This suggests that there exists a smaller dimensional latent representation from which the high-dimensional observed data were generated. Uncovering the hidden explanatory features of complex data is the goal of representation learning, and deep latent variable generative models have emerged as promising unsupervised approaches. In particular, the variational autoencoder (VAE) [1, 2], which is equipped with both a generative and inference model, allows for the analysis, transformation, and generation of various types of data. Over the past few years, the VAE has been extended in many ways, including for dealing with data that are either multimodal [3] or dynamical (i.e., sequential) [4]. In this talk, we will present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities (e.g., the speaker's lip movements) from those that are specific to each modality (e.g., the speaker's pitch variation or eye movements). A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence (e.g., the speaker's identity or global emotional state). The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two steps. In the first step, a vector quantized VAE (VQ-VAE) [5] is learned independently for each modality, without temporal modeling. The second step consists in learning the MDVAE, whose inputs are the intermediate representations of the VQ-VAE before quantization. The disentanglement between static versus dynamical and modality-specific versus shared information occurs during this second training stage. Experimental results will be presented, featuring what characteristics of the audiovisual speech data are encoded within the different latent spaces, how the proposed multimodal model can be beneficial compared with a unimodal one, and how the learned representation can be leveraged to perform downstream tasks.

SESSION: Workshop Presentations

HPFL: Federated Learning by Fusing Multiple Sensor Modalities with Heterogeneous Privacy Sensitivity Levels

  • Yuanjie Chen
  • Chih-Fan Hsu
  • Chung-Chi Tsai
  • Cheng-Hsin Hsu

Solving classification problems to understand multi-modality sensor data has become popular, but rich-media sensors, e.g., RGB cameras and microphones, are privacy-invasive. Though existing Federated Learning (FL) algorithms allow clients to keep their sensor data private, they suffer from degraded performance, particularly lower classification accuracy and longer training time, than centralized learning. We propose a Heterogeneous Privacy Federated Learning (HPFL) paradigm to capitalize on the information in the privacy insensitive data (such as mmWave point clouds) while keeping the privacy sensitive data (such as RGB images) private because sensor data are of diverse sensitivity levels. We evaluate the HPFL paradigm on two representative classification problems: semantic segmentation and emotion recognition. Extensive experiments demonstrate that the HPFL paradigm outperforms: (i) the popular FedAvg by 18.20% in foreground accuracy (semantic segmentation) and 4.20% in F1-score (emotion recognition) under non-i.i.d. sample distributions and (ii) the state-of-the-art FL algorithms by 12.40%--17.70% in foreground accuracy and 2.54%--4.10% in F1-score.

Playing Lottery Tickets in Style Transfer Models

  • Meihao Kong
  • Jing Huo
  • Wenbin Li
  • Jing Wu
  • Yu-Kun Lai
  • Yang Gao

Style transfer has achieved great success and attracted a wide range of attention from both academic and industrial communities due to its flexible application scenarios. However, the dependence on a pretty large VGG-based autoencoder leads to existing style transfer models having high parameter complexities, which limits their applications on resource-constrained devices. Compared with many other tasks, the compression of style transfer models has been less explored. Recently, the lottery ticket hypothesis (LTH) has shown great potential in finding extremely sparse matching subnetworks which can achieve on par or even better performance than the original full networks when trained in isolation. In this work, we for the first time perform an empirical study to verify whether such trainable matching subnetworks also exist in style transfer models. Specifically, we take two most popular style transfer models, i.e., AdaIN and SANet, as the main testbeds, which represent global and local transformation based style transfer methods respectively. We carry out extensive experiments and comprehensive analysis, and draw the following conclusions. (1) Compared with fixing the VGG encoder, style transfer models can benefit more from training the whole network together. (2) Using iterative magnitude pruning, we find the matching subnetworks at 89.2% sparsity in AdaIN and 73.7% sparsity in SANet, which demonstrates that Style transfer models can play lottery tickets too. (3) The feature transformation module should also be pruned to obtain a much sparser model without affecting the existence and quality of the matching subnetworks. (4) Besides AdaIN and SANet, other models such as LST, MANet, AdaAttN and MCCNet can also play lottery tickets, which shows that LTH can be generalized to various style transfer models.

Optimal Tensor Bipartite Graph Learning

  • Haizhou Yang
  • Wenhui Zhao
  • Quanxue Gao
  • Xiangdong Zhang
  • Wei Xia

are concerned in this paper with a multi-view clustering framework based on bipartite graphs. And we propose an efficient multiview clustering method, Optimal Tensor Bipartite Graph Learning for multi-view clustering (OTBGL). Our model is a novel tensorized bipartite graph based multi-view clustering method with low tensorrank constraint. Firstly, to remarkably reduce the computational complexity, we leverage the bipartite graphs of different views instead of full similarity graphs of the corresponding views. Secondly, we measure the similarity between bipartite graphs of different views by minimizing the tensor Schatten p-norm as a tighter tensor rank approximation and explore the spatial low-rank structure embedded in intra-view graphs by minimizing the l1,2-norm of learned graphs. Thirdly, we provide an efficient algorithm suitable for processing large-scale data. Extensive experimental results on six benchmark datasets indicate our proposed OTBGL is superior to the state-of-the-art methods.

Boosting Few-shot Learning by Self-calibration in Feature Space

  • Kaipeng Zheng
  • Liu Cheng
  • Jie Shen

Few-shot learning aims at adapting models to a novel task with extremely few labeled samples. Fine-tuning the models pre-trained on a base dataset has been recently demonstrated to be an effective approach. However, a dilemma emerges as whether to modify the parameters of the feature extractor. This is because tuning a vast number of parameters based on only a handful of samples tends to induce overfitting, while fixing the parameters leads to inherent bias in the extracted features since the novel classes are unseen for the pre-trained feature extractor. To alleviate this issue, we novelly reformulate fine-tuning as calibrating the biased features of novel samples conditioned on a fixed feature extractor through an auxiliary network. Technically, a self-calibration framework is proposed to construct improved image-level features by progressively performing local alignment based on a self-supervised Transformer. Extensive experiments demonstrate that the proposed method vastly outperforms the state-of-the-art methods.