MMAsia '21: ACM Multimedia Asia

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Full Papers

Semantic Enhanced Cross-modal GAN for Zero-shot Learning

Haotian Sun
Jiwei Wei
Yang Yang
Xing Xu

The goal of Zero-shot Learning (ZSL) is to recognize categories that are not seen during the training process. The traditional method is to learn an embedding space and map visual features and semantic features to this common space. However, this method inevitably encounters the bias problem, i.e., unseen instances are often incorrectly recognized as the seen classes. Some attempts are made by proposing another paradigm, which uses generative models to hallucinate the features of unseen samples. However, the generative models often suffer from instability issues, making it impractical for them to generate fine-grained features of unseen samples, thus resulting in very limited improvement. To resolve this, a Semantic Enhanced Cross-modal GAN (SECM GAN) is proposed by imposing the cross-modal association for improving the semantic and discriminative property of the generated features. Specifically, we first train a cross-modal embedding model called Semantic Enhanced Cross-modal Model (SECM), which is constrained by discrimination and semantics. Then we train our generative model based on Generative Adversarial Network (GAN) called SECM GAN, in which the generator generates cross-modal features, and the discriminator distinguishes true cross-modal features from generated cross-modal features. We deploy SECM as a weak constraint of GAN, which makes reliance on GAN get reduced. We evaluate extensive experiments on three widely used ZSL datasets to demonstrate the superiority of our framework.

Motion = Video - Content: Towards Unsupervised Learning of Motion Representation from Videos

Hehe Fan
Mohan Kankanhalli

Motion, according to its definition in physics, is the change in position with respect to time, regardless of the specific moving object and background. In this paper, we aim to learn appearance-independent motion representation in an unsupervised manner. The main idea is to separate motion from videos while leaving objects and background as content. Specifically, we design an encoder-decoder model which consists of a content encoder, a motion encoder and a video generator. To train the model, we leverage a one-step cycle-consistency in reconstruction within the same video and a two-step cycle-consistency in generation across different videos as self-supervised signals, and use adversarial training to remove the content representation from the motion representation. We demonstrate that the proposed framework can be used for conditional video generation and fine-grained action recognition.

Towards Discriminative Visual Search via Semantically Cycle-consistent Hashing Networks

Zheng Zhang
Jianning Wang
Guangming Lu

Deep hashing has shown great potentials in large-scale visual similarity search due to preferable storage and computation efficiency. Typically, deep hashing encodes visual features into compact binary codes by preserving representative semantic visual features. Works in this area mainly focus on building the relationship between the visual and objective hash space, while they seldom study the triadic cross-domain semantic knowledge transfer among visual, semantic and hashing spaces, leading to serious semantic ignorance problem during space transformation. In this paper, we propose a novel deep tripartite semantically interactive hashing framework, dubbed Semantically Cycle-consistent Hashing Networks (SCHN), for discriminative hash code learning. Particularly, we construct a flexible semantic space and a transitive latent space, in conjunction with the visual space, to jointly deduce the privileged discriminative hash space. Specifically, a semantic space is conceived to strengthen the flexibility and completeness of categories in feature inference. Moreover, a transitive latent space is formulated to explore the shared semantic interactivity embedded in visual and semantic features. Our SCHN, for the first time, establishes the cyclic principle of deep semantic-preserving hashing by adaptive semantic parsing across different spaces in visual similarity search. In addition, the entire learning framework is jointly optimized in an end-to-end manner. Extensive experiments performed on diverse large-scale datasets evidence the superiority of our method against other state-of-the-art deep hashing algorithms.

Source-Style Transferred Mean Teacher for Source-data Free Object Detection

Dan Zhang
Mao Ye
Lin Xiong
Shuaifeng Li
Xue Li

Unsupervised cross-domain object detection transfers a detection model trained on a source domain to the target domain that has a different data distribution from the source domain. Conventional domain adaptation detection protocols need source domain data during adaptation. However, due to some reasons such as data security, privacy and storage, we cannot access the source data in many practical applications. In this paper, we focus on source-data free domain adaptive object detection, which uses the pre-trained source model instead of the source data for cross-domain adaptation. Due to the lack of source data, we cannot directly align domain distribution between domains. To challenge this, we propose the Source style transferred Mean Teacher (SMT) for source-data free Object Detection. The batch normalization layers in the pre-trained model contain the style information and the data distribution of the non-observed source data. Thus we use the batch normalization information from the pre-trained source model to transfer the target domain feature to the source-like style feature to make full use of the knowledge from the pre-trained source model. Meanwhile, we use the consistent regularization of the Mean Teacher to further distill the knowledge from the source domain to the target domain. Furthermore, we found that by adding perturbations associated with the target domain distribution, the model can increase the robustness of domain-specific information, thus making the learned model generalized to the target domain. Experiments on multiple domain adaptation object detection benchmarks verify that our method is able to achieve state-of-the-art performance.

S2TD: A Tree-Structured Decoder for Image Paragraph Captioning

Yihui Shi
Yun Liu
Fangxiang Feng
Ruifan Li
Zhanyu Ma
Xiaojie Wang

Image paragraph captioning, a task to generate the paragraph description for a given image, usually requires mining and organizing linguistic counterparts from abundant visual clues. Limited by sequential decoding perspective, previous methods have difficulty in organizing the visual clues holistically or capturing the structural nature of linguistic descriptions. In this paper, we propose a novel tree-structured visual paragraph decoder network, called Splitting to Tree Decoder (S2TD) to address this problem. The key idea is to model the paragraph decoding process as a top-down binary tree expansion. S2TD consists of three modules: a split module, a score module, and a word-level RNN. The split module iteratively splits ancestral visual representations into two parts through a gating mechanism. To determine the tree topology, the score module uses cosine similarity to evaluate the nodes splitting. A novel tree structure loss is proposed to enable end-to-end learning. After the tree expansion, the word-level RNN decodes leaf nodes into sentences forming a coherent paragraph. Extensive experiments are conducted on the Stanford benchmark dataset. The experimental results show promising performance of our proposed S2TD.

Latent Pattern Sensing: Deepfake Video Detection via Predictive Representation Learning

Shiming Ge
Fanzhao Lin
Chenyu Li
Daichi Zhang
Jiyong Tan
Weiping Wang
Dan Zeng

Increasingly advanced deepfake approaches have made the detection of deepfake videos very challenging. We observe that the general deepfake videos often exhibit appearance-level temporal inconsistencies in some facial components between frames, resulting in discriminable spatiotemporal latent patterns among semantic-level feature maps. Inspired by this finding, we propose a predictive representative learning approach termed Latent Pattern Sensing to capture these semantic change characteristics for deepfake video detection. The approach cascades a CNN-based encoder, a ConvGRU-based aggregator and a single-layer binary classifier. The encoder and aggregator are pre-trained in a self-supervised manner to form the representative spatiotemporal context features. Finally, the classifier is trained to classify the context features, distinguishing fake videos from real ones. In this manner, the extracted features can simultaneously describe the latent patterns of videos across frames spatially and temporally in a unified way, leading to an effective deepfake video detector. Extensive experiments prove our approach’s effectiveness, e.g., surpassing 10 state-of-the-arts at least 7.92%@AUC on challenging Celeb-DF(v2) benchmark.

Improving Camouflaged Object Detection with the Uncertainty of Pseudo-edge Labels

Nobukatsu Kajiura
Hong Liu
Shin'ichi Satoh

This paper focuses on camouflaged object detection (COD), which is a task to detect objects hidden in the background. Most of the current COD models aim to highlight the target object directly while outputting ambiguous camouflaged boundaries. On the other hand, the performance of the models considering edge information is not yet satisfactory. To this end, we propose a new framework that makes full use of multiple visual cues, i.e., saliency as well as edges, to refine the predicted camouflaged map. This framework consists of three key components, i.e., a pseudo-edge generator, a pseudo-map generator, and an uncertainty-aware refinement module. In particular, the pseudo-edge generator estimates the boundary that outputs the pseudo-edge label, and the conventional COD method serves as the pseudo-map generator that outputs the pseudo-map label. Then, we propose an uncertainty-based module to reduce the uncertainty and noise of such two pseudo labels, which takes both pseudo labels as input and outputs an edge-accurate camouflaged map. Experiments on various COD datasets demonstrate the effectiveness of our method with superior performance to the existing state-of-the-art methods.

Blindly Predict Image and Video Quality in the Wild

Jiapeng Tang
Yi Fang
Yu Dong
Rong Xie
Xiao Gu
Guangtao Zhai
Li Song

Emerging interests have been brought to blind quality assessment for images/videos captured in the wild, known as in-the-wild I/VQA. Prior deep learning based approaches have achieved considerable progress in I/VQA, but are intrinsically troubled with two issues. Firstly, most existing methods fine-tune the image-classification-oriented pre-trained models for the absence of large-scale I/VQA datasets. However, the task misalignment between I/VQA and image classification leads to degraded generalization performance. Secondly, existing VQA methods directly conduct temporal pooling on the predicted frame-wise scores, resulting in ambiguous inter-frame relation modeling. In this work, we propose a two-stage architecture to separately predict image and video quality in the wild. In the first stage, we resort to supervised contrastive learning to derive quality-aware representations that facilitate the prediction of image quality. Specifically, we propose a novel quality-aware contrastive loss to pull together samples of similar quality and push away quality-different ones in embedding space. In the second stage, we develop a Relation-Guided Temporal Attention (RTA) module for video quality prediction, which captures global inter-frame dependencies in embedding space to learn frame-wise attention weights for frame quality aggregation. Extensive experiments demonstrate that our approach performs favorably against state-of-the-art methods on both authentically distorted image benchmarks and video benchmarks.

BRUSH: Label Reconstructing and Similarity Preserving Hashing for Cross-modal Retrieval

Peng-Fei Zhang
Pengfei Zhao
Xin Luo
Xin-Shun Xu

The hashing technique has recently sparked much attention in information retrieval community due to its high efficiency in terms of storage and query processing. For cross-modal retrieval tasks, existing supervised hashing models either treat the semantic labels as the ground truth and formalize the problem to a classification task, or further add a similarity matrix as supervisory signals to pursue hash codes of high quality to represent coupled data. However, these approaches are incapable of ensuring that the learnt binary codes preserve well the semantics and similarity relationships contained in the supervised information. Moreover, for sophisticated discrete optimization problems, it is always addressed by continuous relaxation or bit-wise solver, which leads to a large quantization error and inefficient computation. To relieve these issues, in this paper, we present a two-step supervised discrete hashing method, i.e., laBel ReconstrUcting and Similarity preserving Hashing (BRUSH). We formulate it as an asymmetric pairwise similarity-preserving problem by using two latent semantic embeddings deducted from decomposing semantics and reconstructing semantics, respectively. Meanwhile, the unified binary codes are jointly generated based on both embeddings with the affinity guarantee, such that the discriminative property of the obtained hash codes can be significantly enhanced alongside preserving semantics well. In addition, by adopting two-step hash learning strategy, our method simplifies the procedure of the hashing function and binary codes learning, thus improving the flexibility and efficiency. The resulting discrete optimization problem is also elegantly solved by the proposed alternating algorithm without any relaxation. Extensive experiments on benchmarks demonstrate that BRUSH outperforms the state-of-the-art methods, in terms of efficiency and effectiveness.

Local Self-Attention on Fine-grained Cross-media Retrieval

Chen Wang
Yazhou Yao
Qiong Wang
Zhenmin Tang

Due to the heterogeneity gap, the data representation of different media is inconsistent and belongs to different feature spaces. Therefore, it is challenging to measure the fine-grained gap between them. To this end, we propose an attention space training method to learn common representations of different media data. Specifically, we utilize local self-attention layers to learn the common attention space between different media data. We propose a similarity concatenation method to understand the content relationship between features. To further improve the robustness of the model, we also train a local position encoding to capture the spatial relationships between features. In this way, our proposed method can effectively reduce the gap between different feature distributions on cross-media retrieval tasks. It also improves the fine-grained recognition performance by attaching attention to high-level semantic information. Extensive experiments and ablation studies demonstrate that our proposed method achieves state-of-the-art performance. At the same time, our approach provides a new pipeline for fine-grained cross-media retrieval. The source code and models are publicly available at: https://github.com/NUST-Machine-Intelligence-Laboratory/SAFGCMHN.

Self-Adaptive Hashing for Fine-Grained Image Retrieval

Yajie Zhang
Yuxuan Dai
Wei Tang
Lu Jin
Xinguang Xiang

The main challenge of fine-grained image hashing is how to learn highly discriminative hash codes to distinguish the within and between class variations. On the one hand, most of the existing methods treat sample pairs as equivalent in hash learning, ignoring the more discriminative information contained in hard sample pairs. On the other hand, in the testing phase, these methods ignore the influence of outliers on retrieval performance. In order to solve the above issues, this paper proposes a novel Self-Adaptive Hashing method, which learns discriminative hash codes by mining hard sample pairs, and improves retrieval performance by correcting outliers in the testing phase. In particular, to improve the discriminability of hash codes, a pair-weighted based loss function is proposed to enhance the learning of hash functions of hard sample pairs. Furthermore, in the testing phase, a self-adaptive module is proposed to discover and correct outliers by generating self-adaptive boundaries, thereby improving the retrieval performance. Experimental results on two widely-used fine-grained datasets demonstrate the effectiveness of the proposed method.

Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension

Hang Yu
Weixin Li
Jiankai Li
Ye Du

Referring Expression Comprehension (REC) is the task of grounding object which is referred by the language expression. Previous one-stage REC methods usually use one single language feature vector to represent the whole query for grounding and no reasoning between different objects is performed despite the rich relation cues of objects contained in the language expression, which depresses their grounding accuracy. Additionally, these methods mostly use the feature pyramid networks for multi-scale visual object feature extraction but ground on different feature layers separately, neglecting the connections between objects with different scales. To address these problems, we propose a novel one-stage REC method, i.e. the Entity Relation Fusion Network (ERFN) to locate referred object by relation guided reasoning on different objects. In ERFN, instead of grounding objects at each layer separately, we propose a Language Guided Multi-Scale Fusion (LGMSF) model to utilize language to guide the fusion of representations of objects with different scales into one feature map.For modeling connections between different objects, we design a Relation Guided Feature Fusion (RGFF) model that extracts entities in the language expression to enhance the referred entity feature in the visual object feature map, and further extracts relations to guide object feature fusion based on the self-attention mechanism. Experimental results show that our method is competitive with the state-of-the-art one-stage and two-stage REC methods, and can also keep inferring in real time.

A Local-Global Commutative Preserving Functional Map for Shape Correspondence

Qianxing Li
Shaofan Wang
Dehui Kong
Baocai Yin

Existing non-rigid shape matching methods mainly involve two disadvantages. (a) Local details and global features of shapes can not be carefully explored. (b) A satisfactory trade-off between the matching accuracy and computational efficiency can be hardly achieved. To address these issues, we propose a local-global commutative preserving functional map (LGCP) for shape correspondence. The core of LGCP involves an intra-segment geometric submodel and a local-global commutative preserving submodel, which accomplishes the segment-to-segment matching and the point-to-point matching tasks, respectively. The first submodel consists of an ICP similarity term and two geometric similarity terms which guarantee the correct correspondence of segments of two shapes, while the second submodel guarantees the bijectivity of the correspondence on both the shape level and the segment level. Experimental results on both segment-to-segment matching and point-to-point matching show that, LGCP not only generate quite accurate matching results, but also exhibit a satisfactory portability and a high efficiency.

Differentially Private Learning with Grouped Gradient Clipping

Haolin Liu
Chenyu Li
Bochao Liu
Pengju Wang
Shiming Ge
Weiping Wang

While deep learning has proved success in many critical tasks by training models from large-scale data, some private information within can be recovered from the released models, leading to the leakage of privacy. To address this problem, this paper presents a differentially private deep learning paradigm to train private models. In the approach, we propose and incorporate a simple operation termed grouped gradient clipping to modulate the gradient weights. We also incorporated the smooth sensitivity mechanism into differentially private deep learning paradigm, which bounds the adding Gaussian noise. In this way, the resulting model can simultaneously provide with strong privacy protection and avoid accuracy degradation, providing a good trade-off between privacy and performance. The theoretic advantages of grouped gradient clipping are well analyzed. Extensive evaluations on popular benchmarks and comparisons with 11 state-of-the-arts clearly demonstrate the effectiveness and genearalizability of our approach.

Hierarchical Deep Residual Reasoning for Temporal Moment Localization

Ziyang Ma
Xianjing Han
Xuemeng Song
Yiran Cui
Liqiang Nie

Temporal Moment Localization (TML) in untrimmed videos is a challenging task in the field of multimedia, which aims at localizing the start and end points of the activity in the video, described by a sentence query. Existing methods mainly focus on mining the correlation between video and sentence representations or investigating the fusion manner of the two modalities. These works mainly understand the video and sentence coarsely, ignoring the fact that a sentence can be understood from various semantics, and the dominant words affecting the moment localization in the semantics are the action and object reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization. Furthermore, considering that videos with different resolution and sentences with different length have different difficulty in understanding, we design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner. Extensive experiments conducted on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our HDRR model compared with other state-of-the-art methods.

MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients

Yixin Zhang
Yoko Yamakata
Keishi Tajima

In this paper, we introduce a new recipe dataset MIRecipe (Multimedia-Instructional Recipe). It has both text and image data for every cooking step, while the conventional recipe datasets only contain final dish images, and/or images only for some of the steps. It consists of 26,725 recipes, which include 239,973 steps in total. The recognition of ingredients in images associated with cooking steps poses a new challenge: Since ingredients are processed during cooking, the appearance of the same ingredient is very different in the beginning and finishing stages of the cooking. The general object recognition methods, which assume the constant appearance of objects, do not perform well for such objects. To solve the problem, we propose two stage-aware techniques: stage-wise model learning, which trains a separate model for each stage, and stage-aware curriculum learning, which starts with the training data from the beginning stage and proceeds to the later stages. Our experiment with our dataset shows that our method achieves higher accuracy than the model trained using all the data without considering the stages. Our dataset is available at our GitHub repository.

Video Saliency Prediction via Deep Eye Movement Learning

Jiazhong Chen
Jie Chen
Yuan Dong
Dakai Ren
Shiqi Zhang
Zongyi Li

Existing methods often utilize temporal motion information and spatial layout information in video to predict video saliency. However, the fixations are not always consistent with the moving object of interest, because human eye fixations are determined not only by the spatio-temporal information, but also by the velocity of eye movement. To address this issue, a new saliency prediction method via deep eye movement learning (EML) is proposed in this paper. Compared with previous methods that use human fixations as ground truth, our method uses the optical flow of fixations between successive frames as an extra ground truth for the purpose of eye movement learning. Experimental results on DHF1K, Hollywood2, and UCF-sports datasets show the proposed EML model achieves a promising result across a wide of metrics.

Structural Knowledge Organization and Transfer for Class-Incremental Learning

Yu Liu
Xiaopeng Hong
Xiaoyu Tao
Songlin Dong
Jingang Shi
Yihong Gong

Deep models are vulnerable to catastrophic forgetting when fine-tuned on new data. Popular distillation-based methods usually neglect the relations between data samples and may eventually forget essential structural knowledge. To solve these shortcomings, we propose a structural graph knowledge distillation based incremental learning framework to preserve both the positions of samples and their relations. Firstly, a memory knowledge graph (MKG) is generated to fully characterize the structural knowledge of historical tasks. Secondly, we develop a graph interpolation mechanism to enrich the domain of knowledge and alleviate the inter-class sample imbalance issue. Thirdly, we introduce structural graph knowledge distillation to transfer the knowledge of historical tasks. Comprehensive experiments on three datasets validate the proposed method.

Learning to Decompose and Restore Low-light Images with Wavelet Transform

Pengju Zhang
Chaofan Zhang
Zheng Rong
Yihong Wu

Low-light images often suffer from low visibility and various noise. Most existing low-light image enhancement methods often amplify noise when enhancing low-light images, due to the neglect of separating valuable image information and noise. In this paper, we propose a novel wavelet-based attention network, where wavelet transform is integrated into attention learning for joint low-light enhancement and noise suppression. Particularly, the proposed wavelet-based attention network includes a Decomposition-Net, an Enhancement-Net and a Restoration-Net. In Decomposition-Net, to benefit denoising, wavelet transform layers are designed for separating noise and global content information into different frequency features. Furthermore, an attention-based strategy is introduced to progressively select suitable frequency features for accurately restoring illumination and reflectance according to Retinex theory. In addition, Enhancement-Net is introduced for further removing degradations in reflectance and adjusting illumination, while Restoration-Net employs conditional adversarial learning to adversarially improve the visual quality of final restored results based on enhanced illumination and reflectance. Extensive experiments on several public datasets demonstrate that the proposed method achieves more pleasing results than state-of-the-art methods.

Conditional Extreme Value Theory for Open Set Video Domain Adaptation

Zhuoxiao Chen
Yadan Luo
Mahsa Baktashmotlagh

With the advent of media streaming, video action recognition has become progressively important for various applications, yet at the high expense of requiring large-scale data labelling. To overcome the problem of expensive data labelling, domain adaptation techniques have been proposed, which transfer knowledge from fully labelled data (i.e., source domain) to unlabelled data (i.e., target domain). The majority of video domain adaptation algorithms are proposed for closed-set scenarios in which all the classes are shared among the domains. In this work, we propose an open-set video domain adaptation approach to mitigate the domain discrepancy between the source and target data, allowing the target data to contain additional classes that do not belong to the source domain. Different from previous works, which only focus on improving accuracy for shared classes, we aim to jointly enhance the alignment of the shared classes and recognition of unknown samples. Towards this goal, class-conditional extreme value theory is applied to enhance the unknown recognition. Specifically, the entropy values of target samples are modelled as generalised extreme value distributions, which allows separating unknown samples lying in the tail of the distribution. To alleviate the negative transfer issue, weights computed by the distance from the sample entropy to the threshold are leveraged in adversarial learning in the sense that confident source and target samples are aligned, and unconfident samples are pushed away. The proposed method has been thoroughly evaluated on both small-scale and large-scale cross-domain video datasets and achieved the state-of-the-art performance.

Hierarchical Composition Learning for Composed Query Image Retrieval

Yahui Xu
Yi Bin
Guoqing Wang
Yang Yang

Composed query image retrieval is a growing research topic. The object is to retrieve images not only generally resemble the reference image, but differ according to the desired modification text. Existing methods mainly explore composing modification text with global feature or local entity descriptor of reference image. However, they ignore the fact that modification text is indeed diverse and arbitrary. It not only relates to abstractive global feature or concrete local entity transformation, but also often associates with the fine-grained structured visual adjustment. Thus, it is insufficient to emphasize the global or local entity visual for the query composition. In this work, we tackle this task by hierarchical composition learning. Specifically, the proposed method first encodes images into three representations consisting of global, entity and structure level representations. Structure level representation is richly explicable, which explicitly describes entities as well as attributes and relationships in the image with a directed graph. Based on these, we naturally perform hierarchical composition learning by fusing modification text and reference image in the global-entity-structure manner. It can transform the visual feature conditioned on modification text to target image in a coarse-to-fine manner, which takes advantage of the complementary information among three levels. Moreover, we introduce a hybrid space matching to explore global, entity and structure alignments which can get high performance and good interpretability.

Hard-Boundary Attention Network for Nuclei Instance Segmentation

Yalu Cheng
Pengchong Qiao
Hongliang He
Guoli Song
Jie Chen

Image segmentation plays an important role in medical image analysis, and accurate segmentation of nuclei is especially crucial to clinical diagnosis. However, existing methods fail to segment dense nuclei due to the hard-boundary which has similar texture to nuclear inside. To this end, we propose a Hard-Boundary Attention Network (HBANet) for nuclei instance segmentation. Specifically, we propose a Background Weaken Module (BWM) to weaken the attention of our model to the nucleus background by integrating low-level features into high-level features. To improve the robustness of the model to the hard-boundary of nuclei, we further design a Gradient-based boundary adaptive Strategy (GS) which generates boundary-weakened data for model training in an adversarial manner. We conduct extensive experiments on MoNuSeg and CPM-17 datasets, and experimental results show that our HBANet outperforms the state-of-the-art methods.

Few-shot Egocentric Multimodal Activity Recognition

Jinxing Pan
Xiaoshan Yang
Yi Huang
Changsheng Xu

Activity recognition based on egocentric multimodal data collected by wearable devices has become increasingly popular recently. However, conventional activity recognition methods face the dilemma of the lack of large-scale labeled egocentric multimodal datasets due to the high cost of data collection. In this paper, we propose a new task of few-shot egocentric multimodal activity recognition, which has at least two significant challenges. On the one hand, it is difficult to extract effective features from the multimodal data sequences of video and sensor signals due to the scarcity of the samples. On the other hand, how to robustly recognize novel activity classes with very few labeled samples becomes another more critical challenge due to the complexity of the multimodal data. To resolve the challenges, we propose a two-stream graph network, which consists of a heterogeneous graph-based multimodal association module and a knowledge-aware activity classifier module. The former uses a heterogeneous graph network to comprehensively capture the dynamic and complementary information contained in the multimodal data stream. The latter learns robust activity classifiers through knowledge propagation among the classifier parameters of different classes. In addition, we adopt episodic training strategy to improve the generalization ability of the proposed few-shot activity recognition model. Experiments on two public datasets show that the proposed model achieves better performances than other baseline models.

Visual Storytelling with Hierarchical BERT Semantic Guidance

Ruichao Fan
Hanli Wang
Jinjing Gu
Xianhui Liu

Visual storytelling, which aims at automatically producing a narrative paragraph for photo album, remains quite challenging due to the complexity and diversity of photo album content. In addition, open-domain photo albums cover a broad range of topics and this results in highly variable vocabularies and expression styles to describe photo albums. In this work, a novel teacher-student visual storytelling framework with hierarchical BERT semantic guidance (HBSG) is proposed to address the above-mentioned challenges. The proposed teacher module consists of two joint tasks, namely, word-level latent topic generation and semantic-guided sentence generation. The first task aims to predict the latent topic of the story. As there is no ground-truth topic information, a pre-trained BERT model based on visual contents and annotated stories is utilized to mine topics. Then the topic vector is distilled to a designed image-topic prediction model. In the semantic-guided sentence generation task, HBSG is introduced for two purposes. The first is to narrow down the language complexity across topics, where the co-attention decoder with vision and semantic is designed to leverage the latent topics to induce topic-related language models. The second is to employ sentence semantic as an online external linguistic knowledge teacher module. Finally, an auxiliary loss is devised to transform linguistic knowledge into the language generation model. Extensive experiments are performed to demonstrate the effectiveness of HBSG framework, which surpasses the state-of-the-art approaches evaluated on the VIST test set.

Language Based Image Quality Assessment

Lorenzo Seidenari
Leonardo Galteri
Pietro Bongini
Marco Bertini
Alberto Del Bimbo

Evaluation of generative models, in the visual domain, is often performed providing anecdotal results to the reader. In the case of image enhancement, reference images are usually available. Nonetheless, using signal based metrics often leads to counterintuitive results: highly natural crisp images may obtain worse scores than blurry ones. On the other hand, blind reference image assessment may rank images reconstructed with GANs higher than the original undistorted images. To avoid time consuming human based image assessment, semantic computer vision tasks may be exploited instead [9, 25, 33]. In this paper we advocate the use of language generation tasks to evaluate the quality of restored images. We show experimentally that image captioning, used as a downstream task, may serve as a method to score image quality. Captioning scores are better aligned with human rankings with respect to signal based metrics or no-reference image quality metrics. We show insights on how the corruption, by artifacts, of local image structure may steer image captions in the wrong direction.

Efficient Proposal Generation with U-shaped Network for Temporal Sentence Grounding

Ludan Ruan
Qin Jin

Temporal Sentence Grounding aims to localize the relevant temporal region in a given video according to the query sentence. It is a challenging task due to the semantic gap between different modalities and diversity of the event duration. Proposal generation plays an important role in previous mainstream methods. However, previous proposal generation methods apply the same feature extraction without considering the diversity of event duration. In this paper, we propose a novel temporal sentence grounding model with an U-shaped Network for efficient proposal generation (UN-TSG), which utilizes U-shaped structure to encode proposals of different lengths hierarchically. Experiments on two benchmark datasets demonstrate that with more efficient proposal generation method, our model can achieve the state-of-the-art grounding performance in faster speed and with less computation cost.

A Model-Guided Unfolding Network for Single Image Reflection Removal

Dongliang Shao
Yunhui Shi
Jin Wang
Nam Ling
Baocai Yin

Removing undesirable reflections from a single image captured through a glass surface is of broad application to various image processing and computer vision tasks, but it is an ill-posed and challenging problem. Existing traditional single image reflection removal(SIRR) methods are often less efficient to remove reflection due to the limited description ability of handcrafted priors. State-of-the-art learning based methods often cause instability problems because they are designed as unexplainable black boxes. In this paper, we present an explainable approach for SIRR named model-guided unfolding network(MoG-SIRR), which is unfolded from our proposed reflection removal model with non-local autoregressive prior and dereflection prior. In order to complement the transmission layer and the reflection layer in a single image, we construct a deep learning framework with two streams by integrating reflection removal and non-local regularization into trainable modules. Extensive experiments on public benchmark datasets demonstrate that our method achieves superior performance for single image reflection removal.

Intra- and Inter-frame Iterative Temporal Convolutional Networks for Video Stabilization

Haopeng Xie
Liang Xiao
Huicong Wu

Video jitter is an uncomfortable product of irregular lens motion in time sequence. How to extract motion state information in a period of continuous video frames is a major issue for video stabilization. In this paper, we propose a novel sequence model, Intra- and Inter-frame Iterative Temporal Convolutional Networks (I3TC-Net), which alternatively transfer the spatial-temporal correlation of motion within and between frames. We hypothesize that the motion state information can be represented by transmission states. Specifically, we employ combination of Convolutional Long Short-Term Memory (ConvLSTM) and embedded encoder-decoder to generate the latent stable frame, which are used to update transmission states iteratively and learn a global homography transformation effectively for each unstable frame to generate the corresponding stabilized result along the time axis. Furthermore, we create a video dataset to solve the lack of stable data and improve the training effect. Experimental results show that our method outperforms state-of-the-art results on publicly available videos, such as 5.4 points improvements in stability score. The project page is available at https://github.com/root2022IIITC/IIITC.

Improving Hyperspectral Super-Resolution via Heterogeneous Knowledge Distillation

Ziqian Liu
Qing Ma
Junjun Jiang
Xianming Liu

Hyperspectral images (HSI) contains rich spectrum information but their spatial resolution is often limited by imaging system. Super-resolution (SR) reconstruction becomes a hot topic aiming to increase spatial resolution without extra hardware cost. The fusion-based hyperspectral image super-resolution (FHSR) methods use supplementary high-resolution multispectral images (HR-MSI) to recover spatial details, but well co-registered HR-MSI is hard to collect. Recently, single hyperspectral image super-resolution (SHSR) methods based on deep learning have made great progress. However, lack of HR-MSI input makes these SHSR methods difficult to exploit the spatial information. To take advantages of FHSR and SHSR methods, in this paper we propose a new pipeline treating HR-MSI as privilege information and try to improve our SHSR model with knowledge distillation. That is, our model uses paired MSI-HSI data to train and only needs LR-HSI as input during inference. Specifically, we combine SHSR and spectral super-resolution (SSR) and design a novel architecture, Distillation-Oriented Dual-branch Net (DODN), to make the SHSR model fully employ transferred knowledge from the SSR model. Since the main stream of SSR model are 2D CNNs and full 2D CNN causes spectral disorder in SHSR task, a new mixed 2D/3D block, called Distillation-Oriented Dual-branch Block (DODB) is proposed, where the 3D branch extracts spectral-spatial correlation while the 2D branch accepts information from the SSR model through knowledge distillation. The main idea is to distill the knowledge of spatial information from HR-MSI to the SHSR model without changing its network architecture. Extensive experiments on two benchmark datasets, CAVE and NTIRE2020, demonstrate that our proposed DODN outperforms the state-of-the-art SHSR methods, in terms of both quantitative and qualitative analysis.

Patch-Based Deep Autoencoder for Point Cloud Geometry Compression

Kang You
Pan Gao

The ever-increasing 3D application makes the point cloud compression unprecedentedly important and needed. In this paper, we propose a patch-based compression process using deep learning, focusing on the lossy point cloud geometry compression. Unlike existing point cloud compression networks, which apply feature extraction and reconstruction on the entire point cloud, we divide the point cloud into patches and compress each patch independently. In the decoding process, we finally assemble the decompressed patches into a complete point cloud. In addition, we train our network by a patch-to-patch criterion, i.e., use the local reconstruction loss for optimization, to approximate the global reconstruction optimality. Our method outperforms the state-of-the-art in terms of rate-distortion performance, especially at low bitrates. Moreover, the compression process we proposed can guarantee to generate the same number of points as the input. The network model of this method can be easily applied to other point cloud reconstruction problems, such as upsampling.

Score Transformer: Generating Musical Score from Note-level Representation

Masahiro Suzuki

In this paper, we explore the tokenized representation of musical scores using the Transformer model to automatically generate musical scores. Thus far, sequence models have yielded fruitful results with note-level (MIDI-equivalent) symbolic representations of music. Although the note-level representations can comprise sufficient information to reproduce music aurally, they cannot contain adequate information to represent music visually in terms of notation. Musical scores contain various musical symbols (e.g., clef, key signature, and notes) and attributes (e.g., stem direction, beam, and tie) that enable us to visually comprehend musical content. However, automated estimation of these elements has yet to be comprehensively addressed. In this paper, we first design score token representation corresponding to the various musical elements. We then train the Transformer model to transcribe note-level representation into appropriate music notation. Evaluations of popular piano scores show that the proposed method significantly outperforms existing methods on all 12 musical aspects that were investigated. We also explore an effective notation-level token representation to work with the model and determine that our proposed representation produces the steadiest results.

Zero-shot Recognition with Image Attributes Generation using Hierarchical Coupled Dictionary Learning

Shuang Li
Lichun Wang
Shaofan Wang
Dehui Kong
Baocai Yin

Zero-shot learning (ZSL) aims to recognize images from unseen (novel) classes with the training images from seen classes. The attributes of each class is exploited as auxiliary semantic information. Recently most ZSL approaches focus on learning visual-semantic embeddings to transfer knowledge from the seen classes to the unseen classes. However, few works study whether the auxiliary semantic information in the class-level is extensive enough or not for the ZSL task. To tackle such problem, we propose a hierarchical coupled dictionary learning (HCDL) approach to hierarchically align the visual-semantic structures in both the class-level and the image-level. Firstly, the class-level coupled dictionary is trained to establish a basic connection between visual space and semantic space. Then, the image attributes are generated based on the basic connection. Finally, the fine-grained information can be embedded by training the image-level coupled dictionary. Zero-shot recognition is performed in multiple spaces by searching the nearest neighbor class of the unseen image. Experiments on two widely used benchmark datasets show the effectiveness of the proposed approach.

Inter-modality Discordance for Multimodal Fake News Detection

Shivangi Singhal
Mudit Dhawan
Rajiv Ratn Shah
Ponnurangam Kumaraguru

The paradigm shift in the consumption of news via online platforms has cultivated the growth of digital journalism. Contrary to traditional media, lowering entry barriers and enabling everyone to be part of content creation have disabled the concept of centralized gatekeeping in digital journalism. This in turn has triggered the production of fake news. Current studies have made a significant effort towards multimodal fake news detection with less emphasis on exploring the discordance between the different multimedia present in a news article. We hypothesize that fabrication of either modality will lead to dissonance between the modalities, and resulting in misrepresented, misinterpreted and misleading news. In this paper, we inspect the authenticity of news coming from online media outlets by exploiting relationship (discordance) between the textual and multiple visual cues. We develop an inter-modality discordance based fake news detection framework to achieve the goal. The modal-specific discriminative features are learned, employing the cross-entropy loss and a modified version of contrastive loss that explores the inter-modality discordance. To the best of our knowledge, this is the first work that leverages information from different components of the news article (i.e., headline, body, and multiple images) for multimodal fake news detection. We conduct extensive experiments on the real-world datasets to show that our approach outperforms the state-of-the-art by an average F1-score of 6.3%.

SESSION: Short Papers

A Coarse-to-fine Approach for Fast Super-Resolution with Flexible Magnification

Zhichao Fu
Tianlong Ma
Liang Xue
Yingbin Zheng
Hao Ye
Liang He

We perform fast single image super-resolution with flexible magnification for natural images. A novel coarse-to-fine super-resolution framework is developed for the magnification that is factorized into a maximum integer component and the quotient. Specifically, our framework is embedded with a light-weight upscale network for super-resolution with the integer scale factor, followed by the fine-grained network to guide interpolation on feature maps as well as to generate the super-resolved image. Compared with the previous flexible magnification super-resolution approaches, the proposed framework achieves a tradeoff between computational complexity and performance. We conduct experiments using the coarse-to-fine framework on the standard benchmarks and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.

Automatically Generate Rigged Character from Single Image

Zhanpeng Huang
Rui Han
Jianwen Huang
Hao Yin
Zipeng Qin
Zibin Wang

Animation plays an important role in virtual reality and augmented reality applications. However, it requires great efforts for non-professional users to create animation assets. In this paper, we propose a systematic pipeline to generate ready-to-used characters from images for real-time animation without user intervention. Rather than per-pixel mapping or synthesis in image space using optical flow or generative models, we employ an approximate geometric embodiment to undertake 3D animation without large distortion. The geometry structure is generated from a type-agnostic character. A skeleton adaption is then adopted to guarantee semantic motion transfer to the geometry proxy. The generated character is compatible with standard 3D graphics engines and ready to use for real-time applications. Experiments show that our method works on various images (e.g. sketches, cartoons, and photos) of most object categories (e.g. human, animals, and non-creatures). We develop an AR demo to show its potential usage for fast prototyping.

Flat and Shallow: Understanding Fake Image Detection Models by Architecture Profiling

Jing Xu
Wei Zhang
Yalong Bai
Qibin Sun
Tao Mei

Digital image manipulations have been heavily abused to spread misinformation. Despite the great efforts dedicated in research community, prior works are mostly performance-driven, i.e., optimizing performances using standard/heavy networks designed for semantic classification. A thorough understanding for fake images detection models is still missing. This paper studies the essential ingredients for a good fake image detection model, by profiling the best-performing architectures. Specifically, we conduct a thorough analysis on a massive number of detection models, and observe how the performances are affected by different patterns of network structure. Our key findings include: 1) with the same computational budget, flat network structures (e.g., large kernel sizes, wide connections) perform better than commonly used deep networks; 2) operations in shallow layers deserve more computational capacities to trade-off performance and computational cost. These findings sketch a general profile for essential models of fake image detection, which show clear differences with those for semantic classification. Furthermore, based on our analysis, we propose a new Depth-Separable Search Space (DSS) for fake image detection. Compared to state-of-the-art methods, our model achieves competitive performance while saving more than 50% parameters.

Multi-branch Semantic Learning Network for Text-to-Image Synthesis

Jiading Ling
Xingcai Wu
Zhenguo Yang
Xudong Mao
Qing Li
Wenyin Liu

In this paper, we propose a multi-branch semantic learning network (MSLN) to generate image according to textual description by taking into account global and local textual semantics, which consists of two stages. The first stage generates a coarse-grained image based on the sentence features. In the second stage, a multi-branch fine-grained generation model is constructed to inject the sentence-level and word-level semantics into two coarse-grained images by global and local attention modules, which generate global and local fine-grained image textures, respectively. In particular, we devise a channel fusion module (CFM) to fuse the global and local fine-grained features in the multi-branch fine-grained stage and generate the output image. Extensive experiments conducted on the CUB-200 dataset and Oxford-102 dataset demonstrate the superior performance of the proposed method. (e.g., FID is reduced from 16.09 to 14.43 on CUB-200).

Attention-based Dual-Branches Localization Network for Weakly Supervised Object Localization

Wenjun Hui
Chuangchuang Tan
Guanghua Gu

Weakly supervised object localization exploits the last convolutional feature maps of classification model and the weights of Fully-Connected (FC) layer to achieves localization. However, high-level feature maps for localization lack edge features. Additionally, the weights are specific to classification task, causing only discriminative regions to be discovered. In order to fuse edge features and adjust the attention distribution for feature map channels, we propose an efficient method called Attention-based Dual-Branches Localization (ADBL) Network, in which dual-branches structure and attention mechanism are adopted to mine edge features and non-discriminative features for locating more target areas. Specifically, dual-branches structure cascades low-level feature maps to mine target object edge regions. Additionally, during inference stage, attention mechanism assigns appropriate attention for different features to preserve non-discriminative areas. Extensive experiments on both ILSVRC and CUB-200-2011 datasets show that the ADBL method achieves substantial performance improvements.

Pose-aware Outfit Transfer between Unpaired in-the-wild Fashion Images

Donnaphat Trakulwaranont
Marc A. Kastner
Shin'ichi Satoh

Virtual try-on systems became popular for visualizing outfits, due to the importance of individual fashion in many communities. The objective of such a system is to transfer a piece of clothing to another person while preserving its detail and characteristics. To generate a realistic in-the-wild image, it needs visual optimization of the clothing, background and target person, making this task still very challenging. In this paper, we develop a method that generates realistic try-on images with unpaired images from in-the-wild datasets. Our proposed method starts with generating a mock-up paired image using geometric transfer. Then, the target’s pose information is adjusted using a modified pose-attention module. We combine a reconstruction and a content loss to preserve the detail and style of the transferred clothing, background and the target person. We evaluate the approach on the Fashionpedia dataset and can show a promising performance over a baseline approach.

Explore before Moving: A Feasible Path Estimation and Memory Recalling Framework for Embodied Navigation

Yang Wu
Shirui Feng
Guanbin Li
Liang Lin

In this paper, we focus on solving the navigation problem of embodied question answering (EmbodiedQA), where the lack of experience and common sense information essentially result in a failure finding target when the robot is spawn in unknown environments. We present a route planning method named Path Estimation and Memory Recalling (PEMR) framework. PEMR includes a “looking ahead” process, i.e. a visual feature extractor module that estimates feasible paths for gathering 3D navigational information; another process “looking behind” process that is a memory recalling mechanism aims at fully leveraging past experience collected by the feature extractor. To encourage the navigator to learn more accurate prior expert experience, we improve the original benchmark dataset and provide a family of evaluation metrics for diagnosing both navigation and question answering modules. We show strong experimental results of PEMR on the EmbodiedQA navigation task.

Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages

Shota Orihashi
Yoshihiro Yamazaki
Naoki Makishima
Mana Ihori
Akihiko Takashima
Tomohiro Tanaka
Ryo Masumura

This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especially for resource-poor languages. To overcome this difficulty, our proposed method utilizes well-prepared large datasets in resource-rich languages such as English, to train the resource-poor encoder-decoder model. Our key idea is to build a model in which the encoder reflects knowledge of multiple languages while the decoder specializes in knowledge of just the resource-poor language. To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language’s dataset and the resource-rich language’s dataset to learn language-invariant knowledge for scene text recognition. The proposed method also pre-trains the decoder by using the resource-poor language’s dataset to make the decoder better suited to the resource-poor language. Experiments on Japanese scene text recognition using a small, publicly available dataset demonstrate the effectiveness of the proposed method.

PLM-IPE: A Pixel-Landmark Mutual Enhanced Framework for Implicit Preference Estimation

Federico Becattini
Xuemeng Song
Claudio Baecchi
Shi-Ting Fang
Claudio Ferrari
Liqiang Nie
Alberto Del Bimbo

In this paper, we are interested in understanding how customers perceive fashion recommendations, in particular when observing a proposed combination of garments to compose an outfit. Automatically understanding how a suggested item is perceived, without any kind of active engagement, is in fact an essential block to achieve interactive applications. We propose a pixel-landmark mutual enhanced framework for implicit preference estimation, named PLM-IPE, which is capable of inferring the user’s implicit preferences exploiting visual cues, without any active or conscious engagement. PLM-IPE consists of three key modules: pixel-based estimator, landmark-based estimator and mutual learning based optimization. The former two modules work on capturing the implicit reaction of the user from the pixel level and landmark level, respectively. The last module serves to transfer knowledge between the two parallel estimators. Towards evaluation, we collected a real-world dataset, named SentiGarment, which contains 3,345 facial reaction videos paired with suggested outfits and human labeled reaction scores. Extensive experiments show the superiority of our model over state-of-the-art approaches.

Head-Motion-Aware Viewport Margins for Improving User Experience in Immersive Video

Mehmet N. Akcay
Burak Kara
Saba Ahsan
Ali C. Begen
Igor Curcio
Emre Aksu

Viewport-dependent delivery (VDD) is a technique to save network resources during the transmission of immersive videos. However, it results in a non-zero motion-to-high-quality delay (MTHQD), which is the delta time from the moment where the current viewport has at least one low-quality tile to when all the tiles in the new viewport are rendered in high quality. MTHQD is an important metric in the evaluation of the VDD systems. This paper improves an earlier concept called viewport margins by introducing head-motion awareness. The primary benefit of this improvement is the reduction (up to 64%) in the average MTHQD.

Chinese White Dolphin Detection in the Wild

Hao Zhang
Qi Zhang
Phuong Anh Nguyen
Victor C. S. Lee
Antoni Chan

For ecological protection of the ocean, biologists usually conduct line-transect vessel surveys to measure sea species’ population density within their habitat (such as dolphins). However, sea species observation via vessel surveys consumes a lot of manpower resources and is more challenging compared to observing common objects, due to the scarcity of the object in the wild, tiny-size of the objects, and similar-sized distracter objects (e.g., floating trash). To reduce the human experts’ workload and improve the observation accuracy, in this paper, we develop a practical system to detect Chinese White Dolphins in the wild automatically. First, we construct a dataset named Dolphin-14k with more than 2.6k dolphin instances. To improve the dataset annotation efficiency caused by the rarity of dolphins, we design an interactive dolphin box annotation strategy to annotate sparse dolphin instances in long videos efficiently. Second, we compare the performance and efficiency of three off-the-shelf object detection algorithms, including Faster-RCNN, FCOS, and YoloV5, on the Dolphin-14k dataset and pick YoloV5 as the detector, where a new category (Distracter) is added to the model training to reject the false positives. Finally, we incorporate the dolphin detector into a system prototype, which detects dolphins in video frames at 100.99 FPS per GPU with high accuracy (i.e., 90.95 mAP@0.5).

BAND: A Benchmark Dataset forBangla News Audio Classification

Md. Rafi Ur Rashid
Mahim Mahbub
Muhammad Abdullah Adnan

Despite being the sixth most widely spoken language in the world, Bangla has barely received any attention in the domain of audio-visual news classification. In this work, we collect, annotate, and prepare a comprehensive news audio dataset in Bangla, comprising 5120 news clips, with around 820 hours of total duration. We also conduct practical experiments to obtain a human baseline for the news audio classification task. Later, we implement one of the human approaches by performing news classification directly on the audio features using various state-of-the-art classifiers and a few transfer learning models. To the best of our knowledge, this is the very first work developing a benchmark dataset for news audio classification in Bangla.

A comparison study: the impact of age and gender distribution on age estimation

Chang Kong
Qiuming Luo
Guoliang Chen

Age estimation from a single facial image is a challenging and attractive research area in the computer vision community. Several facial datasets annotated with age and gender attributes became available in the literature. However, one major drawback is that these datasets do not consider the label distribution during data collection. Therefore, the models training on these datasets inevitably have bias for the age having least number of images. In this work, we analyze the age and gender distribution of previous datasets and publish an Uniform Age and Gender Dataset (UAGD) which has almost equal number of female and male images in each age. In addition, we investigate the impact of age and gender distribution on age estimation by comparing DEX CNN model trained on several different datasets. Our experiments show that UAGD dataset has good performance for age estimation task and also it is suitable for being an evaluation benchmark.

Spherical Image Compression Using Spherical Wavelet Transform

Huan Wang
Yunhui Shi
Jin Wang
Gang Wu
Nam Ling
Baocai Yin

The Spherical Measure Based Spherical Image Representation (SMSIR) has nearly uniformly distributed pixels in the spherical domain with effective index schemes. Based on SMSIR, the spherical wavelet transform can be efficiently designed, which can capture the spherical geometry feature in a compact manner and provides a powerful tool for spherical image compression. In this paper, we propose an efficient compression scheme for SMSIR images named Spherical Set Partitioning in Hierarchical Trees (S-SPIHT) using the spherical wavelet transform, which exploits the inherent similarities across the subbands in the spherical wavelet decomposition of a SMSIR image. The proposed S-SPIHT can progressively transform spherical wavelet coefficients into bit-stream, and generate an embedded compressed bit-stream that can be efficiently decoded at several spherical image quality levels. The most crucial part of our proposed S-SPIHT is the redesign of scanning the wavelet coefficients corresponding to different index schemes. We design three scanning methods, namely ordered root tree index scanning (ORTIS), dyadic index progressive scanning(DIPS) and dyadic index cross scanning(DICS)to efficiently reorganize the wavelet coefficients. These methods can effectively exploit the self-similarity between sub-bands and the fact that the high-frequency sub-bands mostly contain insignificant coefficients. Experimental results on widely-used datasets demonstrate that our proposed S-SPIHT outperforms the straightforward SPIHT for SMSIR images in terms of PSNR, S-PSNR and SSIM.

FQM-GC: Full-reference Quality Metric for Colored Point Cloud Based on Graph Signal Features and Color Features

Ke-xin Zhang
Gang-yi Jiang
Mei Yu

Colored Point Cloud (CPC) is often distorted in the processes of its acquisition, processing, and compression, so reliable quality assessment metrics are required to estimate the perception of distortion of CPC. We propose a Full-reference Quality Metric for colored point cloud based on Graph signal features and Color features (FQM-GC). For geometric distortion, the normal and coordinate information of the sub-clouds divided via geometric segmentation is used to construct their underlying graphs, then, the geometric structure features are extracted. For color distortion, the corresponding color statistical features are extracted from regions divided with color attribution. Meanwhile, the color features of different regions are weighted to simulate the visual masking effect. Finally, all the extracted features are formed into a feature vector to estimate the quality of CPCs. Experimental results on three databases (CPCD2.0, IRPC and SJTU-PCQA) show that the proposed metric FQM-GC is more consistent with human visual perception.

Cross-layer Navigation Convolutional Neural Network for Fine-grained Visual Classification

Chenyu Guo
Jiyang Xie
Kongming Liang
Xian Sun
Zhanyu Ma

Fine-grained visual classification (FGVC) aims to classify sub-classes of objects in the same super-class (e.g., species of birds, models of cars). For the FGVC tasks, the essential solution is to find discriminative subtle information of the target from local regions. Traditional FGVC models preferred to use the refined features, i.e., high-level semantic information for recognition and rarely use low-level information. However, it turns out that low-level information which contains rich detail information also has effect on improving performance. Therefore, in this paper, we propose cross-layer navigation convolutional neural network for feature fusion. First, the feature maps extracted by the backbone network are fed into a convolutional long short-term memory model sequentially from high-level to low-level to perform feature aggregation. Then, attention mechanisms are used after feature fusion to extract spatial and channel information while linking the high-level semantic information and the low-level texture features, which can better locate the discriminative regions for the FGVC. In the experiments, three commonly used FGVC datasets, including CUB-200-2011, Stanford-Cars, and FGVC-Aircraft datasets, are used for evaluation and we demonstrate the superiority of the proposed method by comparing it with other referred FGVC methods to show that this method achieves superior results. https://github.com/PRIS-CV/CN-CNN.git

NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels

mohit sharma
Raj Aaryaman Patra
Harshal Desai
Shruti Vyas
Yogesh Rawat
Rajiv Ratn Shah

Deep learning has shown remarkable progress in a wide range of problems. However, efficient training of such models requires large-scale datasets, and getting annotations for such datasets can be challenging and costly. In this work, we explore user-generated freely available labels from web videos for video understanding. We create a benchmark dataset consisting of around 2 million videos with associated user-generated annotations and other meta information. We utilize the collected dataset for action classification and demonstrate its usefulness with existing small-scale annotated datasets, UCF101 and HMDB51. We study different loss functions and two pretraining strategies, simple and self-supervised learning. We also show how a network pretrained on the proposed dataset can help against video corruption and label noise in downstream datasets. We present this as a benchmark dataset in noisy learning for video understanding. The dataset, code, and trained models are publicly available here for future research. A longer version of our paper is also available here.

CMRD-Net: An Improved Method for Underwater Image Enhancement

Fengjie Xu
Changhua Zhang
Zhongshu Chen
Zhekai Du
Lei Han
Lin Zuo

Underwater image enhancement is a challenging task due to the degradation of image quality in underwater complicated lighting conditions and scenes. In recent years, most methods improve the visual quality of underwater images by using deep Convolutional Neural Networks and Generative Adversarial Networks. However, the majority of existing methods do not consider that the attenuation degrees of R, G, B channels of the underwater image are different, leading to a sub-optimal performance. Based on this observation, we propose a Channel-wise Multi-scale Residual Dense Network called CMRD-Net, which learns the weights of different color channels instead of treating all the channels equally. More specifically, the Channel-wise Multi-scale Fusion Residual Attention Block (CMFRAB) is involved in the CMRD-Net to obtain a better ability of feature extraction and representation. Notably, we evaluate the effectiveness of our model by comparing it with recent state-of-the-art methods. Extensive experimental results show that our method can achieve a satisfactory performance on a popular public dataset.

Deep Multiple Length Hashing via Multi-task Learning

Letian Wang
Xiushan Nie
Quan Zhou
Yang Shi
Xingbo Liu

Hashing can compress heterogeneous high-dimensional data into compact binary codes. For most existing hash methods, they first predetermine a fixed length for the hash code and then train the model based on this fixed length. However, when the task requirements change, these methods need to retrain the model for a new length of hash codes, which increases time cost. To address this issue, we propose a deep supervised hashing method, called deep multiple length hashing(DMLH), which can learn multiple length hash codes simultaneously based on a multi-task learning network. This proposed DMLH can well utilize the relationships with a hard parameter sharing-based multi-task network. Specifically, in DMLH, the multiple hash codes with different lengths are regarded as different views of the same sample. Furthermore, we introduce a type of mutual information loss to mine the association among hash codes of different lengths. Extensive experiments have indicated that DMLH outperforms most existing models, verifying its effectiveness.

Color Image Denoising via Tensor Robust PCA with Nonconvex and Nonlocal Regularization

Xiaoyu Geng
Qiang Guo
Caiming Zhang

Tensor robust principal component analysis (TRPCA) is an important algorithm for color image denoising by treating the whole image as a tensor and shrinking all singular values equally. In this paper, to improve the denoising performance of TRPCA, we propose a variant of TRPCA model. Specifically, we first introduce a nonconvex TRPCA (N-TRPCA) model which can shrink large singular values more and shrink small singular values less, so that the physical meanings of different singular values can be preserved. To take advantage of the structural redundancy of an image, we further group similar patches as a tensor according to nonlocal prior, and then apply the N-TRPCA model on this tensor. The denoised image can be obtained by aggregating all processed tensors. Experimental results demonstrate the superiority of the proposed denoising method beyond state-of-the-arts.

Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features

Alberto Baldrati
Marco Bertini
Tiberio Uricchio
Alberto Del Bimbo

Building on the recent advances in multimodal zero-shot representation learning, in this paper we explore the use of features obtained from the recent CLIP model to perform conditioned image retrieval. Starting from a reference image and an additive textual description of what the user wants with respect to the reference image, we learn a Combiner network that is able to understand the image content, integrate the textual description and provide combined feature used to perform the conditioned image retrieval. Starting from the bare CLIP features and a simple baseline, we show that a carefully crafted Combiner network, based on such multimodal features, is extremely effective and outperforms more complex state of the art approaches on the popular FashionIQ dataset.

PBNet: Position-specific Text-to-image Generation by Boundary

Tian Tian
Li Liu
Huaxiang Zhang
Dongmei Liu

Most existing methods focus on improving the clarity and semantic consistency of the image with a given text, but do not pay attention to the multiple control of generated image content, such as the position of the object in generated image. In this paper, we introduce a novel position-based generative network (PBNet) which can generate fine-grained images with the object at the specified location. PBNet combines iterative structure with generative adversarial network (GAN). A location information embedding module (LIEM) is proposed to combine the location information extracted from the boundary block image with the semantic information extracted from the text. In addition, a silhouette generation module (SGM) is proposed to train the generator to generate object based on location information. The experimental results on CUB dataset demonstrate that PBNet effectively controls the location of the object in the generated image.

An Embarrassingly Simple Approach to Discrete Supervised Hashing

Shuguang Zhao
Bingzhi Chen
Zheng Zhang
Guangming Lu

Prior hashing works typically learn a projection function from high-dimensional visual feature space to low-dimensional latent space. However, such a projection function remains several crucial bottlenecks: 1) information loss and coding redundancy are inevitable; 2) the available information of semantic labels is not well-explored; 3) the learned latent embedding lacks explicit semantic meaning. To overcome these limitations, we propose a novel supervised Discrete Auto-Encoder Hashing (DAEH) framework, in which a linear auto-encoder can effectively project the semantic labels of images into a latent representation space. Instead of using the visual feature projection, the proposed DAEH framework skillfully explores the semantic information of supervised labels to refine the latent feature embedding and further optimizes hashing function. Meanwhile, we reformulate the objective and relax the discrete constraints for the binary optimization problem. Extensive experiments on Caltech-256, CIFAR-10, and MNIST datasets demonstrate that our method can outperform the state-of-the-art hashing baselines.

Towards Transferable 3D Adversarial Attack

Qiming Lu
Shikui Wei
Haoyu Chu
Yao Zhao

Currently, most of the adversarial attacks focused on perturbation adding on 2D images. In this way, however, the adversarial attacks cannot easily be involved in a real-world AI system, since it is impossible for the AI system to open an interface to attackers. Therefore, it is more practical to add perturbation on real-world 3D objects’ surface, i.e., 3D adversarial attacks. The key challenges for 3D adversarial attacks are how to effectively deal with viewpoint changing and keep strong transferability across different state-of-the-art networks. In this paper, we mainly focus on improving the robustness and transferability of 3D adversarial examples generated by perturbing the surface textures of 3D objects. Towards this end, we propose an effective method, named Momentum Gradient-Filter Sign Method (M-GFSM), to generate 3D adversarial examples. Specially, the momentum is introduced into the procedure of 3D adversarial examples generation, which results in multiview robustness of 3D adversarial examples and high efficiency of attacking by updating the perturbation and stabilizing the update directions. In addition, filter operation is involved to improve the transferability of 3D adversarial examples by filtering gradient images selectively and completing the gradients of neglected pixels caused by downsampling in the rendering stage. Experimental results show the effectiveness and good transferability of the proposed method. Besides, we show that the 3D adversarial examples generated by our method still be robust under different illuminations.

Delay-sensitive and Priority-aware Transmission Control for Real-time Multimedia Communications

Ximing Wu
Lei Zhang
Yingfeng Wu
Haobin Zhou
Laizhong Cui

Today’s multimedia applications usually organize the contents into data blocks with different deadlines and priorities. Meeting/missing the deadline for different data blocks may contribute/hurt the user experience to different degrees. With the goal of optimizing real-time multimedia communications, the transmission control scheme needs to make two challenging decisions: the proper sending rate and the best data block to send under dynamic network conditions. In this paper, we propose a delay-sensitive and priority-aware transmission control scheme with two modules, namely, rate control and block selection. The rate control module constantly monitors the network condition and adjusts the sending rate accordingly. The block selection module classifies the blocks based on whether they are estimated to be delivered before deadline and then ranks them according to their effective priority scores. The extensive simulation results demonstrate the superiority of our proposed scheme over the other representative baseline approaches.

Impression of a Job Interview training agent that gives rationalized feedback: Should Virtual Agent Give Advice with Rationale?

Nao Takeuchi
Tomoko Koda

The COVID-19 pandemic has had a significant socio-economic impact on the world. Specifically, social distancing has impacted many activities that were previously conducted face-to-face. One of these was the training that students receive for job interviews. Thus, we developed a job interview training system that will give students the ability to continue receiving this type of training. Our system recognized the nonverbal behaviors of an interviewee, namely gaze, facial expression, and posture and compares the recognition results with those of models of exemplary nonverbal behaviors of an interviewee. A virtual agent acted as an advisor gives feedback on the interviewee's behaviors that need improvement. In order to verify the effectiveness of the two kinds of feedback, namely, rationalized feedback (with quantitative recognition results) vs. non-rationalized one, we compared interviewees’ impression. The results of the evaluation experiment indicated that the virtual agent with rationalized feedback was rated as more reliable but less friendly than the non-rationalized feedback.

SESSION: Demo Papers

An Efficient Bus Crowdedness Classification System

Lingcan Meng
Xiushan Nie
Zhifang Tan

We propose an efficient bus crowdedness classification system that can be used in daily life. In particular, we analyze and study the data collected from real bus, aiming to deal with the difficulty of bus congestion classification. Besides, we combine deep learning and computer vision technology to extract images or videos from the internal surveillance cameras of the bus. The information of crowd will finally be integrated with algorithms into a complete classification system. As a consequence, when the user enters the system and submits the image or video to be detected, the system will display the classification results in turn. The classification results include passenger density distribution, number of passengers, date, and algorithm running time. In addition, the user can use the mouse to delineate an area in the passenger density distribution map and count any image area.

Private-Share: A Secure and Privacy-Preserving De-Centralized Framework for Large Scale Data Sharing

Arun Zachariah
Maha Alrasheed

The various data and privacy regulations introduced around the globe, require data to be stored in a secure and privacy-preserving fashion. Non-compliance with these regulations come with major consequences. This has led to the formation of huge data silos within organizations leading to difficult data analysis along with an increased risk of a data breach. Isolating data also prevents collaborative research. To address this, we present Private-Share, a framework that would enable secure sharing of large scale data. In order to achieve this goal, Private-Share leverages the recent advances in blockchain technology specifically the InterPlanetary File System and Ethereum.

RoadAtlas: Intelligent Platform for Automated Road Defect Detection and Asset Management

Zhuoxiao Chen
Yiyun Zhang
Yadan Luo
Zijian Wang
Jinjiang Zhong
Anthony Southon

With the rapid development of intelligent detection algorithms based on deep learning, much progress has been made in automatic road defect recognition and road marking parsing. This can effectively address the issue of an expensive and time-consuming process for professional inspectors to review the street manually. Towards this goal, we present RoadAtlas, a novel end-to-end integrated system that can support 1) road defect detection, 2) road marking parsing, 3) a web-based dashboard for presenting and inputting data by users, and 4) a backend containing a well-structured database and developed APIs.

SESSION: Applied Research Papers

Goldeye: Enhanced Spatial Awareness for the Visually Impaired using Mixed Reality and Vibrotactile Feedback

Jun Yao Francis Lee
Narayanan Rajeev
Anand Bhojan

One in six people have some form of visual impairment ranging from mild vision loss to total blindness. The visually impaired constantly face the danger of walking into people or hazardous objects. This thesis proposes the use of vibrotactile feedback to serve as an obstacle detection system for visually impaired users. We utilize a mixed reality headset with on-board depth sensors to build a digital map of the real world and a suit with an array of actuators to provide feedback as to indicate to the visually impaired the position of obstacles around them. This is demonstrated by a simple prototype built using commercially available devices (Microsoft HoloLens and bHaptics Tactot) and a qualitative user study was conducted to evaluate the viability of the proposed system. Through our user-testing performed on subjects with simulated visual impairments, our results affirm the potential of using mixed reality to detect obstacles in the environment along with only transmitting essential information through the haptic suit due to limited bandwidth.

Convolutional Neural Network-Based Pure Paint Pigment Identification Using Hyperspectral Images

Ailin Chen
Rui Jesus
Marcia Vilarigues

This research presents the results of the implementation of deep learning neural networks in the identification of pure pigments of heritage artwork, namely paintings. Our paper applies an innovative three-branch deep learning model to maximise the correct identification of pure pigments. The model proposed combines the feature maps obtained from hyperspectral images through multiple convolutional neural networks, and numerical, hyperspectral metric data with respect to a set of reference reflectances. The results obtained exhibit an accurate representation of the pure predicted pigments which are confirmed through the use of analytical techniques. The model presented outperformed the compared counterparts and is deemed to be an important direction, not only in terms of utilisation of hyperspectral data and concrete pigment data in heritage analysis, but also in the application of deep learning in other fields.

CFCR: A Convolution and Fusion Model for Cross-platform Recommendation

Shengze Yu
Xin Wang
Wenwu Zhu

With the emergence of various online platforms, associating different platforms is playing an increasingly important role in many applications. Cross-platform recommendation aims to improve recommendation accuracy through associating information from different platforms. Existing methods do not fully exploit high-order nonlinear connectivity information in cross-domain recommendation scenario and suffer from domain-incompatibility problem. In this paper, we propose an end-to-end convolution and fusion model for cross-platform recommendation (CFCR). The proposed CFCR model utilizes Graph Convolution Networks (GCN) to extract user and item features on graphs from different platforms, and fuses cross-platform information by Multimodal AutoEncoder (MAE) with common latent user features. Therefore, the high-order connectivity information is preserved to the most extent and domain-invariant user representations are automatically obtained. The domain-incompatible information is spontaneously discarded to avoid messing up the cross-platform association. Extensive experiments for the proposed CFCR model on real-world dataset demonstrate its advantages over existing cross-platform recommendation methods in terms of various evaluation metrics.

SESSION: Brave New Ideas

SangeetXML: An XML Format for Score Retrieval for Indic Music

Chandan Misra

Efficient retrieval of score information from a large set of XML-encoded scores and lyrics in an XML database requires such music data to be stored in a well-structured and systematic technique. Current search engines for Indic music (Tagore songs in the present context) retrieves only metadata and lacks scores and lyric retrieval schemes. Being vastly different from its western counterpart, an Indic music piece is required to be encoded in a different way than the XML format used for western music like MusicXML. Such encoding requires a proper understanding of the structure of the music sheet and its careful implementation in XML. In this paper, we propose the development of an XML-based format, SangeetXML, for exchanging and retrieving Indic music information from a theoretical 2D matrix model Swaralipi. We implement SangeetXML by formatting a sample of Rabindra Sangeet (read Tagore Songs in English) compositions and highlights the feasibility of an easy and quick retrieval system based on SangeetXML through XQuery, the de-facto standard for querying XML-encoded data.

Holodeck: Immersive 3D Displays Using Swarms of Flying Light Specks [Extended Abstract]

Shahram Ghandeharizadeh

Unmanned Aerial Vehicles (UAVs) have moved beyond a platform for hobbyists to enable environmental monitoring, journalism, film industry, search and rescue, package delivery, and entertainment. This paper describes 3D displays using swarms of flying light specks, FLSs. An FLS is a small (hundreds of micrometers in size) UAV with one or more light sources to generate different colors and textures with adjustable brightness. A synchronized swarm of FLSs renders an illumination in a pre-specified 3D volume, an FLS display. An FLS display provides true depth, enabling a user to perceive a scene more completely by analyzing its illumination from different angles.

An FLS display may either be non-immersive or immersive. Both will support 3D acoustics. Non-immersive FLS displays may be the size of a 1980’s computer monitor, enabling a surgical team to observe and control micro robots performing heart surgery inside a patient’s body. Immersive FLS displays may be the size of a room, enabling users to interact with objects, e.g., a rock, a teapot. An object with behavior will be constructed using FLS-matters. FLS-matter will enable a user to touch and manipulate an object, e.g., a user may pick up a teapot or throw a rock. An immersive and interactive FLS display will approximate Star Trek’s holodeck.

A successful realization of the research ideas presented in this paper will provide fundamental insights into implementing a holodeck using swarms of FLSs. A holodeck will transform the future of human communication and perception, and how we interact with information and data. It will revolutionize the future of how we work, learn, play and entertain, receive medical care, and socialize.

Discovering Social Connections using Event Images

Ming Cheung
Weiwei Sun
Jiantao Zhou

Social events are very common activities, where people can interact with each other. During an event, the organizer often hires photographers to take images, which provide rich information about the participants’ behaviour. In this work, we propose a method to discover the social graphs among event participants from the event images for social network analytics. By studying over 94 events with 32,330 event images, it is proven that the social graphs can be effectively extracted solely from event images. It is found that the discovered social graphs follow similar properties of online social graphs; for instance, the degree distribution obeys power law distribution. The usefulness of the proposed method for social graph discovery from event images is demonstrated through two applications: important participants detection and community detection. To the best of our knowledge, it is the first work to show the feasibility of discovering social graphs by utilizing event images only. As a result, social network analytics such as recommendations become possible, even without access to the online social graph.

SESSION: Grand Challenge

Hybrid Improvements in Multimodal Analysis for Deep Video Understanding

Beibei Zhang
Fan Yu
Yaqun Fang
Tongwei Ren
Gangshan Wu

The Deep Video Understanding Challenge (DVU) is a task that focuses on comprehending long duration videos which involve many entities. Its main goal is to build relationship and interaction knowledge graph between entities to answer relevant questions. In this paper, we improved the joint learning method which we previously proposed in many aspects, including few shot learning, optical flow feature, entity recognition, and video description matching. We verified the effectiveness of these measures through experiments.

SESSION: W1: Visual Tasks and Challenges under Low-quality Multimedia Data

Local-enhanced Multi-resolution Representation Learning for Vehicle Re-identification

Jun Zhang
Xian Zhong
Jingling Yuan
shilei zhao
Rongbo Zhang
Duxiu Feng
luo zhong

In real traffic scenarios, the changes of vehicle resolution that the camera captures tend to be relatively obvious considering the distances to the vehicle, different directions, and height of the camera. When the resolution difference exists between the probe and the gallery vehicle, the resolution mismatch will occur, which will seriously influence the performance of the vehicle re-identification (Re-ID). This problem is also known as multi-resolution vehicle Re-ID. An effective strategy is equivalent to utilize image super-resolution to handle the resolution gap. However, existing methods conduct super-resolution on global images instead of local representation of each image, leading to much more noisy information generated from the background and illumination variations. In our work, a local-enhanced multi-resolution representation learning (LMRL) is therefore proposed to address these problems by combining the training of local-enhanced super-resolution (LSR) module and local-guided contrastive learning (LCL) module. Specifically, we use a parsing network to parse a vehicle into four different parts to extract local-enhanced vehicle representation. And then, the LSR module, which consists of two auto-encoders that share parameters, transforms low-resolution images into high-resolution in both global and local branches. LCL module can learn discriminative vehicle representation by contrasting local representation between the high-resolution reconstructed image and the ground truth. We evaluate our approach on two public datasets that contain vehicle images at a wide range of resolutions, in which our approach shows significant superiority to the existing solution.

Dedark+Detection: A Hybrid Scheme for Object Detection under Low-light Surveillance

Xiaolei Luo
Sen Xiang
Yingfeng Wang
Qiong Liu
You Yang
Kejun Wu

Object detection under low-light surveillance is a crucial problem that less efforts have been made on it. In this paper, we proposed a hybrid method that jointly use enhancement and object detection for the above challenge, namely Dedark+Detection. In this method, the low-light surveillance video is processed by the proposed de-dark method, and the video can thus be converted to appearance under normal lighting condition. This enhancement bring more benefits to the subsequent stage of object detection. After that, an object detection network is trained on the enhanced dataset for practical applications under low-light surveillance. Experiments are performed on 18 low-light surveillance video test sequences, and superior performance can be found when comparing to state-of-the-arts.

Making Video Recognition Models Robust to Common Corruptions With Supervised Contrastive Learning

Tomu Hirata
Yusuke Mukuta
Tatsuya Harada

The video understanding capability of video recognition models has been significantly improved by the development of deep learning techniques and various video datasets available. However, video recognition models are still vulnerable to invisible perturbations, which limits the use of deep video recognition models in the real world. We present a new benchmark for the robustness of action recognition classifiers to general corruptions, and show that a supervised contrastive learning framework is effective in obtaining discriminative and stable video representations, and makes deep video recognition models robust to general input corruptions. Experiments on the action recognition task for corrupted videos show the high robustness of the proposed method on the UCF101 and HMDB51 datasets with various common corruptions.

Visible-Infrared Cross-Modal Person Re-identification based on Positive Feedback

Lingyi Lu
Xin Xu

Visible-infrared person re-identification (VI-ReID) is undoubtedly a challenging cross-modality person retrieval task with increasing appreciation. Compared to traditional person ReID that focuses on person images in a single RGB mode, VI-ReID suffers from additional cross-modality discrepancy due to the different imaging processes of spectrum cameras. Several effective attempts have been made in recent years to narrow cross-modality gap aiming to improve the re-identification performance, but rarely study the key problem of optimizing the search results combined with relevant feedback. In this paper, we present the idea of cross-modality visible-infrared person re-identification combined with human positive feedback. This method allows the user to quickly optimize the search performance by selecting strong positive samples during the re-identification process. We have validated the effectiveness of our method on a public dataset, SYSU-MM01, and results confirmed that the proposed method achieved superior performance compared to the current state-of-the-art methods.

SESSION: W2: Multi-modal Embedding and Understanding

Focusing Attention across Multiple Images for Multimodal Event Detection

Yangyang Li
Jun Li
Hao Jin
Liang Peng

Multimodal social event detection has been attracting tremendous research attention in recent years, due to that it provides comprehensive and complementary understanding of social events and is important to public security and administration. Most existing works have been focusing on the fusion of multimodal information, especially for single image and text fusion. Such single image-text pair processing breaks the correlations between images of the same post and may affect the accuracy of event detection. In this work, we propose to focus attention across multiple images for multimodal event detection, which is also more reasonable for tweets with short text and multiple images. Towards this end, we elaborate a novel Multi-Image Focusing Network (MIFN) to connect text content with visual aspects in multiple images. Our MIFN consists of a feature extractor, a multi-focal network and an event classifier. The multi-focal network implements a focal attention across all the images, and fuses the most related regions with texts as multimodal representation. The event classifier finally predict the social event class based on the multimodal representations. To evaluate the effectiveness of our proposed approach, we conduct extensive experiments on a commonly-used disaster dataset. The experimental results demonstrate that, in both humanitarian event detection task and its variant of hurricane disaster, the proposed MIFN outperforms all the baselines. The ablation studies also exhibit the ability to filter the irrelevant regions across images which results in improving the accuracy of multimodal event detection.

Adaptive Cross-stitch Graph Convolutional Networks

Zehui Hu
Zidong Su
Yangding Li
Junbo Ma

Graph convolutional networks (GCN) have been widely used in processing graphs and networks data. However, some recent research experiments show that the existing graph convolutional networks have isseus when integrating node features and topology structure. In order to remedy the weakness, we propose a new GCN architecture. Firstly, the proposed architecture introduces the cross-stitch networks into GCN with improved cross-stitch units. Cross-stitch networks spread information/knowledge between node features and topology structure, and obtains consistent learned representation by integrating information of node features and topology structure at the same time. Therefore, the proposed model can capture various channel information in all images through multiple channels. Secondly, an attention mechanism is to further extract the most relevant information between channel embeddings. Experiments on six benchmark datasets shows that our method outperforms all comparison methods on different evaluation indicators.

Generation of Variable-Length Time Series from Text using Dynamic Time Warping-Based Method

Ayaka Ideno
Yusuke Mukuta
Tatsuya Harada

This study is aimed at finding a suitable method for generating time-series data such as video clips or avatar motions from text stating multiple events. This paper addresses the generation of variable-length time-series data considering the order and variable duration of events stated in the text. Although the use of the variant of Mean Squared Error (MSE) is a common means of training, only the gap between the element of ground-truth (GT) data and generated data at the same time are considered. Thus, variants of MSE are unsuitable for the task at hand because the loss may not be small for the generated and GT data with the same order of events if the time for each event does not overlap. To solve the problem, we propose a Dynamic Time Warping-Like method for Variable-Length data (DTWL-VL), which determines the corresponding elements of the GT and the generated data, allowing for the time difference between them, and makes them closer. We compared DTWL-VL, a variant of MSE, and an existing method for time-series data generation which considers the time difference between the corresponding part in the GT and generated data. Since the existing method is aimed at generating fixed-length data, we extend the method for generating variable-length time-series data. We conducted experiments using a dataset prepared for this study. Both DTWL-VL and the existing methods outperformed the MSE variant. Moreover, although the existing method outperformed DTWL-VL under certain settings, DTWL-VL required a smaller training period.

Hierarchical Graph Representation Learning with Local Capsule Pooling

Zidong Su
Zehui Hu
Yangding Li

Hierarchical graph pooling has shown great potential for capturing high-quality graph representations through the node cluster selection mechanism. However, the current node cluster selection methods have inadequate clustering issues, and their scoring methods rely too much on the node representation, resulting in excessive graph structure information loss during pooling. In this paper, a local capsule pooling network (LCPN) is proposed to alleviate the above issues. Specifically, (i) a local capsule pooling (LCP) is proposed to alleviate the issue of insufficient clustering; (ii) a task-aware readout (TAR) mechanism is proposed to obtain a more expressive graph representation; (iii) a pooling information loss (PIL) term is proposed to further alleviate the information loss caused by pooling during training. Experimental results on the graph classification task, the graph reconstruction task, and the pooled graph adjacency visualization task show the superior performance of the proposed LCPN and demonstrate its effectiveness and efficiency.

Deep Adaptive Attention Triple Hashing

Yang Shi
Xiushan Nie
Quan Zhou
Li Zou
Yilong Yin

Recent studies have verified that learning compact hash codes can facilitate big data retrieval processing. In particular, learning the deep hash function can greatly improve the retrieval performance. However, the existing deep supervised hashing algorithm treats all the samples in the same way, which leads to insufficient learning of difficult samples. Therefore, we cannot obtain the accurate learning of the similarity relation, making it difficult to achieve satisfactory performance. In light of this, this work proposes a deep supervised hashing model, called deep adaptive attention triple hashing (DAATH), which weights the similarity prediction scores of positive and negative samples in the form of triples, thus giving different degrees of attention to different samples. Compared with the traditional triple loss, it places a greater emphasis on the difficult triple, dramatically reducing the redundant calculation. Extensive experiments have been conducted to show that DAAH consistently outperforms the state-of-the-arts, confirmed its the effectiveness.

SESSION: W3: Multi-model Computing of Marine Big Data

Deep Reinforcement Learning and Docking Simulations for autonomous molecule generation in de novo Drug Design

Hao Liu
Qian Wang
Xiaotong Hu

In medicinal chemistry programs, it is key to design and make compounds that are efficacious and safe. In this study, we developed a new deep Reinforcement learning-based compounds molecular generation method. Because chemical space is impractically large, and many existing generation models generate molecules that lack effectiveness, novelty and unsatisfactory molecular properties. Our proposed method-DeepRLDS, which integrates transformer network, balanced binary tree search and docking simulation based on super large-scale supercomputing, can solve these problems well. Experiments show that more than 96 of the generated molecules are chemically valid, 99 of the generated molecules are chemically novelty, the generated molecules have satisfactory molecular properties and possess a broader chemical space distribution.

Joint label refinement and contrastive learning with hybrid memory for Unsupervised Marine Object Re-Identification

Xiaorui Han
Zhiqi Chen
Ruixue Wang
Pengfei Zhao

Unsupervised object re-identification is a challenging task due to the missing of labels for the dataset. Many unsupervised object re-identification approaches combine clustering-based pseudo-label prediction with feature fine-tuning. These methods have achieved great success in the field of unsupervised object Re-ID. However, the inevitable label noise caused by the clustering procedure was ignored. Such noisy pseudo labels substantially hinder the model’s capability on further improving feature representations. To this end, we propose a novel joint label refinement and contrastive learning framework with hybrid memory to alleviate this problem. Firstly, in order to reduce the noise of clustering pseudo labels, we propose a novel noise refinement strategy. This strategy refines pseudo labels at clustering phase and promotes clustering quality by boosting the label purity. In addition, we propose a hybrid memory bank. The hybrid memory dynamically generates prototype-level and un-clustered instance-level supervisory signals for learning feature representations. With all prototype-level and un-clustered instance-level supervisions, re-identification model is trained progressively. Our proposed unsupervised object Re-ID framework significantly reduces the influence of noisy labels and refines the learned features. Our method consistently achieves state-of-the-art performance on benchmark datasets.

Prediction of Transcription Factor Binding Sites Using Deep Learning Combined with DNA Sequences and Shape Feature Data

Yangyang Li
Jie Liu
Hao Liu

Knowing transcription factor binding sites (TFBS) is essential to model underlying binding mechanisms and cellular functions. Studies have shown that in addition to the DNA sequence, the shape information of DNA is also an important factor affecting its activity. Here, we developed a CNN model to integrate 3D DNA shape information derived using a high-throughput method for predicting TF binding sites (TFBSs). We identify the best performing architectures by varying CNN window size, kernels, hidden nodes and hidden layers. The performance of the two types of data and their combination was evaluated using 69 different ChIP-seq [1] experiments. Our results showed that the model integrating shape information and sequence information compared favorably to the sequence-based model This work combines knowledge from structural biology and genomics, and DNA shape features improved the description of TF binding specificity.

A Reinforcement Learning-Based Reward Mechanism for Molecule Generation that Introduces Activity Information

Hao Liu
Jinmeng Yan
Yuandong Zhou

In this paper, we propose an activity prediction method for molecule generation based on the framework of reinforcement learning. The method is used as a scoring module for the molecule generation process. By introducing information about known active molecules for specific set of target conformations, it overcomes the traditional molecular optimization strategy where the method only uses computable properties. Eventually, our prediction method improves the quality of the generated molecules. The prediction method utilized fusion features that consist of traditional countable properties of molecules such as atomic number and the binding property of the molecule to the target. Furthermore, this paper designs a ultra large-scale molecular docking parallel computing method, which greatly improves the performance of the molecular docking [1] scoring process. The computing method makes the high-quality docking computing to predict molecular activity possible. The final experimental result shows that the molecule generation model using the prediction method can produce nearly twenty percent active molecules, which shows that the method proposed in this paper can effectively improve the performance of molecule generation.

A Fine-Grained River Ice Semantic Segmentation based on Attentive Features and Enhancing Feature Fusion

rui wang
chengyu zheng
yanru jiang
Zhaoxin Wang
min ye
Chenglong Wang
Ning Song
Jie Nie

The semantic segmentation of frazil ice and anchor ice is of great significance for river management, ship navigation, and ice hazard forecasting in cold regions. Especially, distinguishing frazil ice from sediment-carrying anchor ice can increase the estimation accuracy of the sediment transportation capacity of the river. Although the river ice semantic segmentation methods based on deep learning has achieved great prediction accuracy, there is still the problem of insufficient feature extraction. To address this problem, we proposed a Fine-Grained River Ice Semantic Segmentation (FGRIS) based on attentive features and enhancing feature fusion to deal with these challenges. First, we propose a Dual-Attention Mechanism (DAM) method, which uses a combination of channel attention features and position attention features to extract more comprehensive semantic features. Then, we proposed a novel Branch Feature Fusion (BFF) module to bridge the semantic feature gap between high-level feature semantic features and low-level semantic features, which is robust to different scales. Experimental results conducted on Alberta River Ice Segmentation Dataset demonstrate the superiority of the proposed method.

Multi-Scale Graph Convolutional Network and Dynamic Iterative Class Loss for Ship Segmentation in Remote Sensing Images

Yanru Jiang
Chengyu Zheng
Zhaoxin Wang
Rui Wang
Min Ye
Chenglong Wang
Ning Song
Jie Nie

The accuracy of the semantic segmentation results of ships is of great significance to coastline navigation, resource management, and territorial protection. Although the ship semantic segmentation method based on deep learning has made great progress, there is still the problem of not exploring the correlation between the targets. In order to avoid the above problems, this paper designed a multi-scale graph convolutional network and dynamic iterative class loss for ship segmentation in remote sensing images to generate more accurate segmentation results. Based on DeepLabv3+, our network uses deep convolutional networks and atrous convolutions for multi-scale feature extraction. In particular, for multi-scale semantic features, we propose to construct a Multi-Scale Graph Convolution Network (MSGCN) to introduce semantic correlation information for pixel feature learning by GCN, which enhances the segmentation result of ship objects. In addition, we propose a Dynamic Iterative Class Loss (DICL) based on iterative batch-wise class rectification instead of pre-computing the fixed weights over the whole dataset, which solves the problem of imbalance between positive and negative samples. We compared the proposed algorithm with the most advanced deep learning target detection methods and ship detection methods and proved the superiority of our method. On a High-Resolution SAR Images Dataset [1], ship detection and instance segmentation can be implemented well.

MMAsia '21: ACM Multimedia Asia

MMAsia '21: ACM Multimedia Asia

SESSION: Full Papers

SESSION: Short Papers

SESSION: Demo Papers

SESSION: Applied Research Papers

SESSION: Brave New Ideas

SESSION: Grand Challenge

SESSION: W1: Visual Tasks and Challenges under Low-quality Multimedia Data

SESSION: W2: Multi-modal Embedding and Understanding

SESSION: W3: Multi-model Computing of Marine Big Data

Sections

User login