MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia


MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Full Citation in the ACM Digital Library

A treatment engine by multimodal EMR data

  • Zhaomeng Huang
  • Liyan Zhang
  • Xu Xu

In recent years, with the development of electronic medical record (EMR) systems, it has become possible to mine patient clinical data to improve medical care quality. After the treatment engine learns knowledge from the EMR data, it can automatically recommend the next stage of prescriptions and provide treatment guidelines for doctors and patients. However, this task is always challenged by the multi-modality of EMR data. To more effectively predict the next stage of treatment prescription by using multimodal information and the connection between the modalities, we propose a cross-modal shared-specific feature complementary generation and attention fusion algorithm. In the feature extraction stage, specific information and shared information are obtained through a shared-specific feature extraction network. To obtain the correlation between the modalities, we propose a sorting network. We use the attention fusion network in the multimodal feature fusion stage to give different multimodal features at different stages with different weights to obtain a more prepared patient representation. Considering the redundant information of specific modal information and shared modal information, we introduce a complementary feature learning strategy, including modality adaptation for shared features, project adversarial learning for specific features, and reconstruction enhancement. The experimental results on the real EMR data set MIMIC-III prove its superiority and each part's effectiveness.

Storyboard relational model for group activity recognition

  • Boning Li
  • Xiangbo Shu
  • Rui Yan

This work concerns how to effectively recognize the group activity performed by multiple persons collectively. As known, Storyboards (i.e., medium shot, close shot) jointly describe the whole storyline of a movie in a compact way. Likewise, the actors in small subgroups (similar to Storyboards) of a group activity scene contribute a lot to such group activity and develop more compact relationships among them within subgroups. Inspired by this, we propose a Storyboard Relational Model (SRM) to address the problem of Group Activity Recognition by splitting and reintegrating the group activity based on the small yet compact Storyboards. SRM mainly consists of a Pose-Guided Pruning (PGP) module and a Dual Graph Convolutional Networks (Dual-GCN) module. Specifically, PGP is designed to refine a series of Storyboards from the group activity scene by leveraging the attention ranges of individuals. Dual-GCN models the compact relationships among actors in a Storyboard. Experimental results on two widely-used datasets illustrate the effectiveness of the proposed SRM compared with the state-of-the-art methods.

Distilling knowledge in causal inference for unbiased visual question answering

  • Yonghua Pan
  • Zechao Li
  • Liyan Zhang
  • Jinhui Tang

Current Visual Question Answering (VQA) models mainly explore the statistical correlations between answers and questions, which fail to capture the relationship between the visual information and answers. The performance dramatically decreases when the distribution of handled data is different from the training data. Towards this end, this paper proposes a novel unbiased VQA model by exploring the Casual Inference with Knowledge Distillation (CIKD) to reduce the influence of bias. Specifically, the causal graph is first constructed to explore the counterfactual causality and infer the casual target based on the causal effect, which well reduces the bias from questions and obtain answers without training. Then knowledge distillation is leveraged to transfer the knowledge of the inferred casual target to the conventional VQA model. It makes the proposed method enable to handle both the biased data and standard data. To address the problem of the bad bias from the knowledge distillation, the ensemble learning is introduced based on the hypothetical bias reason. Experiments are conducted to show the performance of the proposed method. The significant improvements over the state-of-the-art methods on the VQA-CP v2 dataset well validate the contributions of this work.

Incremental multi-view object detection from a moving camera

  • Takashi Konno
  • Ayako Amma
  • Asako Kanezaki

Object detection in a single image is a challenging problem due to clutters, occlusions, and a large variety of viewing locations. This task can benefit from integrating multi-frame information captured by a moving camera. In this paper, we propose a method to increment object detection scores extracted from multiple frames captured from different viewpoints. For each frame, we run an efficient end-to-end object detector that outputs object bounding boxes, each of which is associated with the scores of categories and poses. The scores of detected objects are then stored in grid locations in 3D space. After observing multiple frames, the object scores stored in each grid location are integrated based on the best object pose hypothesis. This strategy requires the consistency of object categories and poses among multiple frames, and thus it significantly suppresses miss detections. The performance of the proposed method is evaluated on our newly created multi-class object dataset captured in robot simulation and real environments, as well as on a public benchmark dataset.

An automated method with anchor-free detection and U-shaped segmentation for nuclei instance segmentation

  • Xuan Feng
  • Lijuan Duan
  • Jie Chen

Nuclei segmentation plays an important role in cancer diagnosis. Automated methods for digital pathology become popular due to the developments of deep learning and neural networks. However, this task still faces challenges. Most of current techniques cannot be applied directly because of the clustered state and the large number of nuclei in images. Moreover, anchor-based methods for object detection lead a huge amount of calculation, which is even worse on pathological images with a large target density. To address these issues, we propose a novel network with an anchor-free detection and a U-shaped segmentation. An altered feature enhancement module is attached to improve the performance in dense target detection. Meanwhile, the U-Shaped structure in segmentation block ensures the aggregation of features in different dimensions generated from the backbone network. We evaluate our work on a Multi-Organ Nuclei Segmentation dataset from MICCAI 2018 challenge. In comparisons with others, our proposed method achieves state-of-the-art performance.

Improving face recognition in surveillance video with judicious selection and fusion of representative frames

  • Zhaozhen Ding
  • Qingfang Zheng
  • Chunhua Hou
  • Guang Shen

Face recognition in unconstrained surveillance videos is challenging due to the different acquisition settings and face variations. We propose to utilize the complementary correlation between multi-frames to improve face recognition performance. We design an algorithm to build a representative frame set from the video sequence, selecting faces with high quality and large appearance diversity. We also devise a refined Deep Residual Equivariant Mapping (DREAM) block to improve the discriminative power of the extracted deep features. Extensive experiments on two relevant face recognition benchmarks, YouTube Face and IJB-A, show the effectiveness of the proposed method. Our work is also lightweight, and can be easily embedded into existing CNN based face recognition systems.

Two-stage structure aware image inpainting based on generative adversarial networks

  • Jin Wang
  • Xi Zhang
  • Chen Wang
  • Qing Zhu
  • Baocai Yin

In recent years, the image inpainting technology based on deep learning has made remarkable progress, which can better complete the complex image inpainting task compared with traditional methods. However, most of the existing methods can not generate reasonable structure and fine texture details at the same time. To solve this problem, in this paper we propose a two-stage image inpainting method with structure awareness based on Generative Adversarial Networks, which divides the inpainting process into two sub tasks, namely, image structure generation and image content generation. In the former stage, the network generates the structural information of the missing area; while in the latter stage, the network uses this structural information as a prior, and combines the existing texture and color information to complete the image. Extensive experiments are conducted to evaluate the performance of our proposed method on Places2, CelebA and Paris Streetview datasets. The experimental results show the superior performance of the proposed method compared with other state-of-the-art methods qualitatively and quantitatively.

Low-quality watermarked face inpainting with discriminative residual learning

  • Zheng He
  • Xueli Wei
  • Kangli Zeng
  • Zhen Han
  • Qin Zou
  • Zhongyuan Wang

Most existing image inpainting methods assume that the location of the repair area (watermark) is known, but this assumption does not always hold. In addition, the actual watermarked face is in a compressed low-quality form, which is very disadvantageous to the repair due to compression distortion effects. To address these issues, this paper proposes a low-quality watermarked face inpainting method based on joint residual learning with cooperative discriminant network. We first employ residual learning based global inpainting and facial features based local inpainting to render clean and clear faces under unknown watermark positions. Because the repair process may distort the genuine face, we further propose a discriminative constraint network to maintain the fidelity of repaired faces. Experimentally, the average PSNR of inpainted face images is increased by 4.16dB, and the average SSIM is increased by 0.08. TPR is improved by 16.96% when FPR is 10% in face verification.

A multimedia solution to motivate childhood cancer patients to keep up with cancer treatment

  • Carmen Wang Er Chai
  • Bee Theng Lau
  • Abdullah Al Mahmud
  • Mark Kit Tsun Tee

Childhood cancer is a deadly illness that requires the young patient to adhere to cancer treatment for survival. Sadly, the high treatment side-effect burden can make it difficult for patients to keep up with their treatment. However, childhood cancer patients can manage these treatment side effects through daily self-care to make the process more bearable. This paper outlines the design and development process of a multimedia-based solution to motivate these young patients to adhere to cancer treatment and manage their treatment side effects. Due to the high appeal of multimedia-based interventions and the proficiency of young children in using mobile devices, the intervention of this study takes the form of a virtual pet serious game developed for mobile. The intervention which is developed based on the Protection Motivation Theory, includes multiple game modules with the purpose of improving the coping appraisal of childhood cancer patients on using cancer treatment to fight cancer, and taking daily self-care to combat treatment side-effects. The prototype testing results show that the intervention is well received by the voluntary play testers. Future work of this study includes the evaluation of the intervention developed with childhood cancer patients to determine its effectiveness.

Global and local feature alignment for video object detection

  • Haihui Ye
  • Qiang Qi
  • Ying Wang
  • Yang Lu
  • Hanzi Wang

Extending image-based object detectors into video domain suffers from immense inadaptability due to the deteriorated frames caused by motion blur, partial occlusion or strange poses. Therefore, the generated features of deteriorated frames encounter the poor quality of misalignment, which degrades the overall performance of video object detectors. How to capture valuable information locally or globally is of importance to feature alignment but remains quite challenging. In this paper, we propose a Global and Local Feature Alignment (abbreviated as GLFA) module for video object detection, which can distill both global and local information to excavate the deep relationship between features for feature alignment. Specifically, GLFA can model the spatial-temporal dependencies over frames based on propagating global information and capture the interactive correspondences within the same frame based on aggregating valuable local information. Moreover, we further introduce a Self-Adaptive Calibration (SAC) module to strengthen the semantic representation of features and distill valuable local information in a dual local-alignment manner. Experimental results on the ImageNet VID dataset show that the proposed method achieves high performance as well as a good trade-off between real-time speed and competitive accuracy.

Semantic feature augmentation for fine-grained visual categorization with few-sample training

  • Xiang Guan
  • Yang Yang
  • Zheng Wang
  • Jingjing Li

Small data challenges have emerged in many learning problems, since the success of deep neural networks often relies on the availability of a huge number of labeled data that is expensive to collect. We explore a highly challenging task, few-sample training, which uses a small number of labeled images of each category and corresponding textual descriptions to train a model for fine-grained visual categorization. In order to tackle overfitting caused by small data, in this paper, we propose two novel feature augmentation approaches, Semantic Gate Feature Augmentation (SGFA) and Semantic Boundary Feature Augmentation (SBFA). Instead of generating a new image instance, we propose to directly synthesize instance features by leveraging semantic information, and its main novelties are: (1) The SGFA method is proposed to reduce the overfitting of small data by adding random noise to different regions of the image's feature maps through a gating mechanism. (2) The SBFA approach is proposed to optimize the decision boundary of the classifier. Technically, the decision boundary of the image feature is estimated through the assistance of semantic information, and then feature augmentation is performed by sampling in this region. Experiments in fine-grained visual categorization benchmark demonstrate that our proposed approach can significantly improve the categorization performance.

Unsupervised learning of co-occurrences for face images retrieval

  • Thomas Petit
  • Pierre Letessier
  • Stefan Duffner
  • Christophe Garcia

Despite a huge leap in performance of face recognition systems in recent years, some cases remain challenging for them while being trivial for humans. This is because a human brain is exploiting much more information than the face appearance to identify a person. In this work, we aim at capturing the social context of unlabeled observed faces in order to improve face retrieval. In particular, we propose a framework that substantially improves face retrieval by exploiting the faces occurring simultaneously in a query's context to infer a multi-dimensional social context descriptor. Combining this compact structural descriptor with the individual visual face features in a common feature vector considerably increases the correct face retrieval rate and allows to disambiguate a large proportion of query results of different persons that are barely distinguishable visually.

To evaluate our framework, we also introduce a new large dataset of faces of French TV personalities organised in TV shows in order to capture the co-occurrence relations between people. On this dataset, our framework is able to improve the mean Average Precision over a set of internal queries from 67.93% (using only facial features extracted with a state-of-the-art pre-trained model) to 78.16% (using both facial features and faces co-occurrences), and from 67.88% to 77.36% over a set of external queries.

EvoGAN: an evolutionary GAN for face aging and rejuvenation

  • Lianli Gao
  • Jingqiu Zhang
  • Jingkuan Song
  • HengTao Shen

In biology, evolution is the gradual change in the characteristics of a species over several generations. It has two properties: 1) The change is gradual, and 2) long-term changes are relied on short-term changes. Face aging/rejuvenation, which renders younger or elder facial images, follows the principles of evolution. Inspired by this, we propose an <u>Evo</u>lutionary <u>GAN</u>s (EvoGAN) for face aging/rejuvenation by making each age transformation smooth and decomposing a long-term transformation into several short-terms. Specifically, since short-term facial changes are gradual and relatively easy to render, we first divide the ages into several groups (i.e., chronologically from child, adult to elder). Then, for each pair of adjacent groups, we design two age transforms for face aging and rejuvenation, which are supposed to preserve personal identify information and predict age-specific characteristics. Compared with the mainstream for face aging/rejuvenation, i.e., conditional

GANs based methods utilizing one-hot age vector as an age transformation condition, our smooth EvoGAN abandons this condition and can better predict age-specific factors (e.g., the drastic shape and appearance change from an adult to a child). To evaluate our EvoGAN, we construct a challenging dataset FFHQ_Age. Extensive experiments conducted on the dataset demonstrate that our model is able to generate significantly better results than the state-of-the-art methods qualitatively and quantitatively.

Destylization of text with decorative elements

  • Yuting Ma
  • Fan Tang
  • Weiming Dong
  • Changsheng Xu

Style text with decorative elements has a strong visual sense, and enriches our daily work, study and life. However, it introduces new challenges to text detection and recognition. In this study, we propose a text destylized framework, that can transform the stylized texts with decorative elements into a type that is easily distinguishable by a detection or recognition model. We arranged and integrate an existing stylistic text data set to train the destylized network. The new destylized data set contains English letters and Chinese characters. The proposed approach enables a framework to handle both Chinese characters and English letters without the need for additional networks. Experiments show that the method is superior to the state-of-the-art style-related models.

Hierarchical clustering via mutual learning for unsupervised person re-identification

  • Xu Xu
  • Liyan Zhang
  • Zhaomeng Huang
  • Guodong Du

Person re-identification (re-ID) aims to establish identity correspondence across different cameras. State-of-the-art re-ID approaches are mainly clustering-based Unsupervised Domain Adaptation (UDA) methods, which attempt to transfer the model trained on the source domain to target domain, by alternatively generating pseudo labels by clustering target-domain instances and training the network with generated pseudo labels to perform feature learning. However, these approaches suffer from the problem of inevitable label noise caused by the clustering procedure that dramatically impact the model training and feature learning of the target domain. To address this issue, we propose an unsupervised Hierarchical Clustering via Mutual Learning (HCML) framework, which can jointly optimize the dual training network and the clustering procedure to learn more discriminative features from the target domain. Specifically, the proposed HCML framework can effectively update the hard pseudo labels generated by clustering process and soft pseudo label generated by the training network both in on-line manner. We jointly adopt the repelled loss, triplet loss, soft identity loss and soft triplet loss to optimize the model. The experimental results on Market-to-Duke, Duke-to-Market, Market-to-MSMT and Duke-to-MSMT unsupervised domain adaptation tasks have demonstrated the superiority of our proposed HCML framework compared with other state-of-the-art methods.

Self-supervised adversarial learning for cross-modal retrieval

  • Yangchao Wang
  • Shiyuan He
  • Xing Xu
  • Yang Yang
  • Jingjing Li
  • Heng Tao Shen

Cross-modal retrieval aims at enabling flexible retrieval across different modalities. The core of cross-modal retrieval is to learn projections for different modalities and make instances in the learned common subspace comparable to each other. Self-supervised learning automatically creates a supervision signal by transformation of input data and learns semantic features by training to predict the artificial labels. In this paper, we proposed a novel method named Self-Supervised Adversarial Learning (SSAL) for Cross-Modal Retrieval, which deploys self-supervised learning and adversarial learning to seek an effective common subspace. A feature projector tries to generate modality-invariant representations in the common subspace that can confuse an adversarial discriminator consists of two classifiers. One of the classifiers aims to predict rotation angle from image representations, while the other classifier tries to discriminate between different modalities from the learned embeddings. By confusing the self-supervised adversarial model, feature projector filters out the abundant high-level visual semantics and learns image embeddings that are better aligned with text modality in the common subspace. Through the joint exploitation of the above, an effective common subspace is learned, in which representations of different modlities are aligned better and common information of different modalities is well preserved. Comprehensive experimental results on three widely-used benchmark datasets show that the proposed method is superior in cross-modal retrieval and significantly outperforms the existing cross-modal retrieval methods.

Multi-level expression guided attention network for referring expression comprehension

  • Liang Peng
  • Yang Yang
  • Xing Xu
  • Jingjing Li
  • Xiaofeng Zhu

Referring expression comprehension is a task of identifying a text-related object or region in a given image by a natural language expression. In this task, it is essential to understand the expression sentence in multi-aspect and adapt it to region representations for generating the discriminative information. Unfortunately, previous approaches usually focus on the important words or phrases in the expression using self-attention mechanisms, which causes that they may fail to distinguish the target region from others, especially the similar regions. To address this problem, we propose a novel model, termed Multi-level Expression Guided Attention network (MEGA-Net). It contains a multi-level visual attention schema guided by the expression representations in different levels, i.e., sentence-level, word-level and phrase-level, which allows generating the discriminative region features and helps to locate the related regions accurately. In addition, to distinguish the similar regions, we design a two-stage structure, where we first select top-K candidate regions according to their matching scores in the first stage, then we apply an object comparison attention mechanism to learn the difference between the candidates for matching the target region. We evaluate the proposed approach on three popular benchmark datasets and the experimental results demonstrate that our model performs against state-of-the-art methods.

Adaptive feature aggregation network for nuclei segmentation

  • Ruizhe Geng
  • Zhongyi Huang
  • Jie Chen

Nuclei instance segmentation is essential for cell morphometrics and analysis, playing a crucial role in digital pathology. The problem of variability in nuclei characteristics among diverse cell types makes this task more challenging. Recently, proposal-based segmentation methods with feature pyramid network (FPN) has shown good performance because FPN integrates multi-scale features with strong semantics. However, FPN has information loss of the highest-level feature map and sub-optimal feature fusion strategies. This paper proposes a proposal-based adaptive feature aggregation methods (AANet) to make full use of multi-scale features. Specifically, AANet consists of two components: Context Augmentation Module (CAM) and Feature Adaptive Selection Module (ASM). In feature fusion, CAM focus on exploring extensive contextual information and capturing discriminative semantics to reduce the information loss of feature map at the highest pyramid level. The enhanced features are then sent to ASM to get a combined feature representation adaptively over all feature levels for each RoI. The experiments show our model's effectiveness on two publicly available datasets: the Kaggle 2018 Data Science Bowl dataset and the Multi-Organ nuclei segmentation dataset.

Classification of multimedia SNS posts about tourist sites based on their focus toward predicting eco-friendly users

  • Naoto Kashiwagi
  • Tokinori Suzuki
  • Jounghun Lee
  • Daisuke Ikeda

Overtourism has had a negative impact on various things at tourist sites. One of the most serious problems is environmental issues, such as littering, caused by too many visitors to tourist sites. It is important to change people's mindset to be more environmentally aware in order to improve such situation. In particular, if we can find people with comparatively high awareness about environmental issues for overtourism, we will be able to work effectively to promote eco-friendly behavior for people. However, grasping a person's awareness is inherently difficult. For this challenge, we introduce a new task, called Detecting Focus of Posts about Tourism, which is given users' posts of pictures and comment on SNSs about tourist sites, to classify them into types of their focuses based on such awareness. Once we classify such posts, we can see its result showing tendencies of users awareness and so we can discern awareness of the users for environmental issues at tourist sites. Specifically, we define four labels on focus of SNS posts about tourist sites. Based on these labels, we create an evaluation dataset. We present experimental results of the classification task with a CNN classifier for pictures or an LSTM classifier for comments, which will be baselines for the task.

Learning intra-inter semantic aggregation for video object detection

  • Jun Liang
  • Haosheng Chen
  • Kaiwen Du
  • Yan Yan
  • Hanzi Wang

Video object detection is a challenging task due to the appearance deterioration problems in video frames. Thus, object features extracted from different frames of a video are usually deteriorated in varying degrees. Currently, some state-of-the-art methods enhance the deteriorated object features in a reference frame by aggregating the undeteriorated object features extracted from other frames, simply based on their learned appearance relation among object features. In this paper, we propose a novel intra-inter semantic aggregation method (ISA) to learn more effective intra and inter relations for semantically aggregating object features. Specifically, in the proposed ISA, we first introduce an intra semantic aggregation module (Intra-SAM) to enhance the deteriorated spatial features based on the learned intra relation among the features at different positions of an individual object. Then, we present an inter semantic aggregation module (Inter-SAM) to enhance the deteriorated object features in the temporal domain based on the learned inter relation among object features. As a result, by leveraging Intra-SAM and Inter-SAM, the proposed ISA can generate discriminative features from the novel perspective of intra-inter semantic aggregation for robust video object detection. We conduct extensive experiments on the ImageNet VID dataset to evaluate ISA. The proposed ISA obtains 84.5% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, and it achieves superior performance compared with several state-of-the-art video object detectors.

Robust visual tracking via scale-aware localization and peak response strength

  • Ying Wang
  • Luo Xiong
  • Kaiwen Du
  • Yan Yan
  • Hanzi Wang

Existing regression-based deep trackers usually localize a target based on a response map, where the highest peak response corresponds to the predicted target location. Nevertheless, when the background distractors appear or the target scale changes frequently, the response map is prone to produce multiple sub-peak responses to interfere with model prediction. In this paper, we propose a robust online tracking method via Scale-Aware localization and Peak Response strength (SAPR), which can learn a discriminative model predictor to estimate a target state accurately. Specifically, to cope with large scale variations, we propose a Scale-Aware Localization (SAL) module to provide multi-scale response maps based on the scale pyramid scheme. Furthermore, to focus on the target response, we propose a simple yet effective Peak Response Strength (PRS) module to fuse the multi-scale response maps and the response maps generated by a correlation filter. According to the response map with the maximum classification score, the model predictor iteratively updates its filter weights for accurate target state estimation. Experimental results on three benchmark datasets, including OTB100, VOT2018 and LaSOT, demonstrate that the proposed SAPR accurately estimates the target state, achieving the favorable performance against several state-of-the-art trackers.

Hungry networks: 3D mesh reconstruction of a dish and a plate from a single dish image for estimating food volume

  • Shu Naritomi Keiji Yanai

Dietary calorie management has been an important topic in recent years, and various methods and applications on image-based food calorie estimation have been published in the multimedia community. Most of the existing methods of estimating food calorie amounts use 2D-based image recognition. On the other hand, in this paper, we would like to make inferences based on 3D volume for more accurate estimation. We performed 3D reconstruction of a dish (food and plate) and a plate (without foods), from a single image. We succeeded in restoring the 3D shape with high accuracy while maintaining the consistency between a plate part of an estimated 3D dish and an estimated 3D plate. To achieve this, the following contributions were made in this paper. (1) Proposal of "Hungry Networks," a new network that generates two kinds of 3D volumes from a single image. (2) Introduction of plate consistency loss that matches the shapes of the plate parts of the two reconstructed models. (3) Creating a new dataset of 3D food models that are 3D scanned of actual foods and plates. We also conducted an experiment to infer the volume of only the food region from the difference of the two reconstructed volumes. As a result, it was shown that the introduced new loss function not only matches the 3D shape of the plate, but also contributes to obtaining the volume with higher accuracy. Although there are some existing studies that consider 3D shapes of foods, this is the first study to generate a 3D mesh volume from a single dish image.

Scene graph generation via multi-relation classification and cross-modal attention coordinator

  • Xiaoyi Zhang
  • Zheng Wang
  • Xing Xu
  • Jiwei Wei
  • Yang Yang

Scene graph generation intends to build graph-based representation from images, where nodes and edges respectively represent objects and relationships between them. However, scene graph generation today is heavily limited by imbalanced class prediction. Specifically, most of existing work achieves satisfying performance on simple and frequent relation classes (e.g. on), yet leaving poor performance with fine-grained and infrequent ones (e.g. walk on, stand on). To tackle this problem, in this paper, we redesign the framework as two branches, representation learning branch and classifier learning branch, for a more balanced scene graph generator. Furthermore, for representation learning branch, we propose Cross-modal Attention Coordinator (CAC) to gather consistent features from multi-modal using dynamic attention. For classifier learning branch, we first transfer relation classes' knowledge from large scale corpus, then we leverage Multi-Relationship classifier via Graph Attention neTworks (MR-GAT) to bridge the gap between frequent relations and infrequent ones. The comprehensive experimental results on VG200, a challenge dataset, indicate the competitiveness and the significant superiority of our proposed approach.

A novel system architecture and an automatic monitoring method for remote production

  • Yasuhiro Mochida
  • Daisuke Shirai
  • Takahiro Yamaguchi
  • Seiki Kuwabara
  • Hideki Nishizawa

Remote production is an emerging concept concerning the outside-broadcasting workflow enabled by Internet Protocol (IP)-based production systems, and it is expected to be much more efficient than the conventional workflow. However, long-distance transmission of uncompressed video signals and time synchronization of distributed IP-video devices are challenging. A system architecture for remote production using optical transponders (capable of long-distance and large-capacity optical communication) is proposed. A field experiment confirmed that uncompressed video signals can be transmitted successfully by this architecture. The status monitoring of uncompressed video transmission in remote production is also challenging. To address the challenge, a method for automatically monitoring the status of IP-video devices is also proposed. The monitoring system was implemented by using whitebox transponders, and it was confirmed that the system can automatically register IP-video devices, generate an IP-video flow model, and detect traffic anomalies.

Graph convolution network with node feature optimization using cross attention for few-shot learning

  • Ying Liu
  • Yanbo Lei
  • Sheikh Faisal Rashid

Graph convolution network (GCN) is an important method recently developed for few-shot learning. The adjacency matrix in GCN models is constructed based on graph node features to represent the graph node relationships, according to which, the graph network achieves message-passing inference. Therefore, the representation ability of graph node features is an important factor affecting the learning performance of GCN. This paper proposes an improved GCN model with node feature optimization using cross attention, named GCN-NFO. Leveraging on cross attention mechanism to associate the image features of support set and query set, the proposed model extracts more representative and discriminative salient region features as initialization features of graph nodes through information aggregation. Since graph network can represent the relationship between samples, the optimized graph node features transmit information through the graph network, thus implicitly enhances the similarity of intra-class samples and the dissimilarity of inter-class samples, thus enhancing the learning capability of GCN. Intensive experimental results on image classification task using different image datasets prove that GCN-NFO is an effective few-shot learning algorithm which significantly improves the classification accuracy, compared with other existing models.

A multi-scale language embedding network for proposal-free referring expression comprehension

  • Taijin Zhao
  • Hongliang Li
  • Heqian Qiu
  • Qingbo Wu
  • King Ngi Ngan

Referring expression comprehension (REC) is a task that aims to find the location of an object specified by a language expression. Current solutions for REC can be classified into proposal-based methods and proposal-free methods. Proposal-free methods are popular recently because of its flexibility and lightness. Nevertheless, existing proposal-free works give little consideration to visual context. As REC is a context sensitive task, it is hard for current proposal-free methods to comprehend expressions that describe objects by the relative position with surrounding things. In this paper, we propose a multi-scale language embedding network for REC. Our method adopts the proposal-free structure, which directly feeds fused visual-language features into a detection head to predict the bounding box of the target. In the fusion process, we propose a grid fusion module and a grid-context fusion module to compute the similarity between language features and visual features in different size regions. Meanwhile, we extra add fully interacted vision-language information and position information to strength the feature fusion. This novel fusion strategy can help to utilize context flexibly therefore the network can deal with varied expressions, especially expressions that describe objects by things around. Our proposed method outperforms the state-of-the-art methods on Refcoco, Refcoco+ and Refcocog datasets.

Similar scene retrieval in soccer videos with weak annotations by multimodal use of bidirectional LSTM

  • Tomoki Haruyama
  • Sho Takahashi
  • Takahiro Ogawa
  • Miki Haseyama

This paper presents a novel method to retrieve similar scenes in soccer videos with weak annotations via multimodal use of bidirectional long short-term memory (BiLSTM). The significant increase in the number of different types of soccer videos with the development of technology brings valid assets for effective coaching, but it also increases the work of players and training staff. We tackle this problem with a nontraditional combination of pre-trained models for feature extraction and BiLSTMs for feature transformation. By using the pre-trained models, no training data is required for feature extraction. Then effective feature transformation for similarity calculation is performed by applying BiLSTM trained with weak annotations. This transformation allows for highly accurate capture of soccer video context from less annotation work. In this paper, we achieve an accurate retrieval of similar scenes by multimodal use of this BiLSTM-based transformer trainable with less human effort. The effectiveness of our method was verified by comparative experiments with state-of-the-art using actual soccer video dataset.

Patch assembly for real-time instance segmentation

  • Yutao Xu
  • Hanli Wang
  • Jian Zhu

The paradigm of sliding window is proven effective for the task of visual instance segmentation in many popular research works. However, it still suffers from the bottleneck of inference time. To accelerate existing instance segmentation approaches which are dense sliding window based, this work introduces a novel approach, called patch assembly, which can be integrated into bounding box detectors for segmentation without extra up-sampling computations. A well-designed detector named PAMask is proposed to verify the effectiveness of the proposed approach. Benefitting from the simple structure as well as a fusion of multiple representations, PAMask has the ability to run in real time while achieving competitive performances. Besides, another effective technique called Center-NMS is designed to reduce the number of boxes for intersection of union calculation, which can be fully parallelized on device and contributes 0.6% mAP improvement both in detection and segmentation for free.

Full-resolution encoder-decoder networks with multi-scale feature fusion for human pose estimation

  • Jie Ou
  • Mingjian Chen
  • Hong Wu

To achieve more accurate 2D human pose estimation, we extend the successful encoder-decoder network, simple baseline network (SBN), in three ways. To reduce the quantization errors caused by the large output stride size, two more decoder modules are appended to the end of the simple baseline network to get full output resolution. Then, the global context blocks (GCBs) are added to the encoder and decoder modules to enhance them with global context features. Furthermore, we propose a novel spatial-attention-based multi-scale feature collection and distribution module (SA-MFCD) to fuse and distribute multi-scale features to boost the pose estimation. Experimental results on the MS COCO dataset indicate that our network can remarkably improve the accuracy of human pose estimation over SBN, our network using ResNet34 as the backbone network can even achieve the same accuracy as SBN with ResNet152, and our networks can achieve superior results with big backbone networks.

Graph-based variational auto-encoder for generalized zero-shot learning

  • Jiwei Wei
  • Yang Yang
  • Xing Xu
  • Yanli Ji
  • Xiaofeng Zhu
  • Heng Tao Shen

Zero-shot learning has been a highlighted research topic in both vision and language areas. Recently, generative methods have emerged as a new trend of zero-shot learning, which synthesizes unseen categories samples via generative models. However, the lack of fine-grained information in the synthesized samples makes it difficult to improve classification accuracy. It is also time-consuming and inefficient to synthesize samples and using them to train classifiers. To address such issues, we propose a novel Graph-based Variational Auto-Encoder for zero-shot learning. Specifically, we adopt knowledge graph to model the explicit inter-class relationships, and design a full graph convolution auto-encoder framework to generate the classifier from the distribution of the class-level semantic features on individual nodes. The encoder learns the latent representations of individual nodes, and the decoder generates the classifiers from latent representations of individual nodes. In contrast to synthesize samples, our proposed method directly generates classifiers from the distribution of the class-level semantic features for both seen and unseen categories, which is more straightforward, accurate and computationally efficient. We conduct extensive experiments and evaluate our method on the widely used large-scale ImageNet-21K dataset. Experimental results validate the efficacy of the proposed approach.

A multi-scale human action recognition method based on Laplacian pyramid depth motion images

  • Chang Li
  • Qian Huang
  • Xing Li
  • Qianhan Wu

Human action recognition is an active research area in computer vision. Aiming at the lack of spatial muti-scale information for human action recognition, we present a novel framework to recognize human actions from depth video sequences using multi-scale Laplacian pyramid depth motion images (LP-DMI). Each depth frame is projected onto three orthogonal Cartesian planes. Under three views, we generate depth motion images (DMI) and construct Laplacian pyramids as structured multi-scale feature maps which enhances multi-scale dynamic information of motions and reduces redundant static information in human bodies. We further extract the multi-granularity descriptor called LP-DMI-HOG to provide more discriminative features. Finally, we utilize extreme learning machine (ELM) for action classification. Through extensive experiments on the public MSRAction3D datasets, we prove that our method outperforms state-of-the-art benchmarks.

Fixed-size video summarization over streaming data via non-monotone submodular maximization

  • Ganfeng Lu
  • Jiping Zheng

Video summarization which potentially fast browses a large amount of emerging video data as well as saves storage cost has attracted tremendous attentions in machine learning and information retrieval. Among existing efforts, determinantal point processes (DPPs) designed for selecting a subset of video frames to represent the whole video have shown great success in video summarization. However, existing methods have shown poor performance to generate fixed-size output summaries for video data, especially when video frames arrive in streaming manner. In this paper, we provide an efficient approach k-seqLS which summarizes streaming video data with a fixed-size k in vein of DPPs. Our k-seqLS approach can fully exploit the sequential nature of video frames by setting a time window and the frames outside the window have no influence on current video frame. Since the log-style of the DPP probability for each subset of frames is a non-monotone submodular function, local search as well as greedy techniques with cardinality constraints are adopted to make k-seqLS fixed-sized, efficient and with theoretical guarantee. Our experiments show that our proposed k-seqLS exhibits higher performance while maintaining practical running time.

Overlap classification mechanism for skeletal bone age assessment

  • Pengyi Hao
  • Xuhang Xie
  • Tianxing Han
  • Cong Bai

The bone development is a continuous process, however, discrete labels are usually used to represent bone ages. This inevitably causes a semantic gap between actual situation and label representation scope. In this paper, we present a novel method named as overlap classification network to narrow the semantic gap in bone age assessment. In the proposed network, discrete bone age labels (such as 0-228 month) are considered as a sequence that is used to generate a series of subsequences. Then the proposed network makes use of the overlapping information between adjacent subsequences and output several bone age ranges at the same time for one case. The overlapping part of these age ranges is considered as the final predicted bone age. The proposed method without any preprocessing can achieve a much smaller mean absolute error compared with state-of-the-art methods on a public dataset.

Multi-focus noisy image fusion based on gradient regularized convolutional sparse representatione

  • Xuanjing Shen
  • Yunqi Zhang
  • Haipeng Chen
  • Di Gai

The method proposes a multi-focus noisy image fusion algorithm combining gradient regularized convolutional sparse representatione and spatial frequency. Firstly, the source image is decomposed into a base layer and a detail layer through two-scale image decomposition. The detail layer uses the Alternating Direction Method of Multipliers (ADMM) to solve the convolutional sparse coefficients with gradient penalties to complete the fusion of detail layer coefficients. Then, The base layer uses the spatial frequency to judge the focus area, the spatial frequency and the "choose-max" strategy are applied to achieved the multi-focus fusion result of base layer. Finally, the fused image is calculated as a superposition of the base layer and the detail layer. Experimental results show that compared with other algorithms, this algorithm provides excellent subjective visual perception and objective evaluation metrics.

Fixation guided network for salient object detection

  • Zhe Cui
  • Li Su
  • Weigang Zhang
  • Qingming Huang

Convolutional neural network (CNN) based salient object detection (SOD) has achieved great development in recent years. However, in some challenging cases, i.e. small-scale salient object, low contrast salient object and cluttered background, existing salient object detect methods are still not satisfying. In order to accurately detect salient objects, SOD networks need to fix the position of most salient part. Fixation prediction (FP) focuses on the most visual attractive regions, so we think it could assist in locating salient objects. As far as we know, there are few methods jointly consider SOD and FP tasks. In this paper, we propose a fixation guided salient object detection network (FGNet) to leverage the correlation between SOD and FP. FGNet consists of two branches to deal with fixation prediction and salient object detection respectively. Further, an effective feature cooperation module (FCM) is proposed to fuse complementary information between the two branches. Extensive experiments on four popular datasets and comparisons with twelve state-of-the-art methods show that the proposed FGNet well captures the main context of images and locates salient objects more accurately.

Motion-transformer: self-supervised pre-training for skeleton-based action recognition

  • Yi-Bin Cheng
  • Xipeng Chen
  • Dongyu Zhang
  • Liang Lin

With the development of deep learning, skeleton-based action recognition has achieved great progress in recent years. However, most of the current works focus on extracting more informative spatial representations of the human body, but haven't made full use of the temporal dependencies already contained in the sequence of human action. To this end, we propose a novel transformer-based model called Motion-Transformer to sufficiently capture the temporal dependencies via self-supervised pre-training on the sequence of human action. Besides, we propose to predict the motion flow of human skeletons for better learning the temporal dependencies in sequence. The pre-trained model is then fine-tuned on the task of action recognition. Experimental results on the large scale NTU RGB+D dataset shows our model is effective in modeling temporal relation, and the flow prediction pre-training is beneficial to expose the inherent dependencies in time dimensional. With this pre-training and fine-tuning paradigm, our final model outperforms previous state-of-the-art methods.

Interactive re-ranking for cross-modal retrieval based on object-wise question answering

  • Rintaro Yanagi
  • Ren Togo
  • Takahiro Ogawa
  • Miki Haseyama

Cross-modal retrieval methods retrieve desired images from a query text by learning relationships between texts and images. This retrieval approach is one of the most effective ways in the easiness of query preparation. Recent cross-modal retrieval is convenient and accurate when users input a query text that can uniquely identify the desired image. Meanwhile, users frequently input ambiguous query texts, and these ambiguous queries make it difficult to obtain the desired images. To alleviate these difficulties, in this paper, we propose a novel interactive cross-modal retrieval method based on question answering (QA) with users. The proposed method analyses candidate images and asks users about information that can narrow retrieval candidates effectively. By only answering the questions generated by the proposed method, users can reach their desired images even from an ambiguous query text. Experimental results show the effectiveness of the proposed method.

A background-induced generative network with multi-level discriminator for text-to-image generation

  • Ping Wang
  • Li Liu
  • Huaxiang Zhang
  • Tianshi Wang

Most existing text-to-image generation methods focus on synthesizing images using only text descriptions, but this cannot meet the requirement of generating desired objects with given backgrounds. In this paper, we propose a Background-induced Generative Network (BGNet) that combines attention mechanisms, background synthesis, and multi-level discriminator to generate realistic images with given backgrounds according to text descriptions. BGNet takes a multi-stage generation as the basic framework to generate fine-grained images and introduces a hybrid attention mechanism to capture the local semantic correlation between texts and images. To adjust the impact of the given backgrounds on the synthesized images, synthesis blocks are added at each stage of image generation, which appropriately combines the foreground objects generated by the text descriptions with the given background images. Besides, a multi-level discriminator and its corresponding loss function are proposed to optimize the synthesized images. The experimental results on the CUB bird dataset demonstrate the superiority of our method and its ability to generate realistic images with given backgrounds.

WFN-PSC: weighted-fusion network with poly-scale convolution for image dehazing

  • Lexuan Sun
  • Xueliang Liu
  • Zhenzhen Hu
  • Richang Hong

Image dehazing is a fundamental task for the computer vision and multimedia and usually in the face of the challenge from two aspects, i) the uneven distribution of arbitrary haze and ii) the distortion of image pixels caused by the hazed image. In this paper, we propose an end-to-end trainable framework, named Weighted-Fusion Network with Poly-Scale Convolution (WFN-PSC), to address these dehazing issues. The proposed method is designed based on the Poly-Scale Convolution (PSConv). It can extract the image feature from different scales without upsampling and downsampled, which avoids the image distortion. Beyond this, we design the spatial and channel weighted-fusion modules to make the WFN-PSC model focus on the hard dehazing parts of image from two dimensions. Specifically, we design three Part Architectures followed by the channel weighted-fusion module. Each Part Architecture consists of three PSConv residual blocks and a spatial weighted-fusion module. The experiments on the benchmark demonstrate the dehazing effectiveness of the proposed method. Furthermore, considering that image dehazing is a low-level task in the computer vision, we evaluate the dehazed image on the object detection task and the results show that the proposed method can be a good pre-processing to assist the high-level computer vision task.

Video scene detection based on link prediction using graph convolution network

  • Yingjiao Pei
  • Zhongyuan Wang
  • Heling Chen
  • Baojin Huang
  • Weiping Tu

With the development of the Internet, multimedia data grows by an exponential level. The demand for video organization, summarization and retrieval has been increasing where scene detection plays an essential role. Existing shot clustering algorithms for scene detection usually treat temporal shot sequence as unconstrained data. The graph based scene detection methods can locate the scene boundaries by taking the temporal relation among shots into account, while most of them only rely on low-level features to determine whether the connected shot pairs are similar or not. The optimized algorithms considering temporal sequence of shots or combining multi-modal features will bring parameter trouble and computational burden. In this paper, we propose a novel temporal clustering method based on graph convolution network and the link transitivity of shot nodes, without involving complicated steps and prior parameter setting such as the number of clusters. In particular, the graph convolution network is used to predict the link possibility of node pairs that are close in temporal sequence. The shots are then clustered into scene segments by merging all possible links. Experimental results on BBC and OVSD datasets show that our approach is more robust and effective than the comparison methods in terms of F1-score.

Cross-cultural design of facial expressions for humanoids: is there cultural difference between Japan and Denmark?

  • Ichi Kanaya
  • Meina Tawaki
  • Keiko Yamamoto

In this research, the authors succeeded in creating facial expressions made with the minimum necessary elements for recognizing a face. The elements are two eyes and a mouth made using precise circles, which are transformed to make facial expressions geometrically, through rotation and vertically scaling transformation. The facial expression patterns made by the geometric elements and transformations were composed employing three dimensions of visual information that had been suggested by many previous researches, slantedness of the mouth, openness of the face, and slantedness of the eyes. The authors found that this minimal facial expressions can be classified into 10 emotions: happy, angry, sad, disgust, fear, surprised, angry*, fear*, neutral (pleasant) indicating positive emotion, and neutral (unpleasant) indicating negative emotion. The authors also investigate and report cultural differences of impressions of facial expressions of above-mentioned simplified face.

Table detection and cell segmentation in online handwritten documents with graph attention networks

  • Ying Liu
  • Heng Zhang
  • Xiao-Long Yun
  • Jun-Yu Ye
  • Cheng-Lin Liu

In this paper, we propose a multi-task learning approach for table detection and cell segmentation with densely connected graph attention networks in free form online documents. Each online document is regarded as a graph, where nodes represent strokes and edges represent the relationships between strokes. Then we propose a graph attention network model to classify nodes and edges simultaneously. According to node classification results, tables can be detected in each document. By combining node and edge classification resutls, cells in each table can be segmented. To improve information flow in the network and enable efficient reuse of features among layers, dense connectivity among layers is used. Our proposed model has been experimentally validated on an online handwritten document dataset IAMOnDo and achieved encouraging results.

RICAPS: residual inception and cascaded capsule network for broadcast sports video classification

  • Abdullah Aman Khan
  • Saifullah Tumrani
  • Chunlin Jiang
  • Jie Shao

The field of broadcast sports video analysis requires attention from the research community. Identifying the semantic actions within a broadcast sports video aids better video analysis and highlight generation. One of the key challenges posed to sports video analysis is the availability of relevant datasets. In this paper, we introduce a new dataset SP-2 related to broadcast sports video (available at https://github.com/abdkhanstd/Sports2). SP-2 is a large dataset with several annotations such as sports category (class), playfield scenario, and game action. Along with the introduction of this dataset, we focus on accurately classifying the broadcast sports video category and propose a simple yet elegant method for the classification of broadcast sports video. Broadcast sports video classification plays an important role in sports video analysis as different sports games follow a different set of rules and situations. Our method exploits and explores the true potential of capsule network with dynamic routing, which was introduced recently. First, we extract features using a residual convolutional neural network and build temporal feature sequences. Further, a cascaded capsule network is trained using the extracted feature sequence. Residual inception cascaded capsule network (RICAPS) significantly improves the performance of broadcast sports video classification as deeper features are captured by the cascaded capsule network. We conduct extensive experiments on SP-2 dataset and compare the results with previously proposed methods, and the results show that RICAPS outperforms the previously proposed methods.

Transfer non-stationary texture with complex appearance

  • Cheng Peng
  • Na Qi
  • Qing Zhu

Texture transfer has been successfully applied in computer vision and computer graphics. Since non-stationary textures are usually complex and anisotropic, it is challenging to transfer these textures by simple supervised method. In this paper, we propose a general solution for non-stationary texture transfer, which can preserve the local structure and visual richness of textures. The inputs of our framework are source texture and semantic annotation pair. We record different semantics as different regions and obtain the color and distribution information from different regions, which is used to guide the the low-level texture transfer algorithm. Specifically, we exploit these local distributions to regularize the texture transfer objective function, which is minimized by iterative search and voting steps. In the search step, we search the nearest neighbor fields of source image to target image through Generalized PatchMatch (GPM) algorithm. In the voting step, we calculate histogram weights and coherence weights for different semantic regions to ensure color accuracy and texture continuity, and to further transfer the textures from the source to the target. By comparing with state-of-the-art algorithms, we demonstrate the effectiveness and superiority of our technique in various non-stationary textures.

Story segmentation for news broadcast based on primary caption

  • Heling Chen
  • Zhongyuan Wang
  • Yingjiao Pei
  • Baojin Huang
  • Weiping Tu

In the information explosion era, people only want to access the news information that they are interested in. News broadcast story segmentation is strongly needed, which is an essential basis for personalized delivery and short video. The existing advanced story boundary segmentation methods utilize semantic similarity of subtitles, thus entailing complex semantic computation. The title texts of news broadcast programs include headline (or primary) captions, dialogue captions and the channel logo, while the same story clips only render one primary caption in most news broadcast. Inspired by this fact, we propose a simple method for story segmentation based on the primary caption, which combines YOLOv3 based primary caption extraction and preliminary location of boundaries. In particular, we introduce mean hash to achieve the fast and reliable comparison for detected small-size primary caption blocks. We further incorporate scene recognition to exact the preliminary boundaries, because the primary captions always appear later than the story boundary. Experimental results on two Chinese news broadcast datasets show that our method enjoys high accuracy in terms of R, P and F1-measures.

Intermediate coordinate based pose non-perspective estimation from line correspondences

  • Yujia Cao
  • Zhichao Cui
  • Yuehu Liu
  • Xiaojun Lv
  • Kaibei Peng

In this paper, a non-iterative solution to the non-perspective pose estimation from line correspondences was proposed. Specifically, the proposed method uses an intermediate camera frame and an intermediate world frame, which simplifies the expression of rotation matrix by reducing to the two freedoms from three in the rotation matrix R. Then formulate the pose estimation problem into an optimal problem. Our method solve the parameters of rotation matrix by building the fifteenth-order and fourth-order univariate polynomial. The proposed method can be applied into the pose estimation of the perspective camera. We utilize both the simulated data and real data to conduct the comparative experiments. The experimental results show that the proposed method is comparable or better than existing methods in the aspects of accuracy, stability and efficiency.

An autoregressive generation model for producing instant basketball defensive trajectory

  • Huan-Hua Chang
  • Wen-Cheng Chen
  • Wan-Lun Tsai
  • Min-Chun Hu
  • Wei-Ta Chu

Learning basketball tactic via virtual reality environment requires real-time feedback to improve the realism and interactivity. For example, the virtual defender should move immediately according to the player's movement. In this paper, we proposed an autoregressive generative model for basketball defensive trajectory generation. To learn the continuous Gaussian distribution of player position, we adopt a differentiable sampling process to sample the candidate location with a standard deviation loss, which can preserve the diversity of the trajectories. Furthermore, we design several additional loss functions based on the domain knowledge of basketball to make the generated trajectories match the real situation in basketball games. The experimental results show that the proposed method can achieve better performance than previous works in terms of different evaluation metrics.

Real-time arbitrary video style transfer

  • Xingyu Liu
  • Zongxing Ji
  • Piao Huang
  • Tongwei Ren

Video style transfer aims to synthesize a stylized video that has similar content structure with a content video and is rendered in the style of a style image. The existing video style transfer methods cannot simultaneously realize high efficiency, arbitrary style and temporal consistency. In this paper, we propose the first real-time arbitrary video style transfer method with only one model. Specifically, we utilize a three-network architecture consisting of a prediction network, a stylization network and a loss network. Prediction network is used for extracting style parameters from a given style image; Stylization network is for generating the corresponding stylized video; Loss network is for training prediction network and stylization network, whose loss function includes content loss, style loss and temporal consistency loss. We conduct three experiments and a user study to test the effectiveness of our method. The experimental results show that our method outperforms the state-of-the-arts.

C3VQG: category consistent cyclic visual question generation

  • Shagun Uppal
  • Anish Madan
  • Sarthak Bhagat
  • Yi Yu
  • Rajiv Ratn Shah

Visual Question Generation (VQG) is the task of generating natural questions based on an image. Popular methods in the past have explored image-to-sequence architectures trained with maximum likelihood which have demonstrated meaningful generated questions given an image and its associated ground-truth answer. VQG becomes more challenging if the image contains rich contextual information describing its different semantic categories. In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers. Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations. Most importantly, by eliminating expensive answer annotations, the required supervision is weakened. Using different categories enables us to exploit different concepts as the inference requires only the image and the category. Mutual information is maximized between the image, question, and answer category in the latent space of our VAE. A novel category consistent cyclic loss is proposed to enable the model to generate consistent predictions with respect to the answer category, reducing redundancies and irregularities. Additionally, we also impose supplementary constraints on the latent space of our generative model to provide structure based on categories and enhance generalization by encapsulating decorrelated features within each dimension. Through extensive experiments, the proposed model, C3VQG outperforms state-of-the-art VQG methods with weak supervision.

Determining image age with rank-consistent ordinal classification and object-centered ensemble

  • Shota Ashida
  • Adam Jatowt
  • Antoine Doucet
  • Masatoshi Yoshikawa

A significant number of old photographs including ones that are posted online do not contain the information of the date at which they were taken, or this information needs to be verified. Many of such pictures are either scanned analog photographs or photographs taken using a digital camera with incorrect settings. Estimating the date of such pictures is useful for enhancing data quality and its consistency, improving information retrieval and for other related applications. In this study, we propose a novel approach for automatic estimation of the shooting dates of photographs based on a rank-consistent ordinal classification method for neural networks. We also introduce an ensemble approach that involves object segmentation. We conclude that assuring the rank consistency in the ordinal classification as well as combining models trained on segmented objects improve the results of the age determination task.

Cross-modal learning for saliency prediction in mobile environment

  • Dakai Ren
  • Xiangming Wen
  • Xiaoya Liu
  • Shuai Huang
  • Jiazhong Chen

The existing researches reveal that a significant impact is introduced by viewing conditions for visual perception when viewing media on mobile screens. This brings two issues in the area of visual saliency that we need to address: how the saliency models perform in mobile conditions, and how to consider the mobile conditions when designing a saliency model. To investigate the performance of saliency models in mobile environment, eye fixations in four typical mobile conditions are collected as the mobile ground truth in this work. To consider the mobile conditions when designing a saliency model, we combine viewing factors and visual stimuli as two modalities, and a cross-modal based deep learning architecture is proposed for visual attention prediction. Experimental results demonstrate the model with the consideration of mobile viewing factors often outperforms the models without such consideration.

Objective object segmentation visual quality evaluation based on pixel-level and region-level characteristics

  • Ran Shi
  • Jian Xiong
  • Tong Qiao

Objective object segmentation visual quality evaluation is an emergent member of the visual quality assessment family. It aims at developing an objective measure instead of a subjective survey to evaluate the object segmentation quality in agreement with human visual perception. It is an important benchmark to assess and compare performances of object segmentation methods in terms of the visual quality. In spite of its essential role, it still lacks of sufficient studying compared with other visual quality evaluation researches. In this paper, we propose a novel full-reference objective measure including a pixel-level sub-measure and a region-level sub-measure. For the pixel-level sub-measure, it assigns proper weights to not only false positive pixels and false negative pixels but also true positive pixels according to their certainty degrees. For the region-level sub-measure, it considers location distribution of the false negative errors and correlations among neighboring pixels. Thus, by combining these two sub-measures, our measure can evaluate similarity of area, shape and object completeness between one segmentation result and its ground truth in terms of human visual perception. In order to evaluate the performance of our proposed measure, we tested it on an object segmentation subjective visual quality assessment database. The experimental results demonstrate that our proposed measure with good robustness performs better in matching subjective assessments compared with other state-of-the-art objective measures.

Text-based visual question answering with knowledge base

  • Fang Zhou
  • Bei Yin
  • Zanxia Jin
  • Heran Wu
  • Dongyan Zhang

Text-based Visual Question Answering(VQA) usually needs to analyze and understand the text in a picture to give a correct answer for the given question. In this paper, a generic Text-based VQA with Knowledge Base (KB) is proposed, which performs text-based search on text information obtained by optical character recognition (OCR) in images, constructs task-oriented knowledge information and integrates it into the existing models. Due to the complexity of the image scene, the accuracy of OCR is not very high, and there are often cases where the words have individual character that is incorrect, resulting in inaccurate text information; here, some correct words can be found with help of KB, and the correct image text information can be added. Moreover, the knowledge information constructed with KB can better explain the image information, allowing the model to fully understand the image and find the appropriate text answer. The experimental results on the TextVQA dataset show that our method improves the accuracy, and the maximum increment is 39.2%.

Attention-constraint facial expression recognition

  • Qisheng Jiang

To make full use of existing inherent correlation between facial regions and expression, we propose an attention-constraint facial expression recognition method, where the prior correlation between facial regions and expression is integrated into attention weights for extracting better representation. The proposed method mainly consists of four components: feature extractor, local self attention-constraint learner (LSACL), global and local attention-constraint learner (GLACL) and facial expression classifier. Specifically, feature extractor is mainly used to extract features from overall facial image and its corresponding cropped facial regions. Then, the extracted local features from facial regions are fed into local self attention-constraint learner, where some prior rank constraints summarized from facial domain knowledge are embedded into self attention weights. Similarly, the rank correlation constraints between respective facial region and a specified expression are further embedded into global-to-local attention weights when the global feature and local features from local self attention-constraint learner are fed into global and local attention-constraint learner. Finally, the feature from global and local attention-constraint learner and original global feature are fused and passed to facial expression classifier for conducting facial expression recognition. Experiments on two benchmark datasets validate the effectiveness of the proposed method.

Defense for adversarial videos by self-adaptive JPEG compression and optical texture

  • Yupeng Cheng
  • Xingxing Wei
  • Huazhu Fu
  • Shang-Wei Lin
  • Weisi Lin

Despite demonstrated outstanding effectiveness in various computer vision tasks, Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples. Nowadays, adversarial attacks as well as their defenses w.r.t. DNNs in image domain have been intensively studied, and there are some recent works starting to explore adversarial attacks w.r.t. DNNs in video domain. However, the corresponding defense is rarely studied. In this paper, we propose a new two-stage framework for defending video adversarial attack. It contains two main components, namely self-adaptive Joint Photographic Experts Group (JPEG) compression defense and optical texture based defense (OTD). In self-adaptive JPEG compression defense, we propose to adaptively choose an appropriate JPEG quality based on an estimation of moving foreground object, such that the JPEG compression could depress most impact of adversarial noise without losing too much video quality. In OTD, we generate "optical texture" containing high-frequency information based on the optical flow map, and use it to edit Y channel (in YCrCb color space) of input frames, thus further reducing the influence of adversarial perturbation. Experimental results on a benchmark dataset demonstrate the effectiveness of our framework in recovering the classification performance on perturbed videos.

Fusing CAMs-weighted features and temporal information for robust loop closure detection

  • Yaoqing Li
  • Sheng-hua Zhong
  • Tongwei Ren
  • Yan Liu

As a key component in simultaneous localization and mapping (SLAM) system, loop closure detection (LCD) eliminates the accumulated errors by recognizing previously visited places. In recent years, deep learning methods have been proved effective in LCD. However, most of the existing methods do not make good use of the useful information provided by monocular images, which tends to limit their performance in challenging dynamic scenarios with partial occlusion by moving objects. To this end, we propose a novel workflow, which is able to combine multiple information provided by images. We first introduce semantic information into LCD by developing a local-aware Class Activation Maps (CAMs) weighting method for extracting features, which can reduce the adverse effects of moving objects. Compared with previous methods based on semantic segmentation, our method has the advantage of not requiring additional models or other complex operations. In addition, we propose two effective temporal constraint strategies, which utilize the relationship of image sequences to improve the detection performance. Moreover, we propose to use the keypoint matching strategy as the final detector to further refuse false positives. Experiments on four publicly available datasets indicate that our approach can achieve higher accuracy and better robustness than the state-of-the-art methods.

Fixations based personal target objects segmentation

  • Ran Shi
  • Gongyang Li
  • Weijie Wei
  • Zhi Liu

With the development of the eye-tracking technique, the fixation becomes an emergent interactive mode in many human-computer interaction study field. For a personal target objects segmentation task, although the fixation can be taken as a novel and more convenient interactive input, it induces a heavy ambiguity problem of the input's indication so that the segmentation quality is severely degraded. In this paper, to address this challenge, we develop an "extraction-to-fusion" strategy based iterative lightweight neural network, whose input is composed by an original image, a fixation map and a position map. Our neural network consists of two main parts: The first extraction part is a concise interlaced structure of standard convolution layers and progressively higher dilated convolution layers to better extract and integrate local and global features of target objects. The second fusion part is a convolutional long short-term memory component to refine the extracted features and store them. Depending on the iteration framework, current extracted features are refined by fusing them with stored features extracted in the previous iterations, which is a feature transmission mechanism in our neural network. Then, current improved segmentation result is generated to further adjust the fixation map and the position map in the next iteration. Thus, the ambiguity problem induced by the fixations can be alleviated. Experiments demonstrate better segmentation performance of our method and effectiveness of each part in our model.

Improving auto-encoder novelty detection using channel attention and entropy minimization

  • Miao Tian
  • Dongyan Guo
  • Ying Cui
  • Xiang Pan
  • Shengyong Chen

Novelty detection is a important research area which mainly solves the classification problem of inliers which usually consists of normal samples and outliers composed of abnormal samples. Auto-encoder is often used for novelty detection. However, the generalization ability of the auto-encoder may cause the undesirable reconstruction of abnormal elements and reduce the identification ability of the model. To solve the problem, we focus on the perspective of better reconstructing the normal samples as well as retaining the unique information of normal samples to improve the performance of auto-encoder for novelty detection. Firstly, we introduce attention mechanism into the task. Under the action of attention mechanism, auto-encoder can pay more attention to the representation of inlier samples through adversarial training. Secondly, we apply the information entropy into the latent layer to make it sparse and constrain the expression of diversity. Experimental results on three public datasets show that the proposed method achieves comparable performance compared with previous popular approaches.

Relationship graph learning network for visual relationship detection

  • Yanan Li
  • Jun Yu
  • Yibing Zhan
  • Zhi Chen

Visual relationship detection aims to predict the relationships between detected object pairs. It is well believed that the correlations between image components (i.e., objects and relationships between objects) are significant considerations when predicting objects' relationships. However, most current visual relationship detection methods only exploited the correlations among objects, and the correlations among objects' relationships remained underexplored. This paper proposes a relationship graph learning network (RGLN) to explore the correlations among objects' relationships for visual relationship detection. Specifically, RGLN obtains image objects using an object detector, and then, every pair of objects constitutes a relationship proposal. All relationship proposals construct a relationship graph, in which the proposals are treated as nodes. Accordingly, RGLN designs bi-stream graph attention subnetworks to detect relationship proposals, in which one graph attention subnetwork analyzes correlations among relationships based on visual and spatial information, and the other analyzes correlations based on semantic and spatial information. Besides, RGLN exploits a relationship selection subnetwork to ignore redundant information of object pairs with no relationships. We conduct extensive experiments on two public datasets: the VRD and the VG datasets. The experimental results compared with the state-of-the-art demonstrate the competitiveness of RGLN.

Local structure alignment guided domain adaptation with few source samples

  • Yuying Cai
  • Jinfeng Li
  • Baodi Liu
  • Weifeng Liu
  • Kai Zhang
  • Changsheng Xu

Domain adaptation has received lots of attention for its high efficiency in dealing with cross-domain learning tasks. Most existing domain adaptation methods adopt the strategies relying on large amounts of source label information, which limits their applications in the real world where only a few label samples are available. We exploit the local geometric connections to tackle this problem and propose a Local Structure Alignment (LSA) guided domain adaptation method in this paper. LSA leverages the Nyström method to describe the distribution difference from the geometric perspective and then perform the distribution alignment between domains. Specifically, LSA constructs a domain-invariant Hessian matrix to locally connect the data of the two domains through minimizing the Nyström approximation error. And then it integrates the domain-invariant Hessian matrix with the semi-supervised learning and finally builds an adaptive semi-supervised model. Extensive experimental results validate that the proposed LSA outperforms the traditional domain adaptation methods especially when only sparse source label information is available.

Multiplicative angular margin loss for text-based person search

  • Peng Zhang
  • Deqiang Ouyang
  • Feiyu Chen
  • Jie Shao

Text-based person search aims at retrieving the most relevant pedestrian images from database in response to a query in form of natural language description. Existing algorithms mainly focus on embedding textual and visual features into a common semantic space so that the similarity score of features from different modalities can be computed directly. Softmax loss is widely adopted to classify textual and visual features into a correct category in the joint embedding space. However, softmax loss can only help classify features but not increase the intra-class compactness and inter-class discrepancy. To this end, we propose multiplicative angular margin (MAM) loss to learn angularly discriminative features for each identity. The multiplicative angular margin loss penalizes the angle between feature vector and its corresponding classifier vector to learn more discriminative feature. Moreover, to focus more on informative image-text pair, we propose pairwise similarity weighting (PSW) loss to assign higher weight to informative pairs. Extensive experimental evaluations have been conducted on the CUHK-PEDES dataset over our proposed losses. The results show the superiority of our proposed method. Code is available at https://github.com/pengzhanguestc/MAM_loss.

Integrating aspect-aware interactive attention and emotional position-aware for multi-aspect sentiment analysis

  • Xiaoye Wang
  • Xiaowen Zhou
  • Zan Gao
  • Peng Yang
  • Xianbin Wen
  • Hongyun Ning

Aspect-level Sentiment Analysis is a fine-grained sentiment analysis task, which aims to infer the corresponding sentiment polarity with different aspects in an opinion sentence. Attention-based neural networks have proven to be effective in extracting aspect terms, but the prior models are based on context-dependent. Moreover, the prior works only attend aspect terms to detect the sentiment word and cannot consider the sentiment words that might be influenced by domain-specific knowledge. In this work, we proposed a novel integrating Aspect-aware Interactive Attention and Emotional Position-aware module for multi-aspect sentiment analysis (abbreviated to AIAEP) where the aspect-aware interactive attention is utilized to extract aspect terms, and it fuses the domain-specific information of an aspect and context and learns their relationship representations by global context and local context attention mechanisms. Specifically, in the sentiment lexicon, the syntactic parse is used to increase the prior domain knowledge. Then we propose a novel position-aware fusion scheme to compose aspect-sentiment pairs. It combines absolute distance and relative distance from aspect terms and sentiment words, which can improve the accuracy of polarity classification. Extensive experimental results on SemEval2014 task4 restaurant and AIChallenge2018 datasets demonstrate that AIAEP can outperform state-of-the-art approaches, and it is very effective for aspect-level sentiment analysis.

Graph-based motion prediction for abnormal action detection

  • Yao Tang
  • Lin Zhao
  • Zhaoliang Yao
  • Chen Gong
  • Jian Yang

Abnormal action detection is the most noteworthy part of anomaly detection, which tries to identify unusual human behaviors in videos. Previous methods typically utilize future frame prediction to detect frames deviating from the normal scenario. While this strategy enjoys success in the accuracy of anomaly detection, critical information such as the cause and location of the abnormality is unable to be acquired. This paper proposes human motion prediction for abnormal action detection. We employ sequence of human poses to represent human motion, and detect irregular behavior by comparing the predicted pose with the actual pose detected in the frame. Hence the proposed method is able to explain why the action is regarded as irregularity and locate where the anomaly happens. Moreover, pose sequence is robust to noise, complex background and small targets in videos. Since posture information is non-Euclidean data, graph convolutional network is adopted for future pose prediction, which not only leads to greater expressive power but also stronger generalization capability.

Experiments are conducted both on the widely used anomaly detection dataset ShanghaiTech and our newly proposed dataset NJUST-Anomaly, which mainly contains irregular behaviors happened in the campus. Our dataset expands the existing datasets by giving more abnormal actions attracting public attention in social security, which happen in more complex scenes and dynamic backgrounds. Experimental results on both datasets demonstrate the superiority of our method over the-state-of-the-art methods. The source code and NJUST-Anomaly dataset will be made public at https://github.com/datangzhengqing/MP-GCN.

Attention feature matching for weakly-supervised video relocalization

  • Haoyu Tang
  • Jihua Zhu
  • Zan Gao
  • Tao Zhuo
  • Zhiyong Cheng

Localizing the desired video clip for a given query in an untrimmed video has been a hot research topic for multimedia understanding. Recently, a new task named video relocalization, in which the query is a video clip, has been raised. Some methods have been developed for this task, however, these methods often require dense annotations of the temporal boundaries inside long videos for training. A more practical solution is the weakly-supervised approach, which only needs the matching information between the query and video.

Motivated by that, we propose a weakly-supervised video relocalization approach based on an attention-based feature matching method. Specifically, it recognizes the video clip by finding the clip whose frames are the most relevant to the query clip frames based on the matching results of the frame embeddings. In addition, an attention module is introduced to identify the frames containing rich semantic correlations in the query video. Extensive experiments on the ActivityNet dataset demonstrate that our method can outperform several weakly-supervised methods consistently and even achieve competing performance to supervised baselines.

Pulse localization networks with infrared camera

  • Bohong Yang
  • Kai Meng
  • Hong Lu
  • Xinyao Nie
  • Guanhao Huang
  • Jingjing Luo
  • Xing Zhu

Pulse localization is the basic task of the pulse diagnosis with robot. More accurate location can reduce the misdiagnosis caused by different types of pulse. Traditional works usually use a collection surface with a certain area for contact detection, and move the collection surface to collect changes of power for pulse localization. These methods often require the subjects place their wrist in a given position. In this paper, we propose a novel pulse localization method which uses the infrared camera as the input sensor, and locates the pulse on wrist with the neural network. This method can not only reduce the contact between the machine and the subject, reduce the discomfort of the process, but also reduce the preparation time for the test, which can improve the detection efficiency. The experiments show that our proposed method can locate the pulse with high accuracy. And we have applied this method to pulse diagnosis robot for pulse data collection.

Structure-preserving extremely low light image enhancement with fractional order differential mask guidance

  • Yijun Liu
  • Zhengning Wang
  • Ruixu Geng
  • Hao Zeng
  • Yi Zeng

Low visibility and high-level noise are two challenges for low-light image enhancement. In this paper, by introducing fractional order differential, we propose an end-to-end conditional generative adversarial network(GAN) to solve those two problems. For the problem of low visibility, we set up a global discriminator to improve the overall reconstruction quality and restore brightness information. For the high-level noise problem, we introduce fractional order differentiation into both the generator and the discriminator. Compared with conventional end-to-end methods, fractional order can better distinguish noise and high-frequency details, thereby achieving superior noise reduction effects while maintaining details. Finally, experimental results show that the proposed model obtains superior visual effects in low-light image enhancement. By introducing fractional order differential, we anticipate that our framework will enable high quality and detailed image recovery not only in the field of low-light enhancement but also in other fields that require details.

Change detection from SAR images based on deformable residual convolutional neural networks

  • Junjie Wang
  • Feng Gao
  • Junyu Dong

Convolutional neural networks (CNN) have made great progress for synthetic aperture radar (SAR) images change detection. However, sampling locations of traditional convolutional kernels are fixed and cannot be changed according to the actual structure of the SAR images. Besides, objects may appear with different sizes in natural scenes, which requires the network to have stronger multi-scale representation ability. In this paper, a novel <u>D</u>eformable <u>R</u>esidual Convolutional Neural <u>N</u>etwork (DRNet) is designed for SAR images change detection. First, the proposed DRNet introduces the deformable convolutional sampling locations, and the shape of convolutional kernel can be adaptively adjusted according to the actual structure of ground objects. To create the deformable sampling locations, 2-D offsets are calculated for each pixel according to the spatial information of the input images. Then the sampling location of pixels can adaptively reflect the spatial structure of the input images. Moreover, we proposed a novel pooling module replacing the vanilla pooling to utilize multi-scale information effectively, by constructing hierarchical residual-like connections within one pooling layer, which improve the multi-scale representation ability at a granular level. Experimental results on three real SAR datasets demonstrate the effectiveness of the proposed DR-Net.

Efficient inter-image relation graph neural network hashing for scalable image retrieval

  • Hui Cui
  • Lei Zhu
  • Wentao Tan

Unsupervised deep hashing is a promising technique for large-scale image retrieval, as it equips powerful deep neural networks and has advantage on label independence. However, the unsupervised deep hashing process needs to train a large amount of deep neural network parameters, which is hard to optimize when no labeled training samples are provided. How to maintain the well scalability of unsupervised hashing while exploiting the advantage of deep neural network is an interesting but challenging problem to investigate. With the motivation, in this paper, we propose a simple but effective Inter-image Relation Graph Neural Network Hashing (IRGNNH) method. Different from all existing complex models, we discover the latent inter-image semantic relations without any manual labels and exploit them further to assist the unsupervised deep hashing process. Specifically, we first parse the images to extract latent involved semantics. Then, relation graph convolutional network is constructed to model the inter-image semantic relations and visual similarity, which generates representation vectors for image relations and contents. Finally, adversarial learning is performed to seamlessly embed the constructed relations into the image hash learning process, and improve the discriminative capability of the hash codes. Experiments demonstrate that our method significantly outperforms the state-of-the-art unsupervised deep hashing methods on both retrieval accuracy and efficiency.

Towards annotation-free evaluation of cross-lingual image captioning

  • Aozhu Chen
  • Xinyi Huang
  • Hailan Lin
  • Xirong Li

Cross-lingual image captioning, with its ability to caption an unlabeled image in a target language other than English, is an emerging topic in the multimedia field. In order to save the precious human resource from re-writing reference sentences per target language, in this paper we make a brave attempt towards annotation-free evaluation of cross-lingual image captioning. Depending on whether we assume the availability of English references, two scenarios are investigated. For the first scenario with the references available, we propose two metrics, i.e., WMDRel and CLinRel. WMDRel measures the semantic relevance between a model-generated caption and machine translation of an English reference using their Word Mover's Distance. By projecting both captions into a deep visual feature space, CLinRel is a visual-oriented cross-lingual relevance measure. As for the second scenario, which has zero reference and is thus more challenging, we propose CMedRel to compute a cross-media relevance between the generated caption and the image content, in the same visual feature space as used by CLinRel. We have conducted a number of experiments to evaluate the effectiveness of the three proposed metrics. The combination of WMDRel, CLinRel and CMedRel has a Spearman's rank correlation of 0.952 with the sum of BLEU-4, METEOR, ROUGE-L and CIDEr, four standard metrics computed using references in the target language. CMedRel alone has a Spearman's rank correlation of 0.786 with the standard metrics. The promising results show high potential of the new metrics for evaluation with no need of references in the target language.

Synthesized 3D models with smartphone based MR to modify the PreBuilt environment: interior design

  • Anish Bhardwaj
  • Nikhil Chauhan
  • Rajiv Ratn Shah

The past few years have seen an increase in the number of products that use AR and VR as well as the emergence of products in both these categories i.e. Mixed Reality. However, current systems are exclusive to a market that exists in the top 1% of the population in most countries due to the expensive and heavy technology required by these systems. This project showcases a system in the field of Smartphone Based Mixed Reality through an Interior Design Solution that allows the user to visualise their design choices through the lens of a smartphone. Our system uses Image Processing algorithms to perceive room dimensions alongside a GUI which allows a user to create their own blueprints. Navigable 3D models are created from these blueprints, allowing users to view their builds. Following this, Users switch to the mobile application for the purpose of visualising their ideas in their own homes (MR). This System/POC showcases the potential of MR as a field that can be explored for a larger portion of the population through a more efficient medium.

SeekSuspect: retrieving suspects from criminal datasets using visual memory

  • Aayush Jain
  • Meet Shah
  • Suraj Pandey
  • Mansi Agarwal
  • Rajiv Ratn Shah
  • Yifang Yin

It is crucial for the police department to automatically determine if suspects are present in the criminal database, sometimes based on the informant's visual memory alone. FaceFetch [15] is a state-of-the-art face retrieval system capable of retrieving an envisioned face from a large-scale database. Although FaceFetch can retrieve images effectively, it lacks sophisticated techniques to produce results efficiently. To this end, we propose SeekSuspect, a faster interactive suspect retrieval framework, which introduces several optimization algorithms to FaceFetch's framework. We train and test our system on a real-world dataset curated in collaboration with a metropolitan police department in India. Results reveal that SeekSuspect beats FaceFetch and can be employed by law enforcement agencies to retrieve suspects.

A large-scale image retrieval system for everyday scenes

  • Arun Zachariah
  • Mohamed Gharibi
  • Praveen Rao

We present a system for large-scale image retrieval on everyday scenes with common objects. Our system leverages advances in deep learning and natural language processing (NLP) for improved understanding of images by capturing the relationships between the objects within an image. As a result, a user can retrieve highly relevant images and obtain suggestions for similar image queries to further explore the repository. Each image in the repository is processed (using deep learning) to obtain the most probable captions and objects in it. The captions are parsed into tree structures using NLP techniques, and stored and indexed in a database system. When a query image is posed, an optimized tree-pattern query is executed by the database system to obtain candidate matches, which are then ranked using tree-edit distance of the tree structures to output the top-k matches. Word embeddings and Bloom filters are used to obtain similar image queries. By clicking the suggested similar image queries, a user can intuitively explore the repository.

10 years of video browser showdown

  • Klaus Schoeffmann
  • Jakub Lokoč
  • Werner Bailer

The Video Browser Showdown (VBS) has influenced the Multimedia community already for 10 years now. More than 30 unique teams from over 21 countries participated in the VBS since 2012 already. In 2021, we are celebrating the 10th anniversary of VBS, where 17 international teams compete against each other in an unprecedented contest of fast and accurate multimedia retrieval. In this tutorial we discuss the motivation and details of the VBS contest, including its history, rules, evaluation metrics, and achievements for multimedia retrieval. We talk about the properties of specific VBS retrieval systems and their unique characteristics, as well as existing open-source tools that can be used as a starting point for participating for the first time. Participants of this tutorial get a detailed understanding of the VBS and its search systems, and see the latest developments of interactive video retrieval.