MMAsia '19: Proceedings of the ACM Multimedia Asia on ZZZ

MMAsia '19: Proceedings of the ACM Multimedia Asia on ZZZ

Full Citation in the ACM Digital Library

SESSION: Best Paper Session

Session details: Best Paper Session

  • Cheng Wen-Huang

Efficient Dense Modules of Asymmetric Convolution for Real-Time Semantic Segmentation

  • Lo Shao-Yuan

Real-time semantic segmentation plays an important role in practical applications such as self-driving and robots. Most semantic segmentation research focuses on improving estimation accuracy with little consideration on efficiency. Several previous studies that emphasize high-speed inference often fail to produce high-accuracy segmentation results. In this paper, we propose a novel convolutional network named Efficient Dense modules with Asymmetric convolution (EDANet), which employs an asymmetric convolution structure and incorporates dilated convolution and dense connectivity to achieve high efficiency at low computational cost and model size. EDANet is 2.7 times faster than the existing fast segmentation network, ICNet, while it achieves a similar mIoU score without any additional context module, post-processing scheme, and pretrained model. We evaluate EDANet on Cityscapes and CamVid datasets, and compare it with the other state-of-art systems. Our network can run with the high-resolution inputs at the speed of 108 FPS on one GTX 1080Ti.

Adaptive Bilinear Pooling for Fine-grained Representation Learning

  • Min Shaobo

Fine-grained representation learning targets to generate discriminative description for fine-grained visual objects. Recently, the bilinear feature interaction has been proved effective in generating powerful high-order representation with spatially invariant information. However, the existing methods apply a fixed feature interaction strategy to all samples, which ignore the image and region heterogeneity in a dataset. To this end, we propose a generalized feature interaction method, named Adaptive Bilinear Pooling (ABP), which can adaptively infer a suitable pooling strategy for a given sample based on image content. Specifically, ABP consists of two learning strategies: p-order learning (P-net) and spatial attention learning (S-net). The p-order learning predicts an optimal exponential coefficient rather than a fixed order number to extract moderate visual information from an image. The spatial attention learning aims to infer a weighted score that measures the importance of each local region, which can compact the image representations. To make ABP compatible with kernelized bilinear feature interaction, a crossed two-branch structure is utilized to combine the P-net and S-net. This structure can facilitate complementary information exchange between two different visual branches. The experiments on three widely used benchmarks, including fine-grained object classification and action recognition, demonstrate the effectiveness of the proposed method.

Weakly Supervised Video Summarization by Hierarchical Reinforcement Learning

  • Chen Yiyan

Conventional video summarization approaches based on reinforcement learning have the problem that the reward can only be received after the whole summary is generated. Such kind of reward is sparse and it makes reinforcement learning hard to converge. Another problem is that labelling each shot is tedious and costly, which usually prohibits the construction of large-scale datasets. To solve these problems, we propose a weakly supervised hierarchical reinforcement learning framework, which decomposes the whole task into several subtasks to enhance the summarization quality. This framework consists of a manager network and a worker network. For each subtask, the manager is trained to set a subgoal only by a task-level binary label, which requires much fewer labels than conventional approaches. With the guide of the subgoal, the worker predicts the importance scores for video shots in the subtask by policy gradient according to both global reward and innovative defined sub-rewards to overcome the sparse problem. Experiments on two benchmark datasets show that our proposal has achieved the best performance, even better than supervised approaches.

Semantic Prior Guided Face Inpainting

  • Zhang Zeyang

Face inpainting is a sub-task of image inpainting designed to repair broken or occluded incomplete portraits. Due to the high complexity of face image details, inpainting on the face is more difficult. At present, face-related tasks often draw on excellent methods from face recognition and face detection, using multitasking to boost its effect. Therefore, this paper proposes to add the face prior knowledge to the existing advanced inpainting model, combined with perceptual loss and SSIM loss to improve the model repair efficiency. A new face inpainting process and algorithm is implemented, and the repair effect is improved.

SESSION: Multimedia Search

Session details: Multimedia Search

  • Min Weiqing

Attention-Aware Feature Pyramid Ordinal Hashing for Image Retrieval

  • Sun Xie

Due to the effectiveness of representation learning, deep hashing methods have attracted increasing attention in image retrieval. However, most existing deep hashing methods merely encode the raw information of the last layer for hash learning, which result in the following deficiencies: (1) the useful information from the preceding-layer is not fully exploited; (2) the local salient information of the image is neglected. To this end, we propose a novel deep hashing method, called Attention-Aware Feature Pyramid Ordinal Hashing (AFPH), which explores both the visual structure information and semantic information from different convolutional layers. Specifically, two feature pyramids based on spatial and channel attention are well constructed to capture the local salient structure from multiple scales. Moreover, a multi-scale feature fusion strategy is proposed to aggregate the feature maps from multi-level pyramidal layers to generate the discriminative feature for ranking-based hashing. The experimental results conducted on two widely-used image retrieval datasets demonstrate the superiority of our method.

Measuring Similarity between Brands using Followers' Post in Social Media

  • Zhang Yiwei

In this paper, we propose a new measure to estimate the similarity between brands via posts of brands' followers on social network services (SNS). Our method was developed with the intention of exploring the brands that customers are likely to jointly purchase. Nowadays, brands use social media for targeted advertising because influencing users' preferences can greatly affect the trends in sales. We assume that data on SNS allows us to make quantitative comparisons between brands. Our proposed algorithm analyzes the daily photos and hashtags posted by each brand's followers. By clustering them and converting them to histograms, we can calculate the similarity between brands. We evaluated our proposed algorithm with purchase logs, credit card information, and answers to the questionnaires. The experimental results show that the purchase data maintained by a mall or a credit card company can predict the co-purchase very well, but not the customer's willingness to buy products of new brands. On the other hand, our method can predict the users' interest on brands with a correlation value over 0.53, which is pretty high considering that such interest to brands are high subjective and individual dependent.

Social Font Search by Multimodal Feature Embedding

  • Choi Saemi

A typical tag/keyword-based search system retrieves documents where, given a query term q, the query term q occurs in the dataset. However, when applying these systems to a real-world font web community setting, practical challenges arise --- font tags are more subjective than other benchmark datasets, which magnify the tag mismatch problem. To address these challenges, we propose a tag dictionary space leveraged by word embedding, which relates undefined words that have a similar meaning. Even if a query is not defined in the tag dictionary, we can represent it as a vector on the tag dictionary space. The proposed system facilitates multi-modal inputs that can use both textual and image queries. By integrating a visual sentiment concept model that classifies affective concepts as adjective--noun pairs for a given image and uses it as a query, users can interact with the search system in a multi-modal way. We used crowd sourcing to collect user ratings for the retrieved fonts and observed that the retrieved font with the proposed methods obtained a higher score compared to other methods.

Video Summarization based on Sparse Subspace Clustering with Automatically Estimated Number of Clusters

  • Hao Pengyi

Advancements in technology resulted in a sharp growth in the number of digital cameras at people's disposal all across the world. Consequently, the huge storage space consumed by the videos from these devices on video repositories make the job of video processing and analysis to be time-consuming. Furthermore, this also slows down the video browsing and retrieval. Video summarization plays a very crucial role in solving these issues. Despite the number of video summarization approaches proposed up to the present time, the goal is to take a long video and generate a video summary in form of a short video skim without losing the meaning or the message transmitted by the original lengthy video. This is done by selecting the important frames called key-frames. The approach proposed by this work performs automatic summarization of digital videos based on detected objects' deep features. To this end, we apply sparse subspace clustering with an automatically estimated number of clusters to the objects' deep features. The summary generated from our scheme will store the meta-data for each short video inferred from the clustering results. In this paper, we also suggest a new video dataset for video summarization. We evaluate the performance of our work using the TVSum dataset and our video summarization dataset.

POSTER SESSION: Poster Session

Session details: Poster Session

  • Gan Tian

Residual Graph Convolutional Networks for Zero-Shot Learning

  • Wei Jiwei

Most existing Zero-Shot Learning (ZSL) approaches adopt the semantic space as a bridge to classify unseen categories. However, it is difficult to transfer knowledge from seen categories to unseen categories through semantic space, since the correlations among categories are uncertain and ambiguous in the semantic space. In this paper, we formulated zero-shot learning as a classifier weight regression problem. Specifically, we propose a novel Residual Graph Convolution Network (ResGCN) which takes word embeddings and knowledge graph as inputs and outputs a visual classifier for each category. ResGCN can effectively alleviate the problem of over-smoothing and over-fitting. During the test, an unseen image can be classified by ranking the inner product of its visual feature and predictive visual classifiers. Moreover, we provide a new method to build a better knowledge graph. Our approach not only further enhances the correlations among categories, but also makes it easy to add new categories to the knowledge graph. Experiments conducted on the large-scale ImageNet 2011 21K dataset demonstrate that our method significantly outperforms existing state-of-the-art approaches.

L0 Gradient Smoothing and Bimodal Histogram Analysis: A Robust Method for Sea-sky-line Detection

  • Jiao Jian

Sea-sky-line detection is an important research topic in the field of object detection and tracking on the sea. We propose an L0 gradient smoothing and bimodal histogram analysis based method to improve the robustness and accuracy of sea-sky-line detection. The proposed method mainly depends on the brightness difference between the sea region and the sky region in the image. First, we use L0 gradient smoothing to eliminate discrete noise in the image and achieve the modularity of brightness. Differing from previous methods, diagonal dividing is applied to obtain the brightness thresholds for the sky and sea regions. Then the thresholds are used for bimodal histogram analysis which helps to obtain the brightness near the sea-sky-line and narrow the detection region. After narrowing the detection region, the sea-sky-line in the image is extracted by a linear fitting method. To evaluate the performance of the proposed method, we manually construct an dataset which includes 40, 000 images taken in five scenes. Moreover, we also mark the corresponding ground-truth positions of sea-sky-line in each of the images. Extensive experiments on the dataset demonstrate that our method outperforms the state-of-the-art methods tremendously.

Deep Distillation Metric Learning

  • Han Jiaxu

Due to the emergence of large-scale and high-dimensional data, measuring the similarity between data points becomes challenging. In order to obtain effective representations, metric learning has become one of the most active researches in the field of computer vision and pattern recognition. However, models using trained networks for predictions are often cumbersome and difficult to be deployed. Therefore, in this paper, we propose a novel deep distillation metric learning (DDML) for online teaching in the procedure of learning the distance metric. Specifically, we employ model distillation to transfer the knowledge acquired by the larger model to the smaller model. Unlike the 2-step offline and mutual online manners, we propose to train a powerful teacher model, who transfer the knowledge to a lightweight and generalizable student model and iteratively improved by the feedback from the student model. We show that our method has achieved state-of-the-art results on CUB200-2011 and CARS196 while having advantages in computational efficiency.

Self-balance Motion and Appearance Model for Multi-object Tracking in UAV

  • Yu Hongyang

Under the tracking-by-detection framework, multi-object tracking methods try to connect object detections with target trajectories by reasonable policy. Most methods represent objects by the appearance and motion. The inference of the association is mostly judged by a fusion of appearance similarity and motion consistency. However, the fusion ratio between appearance and motion are often determined by subjective setting. In this paper, we propose a novel self-balance method fusing appearance similarity and motion consistency. Extensive experimental results on public benchmarks demonstrate the effectiveness of the proposed method with comparisons to several state-of-the-art trackers.

Deep Spherical Gaussian Illumination Estimation for Indoor Scene

  • Li Mengtian

In this paper, we propose a learning-based method to estimate high dynamic range (HDR) indoor illumination from only a single low dynamic range (LDR) photograph of limited field-of-view. Considering the extreme complexity of indoor illumination that is virtually impossible to reconstruct perfectly, we choose to encode the environmental illumination in Spherical Gaussian (SG) functions with fixed centering directions and bandwidth and only allow the weights vary. An end-to-end convolutional neural network (CNN) is designed and trained to build the complex relationship between a photograph and its illumination represented by SG functions. Moreover, we employ a masked L2 loss instead of naive L2 loss to avoid the loss of high frequency information, and propose a glossy loss to improve the rendering quality. Our experiments demonstrate that the proposed approach outperforms the state-of-the-arts both qualitatively and quantitatively.

NRQQA: A No-Reference Quantitative Quality Assessment Method for Stitched Images

  • Yu Shengju

Image stitching technology has been widely used in immersive applications, such as 3D modeling, VR and AR. The quality of stitching results is crucial. At present, the objective quality assessment methods of stitched images are mainly based on the availability of ground truth (i.e., Full-Reference). However, in most cases, ground truth is unavailable. In this paper, a no-reference quality assessment metric specifically designed for stitched images is proposed. We first find out the corresponding parts of source images in the stitched image. Then, the isolated points and the outer points generated by spherical projection are eliminated. After that, we take advantage of the bounding rectangle of stitching seams to locate the position of overlapping regions in the stitched image. Finally, the assessment of overlapping regions is taken as the final scoring result. Extensive experiments have shown that our scores are consistent with human vision. Even for the nuances that cannot be distinguished by human eyes, our proposed metric is also effective.

Gradient Guided Image Deblocking Using Convolutional Neural Networks

  • Jung Cheolkon

Block-based transform coding in its nature causes blocking artifacts, which severely degrades picture quality especially in a high compression rate. Although convolutional neural networks (CNNs) achieve good performance in image restoration tasks, existing methods mainly focus on deep or efficient network architecture. The gradient of compressed images has different characteristics from the original gradient that has dramatic changes in pixel values along block boundaries. Motivated by them, we propose gradient guided image deblocking based on CNNs in this paper. Guided by the gradient information of the input blocky image, the proposed network successfully preserves textural edges while reducing blocky edges, and thus restores the original clean image from compression degradation. Experimental results demonstrate that the gradient information in the input compressed image contributes to blocking artifact reduction as well as the proposed method achieves a significant performance improvement in terms of visual quality and objective measurements.

Color Recovery from Multi-Spectral NIR Images Using Gray Information

  • Fu Qingtao

Converting near-infrared (NIR) images into color images is a challenging task due to the different characteristics of visible and NIR images. Most methods of generating color images directly from a single NIR image are limited by the scene and object categories. In this paper, we propose a novel approach to recovering object colors from multi-spectral NIR images using gray information. The multi-spectral NIR images are obtained by a 2-CCD NIR/RGB camera with narrow NIR bandpass filters of different wavelengths. The proposed approach is based on multi-spectral NIR images to estimate a conversion matrix for NIR to RGB conversion. In addition to the multi-spectral NIR images, a corresponding gray image is used as a complementary channel to estimate the conversion matrix for NIR to RGB color conversion. The conversion matrix is obtained from the ColorChecker's 24 color blocks using polynomial regression and applied to real-world scene NIR images for color recovery. The proposed approach has been evaluated by a large number of real-world scene images, and the results show that the proposed approach is simple yet effective for recovering color of objects.

An Efficient Parameter Optimization Algorithm and Its Application to Image De-noising

  • Liu Yinhao

Prevailing image enhancement algorithms deliver flexible tradeoff at different level between image quality and implementation complexity, which is usually achieved via adjusting multiple algorithm parameters, i.e. multiple parameter optimization. Traditional exhaustive search over the whole solution space can resolve this optimization problem, however suffering from high search complexity caused by huge amount of multi-parameter combinations. To resolve this problem, an Energy Efficiency Ratio Model (EERM) based algorithm is proposed which is inspired from gradient decent in deep learning. To verify the effectiveness of the proposed algorithm, it is then applied to image de-noising algorithm framework based on non-local means (NLM) plus iteration. The experiment result shows that the optimal parameter combination decided by our proposed algorithm can achieve the comparable quality to that of the exhaustive search based method. Specifically, 86.7% complexity reduction can be achieved with only 0.05dB quality degradation with proposed method.

WaveCSN: Cascade Segmentation Network for Hip Landmark Detection

  • Wu Hai

Landmark detection in hip X-ray images plays a critical role in diagnosis of Developmental Dysplasia of the Hip (DDH) and surgeries of Total Hip Arthroplasty (THA). Regression and heatmap techniques of convolution network could obtain reasonable results. However, they have limitations in either robustness or precision given the complexities and intensity inhomogeneities of hip X-ray images. In this paper, we propose a Wave-like Cascade Segmentation Network (WaveCSN) to improve the accuracy of landmark detection by transforming landmark detection into area segmentation. The WaveCSN consists of three basic sub-networks and each sub-network is composed of a U-net module, an indicate module and a max-MSER module. The U-net undertakes the task to generate masks, and the indicate module is trained to distinguish the masks and ground truth. The U-net and indicate module are trained in turns, in which process the generated masks are supervised to be more and more alike to the ground truth. The max-MSER module ensures landmarks can be extracted from the generated masks precisely. We present two professional datasets (DDH and THA) for the first time and evaluate the WaveCSN on them. Our results prove that the WaveCSN can improve 2.66 and 4.11 pixels at least on these two datasets compared to other methods, and achieves the state-of-the-art for landmark detection in hip X-ray images.

Shifted Spatial-Spectral Convolution for Deep Neural Networks

  • Xu Yuhao

Deep convolutional neural networks (CNNs) extract local features and learn spatial representations via convolutions in the spatial domain. Beyond the spatial information, some works also manage to capture the spectral information in the frequency domain by domain switching methods like discrete Fourier transform (DFT) and discrete cosine transform (DCT). However, most works only pay attention to a single domain, which is prone to ignoring other important features. In this work, we propose a novel network structure to combine spatial and spectral convolutions, and extract features in both spatial and frequency domains. The input channels are divided into two groups for spatial and spectral representations respectively, and then integrated for feature fusion. Meanwhile, we design a channel-shifting mechanism to ensure both spatial and spectral information of every channel are equally and adequately obtained throughout the deep networks. Experimental results demonstrate that compared with state-of-the-art CNN models in a single domain, our shifted spatial-spectral convolution based networks achieve better performance on image classification datasets including CIFAR10, CIFAR100 and SVHN, with considerably fewer parameters.

Multi-Scale Invertible Network for Image Super-Resolution

  • Li Zhuangzi

Deep convolutional neural networks (CNNs) based image super-resolution approaches have reached significant success in recent years. However, due to the information-discarded nature of CNN, they inevitably suffer from information loss during the feature embedding process, in which extracted intermediate features cannot effectively represent or reconstruct the input. As a result, the super-resolved image will have large deviations in image structure with its low-resolution version, leading to inaccurate representations in some local details. In this study, we address this problem by designing an end-to-end invertible architecture that can reversely represent low-resolution images in any feature embedding level. Specifically, we propose a novel image super-resolution method, named multi-scale invertible network (MSIN) to keep information lossless and introduce multi-scale learning in a unified framework. In MSIN, a novel multi-scale invertible stack is proposed, which adopts four parallel branches to respectively capture features with different scales and keeps balanced information-interaction by branch shifting. In addition, we employee global and hierarchical feature fusion to learn elaborate and comprehensive feature representations, in order to further benefit the quality of final image reconstruction. We show the reversibility of the proposed MSIN, and extensive experiments conducted on benchmark datasets demonstrate the state-of-the-art performance of our method.

Feature fusion adversarial learning network for liver lesion classification

  • Chen Peng

The number of training data is the key bottleneck in achieving good results for medical image analysis and especially in deep learning. Due to small medical training data, deep learning models often fail to mine useful features and have serious over-fitting problems. In this paper, we propose a clean and effective feature fusion adversarial learning network to mine useful features and relieve over-fitting problems. Firstly, we train a fully convolution autoencoder network with unsupervised learning to mine useful feature maps from our liver lesion data. Secondly, these feature maps will be transferred to our adversarial SENet network for liver lesion classification. Our experiments on liver lesion classification in CT show an average accuracy as 85.47% compared with the baseline training scheme, which demonstrate our proposed method can mime useful features and relieve over-fitting problem. It can assist physicians in the early detection and treatment of liver lesions.

Fast and Accurately Measuring Crack Width via Cascade Principal Component Analysis

  • Duan Lijuan

Crack width is an important indicator to diagnose the safety of constructions, e.g., asphalt road, concrete bridge. In practice, measuring crack width is a challenge task: (1) the irregular and non-smooth boundary makes the traditional method inefficient; (2) pixel-wise measurement guarantees the accuracy of a system and (3) understanding the damage of constructions from any pre-selected points is a mandatary requirement. To address these problems, we propose a cascade Principal Component Analysis (PCA) to efficiently measure crack width from images. Firstly, the binary crack image is obtained to describe the crack via the off-the-shelf crack detection algorithms. Secondly, given a pre-selected point, PCA is used to find the main axis of a crack. Thirdly, Robust Principal Component Analysis (RPCA) is proposed to compute the main axis of a crack with a irregular boundary. We evaluate the proposed method on a real data set. The experimental results show that the proposed method achieves the state-of-the-art performances in terms of efficiency and effectiveness.

Active Perception Network for Salient Object Detection

  • Wei Jun

To get better saliency maps for salient object detection, recent methods fuse features from different levels of convolutional neural networks and have achieved remarkable progress. However, the differences between different feature levels bring difficulties to the fusion process, thus it may lead to unsatisfactory saliency predictions. To address this issue, we propose Active Perception Network (APN) to enhance inter-feature consistency for salient object detection. First, Mutual Projection Module (MPM) is developed to fuse different features, which uses high-level features as guided information to extract complementary components from low-level features, and can suppress background noises and improve semantic consistency. Self Projection Module (SPM) is designed to further refine the fused features, which can be considered as the extended version of residual connection. Features that pass through SPM can produce more accurate saliency maps. Finally, we propose Head Projection Module (HPM) to aggregate global information, which brings strong semantic consistency to the whole network. Comprehensive experiments on five benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art approaches on different evaluation metrics.

Surface Normal Data Guided Depth Recovery with Graph Laplacian Regularization

  • Sun Longhua

High-quality depth information has been increasingly used in many real-world multimedia applications in recent years. Due to the limitation of depth sensor and sensing technology, actually, the captured depth map usually has low resolution and black holes. In this paper, inspired by the geometric relationship between surface normal of a 3D scene and their distance from camera, we discover that surface normal map can provide more spatial geometric constraints for depth map reconstruction, as depth map is a special image with spatial information, which we called 2.5D image. To exploit this property, we propose a novel surface normal data guided depth recovery method, which uses surface normal data and observed depth value to estimate missing or interpolated depth values. Moreover, to preserve the inherent piecewise smooth characteristic of depth maps, graph Laplacian prior is applied to regularize the inverse problem of depth maps recovery and a graph Laplacian regularizer(GLR) is proposed. Finally, the spatial geometric constraint and graph Laplacian regularization are integrated into a unified optimization framework, which can be efficiently solved by conjugate gradient(CG). Extensive quantitative and qualitative evaluations compared with state-of-the-art schemes show the effectiveness and superiority of our method.

An Adaptive Dark Region Detail Enhancement Method for Low-light Images

  • Cheng Wengang

The images captured in low-light conditions are often of poor visual quality as most of details in dark regions buried. Although some advanced low-light image enhancement methods could lighten an image and its dark regions, they still cannot reveal the details in dark regions very well. This paper presents an adaptive dark region detail enhancement method for low-light images. As our method is based on the Retinex theory, we first formulate the Retinex-based low-light image enhancement problem into a Bayesian optimization framework. Then, a dark region prior is proposed and an adaptive gradient amplification strategy is designed to incorporate this prior into the illumination estimation. The dark region prior, together with the widely used spatial smooth and structure priors, leads to a dark region and structure-aware smoothness regularization term for illumination optimization. We provide a solver to this optimization and get final enhanced results after post processing. Experiments demonstrate that our method can obtain good enhancement results with better dark region details compared to several state-of-the-art methods.

SESSION: Multimedia Service

Session details: Multimedia Service

  • Yamasaki Toshihiko

Multiple Fisheye Camera Tracking via Real-Time Feature Clustering

  • Sio Chon Hou

Recently, Multi-Target Multi-Camera Tracking (MTMC) makes a breakthrough due to the release of DukeMTMC and show the feasibility of related applications. However, most of the existing MTMC methods focus on the batch methods which attempt to find the global optimal solution from the entire image sequence and thus are not suitable for the real-time applications, e.g., customer tracking in unmanned stores. In this paper, we propose a low-cost online tracking algorithm, namely, Deep Multi-Fisheye-Camera Tracking (DeepMFCT) to identify the customers and locate the corresponding positions from multiple overlapping fisheye cameras. Based on any single camera tracking algorithm (e.g., Deep SORT), our proposed algorithm establishes the correlation between different single camera tracks. Owing to the lack of well-annotated multiple overlapping fisheye cameras dataset, the main challenge of this issue is to efficiently overcome the domain gap problem between normal cameras and fisheye cameras based on existed deep learning based model. To address this challenge, we integrate a single camera tracking algorithm with cross camera clustering including location information that achieves great performance on the unmanned store dataset and Hall dataset. Experimental results show that the proposed algorithm improves the baselines by at least 7% in terms of MOTA on the Hall dataset.

Salient Time Slice Pruning and Boosting for Person-Scene Instance Search in TV Series

  • Wang Zheng

It is common that TV audiences want to quickly browse scenes with certain actors in TV series. Since 2016, the TREC Video Retrieval Evaluation (TRECVID) Instance Search (INS) task has started to focus on identifying a target person in a target scene simultaneously. In this paper, we name this kind of task as P-S INS (Person-Scene Instance Search). To find out P-S instances, most approaches search person and scene separately, and then directly combine the results together by addition or multiplication. However, we find that person and scene INS modules are not always effective at the same time, or they may suppress each other in some situations. Aggregating the results shot after shot is not a good choice. Luckily, for the TV series, video shots are arranged in chronological order. We extend our focus from time point (single video shot) to time slice (multiple consecutive video shots) in the time-line. Through detecting salient time slices, we prune the data. Through evaluating the importance of salient time slices, we boost the aggregation results. Extensive experiments on the large-scale TRECVID INS dataset demonstrate the effectiveness of the proposed method.

Stop Hiding Behind Windshield: A Windshield Image Enhancer Based on a Two-way Generative Adversarial Network

  • Chang Chi-Rung

Windshield images captured by surveillance cameras are usually difficult to be seen through due to severe image degradation such as reflection, motion blur, low light, haze, and noise. Such image degradation hinders the capability of identifying and tracking people. In this paper, we aim to address this challenging windshield images enhancement task by presenting a novel deep learning model based on a two-way generative adversarial network, called Two-way Individual Normalization Perceptual Adversarial Network, TWIN-PAN. TWIN-PAN is an unpaired learning network which does not require pairs of degraded and corresponding ground truth images for training. Also, unlike existing image restoration algorithms which only address one specific type of degradation at once, TWIN-PAN can restore the image from various types of degradation. To restore the content inside the extremely degraded windshield and ensure the semantic consistency of the image, we introduce cyclic perceptual loss to the network and combine it with cycle-consistency loss. Moreover, to generate better restoration images, we introduce individual instance normalization layers for the generators, which can help our generators better adapt to their own input distributions. Furthermore, we collect a large high-quality windshield image dataset (WIE-Dataset) to train our network and to validate the robustness of our method in restoring degraded windshield images. Experimental results on human detection, vehicle ReID and user study manifest that the proposed method is effective for windshield image restoration.

A Performance-Aware Selection Strategy for Cloud-based Video Services with Micro-Service Architecture

  • Xu Zhengjun

The cloud micro-service architecture provides loosely coupling services and efficient virtual resources, which becomes a promising solution for large-scale video services. It is difficult to efficiently select the optimal services under micro-service architecture, because the large number of micro-services leads to an exponential increase in the number of service selection candidate solutions. In addition, the time sensitivity of video services increases the complexity of service selection, and the video data can affects the service selection results. However, the current video service selection strategies are insufficient under micro-service architecture, because they do not take into account the resource fluctuation of the service instances and the features of the video service comprehensively. In this paper, we focus on the video service selection strategy under micro-service architecture. Firstly, we propose a QoS Prediction (QP) method using explicit factor analysis and linear regression. The QP can accurately predict the QoS values based on the features of video data and service instances. Secondly, we propose a Performance-Aware Video Service Selection (PVSS) method. We prune the candidate services to reduce computational complexity and then efficiently select the optimal solution based on Fruit Fly Optimization (FFO) algorithm. Finally, we conduct extensive experiments to evaluate our strategy, and the results demonstrate the effectiveness of our strategy.

SESSION: Human Analysis in Multimedia

Session details: Human Analysis in Multimedia

  • Bao Bing-Kun

Dense Attention Network for Facial Expression Recognition in the Wild

  • Wang Cong

Recognizing facial expression is significant for human-computer interaction system and other applications. A certain number of facial expression datasets have been published in recent decades and helped with the improvements for emotion classification algorithms. However, recognition of the realistic expressions in the wild is still challenging because of uncontrolled lighting, brightness, pose, occlusion, etc. In this paper, we propose an attention mechanism based module which can help the network focus on the emotion-related locations. Furthermore, we produce two network structures named DenseCANet and DenseSANet by using the attention modules based on the backbone of DenseNet. Then these two networks and original DenseNet are trained on wild dataset AffectNet and lab-controlled dataset CK+. Experimental results show that the DenseSANet has improved the performance on both datasets comparing with the state-of-the-art methods.

Make Skeleton-based Action Recognition Model Smaller, Faster and Better

  • Yang Fan

Although skeleton-based action recognition has achieved great success in recent years, most of the existing methods may suffer from a large model size and slow execution speed. To alleviate this issue, we analyze skeleton sequence properties to propose a Double-feature Double-motion Network (DD-Net) for skeleton-based action recognition. By using a lightweight network structure (i.e., 0.15 million parameters), DD-Net can reach a super fast speed, as 3,500 FPS on an ordinary GPU (e.g., GTX 1080Ti), or, 2,000 FPS on an ordinary CPU (e.g., Intel E5-2620). By employing robust features, DD-Net achieves state-of-the-art performance on our experiment datasets: SHREC (i.e., hand actions) and JHMDB (i.e., body actions). Our code is on

A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading

  • Zhao Ya

Lip reading aims at decoding texts from the movement of a speaker's mouth. In recent years, lip reading methods have made great progress for English, at both word-level and sentence-level. Unlike English, however, Chinese Mandarin is a tone-based language and relies on pitches to distinguish lexical or grammatical meaning, which significantly increases the ambiguity for the lip reading task. In this paper, we propose a Cascade Sequence-to-Sequence Model for Chinese Mandarin (CSSMCM) lip reading, which explicitly models tones when predicting sentence. Tones are modeled based on visual information and syntactic structure, and are used to predict sentence along with visual information and syntactic structure. In order to evaluate CSSMCM, a dataset called CMLR (Chinese Mandarin Lip Reading) is collected and released, consisting of over 100,000 natural sentences from China Network Television website. When trained on CMLR dataset, the proposed CSSMCM surpasses the performance of state-of-the-art lip reading frameworks, which confirms the effectiveness of explicit modeling of tones for Chinese Mandarin lip reading.

Learn to Gesture: Let Your Body Speak

  • Gan Tian

Presentation is one of the most important and vivid methods to deliver information to audience. Apart from the content of presentation, how the speaker behaves during presentation makes a big difference. In other words, gestures, as part of the visual perception and synchronized with verbal information, express some subtle information that the voice or words alone cannot deliver. One of the most effective ways to improve presentation is to practice through feedback/suggestions by an expert. However, hiring human experts is expensive thus impractical most of the time. Towards this end, we propose a speech to gesture network (POSE) to generate exemplary body language given a vocal behavior speech as input. Specifically, we build an "expert" Speech-Gesture database based on the featured TED talk videos, and design a two-layer attentive recurrent encoder-decoder network to learn the translation from speech to gesture, as well as the hierarchical structure within gestures. Lastly, given a speech audio sequence, the appropriate gesture will be generated and visualized for a more effective communication. Both objective and subjective validation show the effectiveness of our proposed method.

SESSION: Brave New Idea

Session details: Brave New Idea

  • Ji Rongrong

Multi-scale Features for Weakly Supervised Lesion Detection of Cerebral Hemorrhage with Collaborative Learning

  • Chen Zhiwei

Deep networks have recently been applied to medical assistant diagnosis. The brain is the largest and the most complex structure in the central nervous system, which is also complicated in medical images such as computed tomography (CT) scan. While reading the CT image, radiologists generally search across the image to find lesions, characterize and measure them, and then describe them in the radiological report. To automate this process, we quantitatively analyze the cerebral hemorrhage dataset and propose a Multi-scale Feature with Collaborative Learning (MFCL) strategy in terms of Weakly Supervised Lesion Detection (WSLD), which not only adapts to the characteristics of detecting small lesions but also introduces the global constraint classification objective in training. Specifically, a multi-scale feature branch network and a collaborative learning are designed to locate the lesion area. Experimental results demonstrate that the proposed method is valid on the cerebral hemorrhage dataset, and a new baseline of WSLD is established on cerebral hemorrhage dataset.

Tumor Tissue Segmentation for Histopathological Images

  • Huang Xiansong

Histopathological image analysis is considered as a gold standard for cancer identification and diagnosis. Tumor segmentation for histopathological images is one of the most important research topics and its performance directly affects the diagnosis judgment of doctors for cancer categories and their periods. With the remarkable development of deep learning methods, extensive methods have been proposed for tumor segmentation. However, there are few researches on analysis of specific pipeline of tumor segmentation. Moreover, few studies have done detailed research on the hard example mining of tumor segmentation. In order to bridge this gap, this study firstly summarize a specific pipeline of tumor segmentation. Then, hard example mining in tumor segmentation is also explored. Finally, experiments are conducted for evaluating segmentation performance of our method, demonstrating the effects of our method and hard example mining.

SESSION: Doctorial Symposium

Session details: Doctorial Symposium

  • Jia Jia

Artistic Text Stylization for Visual-Textual Presentation Synthesis

  • Yang Shuai

In this research, we study a specific task of visual-textual presentation synthesis, where artistic text is generated and embedded in a background photo. The art form of visual-textual presentation is widely used in graphic design such as posters, billboards and trademarks, and therefore is of high application value. We propose a new framework to complete this task. First, the shape of the target text is adjusted and the textures are rendered to match the reference style image to generate artistic text. By considering both aesthetics and seamlessness, the layout where the artistic text is placed is determined. Finally the artistic text is blended with the background photo to obtain the visual-textual presentations. The experimental results demonstrate the effectiveness of the proposed framework in creating professionally designed visual-textual presentations.

Multimedia Information Retrieval

  • Guo Yangyang

My main research interests include product search and visual question answering (VQA), lying in the field of information retrieval (IR), which aims to obtain information system resources relevant to an information need from a collection. Product search focuses on the E-commerce domain and aims to retrieve products which are not only relevant to the submitted queries but also fit users' personal preferences; Visual question answering aims to provide a natural language answer for a given image and a free-form, open-ended, natural-language question about this image, which requires semantic understanding on natural language and visual content, as well as knowledge extraction and logic reasoning.

POSTER SESSION: Poster Session

Session details: Poster Session

  • Bai Cong

Deep Structural Feature Learning: Re-Identification of simailar vehicles In Structure-Aware Map Space

  • Zhu Wenqian

Vehicle re-identification (re-ID) has received more attention in recent years as a significant work, making huge contribution to the intelligent video surveillance. The complex intra-class and inter-class variation of vehicle images bring huge challenges for vehicle re-ID, especially for the similar vehicle re-ID. In this paper we focus on an interesting and challenging problem, vehicle re-ID of the same/similar model. Previous works mainly focus on extracting global features using deep models, ignoring the individual loa-cal regions in vehicle front window, such as decorations and stickers attached to the windshield, that can be more discriminative for vehicle re-ID. Instead of directly embedding these regions to learn their features, we propose a Regional Structure-Aware model (RSA) to learn structure-aware cues with the position distribution of individual local regions in vehicle front window area, constructing a FW structural map space. In this map sapce, deep models are able to learn more robust and discriminative spatial structure-aware features to improve the performance for vehicle re-ID of the same/similar model. We evaluate our method on a large-scale vehicle re-ID dataset Vehicle-1M. The experimental results show that our method can achieve promising performance and outperforms several recent state-of-the-art approaches.

Selective Attention Network for Image Dehazing and Deraining

  • Liang Xiao

Image dehazing and deraining are import low-level compute vision tasks. In this paper, we propose a novel method named Selective Attention Network (SAN) to solve these two problems. Due to the density of haze and directions of rain streaks are complex and non-uniform, SAN adopts the channel-wise attention and spatial-channel attention to remove rain streaks and haze both in globally and locally. To better capture various of rain and hazy details, we propose a Selective Attention Module(SAM) to re-scale the channel-wise attention and spatial-channel attention instead of simple element-wise summation. In addition, we conduct ablation studies to validate the effectiveness of the each module of SAN. Extensive experimental results on synthetic and real-world datasets show that SAN performs favorably against state-of-the-art methods.

Manifold Alignment with Multi-graph Embedding

  • Huang Changbin

In this paper, a novel manifold alignment approach via multi-graph embedding (MA-MGE) is proposed. Different from the traditional manifold alignment algorithms that use a single graph to describe the latent manifold structure of each dataset, our approach utilizes multiple graphs for modeling multiple local manifolds in multi-view data alignment. Therefore a composite manifold representation with complete and more useful information is obtained from each dataset through a dynamic reconstruction of multiple graphs. Experimental results on Protein and Face-10 datasets demonstrate that the mapping coordinates of the proposed method provide better alignment performance compared to the state-of-the-art methods, such as semi-supervised manifold alignment (SS-MA), manifold alignment using Procrustes analysis (PAMA) and manifold alignment without correspondence (UNMA).

Multi-Label Image Classification with Attention Mechanism and Graph Convolutional Networks

  • Meng Quanling

The task of multi-label image classification is to predict a set of proper labels for an input image. To this end, it is necessary to strengthen the association between the labels and the image regions, and utilize the relationship between the labels. In this paper, we propose a novel framework for multi-label image classification, which uses attention mechanism and Graph Convolutional Network (GCN) simultaneously. The attention mechanism can focus on specific target regions while ignoring other useless information around, thereby enhancing the association of the labels with the image regions. By constructing a directed graph over the labels, GCN can learn the relationship between the labels from a global perspective and map this label graph to a set of inter-dependent object classifiers. The framework first uses ResNet to extract features while using attention mechanism to generate attention maps for all labels and obtain weighted features. GCN uses weighted fusion features from the output of the resnet and attention mechanism to achieve classification. Experimental results show that both the attention mechanism and GCN can effectively improve the classification performance, and the proposed framework is competitive with the state-of-the-art methods.

RSC-DGS: Fusion of RGB and NIR Images Using Robust Spectral Consistency and Dynamic Gradient Sparsity

  • Yu Shengtao

Color (RGB) images captured under low light condition contain much noise with loss of textures. Since near-infrared (NIR) images are robust to noise with clear textures even in low light condition, they can be used to enhance low light RGB images by image fusion. In this paper, we propose fusion of RGB and NIR images using robust spectral consistency (RSC) and dynamic gradient sparsity (DGS), called RSC-DGS. We build the RSC model based on a robust error function to remove noise and preserve color/spectral consistency. We construct the DGS model based on vectorial total variation minimization that uses the NIR image as the reference image. The DGS model transfers clear textures of the NIR image to the fusion result and successfully preserves cross-channel interdependency of the RGB image. We use alternating direction method of multipliers (ADMM) for efficiency to solve the proposed RSC-DGS fusion. Experimental results confirm that the proposed method effectively preserves color/spectral consistency and textures in fusion results while successfully removing noise with high computational efficiency.

Multi-Feature Fusion for Multimodal Attentive Sentiment Analysis

  • A Man

Sentiment analysis has been an interesting and challenging task, researchers mostly pay attention to single-modal (image or text) emotion recognition, less attention is paid to joint analysis of multi-modal data. Most existing multi-modal sentiment analysis algorithms combined with attention mechanism focus only on local area of images, ignore the emotional information provided by the global features of the image. Motivated by the research status quo, in this paper, we proposed a novel multi-modal sentiment analysis model, which focuses on local attentive feature also on the global contextual feature from image, then a novel feature fusion mechanism is utilized to fuse features from different modal. In our proposed model, we use a convolutional neural network (CNN) to extract the region maps of images, and use the attention mechanism to acquire attention coefficient, then use a CNN with fewer hidden layers to extract the global feature, a long-short term memory model (LSTM) is utilized to extract textual feature. Finally, a tensor fusion network (TFN) is utilized to fuse all features from different modal. Extensive experiments are conducted on both weakly labeled and manually labeled datasets, and the results demonstrate the superiority of the proposed method.

Multimodal Attribute and Feature Embedding for Activity Recognition

  • Zhang Weiming

Human Activity Recognition (HAR) automatically recognizes human activities such as daily life and work based on digital records, which is of great significance to medical and health fields. Egocentric video and human acceleration data comprehensively describe human activity patterns from different aspects, which have laid a foundation for activity recognition based on multimodal behavior data. However, on the one hand, the low-level multimodal signal structures differ greatly and the mapping to high-level activities is complicated. On the other hand, the activity labeling based on multimodal behavior data has high cost and limited data amount, which limits the technical development in this field. In this paper, an activity recognition model MAFE based on multimodal attribute feature embedding is proposed. Before the activity recognition, the middle-level attribute features are extracted from the low-level signals of different modes. On the one hand, the mapping complexity from the low-level signals to the high-level activities is reduced, and on the other hand, a large number of middle-level attribute labeling data can be used to reduce the dependency on the activity labeling data. We conducted experiments on Stanford-ECM datasets to verify the effectiveness of the proposed MAFE method.

Representative Feature Matching Network for Image Retrieval

  • Li Zhuangzi

Recent convolutional neural network (CNNs) have shown promising performance on image retrieval due to the powerful feature extraction capability. However, the potential relations of feature maps are not effectively exploited in the before CNNs, resulting in inaccurate feature representations. To address this issue, we excavate feature channel-wise realtions by a matching strategy to adaptively highlight informative features. In this paper, we propose a novel representative feature matching network (RFMN) for image hashing retrieval. Specifically, we propose a novel representative feature matching block (RFMB) that can match feature maps with their representative one. So, the significance of each feature map can be exploited according to the matching similarity. In addition, we also present an innovative pooling layer based on the representative feature matching to build relations of pooled features with unpooled features, so as to highlight the pooled features retained more valuable information. Extensive experiments show that our approach can promote the average results of conventional residual network more than 2.6% on Cifar-10 and 1.4% on NUS-WIDE dataset, meanwhile achieve the state-of-the-art performance.

Deep Feature Interaction Embedding for Pair Matching Prediction

  • Zhang Luwei

Online dating services have become popular in modern society. Pair matching prediction between two users in these services can help efficiently increase the possibility of finding their life partners. Deep learning based methods with automatic feature interaction functions such as Factorization Machines (FM) and cross network of Deep & Cross Network (DCN) can model sparse categorical features, which are effective to many recommendation tasks of web applications. To solve the partner recommendation task, we improve these FM-based deep models and DCN by enhancing the representation of feature interaction embedding and proposing a novel design of interaction layer avoiding information loss. Through the experiments on two real-world datasets of two online dating companies, we demonstrate the superior performances of our proposed designs.

Multi-source User Attribute Inference based on Hierarchical Auto-encoder

  • Zhang Boyu

With the rapid development of Online Social Networks (OSNs), it is crucial to construct users' portraits from their dynamic behaviors to address the increasing needs for customized information services. Previous work on user attribute inference mainly concentrated on developing advanced features/models or exploiting external information and knowledge but ignored the contradiction between dynamic behaviors and stable demographic attributes, which results in deviation of user understanding

To address the contradiction and accurately infer the user attributes, we propose a Multi-source User Attribute Inference algorithm based on Hierarchical Auto-encoder (MUAI-HAE). The basic idea is that: the shared patterns among the same individual's behaviors on different OSNs well indicate his/her stable demographic attributes. The hierarchical autoencoder is introduced to realize this idea by discovering the underlying non-linear correlation between different OSNs. The unsupervised scheme in shared pattern learning alleviates the requirements for the cross-OSN user account and improves the practicability. Off-the-shelf classification methods are then utilized to infer user attributes from the derived shared behavior patterns. The experiments on the real-world datasets from three OSNs demonstrate the effectiveness of the proposed method.

Comprehensive Event Storyline Generation from Microblogs

  • Sun Wenjin

Microblogging data contains a wealth of information of trending events and has gained increased attention among users, organizations, and research scholars for social media mining in different disciplines. Event storyline generation is one typical task of social media mining, whose goal is to extract the development stages with associated description of events. Existing storyline generation methods either generate storyline with less integrity or fail to guarantee the coherence between the discovered stages. Secondly, there are no scientific method to evaluate the quality of the storyline. In this paper, we propose a comprehensive storyline generation framework to address the above disadvantages. Given Microblogging data related to the specified event, we first propose Hot-Word-Based stage detection algorithm to identify the potential stages of event, which can effectively avoid ignoring important stages and preventing inconsistent sequence between stages. Community detection algorithm is applied then to select representative data for each stage. Finally, we conduct graph optimization algorithm to generate the logically coherent storylines of the event. We also introduce a new evaluation metric, SLEU, to emphasize the importance of the integrity and coherence of the generated storyline. Extensive experiments on real-world Chinese microblogging data demonstrate the effectiveness of the proposed methods in each module and the overall framework.

Domain Specific and Idiom Adaptive Video Summarization

  • Dong Yi

As short videos become an increasingly popular form of storytelling, there is a growing demand for video summarization to convey information concisely with a subset of video frames. Some criteria such as interestingness and diversity are used by existing efforts to pick appropriate segments of content. However, there lacks a mechanism to infuse insights from cinematography and persuasion into this process. As a result, the results of the video summarization sometimes deviate from the original. In addition, the exploration of the vast design space to create customized video summaries is costly for video producer. To address these challenges, we propose a domain specific and idiom adaptive video summarization approach. Specifically, our approach first segments the input video and extracts high-level information from each segment. Such labels are used to represent a collection of idioms and summarization metrics as submodular components which users can combine to create personalized summary styles in a variety of ways. In order to identify the importance of the idioms and metrics in different domains, we leverage max margin learning. Experimental results have validated the effectiveness of our approach. We also plan to release a dataset containing over 600 videos with expert annotations which can benefit further research in this area.

An Automated Lung Nodule Segmentation Method Based On Nodule Detection Network and Region Growing

  • Tan Yanhao

Segmentation of a specific organ or tissue plays an important role in medical image analysis with the rapid development of clinical decision support systems. With medical imaging equipments, segmenting the lung nodules in the images is able to help physicians diagnose lung cancer diseases and formulate proper schemes. Therefore the research of lung nodule segmentation has attracted a lot of attention these years. However, this task faces some challenges, including the intensity similarity between lung nodules and vessel, inaccurate boundaries and presence of noise in most of the images. In this paper, an automated segmentation method is proposed for lung nodules in CT images. At the first stage, a nodule detection network is used to generate region proposals and locate the bounding boxes of nodules, which are employed as the initial input for the following segmentation. Then the nodules are segmented in the bounding boxes at the second stage. Since the image scale for region growing is reduced by locating the nodule in advance, the efficiency of segmentation can be improved. And due to the localization of nodule before segmentation, some tissues with similar intensity can be excluded from the object region. The proposed method is evaluated on a public lung nodule dataset, and the experimental results indicate the effectiveness and efficiency of the proposed method.

Food Photo Enhancer of One Sample Generative Adversarial Network

  • Wang Shudan

Image enhancement is an important branch in the field of image processing. A few existing methods leverage Generative Adversarial Networks (GANs) for this task. However, they have several defects when applied to a specific type of images, such as food photo. First, a large set of original-enhanced image pairs are required to train GANs that have millions of parameters. Such image pairs are expensive to acquire. Second, color distribution of enhanced images generated by previous methods is not consistent with the original ones, which is not expected. To alleviate the issues above, we propose a novel method for food photo enhancement. No original-enhanced image pairs are required except only original images. We investigate Food Faithful Color Semantic Rules in Enhanced Dataset Photo Enhancement (Faith-EDPE) and also carefully design a light generator which can preserve semantic relations among colors. We evaluate the proposed method on public benchmark databases to demonstrate the effectiveness of the proposed method through visual results and user studies.

Generalizing Rate Control Strategies for Realtime Video Streaming via Learning from Deep Learning

  • Huang Tianchi

The leading learning-based rate control method, i.e., QARC, achieves state-of-the-art performances but fails to interpret the fundamental principles, and thus lacks the abilities to further improve itself efficiently. In this paper, we propose EQARC (Explainable QARC) via reconstructing QARC's modules, aiming to demystify how QARC works. In details, we first utilize a novel hybrid attention-based CNN+GRU model to re-characterize the original quality prediction network and reasonably replace the QARC's 1D-CNN layers with 2D-CNN layers. Using trace-driven experiment, we demonstrate the superiority of EQARC over existing state-of-the-art approaches. Next, we collect several useful information from each interpretable modules and learn the insight of EQARC. Following this step, we further propose AQARC (Advanced QARC), which is the light-weighted version of QARC. Experimental results show that AQARC achieves the same performances as the QARC with an overhead reduction of 90%. In short, through learning from deep learning, we generalize a rate control method which can both reach high performance and reduce computation cost.

IKDMM: Iterative Knowledge Distillation Mask Model for Robust Acoustic Beamforming

  • Liu Zhaoyi

Microphone array beamforming has been approved to be an effective method for suppressing adverse interferences. Recently, acoustic beamformers that employ neural networks (NN) for estimating the time-frequency (T-F) mask, termed as TFMask-BF, receive tremendous attention and have shown great benefits as a front-end for noise-robust Automatic Speech Recognition (ASR). However, our preliminary experiments using TFMask-BF for ASR task show that the mask model trained with simulated data cannot perform well in the real environment since there is a data mismatch problem. In this study, we adopt the knowledge distillation learning framework to make use of real-recording data together with simulated data in the training phase to reduce the impact of the data mismatch. Moreover, a novel iterative knowledge distillation mask model (IKDMM) training scheme has been systematically developed. Specifically, two bi-directional long short-term memory (BLSTM) models, are designed as a teacher mask model (TMM) and a student mask model (SMM). The TMM is trained with simulated data at each iteration and then it is employed to separately generate the soft mask labels of both simulated and real-recording data.The simulated data and the real-recording data with their corresponding generated soft mask labels are formed into the new training data to train our SMM at each iteration. The proposed approach is evaluated as a front-end for ASR on the six-channel CHiME-4 corpus. Experimental results show that the data mismatch problem can be reduced by our IKDMM, leading to a 5% relative Word Error Rate (WER) reduction compared to conventional TFMask-BF for the real-recording data under noisy conditions.

Multi-Objective Particle Swarm Optimization for ROI based Video Coding

  • Ren Guangjie

In this paper, we propose a new algorithm for High Efficiency Video Coding(HEVC) based on multi-objective particle swarm optimization (MOPSO) to enhance the visual quality of ROI while ensuring a certain overall quality. According to the R-λ model of detected ROI, the fitness function in MOPSO can be designed as the distortion of ROI and that of the overall frame. The particle consists of ROI's rate and other region's rate. After iterating through the multi-objective particle swarm optimization algorithm, the Pareto front is obtained. Then, the final bit allocation result which are the appropriate bit rate for ROI and non-ROI is selected from this set. Finally, according to the R-λ model, the coding parameters could be determined for coding. The experimental results show that the proposed algorithm improves the visual quality of ROI while guarantees overall visual quality.

An LSTM based Rate and Distortion Prediction Method for Low-delay Video Coding

  • Liu Feiyang

In this paper, an LSTM based rate-distortion (R-D) prediction method for low-delay video coding has been proposed. Unlike the traditional rate control algorithms, LSTM is introduced to learn the latent pattern of the R-D relationship in the progress of video coding. Temporal information, hierarchical coding structure information and the content of the frame which is to be encoded have been used to achieve more accurate prediction. Based on the proposed network, a new R-D model parameters prediction method is proposed and tested on test model of Versatile Video Coding (VVC). According to the experimental results, compared with the state-of-the-art method used in VVC, the proposed method can achieve better performance.

SESSION: Vision in Multimedia

Session details: Vision in Multimedia

  • Hang Hsueh-Ming

Multi-Dilation Network for Crowd Counting

  • Wang Shuheng

With the growth of urban population, crowd analysis has become an important and necessary task in the field of computer vision. The goal of crowd counting, which is a subfield of crowd analysis, is to count the number of people in an image or a zone of a picture. Due to the problems like heavy occlusions, perspective and luminous intensity variations, it is still extremely challenging to achieve crowd counting. Recent state-of-the-art approaches are mainly designed with convolutional neural networks to generate density maps. In this work, Multi-Dilation Network (MDNet) is proposed to solve the problem of crowd counting in congested scenes. The MDNet is made up of two parts: a VGG-16 based front end for feature extraction and a back end containing multi-dilation blocks to generate density maps. Especially, a multi-dilation block has four branches which are used to collect features in different sizes. By using dilated convolutional operations, the multi-dilation block could obtain various features while the maximum kernel size is still 3 x 3. The experiments on two challenging crowd counting datasets, UCF_CC_50 and ShanghaiTech, have shown that the proposed MDNet achieves better performances than other state-of-the-art methods, with a lower mean absolute error and mean squared error. Comparing to the network with multi-scale blocks which adopt larger kernels to extract features, MDNet still gains competitive performances with fewer model parameters.

Excluding the Misleading Relatedness Between Attributes in Multi-Task Attribute Recognition Network

  • Cai Sirui

In the attribute recognition area, attributes that are unrelated in the real world may have a high co-occurrence rate in a dataset due to the dataset bias, which forms a misleading relatedness. A neural network, especially a multi-task neural network, trained on this dataset would learn this relatedness, and be misled when it is used in practice. In this paper, we propose Share-and-Compete Multi-Task deep learning (SCMTL) model to handle this problem. This model uses adversarial training methods to enhance competition between unrelated attributes while keeping sharing between related attributes, making the task-specific layer of the multi-task model to be more specific and thus rule out the misleading relatedness between the unrelated attributes. Experiments performed on elaborately designed datasets show that the proposed model outperforms the single task neural network and the traditional multi-task neural network in the situation mentioned above.

Robust Visual Tracking via Statistical Positive Sample Generation and Gradient Aware Learning

  • Lin Lijian

In recent years, Convolutional Neural Network (CNN) based trackers have achieved state-of-the-art performance on multiple benchmark datasets. Most of these trackers train a binary classifier to distinguish the target from its background. However, they suffer from two limitations. Firstly, these trackers cannot effectively handle significant appearance variations due to the limited number of positive samples. Secondly, there exists a significant imbalance of gradient contributions between easy and hard samples, where the easy samples usually dominate the computation of gradient. In this paper, we propose a robust tracking method via Statistical Positive sample generation and Gradient Aware learning (SPGA) to address the above two limitations. To enrich the diversity of positive samples, we present an effective and efficient statistical positive sample generation algorithm to generate positive samples in the feature space. Furthermore, to handle the issue of imbalance between easy and hard samples, we propose a gradient sensitive loss to harmonize the gradient contributions between easy and hard samples. Extensive experiments on three challenging benchmark datasets including OTB50, OTB100 and VOT2016 demonstrate that the proposed SPGA performs favorably against several state-of-the-art trackers.

Exploring Semantic Segmentation on the DCT Representation

  • Lo Shao-Yuan

Typical convolutional networks are trained and conducted on RGB images. However, images are often compressed for memory savings and efficient transmission in real-world applications. In this paper, we explore methods for performing semantic segmentation on the discrete cosine transform (DCT) representation defined by the JPEG standard. We first rearrange the DCT coefficients to form a preferred input type, then we tailor an existing network to the DCT inputs. The proposed method has an accuracy close to the RGB model at about the same network complexity. Moreover, we investigate the impact of selecting different DCT components on segmentation performance. With a proper selection, one can achieve the same level accuracy using only 36% of the DCT coefficients. We further show the robustness of our method under the quantization errors. To our knowledge, this paper is the first to explore semantic segmentation on the DCT representation.