ICDAR '20: Proceedings of the 2020 on Intelligent Cross-Data Analysis and Retrieval Workshop

ICDAR '20: Proceedings of the 2020 on Intelligent Cross-Data Analysis and Retrieval Workshop

ICDAR '20: Proceedings of the 2020 on Intelligent Cross-Data Analysis and Retrieval Workshop

Full Citation in the ACM Digital Library

SESSION: Keynote & Invited Talks

From Data Collection Merit to Data Connection Merit for Smart Sustainable Cities

  • Koji Zettsu

Smart data refers to the IoT data that has been processed to produce valuable data
be turned into actionable information, which can be used for intelligence, planning,
controlling and decision making efficiently and effectively by governments, industries
and citizens. Unprecedentedly large amount and variety of sensory data can be collected
to explore how these big data can become smart data and offer intelligence. Advanced
data modeling and analytics, as well as data science solutions, are indispensable
for transforming big data into smart data. For accelerating the utilization of smart
data, NICT Real Space Information Analytics Project makes efforts to develop a cross-data
analytics technology for utilizing data obtained from a variety of sensing technologies
and different kinds of social big data to construct a platform that will help develop
and expand smart services with a view towards smart sustainable cities. We are developing
the xData Platform on NICT's Integrated Testbed, which is a platform implementing
functions of a data loader for data collection, retrieval and conversion from a variety
of data sources, association mining for spatiotemporal data integration and discovery
of association rules, machine learning for prediction of spatiotemporal association
patterns , and creation and distribution of prediction result data in a GIS format
for route search and alert notification. For accelerating open innovation using the
xData Platform, we conducted field experiments with participation of citizens.


Microwave Doppler Radar Sensing System for Vital Sign Detection: From Evaluated Accuracy
Models to the Intelligent System

  • Thi Phuoc Van Nguyen
  • Thanh Tung Tran

The development of microwave radar vital sign sensing system brings many benefits
to mankind. This system can be used to detect the location of living people buried
under debris. Other important applications of microwave radar sensor are smart home,
health care and defense. There are several main blocks in the radar sensor such as
a transmitter, a receiver, a signal processing circuit and a display block. The transmitter
propagates radio frequency signals toward the human and collects the reflected signals
from the human chest. By analyzing transmitted and received signals, useful information
like breathing rate, heartbeat, and people's location are taken. This work focuses
on the studying mathematical model to evaluate the accuracy of radar vital sign sensing
system when the operating frequency and distance change. Moreover, the integration
between AI technique and radar sensor is also considered carefully in this study.
The combination makes this system smarter, enables more applications and brings more
benefits to users.

Malware Detection Using System Logs

  • Nhu T. Nguyen
  • Thuy T. Pham
  • Tien X. Dang
  • Minh-Son Dao
  • Duc-Tien Dang-Nguyen
  • Cathal Gurrin
  • Binh T. Nguyen

Malware detection is one of the most critical features in many real applications,
especially for the mobile platform and the Internet of Things (IoT) technology. Due
to the proliferation of mobile devices and the associated app-stores, the volume of
new applications growing extremely fast requires a better way to analyze all possible
malicious behaviors. In this paper, we investigate the malware prediction problem
using system log files that contain numbers of sequences of system calls recorded
from IoT devices. We construct a suitable multi-class classification model by using
the combination of hand-crafted features, (including Bag-of-Ngrams, TF-IDF, and the
statistical metrics computed from the consecutive repeated system calls in each log
file). Also, we consider different machine learning models, including Random Forest,
Support Vector Machines, and Extreme Gradient Boosting, and measure the performance
of each method in terms of precision, recall, and F1-score. The experimental results
show that a combination of different features, as well as using the Extreme Gradient
Boosting technique, can help us to achieve promising performance in the dataset provided
by the organizers of the competition CMDC 2019.

Residence and Workplace Recovery: User Privacy Risk in Mobility Data

  • Yuchen Qiu
  • Yuanyuan Qiao
  • Aimin Zhang
  • Jie Yang

Mobility data has been collected through mobile devices and cellular networks used
in academic research and commercial purposes for the last decade. Since releasing
individual's mobility records or trajectories gives rise to privacy issues, datasets
owners tend to only publish encrypted mobility data, which doesn't contains users'
identification symbol like telephone number. However, we argue and prove that even
publishing encrypted mobility data could lead to privacy problem, of which the critical
problem is users' residence and workplace identification. We develop an attack system
that is able to identify users' important locations by a semi-supervised learning
model. In addition to traditional time features, our system takes the users' mobility
and living patterns into consideration, which are important and affect each other.
Our model demands for less ground truth labels and produces considerable improvement
in learning accuracy. With large-scale factual mobile data and long-time tracking
ground truth data captured from a big city, we reveal that our attack system is able
to identify users' residence and workplace with accuracy about 98%, which indicates
severe privacy leakage in such datasets. And we provide advice for this kind of privacy-preserving

MNR-HCM Data: A Personal Lifelog and Surrounding Environment Dataset in Ho-Chi-Minh
City, Viet Nam

  • Tan-Loc Nguyen-Tai
  • Dang-Hieu Nguyen
  • Minh-Tam Nguyen
  • Thanh-Duong Nguyen
  • Thanh-Hai Dang
  • Minh-Son Dao

In this paper, we introduce a new dataset that contains personal lifelog and surrounding
environment data, collected periodically along predefined routes in Ho-Chi-Minh city,
Vietnam. We also introduce self-developed devices as well as system architecture for
gathering, storing, accessing, and visualizing data. Moreover, some exciting research
topics and applications, especially for understanding correlations between people's
health, air pollution, and congestion, resulting from the insights of this data set
are discussed.

A Digital Insight Provider From Financial Documents In Banking

  • Gokce Aydugan Baydar
  • Cisem Altan
  • Bilge Koroglu
  • Se├žil Arslan

Calculating credibility and ensuring financial sustainability of companies are major
challenges for financial centers. Customer relationship managers in a bank are responsible
for maintaining and expanding their portfolios. Trial balance is a semi-structured
financial document in which related transactions, assets, and liabilities of a company
are recorded in detail. Trial balances are very important for customer relationship
managers because they not only picture the financial situation of the company but
also contain new sales opportunities. However, due to the size of these documents
and other responsibilities of customer relationship managers, examining trial balances
manually is not feasible. In this paper, we present a structured information retrieval
and data analysis system which provides new sales opportunities by automatically and
recursively digitalizing documents of the bank's customers. With high-end interactive
visualization capabilities, such a system is crucial for customer relationship managers
to maintain and expand their portfolios. This system works with high confidence, hence
it not only saves customer relationship managers' time but also eliminates human mistakes.

Duplicate Identification Algorithms in SaaS Platforms

  • Dac Nguyen
  • Quy H. Nguyen
  • Minh-Son Dao
  • Duc-Tien Dang-Nguyen
  • Cathal Gurrin
  • Binh T. Nguyen

Existing duplicate records is one of the most common issues in many Software-as-as-Service
(SaaS) platforms. In this paper, we study the duplicate identification problem in
one specific SaaS platform related to quality and compliance management by using the
address information. We interpret all typical mistakes from users that can generate
the existent duplicated organizations in a given dataset, collected from the SaaS
platform. Also, we create another set by crawling location data from Open Address
(US Zone). We compare different methods, including Bag-of-words (using Cosine Distance),
Record Linkage Toolkits, and Siamese Neural Networks using the triplet loss, in terms
of precision, recall, and F1-score. The experimental results show that using Siamese
Neural Networks can achieve a better performance in comparison with other techniques.
We plan to publish our Open Address dataset and all implementation codes to facilitate
further research in the related fields.