ICDAR '21: Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval

ICDAR '21: Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval

ICDAR '21: Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval

Full Citation in the ACM Digital Library

SESSION: Keynote & Invited Talks

Session details: Keynote & Invited Talks

  • Minh-Son Dao

Discovering Knowledge Hidden in Raster Images using RasterMiner

  • R. Uday Kiran

The satellite imagery data naturally exists as raster data. Useful information that
can empower the domain experts to improve their decision-making abilities lies hidden
in this data. However, finding this hidden knowledge is non-trivial and challenging
due to the lack of open source integrated software to discover knowledge from raster
data. In particular, existing open-source general-purpose data mining libraries, such
as Knime [1], Mahout [3], Weka [5], Sci-kit [4], and SPMF [2], are inadequate to find
knowledge hidden in raster datasets.

In this talk, we present rasterMiner an integrated open-source software to discover
knowledge from raster imagery datasets. It currently provides unsupervised learning
techniques, such as pattern mining and clustering, to discover knowledge hidden in
raster data. The key features of our software are as follows: (i) provides four pattern
mining algorithms and four clustering algorithms to discover knowledge from raster
data, (ii) Our software also provides "elbow method" to choose an appropriate k value
for k-mean and k-means++ algorithms, (iii) Our software presents an integrated GUI
that can facilitate the domain experts to choose algorithm(s) of their choice, (iv)
Our software can also be accessed as a python-library, (v) The knowledge discovered
by our software can be stored in standard formats so that the generated knowledge
can be visualized using any GIS software.

Multimodal Virtual Avatars for Investigative Interviews with Children

  • Gunn Astrid Baugerud
  • Miriam S. Johnson
  • Ragnhild Klingenberg Røed
  • Michael E. Lamb
  • Martine Powell
  • Vajira Thambawita
  • Steven A. Hicks
  • Pegah Salehi
  • Syed Zohaib Hassan
  • Pål Halvorsen
  • Michael A. Riegler

In this article, we present our ongoing work in the field of training police officers
who conduct interviews with abused children. The objectives in this context are to
protect vulnerable children from abuse, facilitate prosecution of offenders, and ensure
that innocent adults are not accused of criminal acts. There is therefore a need for
more data that can be used for improved interviewer training to equip police with
the skills to conduct high-quality interviews. To support this important task, we
propose to research a training program that utilizes different system components and
multimodal data from the field of artificial intelligence such as chatbots, generation
of visual content, text-to-speech, and speech-to-text. This program will be able to
generate an almost unlimited amount of interview and also training data. The goal
of combining all these different technologies and datatypes is to create an immersive
and interactive child avatar that responds in a realistic way, to help to support
the training of police interviewers, but can also produce synthetic data of interview
situations that can be used to solve different problems in the same domain.

SESSION: Session 1: Full Papers

Session details: Session 1: Full Papers

  • Cathal Gurrin

ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

  • Meng-Jiun Chiou
  • Chun-Yu Liao
  • Li-Wei Wang
  • Roger Zimmermann
  • Jiashi Feng

Detecting human-object interactions (HOI) is an important step toward a comprehensive
visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting
on a chair) from static images is feasible, it is unlikely even for humans to guess
temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where
the neighboring frames play an essential role. However, conventional HOI methods operating
on only static images have been used to predict temporal-related interactions, which
is essentially guessing without temporal contexts and may lead to sub-optimal performance.
In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal
information. We first show that a naive temporal-aware variant of a common action
detection baseline does not work on video-based HOIs due to a feature-inconsistency
issue. We then propose a simple yet effective architecture named Spatial-Temporal
HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories,
correctly-localized visual features, and spatial-temporal masking pose features. We
construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves
as a solid baseline.

Temperature Forecasting using Tower Networks

  • Siri S. Eide
  • Michael A. Riegler
  • Hugo L. Hammer
  • John Bjørnar Bremnes

In this paper, we present the tower network, a novel, computationally lightweight
deep neural network for multimodal data analytics and video prediction. The tower
network is especially useful when it comes to combining different types of input data,
a problem not greatly explored within deep learning. The architecture is further applied
to a real-world example, where information from historic meteorological observations
and numerical weather predictions are combined to produce high-quality forecasts of
temperature for 1 to 6 hours into the future. The performance of the proposed model
is assessed in terms of root mean squared error (RMSE), and the tower network outperforms
even state-of-the-art forecasts from the Norwegian weather forecasting app yr.no from
3 hours into the future. On average, the RMSE of the tower network is approximately
6% smaller than that of yr.no, and approximately 27% smaller than that of the raw
numerical weather predictions.

Scattering Transform Based Image Clustering using Projection onto Orthogonal Complement

  • Angel Villar-Corrales
  • Veniaming I. Morgenshtern

In the last few years, large improvements in image clustering have been driven by
the recent advances in deep learning. However, due to the architectural complexity
of deep neural networks, there is no mathematical theory that explains the success
of deep clustering techniques. In this work we introduce Projected-Scattering Spectral
Clustering (PSSC), a state-of-the-art, stable, and fast algorithm for image clustering,
which is also mathematically interpretable. PSSC includes a novel method to exploit
the geometric structure of the scattering transform of small images. This method is
inspired by the observation that, in the scattering transform domain, the subspaces
formed by the eigenvectors corresponding to the few largest eigenvalues of the data
matrices of individual classes are nearly shared among different classes. Therefore,
projecting out those shared subspaces reduces the intra-class variability, substantially
increasing the clustering performance. We call this method 'Projection onto Orthogonal
Complement' (POC). Our experiments demonstrate that PSSC obtains the best results
among all shallow clustering algorithms. Moreover, it achieves comparable clustering
performance to that of recent state-of-the-art clustering techniques, while reducing
the execution time by more than one order of magnitude.

Pyramidal Segmentation of Medical Images using Adversarial Training

  • Espen Naess
  • Vajira Thambawita
  • Steven A. Hicks
  • Michael A. Riegler
  • Paal Halvorsen

Colorectal cancer is a severe health issue globally and a significant cause of cancer-related
mortality, but it is treatable if found at an early stage. Early detection is usually
done through a colonoscopy, where clinicians search for cancer precursors called polyps.
Research has shown that clinicians miss between 14% and 30% of polyps during standard
screenings of the gastrointestinal tract. Furthermore, once the polyps have been found,
clinicians often overestimate the size of the polyps. In this respect, automatic analysis
of medical images for detecting and locating polyps is a research area where machine
learning has excelled in recent years. Still, current models have much room for improvement.
In this paper, we propose a novel approach based on learning to segment within several
grids, which we introduce to U-Net and Pix2Pix architectures. In short, we have experimented
using several grid sizes, and using two open-source polyp segmentation datasets for
cross-data training and testing. Our results suggest that segmentation at lower resolutions
produces better results at the cost of less precision, which proved useful for the
cases where higher precision segmentations gave limited results. Generally, compared
to traditional U-Net and Pix2Pix, our grid-based approaches improve segmentation performance.

Two-Faced Humans on Twitter and Facebook: Harvesting Social Multimedia for Human Personality

  • Qi Yang
  • Aleksandr Farseev
  • Andrey Filchenkov

Human personality traits are the key drivers behind our decision-making, influencing
our life path on a daily basis. Inference of personality traits, such as Myers-Briggs
Personality Type, as well as an understanding of dependencies between personality
traits and users' behavior on various social media platforms is of crucial importance
to modern research and industry applications. The emergence of diverse and cross-purpose
social media avenues makes it possible to perform user personality profiling automatically
and efficiently based on data represented across multiple data modalities. However,
the research efforts on personality profiling from multi-source multi-modal social
media data are relatively sparse, and the level of impact of different social network
data on machine learning performance has yet to be comprehensively evaluated. Furthermore,
there is not such dataset in the research community to benchmark. This study is one
of the first attempts towards bridging such an important research gap. Specifically,
in this work, we infer the Myers-Briggs Personality Type indicators, by applying a
novel multi-view fusion framework, called "PERS" and comparing the performance results
not just across data modalities but also with respect to different social network
data sources. Our experimental results demonstrate the PERS's ability to learn from
multi-view data for personality profiling by efficiently leveraging on the significantly
different data arriving from diverse social multimedia sources. We have also found
that the selection of a machine learning approach is of crucial importance when choosing
social network data sources and that people tend to reveal multiple facets of their
personality in different social media avenues. Our released social multimedia dataset
facilitates future research on this direction.

SESSION: Session 2: Short Papers

Session details: Session 2: Short Papers

  • Thanh-Binh Nguyen

Cross-Modal Deep Neural Networks based Smartphone Authentication for Intelligent Things

  • Tran Anh Khoa
  • Dinh Nguyen The Truong
  • Duc N. M. Dang

Nowadays, identity authentication technology, including biometric identification features
such as iris and fingerprints, plays an essential role in the safety of intelligent
devices. However, it cannot implement real-time and continuous identification of user
identity. This paper presents a framework for user authentication from motion signals
such as accelerometers and gyroscope signals powered received from smartphones. The
proposed innovation scheme including i) a data preprocessing, ii) a novel feature
extraction and authentication scheme based on a cross-modal deep neural network by
applying a time-distributed Convolutional Neural Network (CNN), and Long Short-Term
Memory (LSTM) models. The experimental results of the proposed scheme show the advantage
of our approach against methods.

Models to Predict Sleeping Quality from Activities and Environment: Current Status,
Challenges and Opportunities

  • Thi Phuoc Van Nguyen
  • Do Van Nguyen
  • Koji Zettsu

The development of remote/wearable sensors enables more research in the health care
area. Based on these kinds of sensors, the information of human's active level, health
parameters can be collected to predict one's health status. Sleeping quality is an
important factor to make a person feel healthy. In this work, we summarize the current
models to predict sleeping quality. Inputs of those models could be environmental
factors, activities, or time-series data from wearable sensors. The characteristic
of the input data may lead to the choice of prediction models. The domain of data
that was used to forecast sleeping quality will be considered carefully in parallel
with the prediction model. Challenges and future work for this research direction
will be discussed in this paper.

Dutkat: A Multimedia System for Catching Illegal Catchers in a Privacy-Preserving

  • Tor-Arne S. Nordmo
  • Aril B. Ovesen
  • Håvard D. Johansen
  • Michael A. Riegler
  • Pål Halvorsen
  • Dag Johansen

Fish crime is considered a global and serious problem for a healthy and sustainable
development of one of mankind's important sources of food. Technological surveillance
and control solutions are emerging as remedies to combat criminal activities, but
such solutions might also come with impractical and negative side-effects and challenges.
In this paper, we present the concept and design of a surveillance system in lieu
of current surveillance trends striking a delicate balance between privacy of legal
actors while simultaneously capturing evidence-based footage, sensory data, and forensic
proofs of illicit activities. Our proposed novel approach is to assist human operators
in the 24/7 surveillance loop of remote professional fishing activities with a privacy-preserving
Artificial Intelligence (AI) surveillance system operating in the same proximity as
the activities being surveyed. The system will primarily be using video surveillance
data, but also other sensor data captured on the fishing vessel. Additionally, the
system correlates with other sources such as reports from other fish catches in the
approximate area and time, etc. Only upon true positive flagging of specific potentially
illicit activities by the locally executing AI algorithms, can forensic evidence be
accessed from this physical edge, the fishing vessel. Besides a more privacy-preserving
solution, our edge-based AI system also benefits from much less data that has to be
transferred over unreliable, low-bandwidth satellite-based networks.

Investigation on Privacy-Preserving Techniques For Personal Data

  • Rafik Hamza
  • Koji Zettsu

Privacy protection technology has become a crucial part of almost every existing cross-data
analysis application. The privacy-preserving technique allows sharing sensitive personal
information and preserves the users' privacy. This new trend influences data collection
results by improving the analytical accuracy, increasing the number of participants,
and better understand the participants' environments. Herein, collecting these personal
data is significant to many advantageous applications such as health monitoring. Nevertheless,
these applications encounter real privacy threats and concerns about handling personal
information. This paper aims to determine privacy-preserving personal data mining
technologies and analyze these technologies' advantages and shortcomings. Our purpose
is to provide an in-depth understanding of personal data privacy and highlight important
viewpoints, existing challenges, and future research directions.