MuCAI'20: Proceedings of the 1st International Workshop on Multimodal Conversational AI
MuCAI ?20: Proceedings of the 1st International Workshop on Multimodal Conversational AI
SESSION: Keynote Talk
Augment Machine Intelligence with Multimodal Information
- Zhou Yu
Humans interact with other humans or the world through information from various channels
including vision, audio, language, haptics, etc. To simulate intelligence, machines
require similar abilities to process and combine information from different channels
to acquire better situation awareness, better communication ability, and better decision-making
ability. In this talk, we describe three projects. In the first study, we enable a
robot to utilize both vision and audio information to achieve better user understanding
[1]. Then we use incremental language generation to improve the robot's communication
with a human. In the second study, we utilize multimodal history tracking to optimize
policy planning in task-oriented visual dialogs. In the third project, we tackle the
well-known trade-off between dialog response relevance and policy effectiveness in
visual dialog generation. We propose a new machine learning procedure that alternates
from supervised learning and reinforcement learning to optimum language generation
and policy planning jointly in visual dialogs [2]. We will also cover some recent
ongoing work on image synthesis through dialogs.
SESSION: Workshop Presentations
Assisted Speech to Enable Second Language
- Mehmet Altinkaya
- Arnold W.M. Smeulders
Speaking a second language (L2) is a desired capability for billionsof people. Currently,
the only way to achieve it naturally is througha lengthy and tedious training, which
ends up various stages offluency. The process is far away from the natural acquisition
of alanguage.In this paper, we propose a system that enables any person withsome basic
understanding of L2 speak fluently through "Instant As-sistance" provided by digital
conversational agents such as GoogleAssistant, Microsoft Cortana, or Apple Siri, which
monitors thespeaker. It attends to provide assistance to continue to speak whenspeech
is interrupted as it is not yet completely mastered. The notyet acquired elements
of language can be missing words, unfa-miliarity with expressions, the implicit rules
of articles, and thehabits of sayings. We can employ the hardware and software of
theassistants to create an immersive, adaptive learning environmentto train the speaker
online by a symbiotic interaction for implicit,unnoticeable correction.
A Dynamic, Self Supervised, Large Scale AudioVisual Dataset for Stuttered Speech
- Mehmet Altinkaya
- Arnold W.M. Smeulders
Stuttering affects at least 1% of the world population. It is caused by irregular
disruptions in speech production. These interruptions occur in various forms and frequencies.
Repetition of words or parts of words, prolongations, or blocks in getting the words
out are the most common ones.
Accurate detection and classification of stuttering would be important in the assessment
of severity for speech therapy. Furthermore, real time detection might create many
new possibilities to facilitate reconstruction into fluent speech. Such an interface
could help people to utilize voice-based interfaces like Apple Siri and Google Assistant,
or to make (video) phone calls more fluent by delayed delivery.
In this paper we present the first expandable audio-visual database of stuttered speech.
We explore an end-to-end, real-time, multi-modal model for detection and classification
of stuttered blocks in unbound speech. We also make use of video signals since acoustic
signals cannot be produced immediately. We use multiple modalities as acoustic signals
together with secondary characteristics exhibited in visual signals will permit an
increased accuracy of detection.
FUN-Agent: A 2020 HUMAINE Competition Entrant
- Robert Geraghty
- James Hale
- Sandip Sen
- Timothy S. Kroecker
Of late, there has been a significant surge of interest in industry and the general
populace about future potential of human-AI collaboration [20]. Academic researchers
have been pushing the frontier of new modalities of peer-level and ad-hoc human agent
collaboration [10;22] for a longer period. We have been particularly interested in
research on agents representing human users in negotiating deals with other human
and autonomous agents [12;16;18]. Here we present the design for the conversational
aspect of our agent entry into the HUMAINE League of the 2020 Automated Negotiation
Agent Competition (ANAC). We discuss how our agent utilizes conversational and negotiation
strategies, that mimic those used in human negotiations, to maximize its utility as
a simulated street vendor. We leverage verbal influence tactics, offer pricing, and
increasing human convenience to entice the buyer, build trust and discourage exploitation.
Additionally, we discuss the results of some in-house testing we conducted.
Motivation and Design of the Conversational Components of DraftAgent for Human-Agent
Negotiation
- Dale Peasley
- Michael Naguib
- Bohan Xu
- Sandip Sen
- Timothy S. Kroecker
In sync with the significant interest in industry and the general populace about future
potential of human-AI collaboration [14], academic researchers have been pushing the
frontier of new modalities of peer-level and ad-hoc human agent collaboration [4,15].
We have been particularly interested in research on agents representing human users
in negotiating deals with other human and autonomous agents [6,11,13]. We present
the design motivation and key components of the conversational aspect of our agent
entry into the Human-Agent League(HAL) (http://web.tuat.ac.jp/~katfuji/ANAC2020/cfp/ham_cfp.pdf
)of the 2020 Automated Negotiation Agent Competition (ANAC). We explore how language
can be used to promote human-agent collaboration even in the domain of a competitive
negotiation. We present small scale in-lab testing to demonstrate the potential of
our approach.
Automatic Speech Recognition and Natural Language Understanding for Emotion Detection
in Multi-party Conversations
- Ilja Popovic
- Dubravko Culibrk
- Milan Mirkovic
- Srdjan Vukmirovic
Conversational emotion and sentiment analysis approaches rely on Natural Language
Understanding (NLU) and audio processing components to achieve the goal of detecting
emotions and sentiment based on what is being said. While there has been marked progress
in pushing the state-of-the-art of theses methods on benchmark multimodal data sets,
such as the Multimodal EmotionLines Dataset (MELD), the advances still seem to lag
behind what has been achieved in the domain of mainstream Automatic Speech Recognition
(ASR) and NLU applications and we were unable to identify any widely used products,
services or production-ready systems that would enable the user to reliably detect
emotions from audio recordings of multi-party conversations. Published, state-of-the-art
scientific studies of multi-view emotion recognition seem to take it for granted that
a human-generated or edited transcript is available as input to the NLU modules, providing
no information of what happens in a realistic application scenario, where audio only
is available and the NLU processing has to rely on text generated by ASR. Motivated
by this insight, we present a study designed to evaluate the possibility of applying
widely-used state-of-the-art commercial ASR products as the initial audio processing
component in an emotion-from-speech detection system. We propose an approach which
relies on commercially available products and services, such as Google Speech-to-Text,
Mozilla DeepSpeech and the NVIDIA NeMo toolkit to process the audio and applies state-of-the-art
NLU approaches for emotion recognition, in order to quickly create a robust, production-ready
emotion-from-speech detection system applicable to multi-party conversations.