MAHCI '19- Proceedings of the 2nd Workshop on Multimedia for Accessible Human Computer Interfaces


MAHCI '19- Proceedings of the 2nd Workshop on Multimedia for Accessible Human Computer Interfaces

Full Citation in the ACM Digital Library

SESSION: Workshop Presentations

Emotion Recognition with Simulated Phosphene Vision

  •      Caroline J.M. Bollen
  • Richard J.A. van Wezel
  • Marcel A. J. van Gerven
  • Yağmur Güçlütürk

Electrical stimulation of retina, optic nerve or cortex is found to elicit visual sensations, known as phosphenes. This allows visual prosthetics to partially restore vision by representing the visual field as a phosphene pattern. Since the resolution and performance of visual prostheses are limited, only a fraction of the information in a visual scene can be represented by phosphenes. Here, we propose a simple yet powerful image processing strategy for recognizing facial expressions with prosthetic vision, supporting communication and social interaction in the blind. A psychophysical study was conducted to investigate whether a landmark-based representation of facial expressions could improve emotion detection with prosthetic vision. Our approach was compared to edge detection, which is commonly used in current retinal prosthetic devices. Additionally, the relationship between the number of phosphenes and accuracy of emotion recognition was studied. The landmark model improved accuracy of emotion recognition, regardless of the number of phosphenes. Secondly, the accuracy improved with an increasing number of phosphenes up to a saturation point. The performance saturated with fewer phosphenes with the landmark model than with edge detection. These results suggest that landmark-based image pre-processing allows for a more efficient use of the limited information that can be stored in a phosphene pattern, providing a route towards more meaningful and higher-quality perceptual experience in subjects with prosthetic vision.

A Refreshable Tactile Display Effectively Supports Cognitive Mapping Followed by Orientation and Mobility Tasks: A Comparative Multi-modal Study Involving Blind and Low-vision Participants

  •      Luca Brayda
  • Fabrizio Leo
  • Caterina Baccelliere
  • Claudia Vigini
  • Elena Cocchi

We investigate the role of refreshable tactile display in supporting the learning of cognitive maps, followed by actual exploration of a real environment that matches that map. We test both blind and low-vision persons and compare displaying maps in three information modes: with a pin array matrix, with raised paper and with verbal descriptions. We find that the pin matrix leads to a better way of externalizing a cognitive map and reduces the performance gap between blind and low-vision people. The entire evaluation is performed by participants in autonomy and suggests that refreshable tactile displays may be used to train blind persons in orientation and mobility tasks.

HaptWrap: Augmenting Non-Visual Travel via Visual-to-Tactile Mapping of Objects in Motion

  •      Bryan Duarte
  • Troy McDaniel
  • Abhik Chowdhury
  • Sana Gill
  • Sethuraman Panchanathan

Access to real-time situational information at a distance, including the relative position and motion of surrounding objects, is essential for an individual to travel safely and independently. For blind and low vision travelers, access to critical environmental information is unattainable if it is positioned beyond the reach of their preferred mobility aid or outside their path of travel. Due to its cost and versatility, and the dynamic information which can be aggregated through its use, the long white cane remains the most widely used mobility aid for non-visual travelers. Physical characteristics such as texture, slope, and position can be identified with the long white cane, but only when the traveler is within close proximity to an object. In this work, we introduce a wearable technology to augment non-visual travel methods by communicating spatial information at a distance. We propose a vibrotactile device, the HaptWrap, equipped with vibration motors capable of communicating an object's position relative to the user's orientation, as well as its relative variations in position as the object moves about the user. An experiment supports the use of haptics to represent objects in motion around an individual as a substitute modality for vision.

Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning

  •      Yuling Gui
  • Dan Guo
  • Ye Zhao

Video captioning is a challenging problem in neural networks, computer vision, and natural language processing. It aims to translate a given video into a sequence of words which can be understood by humans. The dynamic information in videos and the complexity in linguistic cause the difficulty of this task. This paper proposes a semantic enhanced encoder-decoder network to tackle this problem. To explore a more abundant variety of video information, it implements a three path fusion strategy in the encoder side which combines complementary features. In the decoding stage, the model adopts an attention mechanism to consider the different contributions of the fused features. In both the encoder and decoder side, the video information is well obtained. Furthermore, we use the idea of reinforcement learning to calculate rewards based on semantic designed computation. Experimental results on Microsoft Video Description Corpus (MSVD) dataset show the effectiveness of the proposed approach.

Continuous Sign Language Recognition Based on Pseudo-supervised Learning

  •      Xiankun Pei
  • Dan Guo
  • Ye Zhao

Continuous sign language recognition task is challenging for the reason that the ordered words have no exact temporal locations in the video. Aiming at this problem, we propose a method based on pseudo-supervised learning. First, we use a 3D residual convolutional network (3D-ResNet) pre-trained on the UCF101 dataset to extract visual features. Second, we employ a sequence model with connectionist temporal classification (CTC) loss for learning the mapping between the visual features and sentence-level labels, which can be used to generate clip-level pseudo-labels. Since the CTC objective function has limited effects on visual features extracted from early 3D-ResNet, we fine-tune the 3D-ResNet by feeding the clip-level pseudo-labels and video clips to obtain better feature representation. The feature extractor and the sequence model are optimized alternately with CTC loss. The effectiveness of the proposed method is verified on the large datasets RWTH-PHOENIX-Weather-2014.

Gaze Detection and Prediction Using Data from Infrared Cameras

  •      Yingxuan Zhu
  • Wenyou Sun
  • Tim Tingqiu Yuan
  • Jian Li

Knowing the point of gaze on a screen can benefit a variety of applications and improve user experiences. Some electronic devices with infrared cameras can generate 3D point cloud for user identification. We propose a paradigm to use 3D point cloud and eye images for gaze detection and prediction. Our method fuses 3D point cloud with eye images by image registration methods. We develop a cost function to detect saggital plane from point cloud data, and reconstruct a symmetric face by saggital plane. Symmetric face data increase the accuracy of gaze detection. We use long-short term memory models to track head and eye movement, and predict next point of gaze. Our method utilizes the existing hardware setup and provides options to improve user experiences.