MADiMa '19- Proceedings of the 5th International Workshop on Multimedia Assisted Dietary Management

MADiMa '19- Proceedings of the 5th International Workshop on Multimedia Assisted Dietary Management

Full Citation in the ACM Digital Library

SESSION: Invited Talk 01

Personalized Nutrition in the Connected Era: Challenges and Opportunities

  •      Frederic Ronga

Today's consumer lives in a connected world, with trends in self-quantification and personalization driving preferences and habits. This talk will discuss the mission of Nestlé Research's Digital Nutrition & Health team, as well as its challenges and approaches on trying to feed into holistic digital platforms that contribute to Nestlé Purpose of enhancing quality of life and contributing to a healthier future, with a special focus on dietary intake capture, assessment, and recommendations.

SESSION: Oral Paper Session 1

Deep Cooking: Predicting Relative Food Ingredient Amounts from Images

  •      Jiatong Li
  • Ricardo Guerrero
  • Vladimir Pavlovic

In this paper, we study the novel problem of not only predicting ingredients from a food image, but also predicting the relative amounts of the detected ingredients. We propose two prediction-based models using deep learning that output sparse and dense predictions, coupled with important semi-automatic multi-database integrative data pre-processing, to solve the problem. Experiments on a dataset of recipes collected from the Internet show the models generate encouraging experimental results.

Impact of Mixed Reality Food Labels on Product Selection: Insights from a User Study using Headset-mediated Food Labels at a Vending Machine

  •      Klaus Fuchs
  • Tobias Grundmann
  • Mirella Haldimann
  • Elgar Fleisch

The rise in diet-related non-communicable diseases suggests that consumers find it difficult to make healthy food-related purchases. This situation is most pertinent in fast-paced retail environments where customers are confronted with sugar-rich or savory food items. Counter-measures such as front-of-package labelling are not yet mandated in most regions, and barcode scanning mobile applications are impractical when purchasing groceries. We thus applied a mixed reality (MR) wearable headset-mediated intervention (N = 61) at vending machines to explore the potential of passively activated, pervasive MR food labels in affecting beverage purchasing choices. Through conduction of a between-subject randomized controlled trial, we find significant, strong improvements in nutritional quality of the selected products (Energy: -34% KJ/100ml, Sugar: -28% g/100ml). Our post-hoc analysis suggests that the intervention effect is especially effective with existing food literacy. This study motivates further research on MR food labels due to the promising, observed intervention effects.

SESSION: Invited Talk 02

Eat, Drink, and be Happy!

  •      Ramesh Jain

Food is essential for human life and it is fundamental to the human experience. Selecting right food at a right place in right situation is very important, but a challenging problem. Food recommendation must consider the personal food model, analyzing unique food characteristics, incorporating various contexts and food related domain knowledge. Each of these is a challenge for multimedia computing. In this presentation, we will discuss technical challenges and opportunities for addressing food recommendation to make people happy and healthy. Recent progress in multimodal computing makes it possible to start addressing this important problem for all of us at personal level. We will discuss our ideas considering some concrete emerging approaches for building multimodal personal food recommendation systems.

SESSION: Invited Talk 03

Learning User Preferences from Social Multimedia Analysis and Overview of the iFood2019Challenge

  •      Karan Sikka

There has been an explosion of multimodal content generated on social media networks in the last few years, which has necessitated a deeper understanding of social media content and user behavior. In the first part of the task, I will present a novel method that provides a unified framework for understanding content as well as modeling user preferences from noisy social media posts. I will discuss some applications in understanding food preferences and trends using this algorithm. I will then give an overview of the second large-scale food classification challenge in images (iFood challenge) held as part of the sixth Fine Grained Visual Classification Workshop at CVPR19. We introduce a new dataset of 251 fine-grained (prepared) food categories with 118K training images collected from the web, and human verified labels for both validation set (11K images) and the test set (12K images). 40 teams from academia and industry competed in this challenge with the top team obtaining a 5.6% top-3 error percentage, which is almost 2-points better than previous year challenge.

SESSION: Oral Paper Session 2

Unseen Food Creation by Mixing Existing Food Images with Conditional StyleGAN

  •      Daichi Horita
  • Wataru Shimoda
  • Keiji Yanai

In recent years, thanks to the development of generative adversarial networks (GAN), it has become possible to generate food images. However, the quality is still low and it is difficult to generate appetizing and delicious-looking images. In the latest GAN study, StyleGAN enabled high-level feature separation and stochastic variation of generated images by unsupervised learning. However, to manipulate any style, it is necessary to understand the representation of the latent space and to input reference images. In this paper, we propose a conditional version of StyleGAN to control probabilistic fluctuations of disentangled features. The conditional style-based generator can manipulate the style of any domain by providing conditional vectors. By applying the conditional StyleGAN to the food image domain, we successfully have generated higher quality food images than before. In addition, introducing conditional vectors enabled us to control the food categories of generated images and to generate unseen foods by mixing multiple kinds of foods. In the experiment, to show the result of the proposed method explicitly, Food13 dataset is constructed and evaluated by both qualitative evaluation and quantitative evaluation.

Self-Attention and Ingredient-Attention Based Model for Recipe Retrieval from Image Queries

  •      Matthias Fontanellaz
  • Stergios Christodoulidis
  • Stavroula Mougiakakou

Direct computer vision based-nutrient content estimation is a demanding task, due to deformation and occlusions of ingredients, as well as high intra-class and low inter-class variability between meal classes. In order to tackle these issues, we propose a system for recipe retrieval from images. The recipe information can subsequently be used to estimate the nutrient content of the meal. In this study, we utilize the multi-modal Recipe1M dataset, which contains over 1 million recipes accompanied by over 13 million images. The proposed model can operate as a first step in an automatic pipeline for the estimation of nutrition content by supporting hints related to ingredient and instruction. Through self-attention, our model can directly process raw recipe text, making the upstream instruction sentence embedding process redundant and thus reducing training time, while providing desirable retrieval results. Furthermore, we propose the use of an ingredient attention mechanism, in order to gain insight into which instructions, parts of instructions or single instruction words are of importance for processing a single ingredient within a certain recipe. Attention-based recipe text encoding contributes to solving the issue of high intra-class/low inter-class variability by focusing on preparation steps specific to the meal. The experimental results demonstrate the potential of such a system for recipe retrieval from images. A comparison with respect to two baseline methods is also presented.

SESSION: Invited Talk 04

FoodLog: Multimedia Food Recording Platform and its Application

  •      Kiyoharu Aizawa

Most existing food recording tools rely on text-based recording, which makes it difficult for users to continuously record and is an unintuitive approach to understanding food content. Photo based recording is advantageous because it provides image-based input assistance and a quick understanding of food intake. Thus, we have constructed a multimedia food recording platform called FoodLog that helps users record their food intakes via photos through personalized image recognition. The number of food records uploaded by general public users exceeds 10 million as of July 2019. We discuss the technical framework and present various analyses of big data containing food records of general public users. We demonstrate that food intake trends have high correlation with the information trends of TV and social media. Based on the FoodLog platform, we collaborated with University Hospital, and we developed a new tool, FoodLog Athlete, that assists nutrition management and communication for sports dietitians and athletes. Thus, FoodLog is not only beneficial for healthcare purposes, but also for other applications, such as nutrition management for athletes.

SESSION: Oral Paper Session 3

Convolutional Neural Networks for Food Image Recognition: An Experimental Study

  •      Yi Sen Ng
  • Wanqi Xue
  • Wei Wang
  • Panpan Qi

The global rising trend of consumer health awareness has increasingly drawn attention towards the development of nutrition tracking applications. A key component to enable large scale, low cost and personalized nutrition analysis is a food image recognition engine based on convolutional neural network (CNN). While there is a strong demand for many variants of such applications to serve different localized needs around the world, it remains an ongoing challenge to piece together a complete solution involving (i) dataset preparation (2) choice of CNN architecture (3) CNN optimization strategies in the context of food classification. In this work, we address this problem by conducting a series of experimental study to aid engineers in quickly preparing datasets, deploying and optimizing CNN based solutions for food image recognition tasks. Based on our results, we recommend that (i) larger networks such as Xception shall be used; (ii) minimum of 300 images per category need to be available to kick-start training for optimal performance; (iii) image augmentation techniques that do not alter shapes are more beneficial; various augmentation techniques can also be combined for better results; (iv) dataset balancing strategy might not be required to address class imbalance in food datasets below an imbalance ratio of 7; (v) higher native image resolution during training is beneficial to classifier performance, especially for networks requiring larger input size.

Categorization of Cooking Actions Based on Textual/Visual Similarity

  •      Yixin Zhang
  • Yoko Yamakata
  • Keishi Tajima

In this paper, we propose a method of automatically categorizing cooking actions appearing in recipe data. We extract verbs from textual descriptions of cooking procedures in recipe data, and vectorize them by using word embedding. These vectors provide a way to compute contextual similarity between verbs. We also extract images associated with each step of the procedures, and vectorize them by using a standard feature extraction method. For each verb, we collect images associated with the steps whose description includes the verb, and calculate the average of their vectors. These vectors provide a way to compute visual similarity between verbs. However, one type of action is sometimes represented by several types of images in recipe data. In such cases, the average of the associated image vectors is not appropriate representation of the action. To mitigate this problem, we propose a yet another way to vectorize verbs. We first cluster all the images in the recipe data into 20 clusters. For each verb, we calculate the ratio of each cluster within the set of images associated with the verb, and create a 20-dimensional vector representing the distribution over the 20 classes. We calculate similarity of verbs by using these three kinds of vector representations. We conducted a preliminary experiment for comparing these three ways, and the result shows that each of them are useful for categorizing cooking actions.

Scraping Social Media Photos Posted in Kenya and Elsewhere to Detect and Analyze Food Types

  •      Mona Jalal
  • Kaihong Wang
  • Sankara Jefferson
  • Yi Zheng
  • Elaine O. Nsoesie
  • Margrit Betke

Monitoring population-level changes in diet could be useful for education and for implementing interventions to improve health. Research has shown that data from social media sources can be used for monitoring dietary behavior.We propose a scrape-by-location methodology to create food image datasets from Instagram posts. We used it to collect 3.56 million images over a period of 20 days in March 2019. We also propose a scrape-by-keywords methodology and used it to scrape -30,000 images and their captions of 38 Kenyan food types.We publish two datasets of 104,000 and 8,174 image/caption pairs, respectively. With the first dataset, Kenya104K, we train a Kenyan Food Classifier, called KenyanFC, to distinguish Kenyan food from non-food images posted in Kenya.We used the second dataset, KenyanFood13, to train a classifier Kenyan-FTR, short for Kenyan Food Type Recognizer, to recognize 13 popular food types in Kenya. The KenyanFTR is a multimodal deep neural network that can identify 13 types of Kenyan foods using both images and their corresponding captions. Experiments show that the average top-1 accuracy of KenyanFC is 99% over 10,400 tested Instagram images and of KenyanFTR is 81% over 8,174 tested data points. Ablation studies show that three of the 13 food types are particularly difficult to categorize based on image content only and that adding analysis of captions to the image analysis yields a classifier that is 9 percent points more accurate than a classifier that relies only on images. Our food trend analysis revealed that cakes and roasted meats were the most popular foods in photographs on Instagram in Kenya in March 2019.

Flavour Enhanced Food Recommendation

  •      Nitish Nag
  • Aditya Narendra Rao
  • Akash Kulhalli
  • Kushal Samir Mehta
  • Nishant Bhattacharya
  • Pratul Ramkumar
  • Aditya Bharadwaj
  • Dinkar Sitaram
  • Ramesh Jain

We propose a mechanism to use the features of flavour to enhance the quality of food recommendations. An empirical method to determine the flavour of food is incorporated into a recommendation engine based on major gustatory nerves. Such a system has advantages of suggesting food items that the user is more likely to enjoy based upon matching with their flavour profile through use of the taste biological domain knowledge. This preliminary intends to spark more robust mechanisms by which flavour of food is taken into consideration as a major feature set into food recommendation systems. Our long term vision is to integrate this with health factors to recommend healthy and tasty food to users to enhance quality of life.

SESSION: Poster and Demo Session

Towards Tailoring Digital Food Labels: Insights of a Smart-RCT on User-specific Interpretation of Food Composition Data

  •      Klaus Fuchs
  • Timothée Barattin
  • Mirella Haldimann
  • Alexander Ilic

Front-of-pack nutrition labelling (FoPL) supports healthier food choices yet remain unmandated in most countries. Simultaneously, labels are criticized for giving standardized recommendations that overlook individual needs. To assess the potential of consumer-specific tailored labels, we thus developed and tested a tailoring logic for adapting labels to individual dietary requirements and a smartphone app that then provided tailored food labels after scanning a product's barcode. The tailoring logic was developed with dieticians, accounting for gender, age, activity, preferences, diet-related diseases. The label showed a combination of established labelling systems: Nutri-Score and Multiple Traffic Light. The application followed a smart-RCT design, randomly attributing users either with tailored or standardized labels. 33 users met the eligibility criteria for our exploratory study. We found promising evidence that tailored digital food labels are perceived as more helpful, relevant, and recommendable than current static food labels, especially in the absence of FoPL.

DepthCalorieCam: A Mobile Application for Volume-Based FoodCalorie Estimation using Depth Cameras

  •      Yoshikazu Ando
  • Takumi Ege
  • Jaehyeong Cho
  • Keiji Yanai

Some recent smartphones such as iPhone Xs have pair of cameras which can be used as stereo cameras on their backside. Regarding iPhone with iOS11 or more, the official API provides the function to estimate depth information from two backside cameras in the real-time way. By taking advantage of this function, we have developed an iOS app, "DepthCalorieCam", which estimates the amount of food calories based on food volumes. In the proposed app, it takes a RGB-D image of a dish, estimate categories and volumes of foods on the dish, and calculate the amount of their calories using the pre-registered calorie density of each food category. We have achieved very accurate calorie estimation by using depth information. The error of estimated calories was reduced greatly compared with the existing size-based systems.

A New Large-scale Food Image Segmentation Dataset and Its Application to Food Calorie Estimation Based on Grains of Rice

  •      Takumi Ege
  • Wataru Shimoda
  • Keiji Yanai

To estimate food calorie accurately from food images, accurate food image segmentation is needed. So far no large-scale food image segmentation datasets which have pixel-wise labels exists. In this paper, we added segmentation masks to the food images in the existing dataset, UEC-Food100, semi-automatically. To estimate segmentation masks, we revised bounding boxes included in the UEC-Food100 dataset so that bounding boxes bounds not dish regions but only food regions. As results, by applying GrubCut, we obtained good segmentation masks, and we checked and corrected 1000 images of them if needed by hand for the testing masks. We trained segmentation networks with the newly-created food image masks. As results, segmentation accuracy was much improved, which is expected to bring more accurate food calorie estimation. In addition, we propose a new method on food calorie estimation using grains of steamed rice which are typically contained in Japanese foods instead of a reference card. By the experiments, we show real food size can be estimated from rice images, which helps accurate food calorie estimation.