MADiMa '23: Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management

MADiMa '23: Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management

MADiMa '23: Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management

Full Citation in the ACM Digital Library

SESSION: Oral Paper 1 Session

Estimating Amount of Food in a Circular Dining Bowl from a Single Image

  • Wenyan Jia
  • Boyang Li
  • Yaguang Zheng
  • Zhi-Hong Mao
  • Mingui Sun

Unhealthy diet is a top risk factor causing obesity and numerous chronic diseases. To help the public adopt healthy diet, nutrition scientists need user-friendly tools to conduct Dietary Assessment (DA). In recent years, new DA tools have been developed using a smartphone or a wearable device which acquires images during a meal. These images are then processed to estimate calories and nutrients of the consumed food. Although considerable progress has been made, 2D food images lack scale reference and 3D volumetric information. In addition, food must be sufficiently observable from the image. This basic condition can be met when the food is stand-alone (no food container is used) or it is contained in a shallow plate. However, the condition cannot be met easily when a bowl is used. The food is often occluded by the bowl edge, and the shape of the bowl may not be fully determined from the image. However, bowls are the most utilized food containers by billions of people in many parts of the world, especially in Asia and Africa. In this work, we propose to premeasure plates and bowls using a marked adhesive strip before a dietary study starts. This simple procedure eliminates the use of a scale reference throughout the DA study. In addition, we use mathematical models and image processing to reconstruct the bowl in 3D. Our key idea is to estimate how full the bowl is rather than how much food is (in either volume or weight) in the bowl. This idea reduces the effect of occlusion. The experimental data have shown satisfactory results of our methods which enable accurate DA studies using both plates and bowls with reduced burden on research participants.

NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches

  • Chi-en Amy Tai
  • Matthew Keller
  • Saeejith Nair
  • Yuhao Chen
  • Yifan Wu
  • Olivia Markham
  • Krish Parmar
  • Pengcheng Xi
  • Heather Keller
  • Sharon Kirkpatrick
  • Alexander Wong

Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating, as malnutrition has been directly linked to decreased quality of life. However self-reporting methods such as food diaries suffer from substantial bias. Other conventional dietary assessment techniques and emerging alternative approaches such as mobile applications incur high time costs and may necessitate trained personnel. Recent work has focused on using computer vision and machine learning to automatically estimate dietary intake from food images, but the lack of comprehensive datasets with diverse viewpoints, modalities and food annotations hinders the accuracy and realism of such methods. To address this limitation, we introduce NutritionVerse-Synth, the first large-scale dataset of 84,984 photorealistic synthetic 2D food images with associated dietary information and multimodal annotations (including depth images, instance masks, and semantic masks). Additionally, we collect a real image dataset, NutritionVerse-Real, containing 889 images of 251 dishes to evaluate realism. Leveraging these novel datasets, we develop and benchmark NutritionVerse, an empirical study of various dietary intake estimation approaches, including indirect segmentation-based and direct prediction networks. We further fine-tune models pretrained on synthetic data with real images to provide insights into the fusion of synthetic and real data. Finally, we release both datasets (NutritionVerse-Synth, NutritionVerse-Real) and the collection of models (NutritionVerse) on as part of an open initiative to accelerate machine learning for dietary sensing.

A Comparative Analysis of Sensor-, Geometry-, and Neural-Based Methods for Food Volume Estimation

  • Lubnaa Abdur Rahman
  • Ioannis Papathanail
  • Lorenzo Brigato
  • Stavroula Mougiakakou

With the rapid advancements in artificial intelligence and computer vision within health and nutrition fields, image-based automatic dietary assessment is gaining popularity. This automation involves food segmentation, recognition, volume estimation, and estimation of nutritional content. While considerable progress has been made in food segmentation and recognition, accurate volume estimation remains challenging. Measuring food volume is crucial in many fields, even thought this is difficult to automate precisely. This is hampering progress, and is leading to continued reliance on time-consuming traditional methods, such as manual computation of food volume through water displacement. The manuscript presents a comparative analysis of sensor-, geometry-, and neural-based methods for computing food volume. We have performed multiple experiments using 20 meal images captured under different settings, with reliable measurements of ground-truth volume obtained by capturing 360-degree views of the food items and computing their volumes in a 3D space. An extensive analysis of our results then serves to identify the strengths and limitations of each approach, and offers valuable insights for selecting the most suitable method in specific settings. Moreover, we have made the collected data (including RGB images, ground-truth point clouds, volumes, etc.) open-source. We intend this as a contribution to the research community and to address the scarcity of food datasets with depth-related information.

SESSION: Invited Talk 2

Computer Vision Assisted Dietary Assessment Through Mobile Phones in Female Youth 18-24y in Urban Ghana: Validity Against Weighed Records and Comparison with 24-Hour Recall

  • Aulo Geli

To validate FRANI (Food Recognition Assistance and Nudging Insights), a mobile phone application with computer vision assisted dietary assessment, and multi-pass 24-hour recalls (24HR), against weighed records (WR) in female youth aged 18-24y in Ghana. Dietary intake was assessed on two non-consecutive days using FRANI, WRs and 24HRs. Equivalence of nutrient intake was tested using mixed effect models adjusted for repeated measures, by comparing ratios (FRANI/WR and 24HR/WR) with equivalence margins at 10%, 15% and 20% error bounds. Agreement between methods was assessed using the concordance correlation coefficient (CCC). Equivalence for FRANI and WR was determined at the 10% bound for fibre and folate, at 15% bound for energy, protein, fat, iron, riboflavin, thiamine, and vitamin B6 and zinc, while intake of calcium was equivalent at the 20% bound. Comparisons between 24HR and WR found protein and riboflavin intake estimates falling within a 10% bound. Iron, niacin, thiamine, and zinc intakes were equivalent at the 15% bound; and folate and vitamin B6 were equivalent at the 20% bound. The CCCs between FRANI and WR ranged between 0.44 and 0.72 (mean=0.57), and between 0.45 and 0.76 (mean=0.62) for 24HR and WR. FRANI-assisted dietary assessment and 24HRs were found to accurately estimate nutrient intake in female youth in urban Ghana.

SESSION: Oral Paper 2 Session

Memory-Efficient High-Accuracy Food Intake Activity Recognition with 3D mmWave Radars

  • Hsin-Che Chiang
  • Yi-Hung Wu
  • Shervin Shirmohammadi
  • Cheng-Hsin Hsu

Non-invasive and privacy-preserving recognition of food intake activities has applications in diet management, telecare, and smarthome data monetization. In this paper, we tackle the challenging problem of recognizing food intake activity using sparse point clouds captured by a single mmWave radar, which is non-invasive and privacy-preserving. We propose: (i) an enhanced Skeletal Pose Estimator (SPE) capable of generating more precise skeletons for food intake activity recognition, outperforming the state-of-the-art MARS with an average reduction of 45.16% in the error of the estimated joints, (ii) a Dynamic Point Cloud Recognizer (DPR), which adapts SPE to directly process dynamic point clouds for food intake activity recognition, outperforming the state-of-the-art FIA with a 4.10% enhancement in classification accuracy and a 78.29% reduction in memory consumption, and (iii) a Lightweight Dynamic Point Cloud Recognizer (LDPR), which eliminates the need for CNNs, hence reducing model complexity and outperforming DPR by 0.15% in classification accuracy and a 42.80% reduction in memory consumption. In addition to food intake activity recognition, the skeletons generated by our SPE can also be used to recognize other fine- and coarse-grained activities for applications like rehabilitation, driver monitoring, and fitness assistance.

Dining on Details: LLM-Guided Expert Networks for Fine-Grained Food Recognition

  • Jesús M. Rodríguez-de-Vera
  • Pablo Villacorta
  • Imanol G. Estepa
  • Marc Bolaños
  • Ignacio Sarasúa
  • Bhalaji Nagarajan
  • Petia Radeva

In the field of fine-grained food recognition, subset learning-based methods offer a strategic approach that groups classes into subsets to guide the training process. Our study introduces a novel approach, referred to as the Dining on Details (DoD), an innovative expert learning framework for food classification. This method ingeniously harnesses the power of large language models to construct subsets of classes within the dataset. The Dining on Details's efficacy is rooted in the robustness of the ImageBind multi-modality embedding space, which can identify meaningful similarities across varied categories. Trained through an end-to-end multi-task learning process, this method enhances performance in the fine-grained food recognition task, showing exceptional prowess with highly similar classes. A key advantage of DoD is its universal compatibility, allowing it to be applied seamlessly to any existing classification architecture. Our comprehensive validation of this method on various food datasets and backbones, both convolutional and transformer-based, reveals competitive results with significant performance gains ranging from 0.5% to 1.61%. Notably, it achieves state-of-the-art results on the Food-101 dataset.

SESSION: Poster Session

An Improved Encoder-Decoder Framework for Food Energy Estimation

  • Jack Ma
  • Jiangpeng He
  • Fengqing Zhu

Dietary assessment is essential to maintaining a healthy lifestyle. Automatic image-based dietary assessment is a growing field of research due to the increasing prevalence of image capturing devices (e.g. mobile phones). In this work, we estimate food energy from a single monocular image, a difficult task due to the limited hard-to-extract amount of energy information present in an image. To do so, we employ an improved encoder-decoder framework for energy estimation; the encoder transforms the image into a representation embedded with food energy information in an easier-to-extract format, which the decoder then extracts the energy information from. To implement our method, we compile a high-quality food image dataset verified by registered dietitians containing eating scene images, food-item segmentation masks, and ground truth calorie values. Our method improves upon previous caloric estimation methods by over 10% and 30 kCal in terms of MAPE and MAE respectively.

Diffusion Model with Clustering-based Conditioning for Food Image Generation

  • Yue Han
  • Jiangpeng He
  • Mridul Gupta
  • Edward J. Delp
  • Fengqing Zhu

Image-based dietary assessment serves as an efficient and accurate solution for recording and analyzing nutrition intake using eating occasion images as input. Deep learning-based techniques are commonly used to perform image analysis such as food classification, segmentation, and portion size estimation, which rely on large amounts of food images with annotations for training. However, such data dependency poses significant barriers to real-world applications, because acquiring a substantial, diverse, and balanced set of food images can be challenging. One potential solution is to use synthetic food images for data augmentation. Although existing work has explored the use of generative adversarial networks (GAN) based structures for generation, the quality of synthetic food images still remains subpar. In addition, while diffusion-based generative models have shown promising results for general image generation tasks, the generation of food images can be challenging due to the substantial intra-class variance. In this paper, we investigate the generation of synthetic food images based on the conditional diffusion model and propose an effective clustering-based training framework, named ClusDiff, for generating high-quality and representative food images. The proposed method is evaluated on the Food-101 dataset and shows improved performance when compared with existing image generation works. We also demonstrate that the synthetic food images generated by ClusDiff can help address the severe class imbalance issue in long-tailed food classification using the VFN-LT dataset.

HowToEat: Exploring Human Object Interaction and Eating Action in Eating Scenarios

  • Yingcheng Wang
  • Junwen Chen
  • Keiji Yanai

Recently, the analysis of multimedia of eating and diet has become a new trend in research. Detecting eating activities in videos and images is a basic requirement for further analysis. However, existing human-centric action detection tasks, such as human-object interaction detection and hand-object interaction detection lack the data in eating scenarios and annotations of eating actions. To fill this gap in research, we introduce a new large-scale dataset, HowToEat, which contains 66 days of videos in 12 eating scenarios, and 95k images with automatic annotations of hand-object interactions and eating actions. Based on the dataset, we propose an eating analysis system, which uses a single model to detect hand-object interaction and eating action at the same time.

Muti-Stage Hierarchical Food Classification

  • Xinyue Pan
  • Jiangpeng He
  • Fengqing Zhu

Food image classification serves as a fundamental and critical step in image-based dietary assessment, facilitating nutrient intake analysis from captured food images. However, existing works in food classification predominantly focuses on predicting 'food types', which do not contain direct nutritional composition information.This limitation arises from the inherent discrepancies in nutrition databases, which are tasked with associating each 'food item' with its respective information. Therefore, in this work we aim to classify food items to align with nutrition database. To this end, we first introduce VFN-nutrient dataset by annotating each food image in VFN with a food item that includes nutritional composition information. Such annotation of food items, being more discriminative than food types, creates a hierarchical structure within the dataset. However, since the food item annotations are solely based on nutritional composition information, they do not always show visual relations with each other, which poses significant challenges when applying deep learning-based techniques for classification.To address this issue, we then propose a multi-stage hierarchical framework for food item classification by iteratively clustering and merging food items during the training process, which allows the deep model to extract image features that are discriminative across labels. Our method is evaluated on VFN-nutrient dataset and achieve promising results compared with existing work in terms of both food type and food item classification.