SUMAC'20: Proceedings of the 2nd Workshop on Structuring and Understanding of Multimedia heritAge Contents

SUMAC'20: Proceedings of the 2nd Workshop on Structuring and Understanding of Multimedia heritAge Contents

Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Deep Image Features for Instance-level Recognition and Matching

  • André Araujo

In this talk, I will discuss recent work from our team at Google Research, covering novel methods and datasets. Instance-level recognition, retrieval and matching are key computer vision problems which generally depend on effective image representations, both global and local. Our team has proposed a suite of state-of-the-art models to address these tasks: DELF (ICCV'17), one of the first deep learning methods for joint detection & description of local image features; Detect-to-Retrieve (CVPR'19), where deep local features can be efficiently aggregated guided by a trained object detector; DELG (ECCV'20), the first end-to-end trained deep model for joint local and global feature extraction. I will also present our team's efforts on pushing for larger scale and more realistic benchmarks in this area, with the Google Landmarks Dataset (CVPR'20), and three workshops at computer vision conferences (CVPR'18, CVPR'19, ECCV'20).

Towards the Semantic-aware 3D Digitisation of Architectural Heritage: The "Notre-Dame de Paris" Digital Twin Project

  • Livio De Luca

The introduction of digital technologies into the practices of documentation, analysis and dissemination of cultural heritage is today an important issue not only in the sphere of computer science, but also in the humanities and social sciences as well as in conservation sciences. New scientific challenges are today at the crossroads of a few trends that shape the contemporary landscape of digital humanities: the democratisation of 3D digitisation means, the emergence of new approaches for the massive cross-analysis of digitised content, the on-going harmonisation of heritage information systems through the formalisation of multidisciplinary knowledge. Whilst recent advances in digital technologies have made it possible to introduce new tools that are making documentation practices evolve within the cultural heritage community, the management of multi-dimensional and multi-format data introduces new challenges, in particular the development of relevant analysis and interpretation methods, the sharing and correlation of heterogeneous data among several actors and contexts, and the centralised archiving of documentation results for long-term preservation purposes. The restoration of Notre-Dame de Paris is today an unprecedented opportunity to gather and analyse the many analytical and documentary resources produced by a large number of scientists and heritage professionals from different backgrounds on the same building. In collaboration with the other working groups of the french CNRS (National Centre for Scientific Research) / MC (Ministry of Culture) task force on the restoration of the cathedral (stone, stained glass, wood, metals, structure, acoustics, heritage emotions, etc.), the "digital data" WG focuses on the introduction of an innovative digital ecosystem - a 'Notre-Dame de Paris' digital twin - to bring together, analyse and correlate the multiple levels of reading that converge towards the production of new knowledge on the cathedral, its transformations over time, its materials, the alterations produced by the fire, as well as towards the restoration project. By combining recent advances in 3D digitisation, knowledge engineering, computer vision and shape analysis, our approach will take into account two complementary aspects. On the one hand, the description of the different steps taken by scientists to move from the observation of raw data to their interpretation in relation to other contextual data from the analysis of typical operating chains. On the other hand, the analysis of the spatial, temporal and semantic overlap of regions of "multidisciplinary interest" based on the correlation of annotations, vocabulary terms, qualitative attributes and morphological features.

SESSION: Workshop Presentations

Semantics Preserving Hierarchy based Retrieval of Indian heritage monuments

  • Ronak Gupta
  • Prerana Mukherjee
  • Brejesh Lall
  • Varshul Gupta

Monument classification can be performed on the basis of their appearance and shape from coarse to fine categories, although there is much semantic information present in the monuments which is reflected in the eras they were built, its type or purpose, the dynasty which established it, etc. Particularly, Indian subcontinent exhibits a huge deal of variation in terms of architectural styles owing to its rich cultural heritage. In this paper, we propose a framework that utilizes hierarchy to preserve semantic information while performing image classification or image retrieval. We encode the learnt deep semantic embeddings to construct a dictionary of images and then utilize a re-ranking framework on the the retrieved results using DeLF features. The semantic information preserved in these embeddings helps to classify unknown monuments at higher level of granularity in hierarchy. We have curated a large, novel Indian heritage monuments dataset (IHMD) comprising of images of historical, cultural and religious importance with subtypes of eras, dynasties and architectural styles. We demonstrate the performance of the proposed framework in image classification and retrieval tasks and compare it with other competing methods on this dataset. The dataset can be accessed at:

A Generative Adversarial Approach with Residual Learning for Dust and Scratches Artifacts Removal

  • Ionuţ Mironică

Retouching can significantly elevate the visual appeal of photos, but many casual photographers lack the expertise to operate in a professional manner. One particularly challenging task for old photo retouching remains the removal of dust and scratches artifacts. Traditionally, this task has been completed manually with special image enhancement software and represents a tedious task that requires special know-how of photo editing applications. However, recent research utilizing Generative Adversarial Networks (GANs) has been proven to obtain good results in various automated image enhancement tasks compared to traditional methods. This motivated us to explore the use of GANs in the context of film photo editing. In this paper, we present a GAN based method that is able to remove dust and scratches errors from film scans. Specifically, residual learning is utilized to speed up the training process, as well as boost the denoising performance. An extensive evaluation of our model on a community provided dataset shows that it generalizes remarkably well, not being dependent on any particular type of image. Finally, we significantly outperform the state-of-the-art methods and software applications, providing superior results.

Face Detection on Pre-modern Japanese Artworks using R-CNN and Image Patching for Semi-Automatic Annotation

  • Alexis Mermet
  • Asanobu Kitamoto
  • Chikahiko Suzuki
  • Akira Takagishi

We propose a face detection method for semi-automatic annotation of faces on pre-modern Japanese artworks to assist art historians identify objects in the art collection. Our method is based on R-CNN, such as Faster R-CNN and Cascade R-CNN, for object detection, and image patching for taking advantage of high resolution images. Our face detectors were first trained on the KaoKore dataset to demonstrate that existing object detection models with image patching can successfully learn faces in artworks. Our face detectors were then applied to the Kouhon dataset to assist art historians create a new facial expression dataset. Finally the impact of face detection on art history research was measured by the reduction of annotation time, and it was estimated to be $1/3$ in comparison to fully manualdiscussed as the reduction of annotation time to $1/3$ in comparison to fully manual annotation.

An Automated Pipeline for a Browser-based, City-scale Mobile 4D VR Application based on Historical Images

  • Sander Münster
  • Ferdinand Maiwald
  • Christoph Lehmann
  • Taras Lazariv
  • Mathias Hofmann
  • Florian Niebling

The process for automatically creating 3D city models from contemporary photographs and visualizing them on mobile devices is now well established, but historical 4D city models are more challenging. The fourth dimension here is time. This article describes an automated VR pipeline based on historical photographs and resulting in an interactive browser-based device-rendered 4D visualization and information system for mobile devices. Since the pipeline shown is currently still under development, initial results for stages of the process will be shown and assessed for accuracy and usability.

New Interactive Methods for Image Registration with Applications in Repeat Photography

  • Axel Schaffland
  • Tri Hiep Bui
  • Oliver Vornberger
  • Gunther Heidemann

We present two methods to interactively register images, which can not be registered automatically due to the large differences between them. Those difference may occur with images which are multitemporal, multisensory, or multipositional. Our methods are used to register repeat photography compilations consisting of a historic image and a contemporary image of the same scene. Method I iteratively computes a rigid transformation registering the two images based on user added point pairs. More complex transformations are generated with more added point pairs starting from a translation with one point pair to an optimized perspective transformation with more than four user added point pairs. Method II allows users to grab and drag one of the images to be registered and pin it to the second image. Depending on the number of positions at which the images are pinned together more complex transformations are computed. Further we present a third method aiding already at the creation of the second image. Method III allows the user to compare a life camera stream (e.g. from a smartphone camera or webcam) with a first image using different composition forms such allowing users to check if they reached the camera position of the first image. All methods give direct feedback to the user during the registration process and run on client side on a web browser without additional software. Further, they are planned to be integrated into our repeat photography web portal.

Hybrid Human-Machine Classification System for Cultural Heritage Data

  • Shaban Shabani
  • Maria Sokhn
  • Heiko Schuldt

The advancement of digital technologies has helped cultural heritage organizations to digitize their data collections and improve the accessibility via online platforms. These platforms have enabled citizens to contribute to the process of digital preservation of cultural heritage by sharing documents and their knowledge. However, many historical datasets have problems due to incomplete metadata. To solve this issue, cultural heritage organizations heavily depend on domain experts. In this paper, we address the issue of completing the metadata of historical digital collections. For this, we introduce a new hybrid human-machine model. This model jointly integrates predictions of a deep multi-input model and inferred labels from multiple crowd judgements. The multi-input model uses visual features extracted from the images and textual features from the metadata, complemented with Wikipedia classes of concepts extracted in the text. On the crowd answer aggregation, our method considers the workers' reliability scores. This score is based on the performance of workers' task history and their performance in our task. We have applied our hybrid approach to a culture heritage platform and the evaluations show that it outperforms both deep learning and crowdsourcing when applied individually.

PP-LinkNet: Improving Semantic Segmentation of High Resolution Satellite Imagery with Multi-stage Training

  • An Tran
  • Ali Zonoozi
  • Jagannadan Varadarajan
  • Hannes Kruppa

Road network and building footprint extraction is essential for many applications such as updating maps, traffic regulations, city planning, ride-hailing, disaster response etc. Mapping road networks is currently both expensive and labor-intensive. Recently, improvements in image segmentation through the application of deep neural networks has shown promising results in extracting road segments from large scale, high resolution satellite imagery. However, significant challenges remain due to lack of enough labeled training data needed to build models for industry grade applications. In this paper, we propose a two-stage transfer learning technique to improve robustness of semantic segmentation for satellite images that leverages noisy pseudo ground truth masks obtained automatically (without human labor) from crowd-sourced OpenStreetMap (OSM) data. We further propose Pyramid Pooling-LinkNet (PP-LinkNet), an improved deep neural network for segmentation that uses focal loss, poly learning rate, and context module. We demonstrate the strengths of our approach through evaluations done on three popular datasets over two tasks, namely, road extraction and building foot-print detection. Specifically, we obtain 78.19% meanIoU on SpaceNet building footprint dataset, 67.03% and 77.11% on the road topology metric on SpaceNet and DeepGlobe road extraction dataset, respectively.