AMC-SME '23: Proceedings of the 2023 Workshop on Advanced Multimedia Computing for Smart Manufacturing and Engineering

AMC-SME '23: Proceedings of the 2023 Workshop on Advanced Multimedia Computing for Smart Manufacturing and Engineering

AMC-SME '23: Proceedings of the 2023 Workshop on Advanced Multimedia Computing for Smart Manufacturing and Engineering

Full Citation in the ACM Digital Library

SESSION: Keynote Talk

AI/Machine Learning for Internet of Dependable and Controllable Things

  • Houbing Herbert Song

The Internet of Things (IoT) has the potential to enable a variety of applications and services. However, it also presents grand challenges in security, safety, and privacy. Therefore, there is a need for moving from IoT to Internet of Dependable Things, which is defined as Internet of Things which is designed, built, deployed and operated in a highly trustworthy manner, and Internet of Controllable Things, which is defined as Internet of Things which is operated in a highly controllable manner. A massive resurgence of artificial intelligence (AI) and machine learning (ML) presents tremendous opportunities for Internet of Dependable and Controllable Things and as well as significant challenges to Internet of Dependable and Controllable Things. In this lecture, I will present the state of the art by reviewing and classifying the existing literature, evaluate the opportunities and challenges, and identify trends by evaluating what needs to be done to enable AI/Machine Learning for Internet of Dependable and Controllable Things.

SESSION: Session 1: Applications in AMC-SME

A Scalable Real-time Semantic Segmentation Network for Autonomous Driving

  • Jun Liu
  • Chao Wu
  • Geng Yuan
  • Wei Niu
  • Wenbin Zhang
  • Houbing Herbert Song

Computational resources are limited on real-time embedded devices, so the available computing cost of deployment on the target platform must be considered. We develop a feature extraction module based on the MobileNet backbone that can be adjusted in terms of computational complexity and capacity using the depth multiplier parameter, classifier depth, and kernel depth. These three parameters allow us to control the count of channels within the network, effectively managing the model's capacity and computational requirements. To achieve semantic segmentation, we incorporate additional components, such as an extension module. This extension module typically includes 1x1 pointwise convolutional layers for pixel-level classification and a transposed convolutional layer for upsampling the output to the original input image size. By combining the feature extraction module with this extension module, we can create a complete architecture capable of performing semantic segmentation tasks. The feature extraction module provides the initial feature extraction and the extension module adds the necessary components for accurate pixel-wise classification and upsampling. Compared to Hardware-aware Neural Architecture Search (NAS), pruning, runtime pruning, and knowledge distillation methods, our model has several advantages in terms of modular design, structural controllability, ease of implementation, and cost-effectiveness. Our computational efficiency, as measured by FLOPS, is highly competitive. Our method is distinguished by solving the problem of MobileNet's inability to adjust the size and number of convolution kernels. It achieves this through adaptable parameter tuning, including MobileNet's depth multiplier, the kernel size in the FCN head's Separable Convolution layer, and the depth of the first Point-wise Convolution layer. These adjustments are customized to match hardware's max multiply-accumulates (MACs), optimizing network capacity and maximizing resource utilization.

Adapting Segment Anything Model for Shield Tunnel Water Leakage Segmentation

  • Shichang Liu
  • Junxin Chen
  • Ben-Guo He
  • Tao Chen
  • Gwanggil Jeon
  • Wei Wang

Intelligent algorithms have a significant driving force in smart engineering. As one of the common engineering projects, water leakage detection in shield tunnel is critical. Manual inspection methods for detecting leakage areas are time-consuming and may yield inconsistent results. To improve efficiency and reliability, automated methods should be developed. Recently, pre-trained foundational segmentation model on huge datasets are gaining attention. By modifying Segment Anything Model (SAM) without retraining or fine-tuning the entire model, we provide an accurate automatic water leakage segmentation approach using adaptation technique. Our method simplifies tunnel maintenance, ensuring the integrity and safety of these critical infrastructures. Our study highlights the limitations of SAM and the need for its adaptation in specific task. The contributions of this paper include the efficient adaptation of SAM for shield tunnel water leakage segmentation and the demonstration of the application effect by data experiments. With AI models we improve the efficiency of engineering challenges.

Plug-and-Play Multi-class Lane Detection Module

  • Qiankun Li
  • Huabao Chen
  • Debin Liu
  • Zengyu Qiu
  • Zengfu Wang

Lanes play a crucial role in visual navigation systems for Autonomous driving. Several studies have employed deep learning technology to design networks for lane detection. However, most methods simply detect lanes area, ignoring that different types of lanes as traffic signs carry different high-level semantic meanings. In this paper, we propose a plug-and-play multi-class lane detection module (MLDM) aimed at distinguishing different kinds of lanes. The module identifies lane regions based on the coordinates of the lanes as predicted by the network. It then uses pixel density and lane length to detect color and determine whether a lane is solid or dashed. In addition, to demonstrate the practical value of MLDM, we devise a lane departure warning system that integrates the detection results of MLDM. Experimental results demonstrate the efficacy of the proposed method.

CMFS-Net: Common Mode Features Suppression Network for Gaze Estimation

  • Xu Xu
  • Lei Yang
  • Yan Yan
  • Congsheng Li

Gaze estimation typically involves determining the direction or point of gaze based on a single image of the eye or face. However, due to variations in the internal structure and morphology of eyes among individuals, existing models for gaze estimation have limited accuracy. They often exhibit subject-dependent bias and a high degree of variance in their output. To address these limitations, calibration is commonly used to improve accuracy by mapping individuals' gaze predictions to real gaze. In this study, we propose an innovative approach for gaze estimation called the image common-mode feature suppression network. By training this network to suppress common-mode features, we can forecast the difference in gaze between two input images from the same subject. Furthermore, by leveraging the inferred differences and a set of calibration images for a specific object, we can forecast the gaze direction for a new eye sample. Our hypothesis is that by contrasting the two eye images, we can significantly reduce confounding factors that typically affect single-image forecasting methods, thereby producing superior predictions. To evaluate our approach, we curated a dataset specifically for training and testing the network. The experimental results demonstrate that our proposed model achieves gaze estimation with an error of less than 5.3 mm.

Real-Time Machine Learning Based Object Detection and Recognition System for the Visually Impaired

  • Jenny Liu

For those severely vision impaired, day-to-day life and navigation pose much risk - current aids are either unreliable or expensive. This project aimed to develop a lower-cost, wearable assistive device targeting two main aspects: object detection/recognition and image captioning. The final system uses MobileNet-SSD as a base model trained for 80 different classification labels on the open-source object detection, segmentation, and captioning COCO ( data set. The data set consists of 400,000 images from COCO and 200 images taken by the student researcher. The design required a Raspberry Pi 4 (Model B) as a hardware platform powered by a portable USB battery with a high-definition camera and earbuds attached. Image captioning was implemented using the Python library Google Text-to-Speech (GTTS). The system can identify and caption up to 80 different basic objects for under 200. A field test demonstrated the system performs with high accuracy both indoors (95%, 90%) and outdoors (78.6%, 85.7%), implying the system provides a novel machine-learning device that could offer a versatile alternative to current tech. Should the system be further perfected and optimized and developed into assistive technology using the prototype designs created, it could aid and help keep 2.2 billion lives safe from the risks of vision impairment.

SESSION: Session 2: Supportive Technologies in AMC-SME

Physically Robust Reversible Watermarking

  • Jiale Chen
  • Xiaojian Ji
  • Li Dong
  • Rangding Wang
  • Diqun Yan
  • Jinyu Tian

In this work, we for the first time propose a novel invisible digital image watermarking framework which dubbed as Physical Robust Reversible Watermarking (PRRW). The proposed PRRW enjoys two merits. First, it allows for complete extraction of the watermark and perfect reconstruction of the cover image when the watermarked image shared through digital lossless channel. Second, the PRRW is resilient to distortions caused by the physically screen-tocamera communication channel. Specifically, PRRW is based on a template-based robust watermarking scheme and we further arm it with reversibility. For watermark embedding, we first identify a sub-image of the cover image that can accommodate the maximum embeddable strength, and then embed the watermark into it. For watermark extraction, we introduce a watermark detection network that could accurately locate the marked sub-image and extract the watermark, without any assistance from any landmarks. Experimental results demonstrate the effectiveness and practical usage of the proposed PRRW framework.

Intelligent Classification of Multimedia Images Based on Class Information Mining

  • Shuai Xiao
  • Xiaotong Shen
  • Guipeng Lan
  • Jiachen Yang
  • Jiabao Wen
  • Yong Zhu

The classification research of multimedia images has always been of great concern, and related technologies are constantly being improved to increase accuracy. Adequate evaluation and mining of sample information is an important direction, but it is always a challenge. In this article, we propose a method for constructing deep learning training datasets, which fully considers the intra class and inter class features of the samples. The intra class dispersion of the sample is evaluated by the distance from the sample features to the prototype, while inter class confusion between classes is evaluated by combining the distance of the prototype in the feature space with intra class dispersion. Based on the intra class and inter class features of samples, determine the proportion of imbalanced construction to achieve the construction of imbalanced datasets. This method has the potential to be applied to different multimedia visual tasks.

Self-Augmentation Graph Contrastive Learning for Multi-view Attribute Graph Clustering

  • Shan Jin
  • Zhikui Chen
  • Shuo Yu
  • Muhammad Altaf
  • Zhenchao Ma

Multi-view attribute graph clustering is a fundamental task which aims to partition multi-view attributes into multiple clusters in an unsupervised manner. The existing multi-view attribute graph clustering methods lack the utilization of comprehensive structural information within each view and further ignore the unreliable relations between different views, leading to suboptimal clustering results. To this end, we develop a Self-Augmentation Graph Contrastive Learning (SAGCL) for multi-view attribute graph clustering, which integrates the comprehensive structural learning of view-specific and the alignment of multi-level reliable relations between different views into a unified framework. Graph self-augmentation strategy is proposed to adaptively explore the structural information within each view, which can comprehensively capture the critical structure of each view for multi-view attribute graph. Dual-alignment constraint is developed to guide the consistency of inter-view relations in the embedding-level and clustering-level, which can extract the consistent structure between multiple views and obtain cluster-oriented graph embedding with more discriminating. Furthermore, with the help of robust contrastive loss, our proposed network can suppress the existence of noisy information within each view and unreliable relations between different views. Extensive experiments prove that SAGCL outperforms the state-of-the-art methods.

Cooperative Spectrum Sensing with Deep Q-Network for Multimedia Applications

  • Qingying Wu
  • Benjamin K. Ng
  • Han Zhu
  • Chan-Tong Lam

With the gradually stricter requirement for multimedia applications, spectrum inefficiencies are urgent to be relieved by sensing and utilizing Spectrum Holes (SHs) over a wide spectrum. Cognitive Radio Sensor Network (CRSN) has drawn a lot of attention, which determines the state of Primary Users (PUs) by implementing Cooperative Spectrum Sensing (CSS), further overcoming various noise and fading issues in the radio environment. A survey on the application of Reinforcement Learning (RL) technology for CSS is conducted, especially through handling the performance optimization problem that cannot be achieved by traditional methods. Specifically, we transformed the traditional Fusion Center (FC) into an intelligent Agent that is responsible for making fusion decisions based on the results of Energy Detection (ED) technology. In this way, through learning from experience, the system performance in global probabilities can be improved by making fusion decisions as accurately as possible. Compared with traditional methods, comparison studies demonstrate the effectiveness of the proposed method in improving the CSS system performances, as well as its robustness in the face of various environments. The combination and complement of the traditional and the proposed scheme are also suggested in this paper.

Mitigating Context Bias in Action Recognition via Skeleton-Dominated Two-Stream Network

  • Qiankun Li
  • Xiaolong Huang
  • Yuwen Luo
  • Xiaoyu Hu
  • Xinyu Sun
  • Zengfu Wang

In the realm of intelligent manufacturing and industrial upgrading, sophisticated multimedia computing technologies play a pivotal role in the recognition of video actions. However, most studies suffer from the issue of background bias, where the models excessively focus on the contextual information within the videos rather than concentrating on comprehending the human actions themselves. This could potentially lead to severe misjudgments in industrial applications. In this paper, we propose a Skeleton-Dominated Two-Stream Network (SDTSN), which is a novel two-stream framework that fuses and ensembles the skeleton and RGB modalities for video action recognition. Experimental results on the Mimetics dataset, without any background bias, demonstrate the efficacy of our approach.

Adaptive Multiobjective Evolutionary Neural Architecture Search for GANs based on Two-Factor Cooperative Mutation Mechanism

  • Bin Cao
  • Zirun Zhou
  • Xin Liu
  • M. Shamim Hossain
  • Zhihan Lv

The automated design of generative adversarial networks (GAN) is currently being solved well by neural architecture search (NAS), although there are still some issues. One problem is the vast majority of NAS for GANs methods are only based on a single evaluation metric or a linear superposition of multiple evaluation metrics. Another problem is that the conventional evolutionary neural architecture search (ENAS) is unable to adjust its mutation probabilities in accordance with the NAS process, making it simple to settle into a local optimum. To address these issues, we firstly design a two-factor cooperative mutation mechanism that can control the mutation probability based on the current iteration rounds of the population, population fitness and other information. Secondly, we divide the evolutionary process into three stages based on the properties of NAS, so that the different stages can adaptively adjust the mutation probability according to the population state and the expected development goals. Finally, we incorporate multiple optimization objectives from GANs based on image generation tasks into ENAS. And we construct an adaptive multiobjective ENAS based on a two-factor cooperative mutation mechanism. We test and ablate our algorithm on the STL-10 and CIFAR-10 datasets, and the experimental results show that our method outperforms the majority of traditional NAS-GANs.