McGE '23: Proceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice

McGE '23: Proceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice

McGE '23: Proceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice

Full Citation in the ACM Digital Library

SESSION: Session 1: Multimedia Content Evaluation: New Methods and Practice

Automatic Image Aesthetic Assessment for Human-designed Digital Images

  • Yitian Wan
  • Weijie Li
  • Xingjiao Wu
  • Junjie Xu
  • Jing Yang

Recently, with the ever-growing scale of aesthetic assessment data, researchers have the image aesthetic assessment (IAA) task. Meanwhile, as technology developing, there are more and more human-designed digital images through software like Photoshop on the Internet. However, existing datasets merely focus on the images from real world, leaving the blank of aesthetic assessment of human-designed digital images. Adding to this, numerous existing IAA datasets rely solely on the Mean Opinion Score (MOS) for calculating aesthetic scores. Nonetheless, we contend that scores from individuals with diverse expertise should be treated distinctively, as differing fields of knowledge likely yield disparate opinions regarding the same image. To address these challenges, we construct the first Human-Designed Digital (HDDI) dataset for IAA tasks. And we develop a multi-angle method to generate aesthetic scores. Furthermore, we present the TAHF model as a novel baseline for our newly curated dataset. Empirical validation demonstrates the superior performance of our TAHF model over the current state-of-the-art (SOTA) model on the HDDI dataset.

Multimedia Cognition and Evaluation in Open Environments

  • Wei Feng
  • Haoyang Li
  • Xin Wang
  • Xuguang Duan
  • Zi Qian
  • Wu Liu
  • Wenwu Zhu

Within the past decade, a plethora of emerging multimedia applications and services has catalyzed the production of an enormous quantity of multimedia data. This data-driven epoch has significantly propelled the trajectory of advanced research in various facets of multimedia, including image/video content analysis, multimedia search and recommendation systems, multimedia streaming, and multimedia content delivery among others. In parallel to this, the discipline of cognition, has embarked on a renewed trajectory of progression, largely attributing its remarkable success to the revolutionizing advent of machine learning methodologies. This concurrent evolution of the two domains invariably presents an intriguing question: What happens when multimedia meets cognition? To decipher this complex interplay, we delve into the concept of Multimedia Cognition, which encapsulates the mutual influence between multimedia and cognition. This exploration is primarily directed toward three crucial aspects. Firstly, the way multimedia and cognition influence each other, prompting theoretical developments towards multiple intelligence and cross-media intelligence. More important, cognition reciprocates this interaction by infusing novel perspectives and methodologies into multimedia research, which can promote the interpretability, generalization ability, and logical thinking of intelligent systems in open environments. Last but not least, these two aspects form a loop in which multimedia and cognition interactively enhance each other, bringing a new research problem, so that the proper evaluation for multimedia cognition in open environments is important. In this paper, we discuss what and how efforts have been done in the literature and share our insights on research directions that deserve further study to produce potentially profound impacts on multimedia cognition and evaluation in open environments.

How Art-like are AI-generated Images? An Exploratory Study

  • Junyu Chen
  • Jie An
  • Hanjia Lyu
  • Jiebo Luo

Assessing the artness or artistic quality of AI-generated images continues to be a challenge within the realm of image generation. Most existing metrics cannot be used to perform instance-level and reference-free artness evaluation. This paper presents ArtScore, a metric designed to evaluate the degree to which an image resembles authentic artworks by artists (or conversely photographs), thereby offering a novel approach to artness assessment. We first blend pre-trained models for photo and artwork generation, resulting in a series of mixed models. Subsequently, we utilize these mixed models to generate images exhibiting varying degrees of artness with pseudo-annotations. Each photorealistic image has a corresponding artistic counterpart and a series of interpolated images that range from realistic to artistic. This dataset is then employed to train a neural network that learns to estimate quantized artness levels of arbitrary images. Extensive experiments reveal that the artness levels predicted by ArtScore align more closely with human artistic evaluation than existing evaluation metrics, such as Gram loss and ArtFID.

Exploring Anchor-Free Approach for Reading Chinese Characters

  • Zhuoyao Wang
  • Yiqun Wang
  • Zhao Zhou
  • Han Yuan
  • Cheng Jin

Scene text spotting has achieved an impressive performance over recent years. Currently, most text localization methods are designed with the text line instance. We argue that building a character-level spotting network is more suited to recognize the Chinese of text and Chinese is also common in scene text images. In this paper, we explore an anchor-free spotting framework that treats a character as a single point. To better capture Chinese character features, we first use the Canny edge detectors and superimpose the obtained edge information onto the RGB image channel. After that, a feed-forward network is set up and the inference can be processed in a single network forward-pass, without complex post-processing steps. Experiments are performed on the Chinese text dataset and the quantitative comparisons demonstrate the effectiveness of the anchor-free approach.

Semi-supervised Learning with Easy Labeled Data via Impartial Labeled Set Extension

  • Xuan Han
  • Mingyu You
  • Wanjing Ma

Traditional Semi-supervised Learning (SSL) methods usually assume that the labeled data is independent and identically distributed (i.i.d.) from the underlying distribution. However, several relevant researches have revealed that i.i.d. assumption may not always hold. Influenced by the human preference or automatic labeling, in some cases, the labels (or trusted labels) would be concentrated in the easy samples which have distinctive characteristics. Such a biased labeled set will lead to grave misestimating for the decision boundaries in learning process. In this paper, we proposed a novel evolutionary SSL framework, Solar Eclipse (SE), to address the problem. This framework is based on the concept of progressively enlarging the labeled set with the closest unlabeled samples. Specifically, a novel relative distance measurement Regional Label Propagation (R-LP) is designed. In R-LP, the sample space is divided into several regions according to the class similarities, and the distance is calculated independently in each region with label propagation. Such segregation strategy efficiently reduces the complicity of distance measurement in the feature space. Moreover, R-LP also facilitates the ensemble of different feature views. In our practice, an unbiased self-supervised feature view is introduced to assist the measurement. Experiments show that such dual-view scheme can help us find more reliable extending samples. The evaluation on the popular SSL benchmarks shows that the proposed SE framework achieves the most advanced performance with the easy labeled data. Except that, it also shows advantages when only a few i.i.d. labeled samples is provided, given that they may also have sampling bias.

EMID: An Emotional Aligned Dataset in Audio-Visual Modality

  • Jialing Zou
  • Jiahao Mei
  • Guangze Ye
  • Tianyu Huai
  • Qiwei Shen
  • Daoguo Dong

In this paper, we propose Emotionally paired Music and Image Dataset (EMID), a novel dataset designed for the emotional matching of music and images, to facilitate auditory-visual cross-modal tasks such as generation and retrieval. Unlike existing approaches that primarily focus on semantic correlations or roughly divided emotional relations, EMID emphasizes the significance of emotional consistency between music and images using an advanced 13-dimension emotional model. By incorporating emotional alignment into the dataset, it aims to establish pairs that closely align with human perceptual understanding, thereby raising the performance of auditory-visual cross-modal tasks. We also design a supplemental module named EMI-Adapter to optimize existing cross-modal alignment methods. To validate the effectiveness of the EMID, we conduct a psychological experiment, which has demonstrated that considering the emotional relationship between the two modalities effectively improves the accuracy of matching in abstract perspective. This research lays the foundation for future cross-modal research in domains such as psychotherapy and contributes to advancing the understanding and utilization of emotions in cross-modal alignment. The EMID dataset is available at

2CET-GAN: Pixel-Level GAN Model for Human Facial Expression Transfer

  • Xiaohang Hu
  • Nuha Aldausari
  • Gelareh Mohammadi

Recent studies have used GANs to transfer expressions between human faces. However, existing models have some flaws, such as relying on emotion labels, lacking continuous expressions, and fail- ing to capture the expression details. To address these limitations, we propose a novel two-cycle network called 2 Cycles Expression Transfer GAN (2CET-GAN), which can learn continuous expression transfer without using emotion labels in an unsupervised fashion. The proposed network learns the transfer between two distribu- tions while preserving identity information. The quantitative and qualitative experiments on two public datasets of emotions (CFEE and RafD) show our network can generate diverse and high-quality expressions and can generalize to unknown identities. We also com- pare our methods with other GAN models and show the proposed model generates expressions that are closer to the real distribution and discuss the findings. To the best of our knowledge, we are among the first to successfully use an unsupervised approach to disentangle expression representation from identities at the pixel level. Our code is available at

4DSR-GCN: 4D Video Point Cloud Upsampling using Graph Convolutional Networks

  • Lorenzo Berlincioni
  • Stefano Berretti
  • Marco Bertini
  • Alberto Del Bimbo

Time varying sequences of 3D point clouds, or 4D point clouds, are now being acquired at an increasing pace in several applications (personal avatar representation, LiDAR in autonomous or assisted driving). In many cases, such volume of data is transmitted, thus requiring that proper compression tools are applied to either reduce the resolution or the bandwidth. In this paper, we propose a new solution for upscaling and restoration of time-varying 3D video point clouds after they have been heavily compressed. Our model consists of a specifically designed Graph Convolutional Network that combines Dynamic Edge Convolution and Graph Attention Networks for feature aggregation in a Generative Adversarial setting. We present a different way to sample dense point clouds with the intent to make these modules work in synergy to provide each node enough features about its neighbourhood in order to later on generate new vertices. Compared to other solutions in the literature that address the same task, our proposed model is capable of obtaining comparable results in terms of quality of the reconstruction, while using a substantially lower number of parameters (\simeq 300KB), making our solution deployable in edge computing devices.

SESSION: Session 2: Multimedia Content Generation

Taming Vector-Wise Quantization for Wide-Range Image Blending with Smooth Transition

  • Zeyu Wang
  • Haibin Shen
  • Kejie Huang

Wide-range image blending is a novel image processing technique that merges two different images into a panorama with a transition region. Conventional image inpainting and outpainting methods have been used to complete this task, but always create significant distorted and blurry structures. The State-Of-The-Art (SOTA) method uses a U-Net-like model with a feature prediction module for content inference. However, it fails to generate panoramas with smooth transitions and visual realness, particularly when the input images have distinct scenery features. It indicates that the predicted features may deviate from the natural latent distribution of authentic images. In this paper, we propose an effective deep-learning model that integrates vector-wise quantization for feature prediction. This approach searches for the most-like latent features from a discrete codebook, resulting in high-quality wide-range image blending. In addition, we propose to use the global-local discriminator for adversarial training to improve the predicted content quality and smooth the transition. Our experiments demonstrate that our method generates visually appealing panoramic images and outperforms baseline approaches on the Scenery6000 dataset.

TCGIS: Text and Contour Guided Controllable Image Synthesis

  • Zhiqiang Zhang
  • Wenxin Yu
  • Yibo Fan
  • Jinjia Zhou

Recently, text-to-image synthesis (T2I) has received extensive attention with encouraging results. However, the research still has the following challenges: 1) the quality of the synthesized images cannot be effectively guaranteed; 2) the human participation in the synthesis process is still insufficient. Facing the above challenges, we propose a text- and contour-guided artificial controllable image synthesis method. The method can synthesize corresponding image results based on the manual input text and simple contour, wherein the text determines the basic content and the contour determines the shape position. Based on the above idea, we propose a reasonable and efficient network structure using the attention mechanism and achieve amazing synthetic results. We validate the effectiveness of our proposed method on three widely used datasets, and both qualitative and quantitative experimental results show promising performance. In addition, we design a lightweight structure to further improve the practicability of the model.

Emotionally Enhanced Talking Face Generation

  • Sahil Goyal
  • Sarthak Bhagat
  • Shagun Uppal
  • Hitkul Jangra
  • Yi Yu
  • Yifang Yin
  • Rajiv Ratn Shah

Several works have developed end-to-end pipelines for generating lip-synced talking faces with real-world applications, such as teaching and language translation in videos. However, these prior works fail to create realistic-looking videos since they focus little on people's expressions and emotions. Moreover, these methods' effectiveness largely depends on the faces in the training dataset, which means they may not perform well on unseen faces. To mitigate this, we build a talking face generation framework conditioned on a categorical emotion to generate videos with appropriate expressions, making them more realistic and convincing. With a broad range of six emotions, i.e., happiness, sadness, fear, anger, disgust, and neutral, we show that our model can adapt to arbitrary identities, emotions, and languages. Our proposed framework has a user-friendly web interface with a real-time experience for talking face generation with emotions. We also conduct a user study for subjective evaluation of our interface's usability, design, and functionality. Project page: \href .

Human Pose Recommendation and Professionalization

  • Xin Jin
  • Chenyu Fan
  • Biao Wang
  • Chaoen Xiao
  • Chao Xia

Thanks to the proliferation of smartphones, taking photos is a breeze. Embarrassingly, we often find it difficult to strike a proper pose due to a lack of professional photography knowledge or guidance. The resulting photos are less than satisfactory. Nowadays, there exist plenty of scenarios where the original person images need to be automatically modified, such as social-media apps and photo-sharing sites. To solve this problem, we introduce a two-stage framework for human pose recommendation and professionalization. This novel pose-guided person image generation task is to transform a source person image to a professional pose. First, the Pose Recommendation Stage, then Reposing Stage. In the recommendation stage, a dataset of professional person pose images (Human Posture Recommendation Templates, HPRT) is first collected. And we propose human posture recommendation algorithm. Given a source person image, our algorithm can find some proper reference posture images in this shooting scene from the collected template dataset. Then, in reposing stage, we propose a pose-conditioned transformer-based StyleGAN generator to translate the source person image to the reference posture. Also, to make the image more realistic, we add completion operation. So our work can automatic modify the person pose in the image with more professional one without any manual operations. Qualitative and quantitative evaluations show that our two-stage framework can solve such task well.

Alleviating Training Bias with Less Cost via Multi-expert De-biasing Method in Scene Graph Generation

  • Xuezhi Tong
  • Rui Wang
  • Lihua Jing

Scene graph generation (SGG) methods have suffered from a severe training bias towards frequent (head) predicate classes. Recent works owe it to the long-tailed distribution of predicates and alleviate the long-tailed problem to conduct de-biasing. However, the "unbiased'' models are in turn biased to tail predicate classes, resulting in a significant performance loss on head predicate classes. The main cause of such a trade-off between head and tail predicates is the fact that multiple predicates from the head or tail ones can be labeled as the ground-truth. To this end, we propose a multi-expert de-biasing method (MED) for SGG that can produce unbiased scene graphs with minor influence on recognizing head predicates. We avoid the dilemma of balancing between head and tail predicates by adaptively classifying the predicates with multiple complementary models. Experiments on the Visual Genome dataset show that MED provides significant gains on mRecall@K without harming the performance on Recall@K, and achieves a state-of-the-art on the mean of Recall@K and mRecall@K.

Multi-View Predicate Recognition for Solving Semantic Ambiguity Problem in Scene Graph Generation

  • Xuezhi Tong
  • Lihua Jing
  • Cong Zou
  • Rui Wang

Recent works on Scene Graph Generation (SGG) have been concentrating on solving the problem of long-tailed distribution. While these methods are making significant improvements on the tail predicate categories, they sacrifice the performance of the head ones severely. The major issue lies in the semantic ambiguity problem, which is the contradiction between the commonly used criterion and the nature of relationships in the SGG datasets. The models are evaluated with graph constraint, which allows merely one relationship between a pair of objects. However, the relationships are much more complex and can always be described from different views. For example, when a man is in front of a computer, we can also say he is watching it. Both options are plausible, describing the different aspects of the relationship. Which of them is determined to be the ground-truth is highly subjective. In this paper, we claim that the relationships should be considered from multiple views to avoid the semantic ambiguity. In other words, the model should provide all the possibilities, rather than being biased to any one of the options. To this end, we propose the Multi-View Predicate Recognition (MVPR), which separates the label set into multiple views and enables the model to represent and predict in a "multi-view'' style. Specifically, MVPR can be divided into three parts: Adaptive Bounding Box for Predicate is proposed to help the model attend to the crucial areas for the predicate categories in different views; Multi-View Predicate Feature Learning is designed to separate the feature space of different views of predicate categories; Multi-View Predicate Prediction and Multi-View Graph Constraint are used to allow the model to provide multi-view predictions to accurately estimate ambiguous relationships. Experimental results on the Visual Genome dataset show that our MVPR can significantly improve the model performance on the SGG task, and achieves a new state-of-the-art.

Nonword-to-Image Generation Considering Perceptual Association of Phonetically Similar Words

  • Chihaya Matsuhira
  • Marc A. Kastner
  • Takahiro Komamizu
  • Takatsugu Hirayama
  • Keisuke Doman
  • Ichiro Ide

Text-to-Image (T2I) generation has long been a popular field of multimedia processing. Recent advances in large-scale vision and language pretraining have brought a number of models capable of very high-quality T2I generation. However, they are reported to generate unexpected images when users input words that have no definition within a language (nonwords), including coined words and pseudo-words. To make the behavior of T2I generation models against nonwords more intuitive, we propose a method that considers phonetic information of text inputs. The phonetic similarity is adopted so that the generated images from a nonword contain the concept of its phonetically similar words. This is based on the psycholinguistic finding that humans would also associate nonwords with their phonetically similar words when they perceive the sound. Our evaluations confirm a better agreement of the generated images of the proposed method with both phonetic relationships and human expectations than a conventional T2I generation model. The cross-lingual comparison of generated images for a nonword highlights the differences in language-specific nonword-imagery correspondences. These results provide insight into the usefulness of the proposed method in brand naming and language learning.

Language Guidance Generation Using Aesthetic Attribute Comparison for Human Photography and AIGC

  • Xin Jin
  • Qianqian Qiao
  • Qiang Deng
  • Chaoen Xiao
  • Heng Huang
  • Hao Lou

With the proliferation of mobile photography technology, leading mobile phone manufacturers are racing to enhance the shooting capabilities of their equipment and the photo beautification algorithm of their software. However, the development of intelligent equipment and algorithms cannot supplant human subjective photography techniques. Simultaneously, with the rapid advancement of AIGC technology, AI simulation shooting has become an integral part of people's daily lives. If it were possible to assist human photography and AIGC with language guidance, this would be a significant step forward in subjectively improving the aesthetic quality of photographic images. In this paper, we propose Aesthetic Language Guidance of Image (ALG) and present a series of language guidance rules (ALG Rules). ALG is divided into ALG-T and ALG-I based on whether the guiding rules are derived from photography templates or reference images, respectively. ALG-T and ALG-I both provide guidance for photography based on three attributes of color, light, and composition of images. ALG-T and ALG-I provide aesthetic language guidance for two types of input images, landscape and portrait images. We employ two methods to conduct confirmatory experiments, human photography, and AIGC imitation shooting. In the experiments, by comparing the aesthetic scores of original and modified images, the results show that our proposed guidance scheme significantly improves the aesthetic quality of photos in terms of color, composition, and lighting attributes.

Responsive Listening Head Synthesis with 3DMM and Dual-Stream Prediction Network

  • Jun Yu
  • Shenshen Du
  • Haoxiang Shi
  • Yiwei Zhang
  • Renbin Su
  • Zhongpeng Cai
  • Lei Wang

In a conversation, it is crucial for the listener to provide appropriate reactions to the speaker, as the dialogue becomes challenging to sustain without the listener's involvement. Consequently, responsive listening head synthesis has become an important task. However, the existing methods fail to adequately utilize the audio and video of the speaker to generate listening heads, resulting in unnatural or even distorted generated videos. In this paper, we propose a framework to effectively encode audio and video features to address this problem. The framework includes speaker audio encoder, speaker video encoder, a Dual-Stream Prediction Network and a rendering network. The audio encoder utilizes a transformer encoder to encode the audio features, enabling better focus on the contextual features of long audio inputs. The speaker video encoder uses a 3D morphable model (3DMM) to extract the speaker's video features. Then we fuse the speaker's audio-video features with the listener's identity information extracted by 3DMM to preserve the listener's style. Additionally, a Dual-Stream Prediction Network is introduced to further enhance the prediction capability of the network. Finally, a rendering network is used to generate the listening heads based on the result of the prediction network. Comprehensive experiments demonstrate that our approach is capable of generating responsive listening heads with higher visual quality, better naturalness and higher reconstruction fidelity.