The development of generative models for text-to-image synthesis has advanced significantly. Recently, these models have had wide success in the application in the field of AI-Art since it allows artists and designers to create images from descriptive text automatically with excellent visual quality.
Several works exist in the literature dealing with synthetic image generation. Although Generative Adversarial Networks (GAN) effectively sample high-resolution pictures with acceptable perceptual quality, they are challenging to tune and have trouble capturing the whole data distribution. On the other hand, the emphasis on accurate density estimates in likelihood-based approaches makes optimization more controlled. More precisely, several models in the field of text-to-image synthesis demonstrated the potential to assist artists in producing new works of art and have led to the explosive expansion of the AI-produced art sector. Due to their high computational requirements, these models can now only be applied to the tasks for which they were initially developed. Recently, authors from Germany proposed a new approach to train accessible and controllable visual art retrieval-augmented diffusion model (RDM) to create new images in a text-to-image synthesis fashion.
The proposed approach combines a relatively small generative model with a vast image library to dramatically minimize the computing complexity needed during training. Authors also suggest exploiting CLIP’s common text-image feature space to provide text-prompts in order to guide the synthesis process. Since CLIP provides a shared image/text feature space, and RDMs learn to cover a neighborhood of a query image in the training step, it is possible to directly take a CLIP text embedding of a given prompt and condition on it. Following this strategy, they obtain a controlled synthesis model that is only trained on picture data.
After the training step, the original RDM’s “train dataset,” used to train the model, is swapped out for an alternative databases “style dataset” obtained from art datasets to produce a post-hoc model modification and, as a result, zero-shot stylization. The style dataset is used to achieve a specific visual style in the created image.
To demonstrate the efficiency of the proposed approach, the authors proposed to use two models. The first RDM model is trained on a training dataset from OpenImages. Then, in the inference stage, a style dataset from the WikiArt image database is used to achieve stylization. The second RDM model, larger than the first, is trained on 100M examples from LAION-2B-en. The style dataset used in the inference step is taken from the ArtBench dataset. The results demonstrate that RDM may be utilized for fine-grained stylization without prior training. In addition, some examples of the created images are provided to show the style-specific stylization capabilities of the suggested approach.
This article introduces a novel method for developing controllable and accessible models of visual art. This approach is manageable because it enables the specification of a desired visual style through the post-hoc replacement of the external database, which in the trials proves to be a potent substitute for solely text-based methods.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link. Please Don't Forget To Join Our ML Subreddit
Mahmoud is a PhD researcher in machine learning. He also holds a
bachelor’s degree in physical science and a master’s degree in
telecommunications and networking systems. His current areas of
research concern computer vision, stock market prediction and deep
learning. He produced several scientific articles about person re-
identification and the study of the robustness and stability of deep