Research Paper
Research Paper
Abstract
Fashion designers encounter a lot of difficulties when crafting and designing in-vogue fashion designs.
As a result, many fashion-design productions don’t come large-scale to the market and are below
standard. These problems affect the demand and supply chain of the fashion design industry.
These problems can be addressed with the usage of text-to-image synthesis. Text-to-image synthesis is
the process of transforming text descriptions into high-quality two-dimensional images. After critically
analyzing the existing systems, text descriptions with a large number of words that transform into
images have not yet been identified in the text-to-image synthesis domain. So, the author has decided
to bridge the gap by building a novel algorithm using Attn generative adversarial networks ensembled
with a contrastive learning approach to synthesize fashion-design-based descriptions to high-quality
fashion designs.
This system is developed using deep learning, following a multi-level architecture of GAN networks.
An image-text encoder is simulated and trained to emphasize the words provided in a text description
and make it semantically consistent with the images. Additionally, during training, contrastive loss of
the image and text is computed to minimize the distance of textual descriptions related to the same
image and maximize those related to different images. an Attn GAN network is employed to train the
text description. After training a maximum of 800 epochs, the GAN model was able to generate images
for a variety of text descriptions for classes (Shirts, Trousers, Blazers, Shorts & Tops and Dresses).
Also, with the use of ESRGAN trained on the FashionGen dataset, the final image that was generated
was of good resolution.
Subject Descriptors
Acknowledgment
It is a great honor to have completed a successful research in the text-to-image synthesis domain by
generating images from descriptions containing numerous words. It would not have been feasible
without the great assistance, directions, and information supplied by numerous people I met along the
way. But it was my supervisor, Mr. Guhanathan Poravi, who helped me finish my project. It was thanks
to his critiques, suggestions, and research viewpoints that I was able to perform proper research. Thank
you to all industry and academic experts who took the time to listen to my research presentations and
provide extensive feedback. The teachers, administration, and colleagues who helped me with my final
year project are greatly appreciated. Finally, but most importantly, I want to thank my parents, who
were always there to help me succeed.
Contents
Declaration............................................................................................................................................................. i
Abstract................................................................................................................................................................. ii
Acknowledgment ................................................................................................................................................. iii
1. INTRODUCTION............................................................................................................................................ 1
1.1 Chapter Overview ...................................................................................................................................... 1
1.2 Problem Domain ........................................................................................................................................ 1
1.2.1 Generative adversarial networks ....................................................................................................... 1
1.2.2 Text-to-Image Synthesis ..................................................................................................................... 1
1.2.3 In-Vogue Fashion Design Industry and Problems Encountered .................................................... 2
1.3 Problem Definition ..................................................................................................................................... 2
1.3.1 Problem Statement .............................................................................................................................. 3
1.4 Research Motivation .................................................................................................................................. 3
1.5 Existing Work............................................................................................................................................. 3
1.6 Research Gap ............................................................................................................................................. 5
1.7 Contribution to The Body of Knowledge ................................................................................................. 5
1.7.1 Technical Contribution....................................................................................................................... 6
1.7.2 Domain Contribution .......................................................................................................................... 6
1.8 Research Challenges .................................................................................................................................. 6
1.9 Research Questions .................................................................................................................................... 6
1.10 Research Aim ........................................................................................................................................... 7
1.11 Research Objectives ................................................................................................................................. 7
1.12 Project Scope ............................................................................................................................................ 9
1.12.1 In-Scope ............................................................................................................................................. 9
1.12.2 Out-Scope......................................................................................................................................... 10
1.12.3 Diagram Explaining Prototype Feature Diagram ................................................................... 10
1.13 Document Structure............................................................................................................................... 11
1.14 Chapter Summary ................................................................................................................................. 11
2. LITERATURE REVIEW ............................................................................................................................. 12
2.1 Chapter Overview .................................................................................................................................... 12
2.2 Concept Graph ......................................................................................................................................... 12
2.3 Problem Domain ...................................................................................................................................... 12
2.3.1 Introduction to the in-Vogue Fashion Design Industry ................................................................. 12
7. IMPLEMENTATION ................................................................................................................................... 61
7.1 Chapter Overview .................................................................................................................................... 61
7.2 Technological Selection ........................................................................................................................... 61
7.2.1 Technological Stack .......................................................................................................................... 61
7.2.2 Data Selection .................................................................................................................................... 61
7.2.3 Selection of Development Framework ............................................................................................ 62
7.2.4 Programming Language ................................................................................................................... 62
7.2.5 Libraries Utilized .............................................................................................................................. 62
7.2.6 IDE’s Utilized .................................................................................................................................... 62
7.2.7 Summary of Technology Selection .................................................................................................. 63
7.3 Implementation of Core Functionalities ................................................................................................ 63
7.3.1 Core Research Contribution ............................................................................................................ 63
7.3.2 System Benchmarking Algorithms .................................................................................................. 66
7.4 Implementation of APIs........................................................................................................................... 67
7.5 Chapter Summary ................................................................................................................................... 67
8. TESTING ........................................................................................................................................................ 68
8.1 Chapter Overview .................................................................................................................................... 68
8.2 Objectives and Goals of Testing ............................................................................................................. 68
8.3 Testing Criteria ........................................................................................................................................ 68
8.4 Model Evaluation ..................................................................................................................................... 68
8.5 Benchmarking .......................................................................................................................................... 71
8.6 Functional Testing ................................................................................................................................... 71
8.7 Module Integration Testing..................................................................................................................... 73
8.8 Non-Functional Testing ........................................................................................................................... 74
8.8.1 Accuracy ............................................................................................................................................ 74
8.8.2 Performance ...................................................................................................................................... 74
8.8.3 Security .............................................................................................................................................. 75
8.8.4 User Friendliness ............................................................................................................................... 75
8.9 Limitations of the Testing Process .......................................................................................................... 76
8.10 Chapter Summary ................................................................................................................................. 76
9. EVALUATION .............................................................................................................................................. 77
9.1 Chapter Overview .................................................................................................................................... 77
9.2 Evaluation Methodology & Approach ................................................................................................... 77
9.3 Evaluation Criteria .................................................................................................................................. 77
Aarthif Nawaz | w1715752 vii
Fashionable
List of Figures
Figure 1.1 - Prototype Feature Diagram (Self-Composed) ................................................................... 10
Figure 2.1 - Proposed Architecture (Self- Composed) .......................................................................... 16
Figure 4.1 - Rich Picture of the System (Self-Composed) .................................................................... 35
Figure 4.2 - Stakeholder Onion Model (Self-Composed) ..................................................................... 36
Figure 4.3 - Context Diagram (Self-Composed) ................................................................................... 47
Figure 4.4 - Use case Diagram (Self-Composed) .................................................................................. 48
Figure 6.1 - Tiered Architecture (Self-Composed) ............................................................................... 55
Figure 6.2 - Component Diagram (Self-Composed) ............................................................................. 58
Figure 6.3 - Sequence Diagram (Self-Composed)................................................................................. 58
Figure 6.4 - Class Diagram (Self-Composed) ....................................................................................... 59
Figure 6.5 - System Process Flowchart (Self-Composed) ..................................................................... 60
Figure 7.1 - Contrastive Loss Function ................................................................................................. 64
Figure 7.2 - Training GAN Network ..................................................................................................... 65
Figure 7.3 - Inception Score Calculation ............................................................................................... 66
Figure 7.4 - R-Precision Calculation ..................................................................................................... 66
Figure 7.5 - API Route (Generate Fashion Design) .............................................................................. 67
Figure 7.6 - API Route (Download Image) ........................................................................................... 67
Figure 8.1 - Inception Score .................................................................................................................. 69
Figure 8.2 - R-Precision......................................................................................................................... 70
Figure 8.3 - DAMSM Model ................................................................................................................. 70
Figure 8.4 - Benchmarking of Existing Systems ................................................................................... 71
Figure 8.5 - GPU Performance .............................................................................................................. 75
Figure 8.6 - Usability of the Prototype .................................................................................................. 76
Figure 9.1 - Presented Solution Evaluation ........................................................................................... 83
Figure 9.2 - Solution to the identified Problem Evaluation ................................................................... 83
Figure 9.3 - Evaluation metrics Evaluation ........................................................................................... 84
Figure 9.4 - Accuracy of the Prototype Evaluation ............................................................................... 84
Aarthif Nawaz | w1715752 ix
Fashionable
List of Tables
Table 1.1 - Existing Works ...................................................................................................................... 5
Table 1.2 - Research Objectives .............................................................................................................. 9
Table 3.1 - Research Methodology........................................................................................................ 31
Table 3.2 - Project Deliverables ............................................................................................................ 32
Table 3.3 - Hardware Requirements ...................................................................................................... 33
Table 3.4 - Software Requirements ....................................................................................................... 33
Table 3.5 - Risk Management................................................................................................................ 34
Table 4.1 - Stakeholder Viewpoints ...................................................................................................... 38
Table 4.2 - Selection of Requirement Elicitation Methods ................................................................... 39
Table 4.3 - Findings through Literature Review ................................................................................... 40
Table 4.4 - Survey Findings .................................................................................................................. 45
Table 4.5 - Brainstorm Findings ............................................................................................................ 46
Table 4.6 - Summary of Findings .......................................................................................................... 47
Table 4.7 - Use case Description (Input Description) ........................................................................... 49
Table 4.8 - Use case Description (Download Image Output)................................................................ 50
Table 4.9 - MosCOW Techniques ......................................................................................................... 50
Table 4.10 - Functional Requirements .................................................................................................. 51
Table 4.11- Non- Functional Requirements .......................................................................................... 52
Table 5.1 - SLEP Issues & Mitigations ................................................................................................. 54
Table 6.1 - Choice of Design Paradigm ................................................................................................ 57
Table 7.1 - Summary of Technology Selection ..................................................................................... 63
Table 8.1 - Quantitative Test Results .................................................................................................... 70
Table 8.2- Functional Testing ................................................................................................................ 73
Table 8.3 - Module Integration Testing ................................................................................................. 74
Table 9.1 - Evaluation Criteria .............................................................................................................. 78
Table 9.2 - Self-Evaluation .................................................................................................................... 80
Table 9.3 - Research Concept of the Project ......................................................................................... 80
Table 9.4 - Novelty of the Project Evaluation ....................................................................................... 81
Table 9.5 - Proposed Architecture of the Project Evaluation ................................................................ 82
List of Abbreviations
Abbreviations Acronym
AI Artificial Intelligence
CNN Convolutional Neural Network
GAN/s Generative Adversarial Network/s
GPU Graphics Processing Unit
GUI Graphical User Interface
LR Literature Review
LSTM Long Short-Term Memory
OS Operating System
SSADM Structured Systems Analysis and Design Method
DAMSM Deep Attentional Multimodal Similarity Model
SOTA State of The Art
IS Inception Score
LO Learning Outcome
FR Functional Requirement
NFR Non-Functional Requirement
RNN Recurrent Neural Network
SLEP Social, Legal, Ethical and Professional
NLP Natural Language Processing
RO Research Objectives
VAE Variational Autoencoder
CPU Central Processing Unit
FID Fréchet Inception Distance
1. INTRODUCTION
1.1 Chapter Overview
The introduction chapter gives a comprehensive overview of the entire research project. The
background is covered first, followed by the problem domain and definition. Following that, the
research contributions, aims, and objectives are all addressed in detail. In conclusion, the document
thoroughly discusses the problem, the research domain, and the author's motivation for conducting the
research.
Text-to-image synthesis has been applied to modern multimodal applications to facilitate the easiness
of generating images for a wide variety of textual inputs such as keywords and phrases. It has been used
as a generic approach in CAD designing, scientific engineering, graphic designing, image fine-tuning,
and perhaps even animation (Zaidi, 2017). But these transformations only have produced outputs for
simple phrases, clauses, and pre-defined trained text towards the image. As for the basics, generative
adversarial networks (GAN) (Goodfellow et al. 2014) have successfully demonstrated the learning
probability distribution to synthesize realistic examples of images from textual descriptions which has
been vividly used in the text to image domain. Recent progress in generative models, especially
Generative Adversarial Nets (GANs) in many formats has made significant improvements in
synthesizing images and generating plausible images. But it has only proven successful for text
descriptions having a span of less than max of 15-20 words which also uses only the terms specific to
the domain. But if we are to input a larger span of text like storylines, paragraphs, and detailed
descriptions. The GANs haven’t been able to pull out versatile quality images. In simple terms, only
fine-grained texts are of considerable input producing images specific to the domain. (Nasr, Mutasim
and Imam, 2021).The current model boasts the original GAN Inception score by +4.13% and the
Frechet Inception Distance score by +13.93% benchmarked against the CIFAR_100 dataset, which is
not quite good compared to generating images from a large text description(Cheng and Gu, 2020a).
Analyzing and choosing the appropriate technology to solve the problem of taking input of
larger descriptions of substantial size and synthesizing them into high quality images.
Designing and developing a framework that will overcome the limitations like mode collapse,
blurry image, inconsistency of visual and textual semantics incurred from the existing SOTA
text-to-image synthesis models.
Analyzing and choosing the best text-to-image synthesis unsupervised approach with
contrastive learning to synthesize textual descriptions into high quality two-dimensional images.
Using techniques and identifying the appropriate tools, libraries, and technologies to develop
the text-to-image synthesis model which will accept larger text descriptions.
Analyzing and identifying tools and technologies to train multi-language-based text-to-image
synthesis models.
RQ2: What are the current problems being faced and areas that should be improved when developing
text-to-image synthesis models to generate better-detailed images for the descriptions given?
RQ3: What are the current problems being faced and areas that should be improved when developing
text-to-image synthesis models to generate better-detailed images for the descriptions given?
RQ4: What are the challenging problems that need to be solved when transforming text to images?
RQ5: What existing text to image synthesis models can be used as a partial resolution to solve the
challenging problems faced when transforming text to images?
To further elaborate on the aim, this research project will produce a text-to-image synthesis model for
fashion designers. This system will allow stakeholders to input descriptions explaining the desired in-
vogue fashion style they wish to visualize, and it will produce high-quality two-dimensional images
with a lot of details present in the image. The outputs of state-of-the-art models in the domain will be
compared using proper evaluation metrics. The research domain and the technological body will be
properly researched to obtain knowledge before proceeding towards the development of this project.
The knowledge gained will be used in developing components needed for the model and several other
areas that will achieve the outcome. The system will be open source and it will have the ability to run
across any device, including mobile and desktop.
Literature Conduct extensive research to evaluate how to achieve the LO1, RQ2,
Review target outcome. LO4, RQ3
RO1: Research and analyze existing text to image LO8
synthesis models.
RO2: Analyze techniques and technologies used
to transform text to images.
RO3: Research on how the current text to image
synthesis models has been applied to different
industries.
RO4: Elaborating the research gap applied to the
fashion design industry and training larger spans
of texts with three natural languages.
RO5: Provide an analysis document on the
critically evaluated system.
Data Gathering Carry out a requirement gathering analysis. LO3, RQ3
and Analysis RO1: Gathering feedback on building a text to LO4,
image transformation model that will take larger LO6,
spans of texts in three natural languages applied LO8
for the first time to a fashion design domain.
RO2: Evaluate the requirements gathered to
develop a GAN-based network that will allow the
transformation of text to images.
RO3: Gather, evaluate, and define the end-user
requirements through questionnaires.
Design & Plan the timeline, design, a text-to-image synthesis model LO2, RQ4,
Implementation that takes larger spans of fashion-based textual LO5, RQ5
descriptions. LO7,
RO1: Design and develop a DAMSM image- text LO8
encoder network.
RO2: Design and develop AttnGAN model
architecture ensembled with contrastive learning
using DL to combine the images and text vectors
into batch layers.
RO3: Design and develop the frontend and
backend for the system.
Testing and Evaluation of generative models includes inception score LO8, RQ3,
Evaluation and visual quality comparisons. Test and evaluate the LO9 RQ4,
prototype. RQ5
RO1: Evaluate and test the created text-to-image
model and compare against current existing
models
RO2: Create a test plan to perform unit,
integration, and functional requirement testing of
the prototype.
RO3: Produce a detailed report for the academic
and research community.
Table 1.2 - Research Objectives
1.12.1 In-Scope
Parts that will be focused on during the research process are as follows:
Reviewing and analyzing the SOTA generative adversarial network models applied to the text-
to-image domain and other text-to-image synthesis models.
Deciding on using a proper approach to transform larger spans of text to a detailed image with
all the features mentioned in the text.
Effectively reading in-vogue fashion-based text descriptions and mapping each feature word
from the text to an image and finally generating an in-vogue fashion design image with all the
relevant details needed from the text.
Evaluating results of the system based on common evaluation metrics.
Developing a full-stack application that will take in-vogue fashion text descriptions and generate
a two-dimensional high-quality image.
1.12.2 Out-Scope
Parts that will not be focused on during the research process are as follows:
The project will be limited to only taking in-vogue fashion based textual descriptions and the
model will only be effective based on fashion-related texts and other supporting words to
generate in-vogue fashion design images.
Projects focus is only to extend the ability to train larger spans of textual description and the
ability to work on three natural languages. But not to generate videos from texts or generate
images outside of the fashion design scope.
A thorough literature review of the research, existing systems, and possible evaluation metrics is
presented in Chapter two.
All of the approaches utilized in research, software development, and project management are described
in Chapter three.
The prototype's functional and non-functional needs will be discussed in detail using techniques
selected in this chapter's software requirement specification which is discussed in detail in chapter four.
The possible social, legal, ethical, and professional concerns that may arise throughout the course of
the project and the steps that can be taken to mitigate those issues are discussed in detail in chapter five.
In chapter six, we'll look at how to use the literature and requirements we've gathered to create a system
that both solves the problem we've identified and fills in the research gap.
After making design decisions, they had to be written down in a way that could be implemented in a
computer program, which is explained in Chapter seven.
An in-depth examination of software testing procedures is provided in Chapter 8 to ensure that the
system is up to code.
Chapter nine explains how the system was evaluated using a variety of quantitative and qualitative
methods.
Chapter ten wraps up the research and reflects on the project's achievements.
2. LITERATURE REVIEW
2.1 Chapter Overview
Text to image synthesis has been given significant exposure in the recent past. Many of the approaches
followed by different researchers have been experimented with, examined, and evaluated to show the
success and failures this domain produces. And if the existing work is critically analyzed to understand
how these prototypes function, it will be beneficial to continue the research in this domain. This chapter
presents an analysis of the existing work in the domains of text-to-image synthesis. The pros and cons
of the various approaches and their reviews in terms of algorithms supported, features, and
implementations are explored.
based on global linguistic representation. However, due to the sparsity of the global representation,
GAN training is challenging and the resulting images lack fine-grained information. As of now, the
majority of the images generated are merely a few sentences long (Hu, Long and Xiao, 2021). But the
promising thing about using text-to-image synthesis is the ability to generate high quality synthesized
images using different techniques, which is beneficial in various ways (Li et al., 2020a).
The art of generating images from text using AI and deep learning has recently received a lot of
attention.(Bodnar, 2018b). The results of several text-to-image synthesis applications have attracted
attention from a variety of interested stakeholders. The fashion design industry uses a lot of in-house
energy and additional unwanted raw materials, human effort, and time consumption in manually
designing in-vogue fashion designs. To avoid all of these hassles, a fashion-related system belonging
to the text-to-image synthesis domain will resolve the issue of manual creation of in-vogue fashion
designs. With just a few pieces of textual description combined with technical intervention, high quality
two-dimensional unique synthesized fashion designs can be generated.
single language still exists. To overcome this the author has chosen to build a system which can
monopolize substantial amount of text and generate two-dimensional unique images in multiple
languages which will be a welcome turning point in the text-to-image synthesis domain.
they are connected. The decoder is the generating model, while the encoder is the recognition model.
In terms of approach, these two models are complementary to one another(Grønbech et al., 2020). These
autoencoders converts the input to a fixed vector, whereas VAE converts it to a distribution. We take a
z sample from a prior distribution p. (z). Then, using a conditional distribution p(x | z), x is created. The
process can be described mathematically as
pθ (x) = Z pθ (x | z) pθ(z)dz
evaluation the SOTA system which is Attn GAN based on attention driven mechanisms has reached an
inception score of +4.3% and FID score of +13.93%.
GANs began to thrive in text-to-image synthesis problems after their creative power was unlocked by
(Goodfellow et al., 2014a).As a result, due to the high quality synthetic outputs created by GANs, their
performance in such text-to-image tasks is rated substantially higher than that of other generative
models.
image synthesis, many systems were built using GANS. (Arjovsky, Chintala and Bottou, 2017). Given
a dataset inclusive of the text and images as pairs, a generative model extracts the features from words
using a text encoder and generates samples from neural networks at a fine grained level (Parihar et al.,
2020). Based on the training weights of the generative models. It reproduces real data distributions that
do not exist in the real world. As a result, the model is forced to explore and extract meaningful
conceptual representations of two-dimensional images. The main advantage of unsupervised learning
is the ability of generative models to generate samples from unlabeled data (Bekesh et al., 2020). CGAN
and DC-GAN were both built on top of the original unsupervised text-to-image synthesis approach
where the CGAN takes into account extra information such as the class name, file name, and textual
description mapped to the dataset(Jin et al., 2019). while the DC-GAN uses deep convolutional network
for the generation of image realistic synthetic images.
Since word embedding helps to transform raw data (text document characters) into a meaningful
orientation of word vectors in the embedding space, it is a hot text encoding strategy. Word embedding
techniques collect data from the pattern and occurrence of words and go farther than typical token
modeling approaches in decoding/identifying the meaning/context of the words, allowing the model to
solve the underlying problem with more critical and valuable features. When the model is trained, these
Aarthif Nawaz | w1715752 19
Fashionable
word embeddings are used as machine-readable inputs along with the image(Levy and Goldberg, 2014).
In our case we will be using a character level text encoder (S. Reed et al., 2016) that will encode raw
fashion terms to 1024 dimension size embeddings. These embeddings will be concatenated to the
universal Gaussian latent space z-dim distribution and passed as pair with the encoded image to the
model.
Text to Image Synthesis Using Generative Adversarial Networks (Bodnar, 2018b) – For the
first time, text-to-image synthesis research has been conducted using unsupervised learning techniques.
The Wasserstein T-M model was also referred to as the Wasserstein GAN-CLS framework for this
research. The Wasserstein distance similarity was also the first unsupervised learning approach for
conditional image generation. Preliminary results show that the Wasserstein GAN-CLS loss function is
stable. With the help of deep learning, images were generated by the generator and then compared to
the discriminator's output to see which one performed better. However, the images were blurred and
this first attempt at unsupervised learning failed to demonstrate consistency.
DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthes (Tao
et al., 2020) – This model was introduced as a new approach to the text-to-image synthesis domain that
uses a single self-supervised GAN network. This novel approach follows a one-stage network model
having a single generator and discriminator for the generation of images from text descriptions. A novel
fusion-based module called the deep text-image fusion block deepens the bond of the text-image fusion
process in the generator ensemble with a novel target-aware discriminator composed of a matching-
aware gradient penalty and a one-way output that promotes the generator to generate more realistic
images while maintaining semantic consistency between the image and the text without the use of
additional networks. This GAN was solely introduced to decrease the training time and avoid mode
collapse from the discriminators whilst maintaining the generation of high-quality synthetic images.
This model was evaluated on the flower and bird datasets, and it showed a remarkable inception score
of 12.2%.
FA-GAN: Feature-Aware GAN for Text to Image Synthesis (Jeon, Kim and Kim, 2021) –
The FA GAN was introduced as an inspiration for the Stack GAN. This system uses unsupervised
learning to focus on adversarial training loss. This was built using deep learning architecture.
Combining a self-supervised generator, discriminator, and feature-aware loss, it generates images from
text. The auxiliary decoder for the self-supervised discriminator model provided a feature-aware loss,
which the self-supervised generator used to better represent features in the image. Images were created
using text keywords. Using the MS-COCO dataset, the suggested model reduces the current FID score
from 28.92 to 24.58.
The Inception score is an evaluation metric that gauges good correlation with human judgment. IS does
not show image attributes that indicate a text-to-image synthesis method's capacity to appropriately
express the semantics of the input text description (Sommer and Iosifidis, 2020). As a result of its good
correlation with subjective human judgment, it is the most commonly used statistic for evaluating image
quality. This statistic, on the other hand, only takes into account the generated images' quality and
variety. This technique performs well for unrestricted image generation, but fails to collect essential
textual information when the assignment is based on classes or descriptions. Pre-trained Inception-v3
networks are used to evaluate generated images and generate a conditional label distribution(Frolov et
al., 2021).
In terms of features extracted by a pre-trained network, the FID quantifies the difference between the
distribution of real and generated images. The FID is much more consistent than the IS in analyzing
GANs and detects more types of disruptions (Zhang et al., 2021). The FID is generated from real and
generated image samples using a pre-trained Inception-v3 model's last pooling layer to acquire visual
attributes like the IS. To calculate the FID score, the activations between actual and false images follow
a multivariate Gaussian distribution(Frolov et al., 2021).
R-Precision
Using R-precision, we can compare the visual-semantic similarity of retrieved images and generated
images to text descriptions. In addition to the caption from the test dataset, a different caption is
randomly chosen from the dataset. This is done by calculating the cosine similarity between a visual
feature and the text embedding of each caption and sorting the captions by decreasing cosine similarity
(Frolov et al., 2021).
Text-to-image synthesis refers to the process of creating a two-dimensional image from a textual
description in any natural language. Seeing an image in your mind's eye from a written description is
an apparently straightforward feat for humans. At the same time, it is one of the most difficult topics in
the field of natural language and computer vision that has received a lot of attention in recent
years(Zhou, Jiang and Xu, 2021). The text to image synthesis systems developed over the years is
categorized to two main sub systems which will be discussed briefly.
Text-to-Picture Synthesis System for Augmenting Communication (Zhu et al., 2007)- The
first supervised learning system to create images from text was approved. It converts text into speech,
but also displays the text's meaning. Natural language processing, computer vision, computer graphics,
and machine learning were used to create the system. To add images to text, these components are
combined first to identify picture and textual units. Combining image and text creates a full image that
can be viewed from any angle. Newspapers and children's books were used to test this method. In
experiments, blurry images were generated. Only 30% of its images were successful. The training
images helped with a few simple modifications to the images.
“CookGAN”: Meal Image Synthesis from Ingredients (Han, Guerrero and Pavlovic, 2020)
– This system was built using the model “StackGAN V2", which addresses text-to-image from a
completely different perspective. This system focuses more on the visual effects that are depicted in the
image and preserves fine-grained details and progressively up-samples the images. Textual descriptions
of the image are fed into a simulator network, which makes instantaneous modifications to its
Aarthif Nawaz | w1715752 25
Fashionable
appearance. It has a cycle-constant limitation as well, to enhance image quality and maintain a
consistent appearance. Food chain image generation was the only use for this technology because it
used a visual representation of the ingredients rather than a written means of displaying them.
Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions (Nasir et al.,
2019) - Textual descriptions of facial features are used to generate face models in the system. CelebA
dataset is used for the purpose of building an algorithm that automatically generates images with a list
of attributes. The text-to-face generation problem is then modeled as learning the distribution of faces
(conditioned on text) in the same latent space. An advanced version of GAN is used for conditional
multi-modality learning in this system (DC-GAN with GAN-CLS loss). It is necessary to switch the
labels for actual and fake images, and then introduce noise to the discriminator as a result. Using skip
thought vectors, the text is encoded before being provided to the generator as a tuple along with the
image. The results of generated images for various textual descriptions are impressive, based on the
final training.
Recipe2Image (El, Licht and Yosephian, 2018) - It presents a novel method of creating images
from written descriptions that do not directly describe the visual content of the image. They achieve
this by creating a system that uses their recipes to create images of food with a resolution of 256*256
pixels (or higher). It's unclear how the recipe's visual content relates to its accompanying text because
recipes have a complex language structure, with two segments (ingredients and instructions) each
including several words and expressions. A Stacked Generative Adversarial Network with two recipe
embeddings computed to construct food images conditioned on their recipes forms a baseline for this
challenge, which is a first step in solving this problem. REG is used for the embedding of the ingredients
and the cooking instructions. As a result, a common space is created by concatenating the embeddings
with the image and comparing their cosine similarity to the recipe images. Recipes are then sent into a
two-stage Stack GAN network, which generates the final recipe images.
Dall-E: Zero-Shot Text-to-Image Generation (Ramesh et al., 2021) – The system is trained
using a transformer (Vaswani et al., 2017) to autoregressively model the text and image tokens as a
single stream of data. They've used a two-stage training approach. First, a discrete variational
autoencoder (dVAE) is trained to reduce each 256*256 RGB image to a 32*32 image token grid size.
After concatenating up to 256 BPE-encoded text tokens with the 32 32 = 1024 image tokens, they train
an autoregressive transformer to simulate the combined distribution over the text and image tokens
before creating an image. In-scope models that Dall - E generates are furniture, people, flowers, and
birds, and the system was built with the goal of promoting open AI.
Text to Fashion image synthesis (Ak et al., 2020) – For text-to-image synthesis, the system
uses an e-AttnGAN with greater training stability. AttnGAN's attention module combines phrase and
word context characteristics and feature-wise linear modulation to merge visual and natural language
interpretations (FiLM). Also provided are similarity and feature detection losses between real and
generated pictures, as well as classification losses for "relevant characteristics" in AttnGAN's
multimodal similarity learning. To improve training stability and prevent mode collapse, the
discriminator uses spectral normalization, a two-time scale updating technique, and instance noise. An
LSTM network extracts word and sentence features from text. The hierarchical architecture uses
upsampling layers to build a low-resolution image from the phrase feature. FiLM-ed ResBlocks
integrate linguistic and image data to create higher-resolution images. The architecture includes feature
matching, cosine similarity, and classification losses to improve text-to-image synthesis. e-AttnGAN
surpasses state-of-the-art techniques in inception score, R-precision, and classification accuracy using
FashionGen and DeepFashion-Synthesis datasets.
and are amongst the principal technologies, GANs (Goodfellow et al., 2014a) and Image Generative
models with auto encoders (Kingma and Welling, 2019) were primarily used. (Zhang et al., 2017)
started the work on unsupervised learning using multi stage GAN framework. The text and image were
trained as pairs. Following this technology many existing systems were built on top of it. (Han, Guerrero
and Pavlovic, 2020) built a system for visualizing cooking ingredients and its instructions. This system
was developed using StackGan-V2, which uses a two-stage model architecture. The text is feature
extracted using a char-cnn-rnn model concatenated with a noise dimension before being passed to the
two staged GAN model. The system worked at a fine-grained level paying attention to the highy
correlated word amongst the text description and generate the image. Such that the image proved to be
semantically consistent with the text. (Dong et al., 2017) Developed a system called “I2T21” which
was built using GAN-CLS. The text and image semantic consistency was measured using the
Wasserstein distance. This system was prone to mode collapse and paid attention to the text at a fine-
grained level. In 2021, understanding the image quality and the lack of features presented in the visual
representation of the image (Hossain et al., 2021) introduced a system which focusses more on the
image captioning, the system was built using the GAN-LSTM architecture to extract both semantic and
spatial relationship of an image. Though it paid a lot of attention on the semantic consistency of the text
and image, the generation of images were prone to be entirely fake and sometimes didn’t generated the
image of expected standard due to the high priority given to the image captioning module. (Nasir et al.,
2019, p. 2) developed a system that generates a synthetic face from textual description, It used GAN-
CLS to extract fine-grain details, map them to a latent space and learn their distribution in the latent
space, but as stated this system focused more on the fine-grained details of the text description and
generated the face. (Ak et al., 2020) developed a system for fashion design which focused more on the
image consistency being generated, it was developed using “e-AttnGAN” fused with FILM layer to
enable the consistency of the generation of images, but paid less attention on the text descriptions. Thus,
it too concentrated on the text at a fine-grained level. Throughout the works there were many gains and
losses. However, none of the systems focused on generating images from large text descriptions and
multiple languages which still remains a big gap on the text-to-image synthesis domain.
2.5.4 Benchmarking
Existing systems have laid a lot of potential drawbacks and challenges in terms of accuracy,
performance, quality of images, semantic consistency, hyper parameter tunning, time complexity and
number of models used. According to the development of text-to-image synthesis, dataset plays a key
role. Since the author has chosen an applied research pertained towards the in-vogue fashion domain,
there will be no global datasets that our system will be tested upon. Infact, the system’s scope will focus
only on the fashion dataset (FashionGen) (Rostamzadeh et al., 2018) to measure and evaluate the
system. In terms of performance, the author has pre dominantly focused on avoiding mode collapse and
efficient model training compared to other existing systems. The author decides to achieve this using
AttnGAN network ensembled with a contrastive learning approach as hybrid method such that the loss
functions are preserved. The quality of images will be of size 256*256 pixels which will be better at
better resolution after employing ESRGAN trained on the dataset, which will be better compared to the
image quality with the existing systems, The Image augmentation and encoding with latent space
distribution will be of high priority when generating output of 256*256 resolution. Unlike other certain
existing systems, the semantic consistency between the text description and the visual description will
be preserved throughout using conditioning augmentation, zero gradient loss and adam optimization
with cross entropy classification. The time complexity will be linear compared to other systems. Finally,
we will be using two generators and discriminators, one of each will be self-supervised devised to avoid
mode collapse, momentum, and decrease training time. Acknowledging the above benchmarking
prospects against the existing systems will be beneficial in terms of the development towards the
system.
3. METHODOLOGY
3.1 Chapter Overview
As part of this chapter, we'll go through how we'll do our study and how we'll handle our software
development and project management. The relevant sub-sections of each approach will be explained in
detail.
Research Methodology
Philosophy Pragmatism was chosen as the philosophy because research is based on data
to develop a hypothesis and this research is comparing both qualitative and
quantitative results produced by different text-to-image synthesis models
using GANs. This is also applied research on the domain of In-vogue
fashion designs and stylings.
Approach This research aims to test and prove a hypothesis that needs to be solved. This
is to input larger spans of In-vogue fashion-based textual descriptions in three
natural languages (English, Sinhala, Tamil) and achieve high-quality detailed
In-vogue fashion design images as output. A deductive approach was selected
to follow, as the research chooses to apply an existing theory to the domain of
interest.
Strategy The strategy of research is how the answers to the research questions are
proposed. Interviews, documents, and research analysis, experiments, and
surveys were chosen as strategies to fit the research strategy.
Choice Choice of research will depend on the research paradigm that is chosen.
Among the mono, multi and mixed methods for choice, mixed-method was
chosen, as text to image synthesis model had both quantitative and qualitative
results which were gained through interviews, survey papers, and other
documents such as journal articles and conference papers that could be used
as a comparison for the model prototype that is going to be developed in this
research.
Time Horizon Data needs to be gathered at a single point in time to do evaluations. Hence,
out of longitudinal and cross-sectional, cross-sectional time horizon looks the
most convenient and was chosen for the research.
Techniques For the collection and analysis of data, techniques such as observations,
and Procedures documents, conversations, evaluation reports, interviews, and questionnaires
will be used.
Based on the above research methodologies, the below mentioned aspects of the research were
determined.
Research Hypothesis: With the ability of text to image transformation models to take in larger spans
of texts with multiple natural languages by training and validating to provide high-quality fashion
design images. It is possible for developers and other stakeholders to now train larger text descriptions
with multiple natural languages and apply them to any other domain for productive outcomes.
Research Process: Finding out the best possible way to train larger corpora of texts in multiple
languages whilst simultaneously applying those trained texts description on the images to extend the
process for the unsupervised GAN network to produce images for larger textual descriptions.
Prototype Input: Text descriptions of in-vogue fashion-based designs that you desire to visualize.
Prototype Output: High-quality two-dimensional synthesized In-vogue fashion design images
having all the detailed features given as input.
Prototype Features:
1. A novel framework to generate high-quality images from larger spans of texts in multiple
languages.
2. It will be an open-source application to be used by end-users to input in-vogue fashion
descriptions and achieve high-quality in-vogue fashion design images as output.
3. A GUI for better user experience.
adjustments and testing until he achieves a positive conclusion, the author chose to go with the prototype
model.
Agile Prince2 was selected for project management from a number of options because of its emphasis
on management, recursive planning, and flexible delivery while responding to risks. As a result of
these considerations, the author chose agile prince2 methodology since it allows you to focus on both
management and delivery at once, aids in being on time and meeting deadlines on time, fosters
cooperation, and increases stakeholder confidence.
3.4.1 Project Deliverables
Deliverable Component Tentative Delivery Date
Project Proposal 1st Nov 21
Review Paper 15th Oct 21
Literature Review Document 18th Oct 21
Software Requirement Specification 22nd Nov 21
System Design Document 6th Dec 21
Prototype 20th Apr 22
Thesis 20th May 22
Project Research Paper 19st June 22
Table 3.2 - Project Deliverables
memory.
Graphics Processing Unit A powerful enough GPU to train models will
Fashion Designers Functional Fashion Designers, use the system where they
Beneficiary will input text and visualize fashion designs for
their business benefits.
Purchasing Agents Functional Purchasing Agents are stakeholders who will
Beneficiary use the system to visualize fashion designs
before they commit to purchasing from a fashion
store or any other fashion-based warehouse.
Students/Researchers Functional Students/Researchers are stakeholders who will
Beneficiary use the system to understand how it works, its
functionalities, study its core components, and
attempt to build something new from it as a
research gap.
Containing System Stakeholders
Fashion Consultants Functional Fashion designers after viewing the design
Beneficiary utilized from the system show it to fashion
consultants to receive feedback.
Product Owner Managerial/Financial Assists the developer to develop the system
Beneficiary optimally by removing any obstructions and by
defining clear goals.
Wider Environment Stakeholder
Colleagues Advisory Provides the required guidance and support for
Supervisor a successful prototype of the system to be
completed.
Developer Engineering Is responsible for building the primary software
Employee product before its deployment in production.
Maintenance Operator Setting up the cloud environment in which the
Engineer system must function and deploying new
versions of the system to the production
environment are the responsibilities.
Hackers Negative Hackers may target the system in one way or
Stakeholder another, attempting to modify valuable
technical specialists with a viable fashion background. The goal of the questionnaire is to help
the author understand the expectations of the users regarding the prototype. It also helps the
author determine the system's expected goals and project aims.
Method 3: Brainstorming
During the placement year, the author experienced a client request to build a system that inputs
facial descriptions and generates facial images as output to attain the goal of identifying
criminals which will be useful to the police department. Thus, upon brainstorming and analyzing
the positive outcome this system could produce for other industries, the author decided to dig
deeper into this by performing a literature review to identify gaps and issues in current systems.
As a result, the author decided to develop a text-to-image synthesis application that will input
large text-related fashion terms and generate high-quality fashion designs. This was only
restricted to the fashion design industry due to time constraints.
Table 4.2 - Selection of Requirement Elicitation Methods
4.5.2 Survey
A questionnaire was emailed to 179 people who represented a mixture of fashion designers and
the general public alias purchasing gents of fashion designs. Appendix F contains the
questionnaire form that was distributed.
Observations
It was observed that 53.7% of the participants were
fashion designers who shared their experience crafting
fashion designs and 46.3% of the participants were
purchasing agents of fashion designs who shared their
experience on how they purchase fashion designs.
Conclusion
This survey was divided into two sections to capture the insights of different categories among
people who are interested in fashion design. As per the results, a wide range of people representing
the general public and certain fashion wholesale purchase merchants filled this survey. The
questionnaire also had a wide range of fashion designers participating in this survey from various
fashion organizations. This survey took into consideration the mixture of these responses to ensure
the system meets the requirements filled by these participants for a better user experience.
Question How often do you purchase fashion designs (T-shirts/shirts/jeans, fabric,
ladies wear, shoes, etc)?
Aim of Question This question was for purchasing agents of fashion designs to identify
how often people will use the system to check for designs from the system.
Observation
This part of the survey was designed for the purchasing
agents of fashion design, where most of the purchasing
agents shopped a few times a year while the rest of the
respondents had other times they shopped.
Conclusion
As per the findings, a vast number of users purchase designs a few times a year. To keep the system
up and running and identify the frequency of users who use it, it must be fully functional as a
whole. If at all the system requires updates or version upgrades, it can be done when there is less
traffic on the system. This will ensure that the user will have a quality experience with the system.
Question Do you check for samples online before purchasing a fashion warehouse
or any online/physical fashion store?
Aim of Question Identifying participants who check for samples before purchasing as they
will be the system’s target audience.
Observation
It was observed that 82% of the users check for samples
before making a purchase at a fashion store and 27% of the
users directly approach the store and purchase designs.
Conclusion
The majority of users who purchase fashion designs check for samples before purchasing from a
fashion warehouse or store. These responses ensure that this system caters to its requirements
specifically for these users.
Question How do you check for fashion samples?
Aim of Question Identifying user preferences
Observation
Conclusion
The majority of users use social media and Google search to narrow down their search for samples
before making a purchase.
Conclusion
The majority of users, over 87%, think that an AI-based solution that will transform the text into
fashion designs will be helpful for them, as they will be able to visualize fashion designs before
making a purchase.
Question If Yes, please explain how you craft in-vogue fashion designs? (Tools &
Technology)
Aim of Question Identifying the tools and technology used by users to craft fashion designs.
These data will be used to understand the quality of fashion designs being
produced.
Findings
To understand the quality of fashion designs being produced by identifying the current tools and
technology used by fashion designers when crafting in-vogue fashion designs.
Theme Analysis
Identifying user’s Diverse responses from participants were received. Some participants
preferences when preferred to use manual resources to craft fashion designs, but the majority
crafting fashion of the participants preferred an automated approach to crafting fashion
designs designs. These preferences helped the author identify that the quality of
the fashion designs produced depends on the tools and technology used,
thus building the system matching those standards.
Automated To reduce the physical problems encountered, participants used
Production Vs automated tools to craft fashion designs. Only 10% of the participants
Manual Production were willing to manually craft fashion designs as they believed going old
school will always produce quality designs, even though it was time-
consuming.
Quality of Outcome Participants confirmed that automated production using fashion designer
software, and image editing applications were superior compared to
manual production in terms of the quality of fashion designs produced.
Question Have you ever faced issues and encountered problems while designing
in-vogue fashion designs?
Aim of Question Identifying the problems encountered when crafting fashion designs
Observation
58% of the fashion designer of the survey have
encountered problems while crafting fashion designs
either technically or manually, whilst 42% of the
users haven’t mentioned problems they encounter
while crafting fashion designs.
Conclusion
Over 58% of the fashion designers who use handmade instruments and resources to craft fashion
designs run into problems. This confirms the prototype will be very useful for the users who craft
designs manually, making them automate the process and reducing the problems they encounter.
Question Do you think it will be better to limit raw material, industrial resources,
human energy or get rid of it completely while crafting in-vogue fashion
design?
Aim of Question Identifying to check if users are willing to limit the raw materials or
resources while crafting fashion designs
Observation
76% of the users think that it will be better to limit the
raw material and other energy wastes being exposed
during the crafting of fashion designs, while 24% of the
users think it’s not necessary to do so based on particular
reasons.
Conclusion
Most of the participants agreed to reduce the burden of human energy and the wastage of additional
raw materials. The conclusion from these statistics prove that over 75% of users would love to
limit or completely get rid of these resources while crafting fashion designs
Question Will a computer-aided solution that will transform fashion-related
descriptions into in-vogue fashion designs be helpful?
Aim of Question To validate if the system will be useful to the end-user.
Observation
100% of the users who participated as fashion designers have
decided that such a solution will be of complete help to their
industry.
Conclusion
All the participants thought an automated option that would transform the text into fashion designs
would be helpful for the users.
Question If yes, was selected above. Please explain how you will plan to use this
solution to craft in-vogue fashion designs and benefit from it?
Aim of Question To understand what the users will plan on achieving from this system.
Findings
Different perceptive responses from users who will benefit from using the system.
Theme Analysis
Research Gap and All participants approved and liked the idea of having such a system. This
Scope Depth will help reduce the impact of unwanted costs emitted during the design
stage. Overall, this niche describes fashion designs with a lot of fashion
terms. It was valid to develop such a system that synthesizes high-quality
fashion designs from a substantial number of words.
Beneficial Prototype Survey participants said utilizing this prototype has several benefits. Less
Features and raw materials, time, and labor were used. They think this prototype will
Suggestions decrease these effects over time. Participants suggested selecting a
language before converting a written description to a fashion design and
proposing a 360-degree design perspective to visualize the whole design.
Before obtaining the final result, all of these will be considered.
Table 4.4 - Survey Findings
4.5.3 Brainstorming
Criteria Findings
Deriving research Acquiring a research idea based on a client request to develop a text-to-
idea image application for facial image generation depends on descriptions
provided as input.
Identifying Based on the research idea found, analyzing and brainstorming how the
technical text-to-image system will have a technical contribution towards the
contribution from project and how it will have an impact on the fashion design industry. The
research idea pros and cons that end users can face
Deriving how the Since it’s a fashion-based application, I'm brainstorming about how the
user interface GUI should appear. So, based on similar fashion websites developed and
should resemble through insights gathered, the author decided to use a plain template
the prototype coupled with a black-white theme.
Table 4.5 - Brainstorm Findings
Brainst
orming
Review
Survey
Literat
ure
1 Acquired an idea from a third-party source and started
doing thorough research on it.
2 The research gap in the domain of text-to-image synthesis
has extended to synthesizing larger text descriptions into
high quality, two-dimensional images.
3 The prototype should be a system that takes in an input of
fashion-related text descriptions and outputs two-
dimensional fashion designs.
4 To synthesize text descriptions with a large number of
words, the best technique that can be used is to ensemble
AttnGAN with a contrastive learning approach, whereby
the image and text encoder of the AttnGAN will be used to
unify the words that occur close to the captions per image
and disregard the words that occur less frequently in a
caption. Thus, the semantic consistency and balance
between the image and text is maintained.
5 Identifying the tools and technology used whilst crafting
fashion designs, to match the same quality and standard
which should be generated by the porotype.
6 Identifying the problems encountered whilst crafting
fashion designs, and making sure the prototype solves all
those problems.
Use Case ID UC – 02
Use Case Name Download image Output
Description Users can download the image outputs which has been synthesized from
fashion descriptions.
End Objective To download the image and visualize it locally, and share it with other
end users.
Priority High
User/Actor Fashion Designer/ Purchasing Agent/ Fashion Organization & Firms
Trigger The user has to click the “Download” button
Frequency of Use Realtime
Preconditions 1. The system should be functional.
4.10 Requirements
Priority levels of system requirements were defined using the MoSCoW technique, based on their
importance.
Must have (M) This level's demand is a prototype's primary functional requirement, and
it must be implemented.
Should have (S) Important needs aren't necessary for the intended prototype to work, but
they do provide a lot of value.
Could have (C) Desirable criteria are always optional and are never regarded as critical to
the project's scope.
Will not have (W) The needs that the system will not have and that is not a top priority at this
time.
Table 4.9 - MosCOW Techniques
NFR1 User Users should be able to readily grasp and move around the M
Friendliness system without the need for extra training of the text-to-
image synthesis models.
NFR2 Performance Provided that a user enters a text description into the M
system, the text-to-image synthesis process must not take
too long to complete and give the user a result. It must also
be verified that the application will not crash during the
processing of the text and creation of the image.
NFR3 Quality of Once a provided input text description has been M
Image synthesized into a fashion design, the quality of the
resulting fashion design must be acceptable and usable by
the user, as delivering a quality result to the user is equally
important.
NFR4 Security It is critical to maintaining stronger security levels inside M
the system, as it is essential to provide improved security
levels inside the system that processes sensitive images
submitted by users.
NFR5 Scalability The system must be able to adapt to large number of users C
as the system grows.
Table 4.11- Non- Functional Requirements
contributes to the project's goals and compliance with all applicable industry
objectives through completing the standards and guidelines at all times.
questionnaires. Each step of the prototype
This dissertation does not allow for the development process took place in
use of fabrication, falsification, or highly secure environments that were
plagiarism of any kind. All of the data password protected and kept up-to-date
and information presented is correct, with the most recent security patches
and all of the knowledge and facts that available.
were extracted have been properly There was no piracy in any of the
cited and referenced in the document. software or tools that were used in the
development of the prototype, and
none of them were illegally obtained.
There were no commercial or student
licenses used at any point in the
process, and only open-source licenses
were used throughout the duration of
the project.
In order to deceive and lead viewers or
evaluators to believe in a successful
state that was never achieved or
achieved successfully, no fabrication
or falsification of data or results from
the project were used.
Table 5.1 - SLEP Issues & Mitigations
6. DESIGN
6.1 Chapter Overview
This chapter discussed the project's design aspects in depth. It covers everything from the system's
core to the user interface. The requirements acquired through the literature review, questionnaire,
and discoveries from the brainstorming process were used to make design selections. The rationale
behind various design decisions is also explained.
The contributing system's tiered design is divided into three tiers: Presentation Layer, Logic Layer,
and Data Layer. The Logic Layer serves as a link between the Data Layer which stores all of the
application's data and the Client Layer, which displays the user-interactive components. When it
comes to operations, Logic Layer organizes the processes better and provides more flexibility.
Data Layer
Existing Fashion Dataset – These are the stored fashion dataset of type h5 files. This file
consists of the text description, image associated with it, and category type. These datasets
only will be used to create the image-encoder and text-encoder.
Trained text-encoder and image encoder – This storage contains the trained image-text
encoder that is developed to compute the image-text matching loss for the training of the
generator.
Trained Models Generator & Discriminator – Using the fashion dataset, text-image
encoders, different models of generator, and discriminators were trained. This storage
offers all the models for the process of synthesizing fashion designs from text descriptions.
Image Output Storage - This is the storage where after the image is generated is stored in
this file directory, so it can be sent to the client layer to be visualized.
Logic Layer
GAN Module – This module will consist of the GAN network which will be used to
generate fashion images from the fashion training dataset. This GAN network will consist
of a generator model to generate fake images based on the pre-trained DAMSM encoder
and discriminator model that will be trained separately on real images and read captions.
Text-to-Image Synthesis Module - This module will be responsible for synthesizing
fashion design from the text description. It consists of a DAMSM encoder to compute the
image-text similarity score, contrastive learning module that will maximize and minimize
loss based on the text-image DAMSM similarity, extract word and sentence embeddings
from text description using RNN encoder and the generator module to generate the fashion
design images from the provided text description.
Presentation Layer
Landing Webpage Wizard - This will be displayed to the user when they first access the
online application, along with a quick introduction and instructions on how to use it.
Input Text Description - This is the wizard that will invite the user to provide their text
description to produce a high-quality fashion design. This module will send the text
description to the text-to-image synthesis processing module once the user has entered it.
Loading Screen Wizard - It will take some time for the system to synthesize the fashion
design from the text description once the user submits the text description. As a result, the
user must be informed that their text description is being processed by the system. As a
result, a loading screen appears.
Display Image Wizard - The loading screen will transfer the user to a new page that will
display both their input description and the output image once the text-to-image process is
complete and the image is ready.
6.4.5 UI Design
The project's UI (User Interface) had a straightforward goal. It intended to create a user-friendly
interface that would simplify the process of synthesizing fashion designs from user-provided
descriptions. Another requirement was for the user interface to be responsive for mobile
applications, as consumers tend to save a large number of photographs on their mobile devices. A
four-page responsive web application was created to meet all of these goals and to give a basic yet
comfortable user experience with a lower learning curve. The wireframes below will appear on
mobile and desktop devices.
7. IMPLEMENTATION
7.1 Chapter Overview
This chapter describes the implementation of the prototype of the research project which shows
the technological stack of the system and how it was built, then also the data selection was closely
looked at explaining the dataset and its relevant attributes that will be used by the text-to-image
synthesis model, the libraries and the IDE’s used was also discussed in detail. It also describes
core components and the various decisions taken during the development of the system.
The project aimed to use fashion-based text and image pairs to train and develop a text-to-image
synthesis model so that the system could generate fashion design images from large text
descriptions consisting of a substantial amount of words. The FashionGen dataset was used to train
the framework's GAN network. FashionGen included text descriptions and images for T-shirts,
Shirts, Jeans, Pants, Suits & Blazers, and Tops.
React JS - As a requirement of the project, the React JS framework will be used for the frontend.
The front end is a simple user-friendly web application. The usage of JavaScript libraries with the
support of node modules will be incorporated to make the process of the frontend easy, it will be
separated to use a component-page architecture so that the API calls from the frontend can be
easily executed.
Flask - To connect the application's backend and frontend, the Flask web framework was used.
Flask was chosen because communication between the front end and the back end was required.
Flask is also a lightweight web framework that connects both ends via API requests. It also helps
the application run more efficiently.
the best free option for an IDE. Initially, the backend was developed using PyCharm IDE, and the
front end of the application was developed using Visual Studio Code IDE. PyCharm was chosen
to create the backend because it is the most well-known IDE for Python application development
and also makes it easier to use required packages when developing.
The above route is to generate the fashion design using the trained GAN Model
8. TESTING
8.1 Chapter Overview
This chapter covers how testing was done to test “Fashionables” intended function flow. This
chapter discusses the testings in detail which include model testing, benchmarking, functional
testing, non-functional testing, module and integration testing.
To ensure that all system models are performing as expected and are thoroughly tested in
order to achieve the best results.
Identify how the system can be benchmarked and achieve proper benchmarking against
other systems.
To determine whether the system satisfies all the functional requirements and non-
functional requirements.
To improve the system's experience based on test results.
When looking at model evaluation metrics for text-to-image synthesis approaches, it was
discovered in the Evaluation Methodology section that the Inception Score (IS) and R-Precision
were the most commonly used metrics.
Inception Score
Inception score is the mean calculation for assessing image quality since it has been demonstrated
to correspond well with human opinion.
The predicted images are taken as input and compared with the pretrained inception v3 model
CIFAR-100 images to get the mean score and the standard deviation from the predicted images to
the CIFAR-100 images.
R-Precision
R-precision ranks retrieval performance between retrieved image and text attributes to determine
visual-semantic similarity between text descriptions and generated image. It is defined as the (r/R),
in other words the ratio between all the correct pairs of text and images against the total number
of predicted texts along with the images. So these correct against the total is divided to get the R-
Precision.
The R-Precision function inputs the trained image and text encoder as input and it gets the cosine
similarity between the test text and images from the text and image encoder respectively. The
closest similarity is defined as success against the total processed and is divided to get the precise
score.
The average discriminator loss for a discriminator text-to-image model ranges between 0.1 to 0.3
(Bodnar, 2018a). 0 indicates that the Discriminator network has overfitted to the training data and
has won the min-max game; otherwise, it indicates that the network has failed. A random set of
outputs is generated by the Generator network in this scenario.
The contrastive learning is nothing but a loss function that will be used to gather the captions
corresponding to the same image and ignoring the captions that do not correspond to the same
image. The contrastive learning uses the cross entropy loss function with “sum” reduction to
calculates the cosine similarity between the positive and negative samples. Finally, the loss is
derived by dividing it with current batch size.
The DAMSM Model consisting of an image encoder and text encoder is to compute the
multimodal similarity between the text and image. A good text-image matching loss will be used
by the generator to generate images in a semi-supervised manner.
8.5 Benchmarking
The best-performing ensemble model for the text-to-image synthesis domain to generate images
for long phrases was selected to be used in the final system after extensive testing of various loss
functions, optimizers, splits, epochs, and batch sizes. The inception score and R-Precision was
used to benchmark the system against other systems in the domain.
8.8.1 Accuracy
As part of the model review and benchmarking process, accuracy testing was carried out. The goal
was to outperform hand-designed architecture in terms of the proposed approach. A higher
Inception score was achieved by utilizing the provided strategy. Benchmarking included
comparisons to other SOTA models. By doing functional and non-functional testing, the
prototype's correctness was determined.
8.8.2 Performance
Research Development Performance
It was the primary focus of the research that was carried out to investigate the learning of graphical
representations, as well as the graphical text-to-image conversion process. Therefore, in order to
achieve successful results, it was necessary to have not only sufficient CPU and RAM memory,
but also sufficient GPU power as well. Therefore, Google Colab was used, with the company's
powerful Tesla K80 GPU, which had a capacity of 11 Gigabytes of Video RAM, being used for
model training. Because the training phase of the research was the most resource intensive phase,
the Tesla GPU was used for model training. The information about GPU utilization can be seen in
the screenshot below.
A result of the high consumption of graphical resources, the GPU Video RAM usage exceeded 8
Gigabytes out of a total of 11 Gigabytes of available allocation. As a result, in order to summarize
the effectiveness of developing the research component, the training process was the phase that
required the most time and resources.
8.8.3 Security
When building a system that handles user data, security is crucial. Having it on a public cloud
server is crucial. The implementation below makes the app safer for users.
1. It was decided to deploy a web application with the capability of communicating with the
server using HTTPS, which is an encrypted communication protocol. Thus, no opportunity
will exist for an attacker to intercept communication between the application and the server
in this situation.
2. After a user session ended, none of the user data was retained within the application, which
was also a functional requirement of the application. The use of this practice eliminated the
possibility of unnecessary user data being disclosed to unapproved parties.
9. EVALUATION
9.1 Chapter Overview
It was time to analyze the research project's overall conclusion after the designed prototype had
been successfully implemented and optimized to achieve the greatest performance feasible through
a large number of training combinations. This chapter will focus on implementing the project's
evaluation process, which includes self-evaluation, domain and technical expert evaluations, and
other stakeholder evaluations.
undertaken development process of the prototype that was used to demonstrate the
concept to the consumer.
Quantitative Specifically, the goal of this study is to validate the metrics that have
benchmarking been used to benchmark the model and to assess how well it has
approach performed when compared to other models in the same domain.
Quality of the GUI To evaluate on the GUI quality.
Table 9.1 - Evaluation Criteria
The refined scope and depth were both significantly higher than average
due to the nature of GANs.
System Design The design, architecture, and implementation of the concepts and
Architecture & components based on the research were complicated as a result of the
Implementation difficulty of the concepts and components. Involved in frequent discussions
and prototyping as well as interview techniques, the functional and
nonfunctional requirements, as well as the main components, were
identified. A variety of cutting-edge technologies were employed in the
development process. Aspects such as code quality and industry standards
were taken into consideration during the entire design and implementation
process. It was completed to a high level of quality in terms of functionality,
design, architecture, and implementation.
Model To generate images from text descriptions, an AttnGAN model ensemble
Implementation with a contrastive learning loss function was used as the model
implementation in this research. This model implementation was the core
component of this research. A very satisfactory level of model
implementation was achieved based on the refined scope, and this was
achieved by following best practices and industry standards throughout its
development.
Solution and The prototype was developed as a web-based solution. There was a lot of
the prototype thought put into the components, and they can be improved even more. It
was noted in the future work section that product level enhancements
would be made. The prototype's usability demonstrated that text-to-image
synthesis can be learned and experimented with in a straightforward and
simple manner.
Limitations and The primary constraint on the development process is the availability of GPU
future resources. Through the use of hyperparameter tuning, the model can be
enhancements made even better. All of the limitations and enhancements that have been
identified are detailed in Chapter 10.
The Appendix J contains evaluators in a tabular format with their respective name, position and
affiliations
9.6.1.5 GUI
Eval ID Feedback
EV12 The GUI of the application was clear and responsive. It was easy to understand
how the application works and steps to follow to complete my task. The
description about the application was clear and the text and graphics used were
also understandable.
EV11 The responsiveness of the GUI is great and the user interface was simple to use. A
very good idea of portraying how the prototype should be used beforehand makes
the application a very handy.
EV3 The GUI features are convenient and cover most of what is required in this use case.
Much can be admired in terms of the research idea, the project's management, the
outcomes and output, and even the user interface of this project. Considering that
this was an undergraduate effort, this appears to be an impressive outcome.
Table 9.7 – GUI Evaluation
3. Do you think the used evaluation metrics are relevant for the project?
form; this eliminated any opportunity for hands-on experimentation and understanding of the
prototype by the experts.
10. CONCLUSION
10.1 Chapter Overview
This chapter summarizes the findings from the research as a whole. The project's aims, objectives,
and learning outcomes are laid down, along with the difficulties that were encountered. Knowledge
and abilities developed over four years of study, as well as new talents learned through research,
are presented in this section. Also covered are the project's deviation from its initial design and
restrictions, as well as future work.
The aim of the project was successfully achieved by designing, developing, and evaluating a
novel text-to-image synthesis model which is fully capable of transforming text descriptions in
three natural languages which has substantial number of words to a high-quality two-dimensional
fashion design.
Software Development To begin this research, this module is what you'll need. As a result,
Group Project we had to go through every stage of the process, from finding out the
problem to developing and testing a prototype. Though it was a team
work at that stage, the writing of a final project report was found easy
because of the exposure to the SDGP module.
Client-Server This module helped us understand how the front-end backend works,
Architecture how the APIs should be structured and used during development.
Algorithms, Theory Data structures and algorithms were covered extensively in this
Design & module. This aided in the implementation of time-critical algorithms
Implementation by using logic and reasoning.
Table 10.1 - Utilization of the Course
Use of GANs Complementing Text-to-Image Synthesis - The author had never had
hands on experience with GANs or model ensembling before. This endeavor necessitated
a research into how these techniques were used in text-to-image synthesis.
Deep Learning - Prior to this study, the author had no prior knowledge of deep learning.
In order to complete this project, it was necessary to study deep learning techniques from
sources like Coursera and Udemy.
Ability to Fine-Tune Models – The author extensively research and gained knowledge on
how deep learning models should be optimized and fine-tuned to perform better.
Diverse domains Text-to-image synthesis with GANs was necessary in order to solve the
involved in the problem identified in the field of computer vision. To overcome this
research. obstacle, the author had to conduct extensive study on each topic and
dedicate a significant amount of time to mastering deep learning, GAN
development, and text-to-image synthesis skills prior to the start of the
project timeframe.
Vast Learning It was a challenge to learn about and implement several different GAN
Curve models. As soon as the project idea was finalized, trials with the
prototype were begun in order to solve this obstacle.
Table 10.3 - Problems & Challenges Faced
10.8 Deviations
The initial plan was to use Stack GAN and contrastive learning, but because of the longer
training time and the inability to focus on the fine-grained details, the algorithmic selection
was changed to AttnGAN, which also took longer training time, but this GAN focused
more on the fine-grained details of the description.
It was very evident that the use of AttnGAN ensemble with contrastive learning to generate
high quality images was very eminent. But with the AttnGAN having multiple generators
and discriminators, and each generator and discriminator corresponding to a specific image
size with low resolution and in addition to the GPU constraints, the author decided to use
only two generators and discriminators, which in turn gave an image resolution of 128*128
pixels. So, in order to improve the resolution, ESRGAN was trained on the dataset to
produce high quality images.
3. Though the R-Precision score was superior compared to other models, the model inception
score was pretty low, the reason being that we were unable to train the model with a large
dataset because of GPU constraints affecting the accuracy of the model.
A major future enhancement would be to introduce text-to-video synthesis for the fashion
design industry.
The model can be further improved by increasing the dataset size and improving the model
inception score.
The core of the AttnGAN can be improved to up-sample a larger resolution of the image
at the initial stages of the generator and discriminator.
There may be a number of additional ways to handle the same problem that could increase
the accuracy or performance of the image generation process.
REFERENCES
Abid, M. (2020) ‘Fashion Designing’, Medium, 24 January. Available at:
https://medium.com/@maryamabidkhan2/fashion-designing-6b4106dbf254 (Accessed: 23
November 2021).
Agnese, J. et al. (2020) ‘A survey and taxonomy of adversarial neural networks for
tex�_t�_image synthesis’, Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, 10.
Ak, K.E. et al. (2020) ‘Semantically consistent text to fashion image synthesis with an enhanced
attentional generative adversarial network’, Pattern Recognit. Lett., 135, pp. 22–29.
Arjovsky, M., Chintala, S. and Bottou, L. (2017) ‘Wasserstein GAN’, ArXiv, abs/1701.07875.
Bekesh, R. et al. (2020) ‘Structural Modeling of Technical Text Analysis and Synthesis Processes’,
in COLINS.
Bodnar, C. (2018a) ‘Text to Image Synthesis Using Generative Adversarial Networks’, ArXiv,
abs/1805.00676.
Bodnar, C. (2018b) ‘Text to Image Synthesis Using Generative Adversarial Networks’, ArXiv,
abs/1805.00676.
Cheng, Q. and Gu, X. (2020a) ‘Cross-modal Feature Alignment based Hybrid Attentional
Generative Adversarial Networks for text-to-image synthesis’, Digit. Signal Process., 107, p.
102866.
Cheng, Q. and Gu, X. (2020b) ‘Cross-modal Feature Alignment based Hybrid Attentional
Generative Adversarial Networks for text-to-image synthesis’, Digit. Signal Process., 107, p.
102866.
Dhivya, K. and Navas, N.S. (2020) ‘Text to Realistic Image Generation Using Stackgan’, 2020
7th International Conference on Smart Structures and Systems (ICSSS), pp. 1–7.
I
Fashionable
Dong, H. et al. (2017) ‘I2T2I: Learning text to image synthesis with textual data augmentation’,
in 2017 IEEE International Conference on Image Processing (ICIP), pp. 2015–2019.
doi:10.1109/ICIP.2017.8296635.
Dong, Y. et al. (2021) ‘Unsupervised text-to-image synthesis’, Pattern Recognit., 110, p. 107573.
El, O.B., Licht, O. and Yosephian, N. (2018) ‘Recipe 2 Image : Multimodal High-Resolution Text
to Image Synthesis using Stacked Generative Adversarial Network’, in.
Frolov, S. et al. (2021) ‘Adversarial Text-to-Image Synthesis: A Review’, Neural networks : the
official journal of the International Neural Network Society, 144, pp. 187–209.
Grønbech, C.H. et al. (2020) ‘scVAE: variational auto-encoders for single-cell gene expression
data’, Bioinformatics [Preprint].
Grover, A., Dhar, M. and Ermon, S. (2018) ‘Flow-GAN: Combining Maximum Likelihood and
Adversarial Learning in Generative Models’, arXiv:1705.08868 [cs, stat] [Preprint]. Available at:
http://arxiv.org/abs/1705.08868 (Accessed: 10 November 2021).
Haddi, E., Liu, X. and Shi, Y. (2013) ‘The Role of Text Pre-processing in Sentiment Analysis’, in
ITQM.
Han, F., Guerrero, R. and Pavlovic, V. (2020) ‘CookGAN: Meal Image Synthesis from
Ingredients’, arXiv:2002.11493 [cs] [Preprint]. Available at: http://arxiv.org/abs/2002.11493
(Accessed: 13 November 2021).
Heusel, M. et al. (2017) ‘GANs Trained by a Two Time-Scale Update Rule Converge to a Local
Nash Equilibrium’, in NIPS.
II
Fashionable
Hong, S. et al. (2018) ‘Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis’, 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7986–7994.
Hossain, Md.Z. et al. (2021) ‘Text to Image Synthesis for Improved Image Captioning’, IEEE
Access, 9, pp. 64918–64928. doi:10.1109/ACCESS.2021.3075579.
Hu, K. et al. (2021) ‘Text to Image Generation with Semantic-Spatial Aware GAN’, ArXiv,
abs/2104.00567.
Hu, T., Long, C. and Xiao, C. (2021) ‘CRD-CGAN: Category-Consistent and Relativistic
Constraints for Diverse Text-to-Image Generation’, ArXiv, abs/2107.13516.
Isola, P. et al. (2017) ‘Image-to-Image Translation with Conditional Adversarial Networks’, 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976.
Jeon, E.S., Kim, K. and Kim, D. (2021) ‘FA-GAN: Feature-Aware GAN for Text to Image
Synthesis’, ArXiv, abs/2109.00907.
Jin, Q. et al. (2019) ‘Image Generation Method Based on Improved Condition GAN’, 2019 6th
International Conference on Systems and Informatics (ICSAI), pp. 1290–1294.
Karras, T. et al. (2018) ‘Progressive Growing of GANs for Improved Quality, Stability, and
Variation’, ArXiv, abs/1710.10196.
Kingma, D.P. and Welling, M. (2019) ‘An Introduction to Variational Autoencoders’, Foundations
and Trends® in Machine Learning, 12(4), pp. 307–392. doi:10.1561/2200000056.
Levy, O. and Goldberg, Y. (2014) ‘Neural Word Embedding as Implicit Matrix Factorization’, in
NIPS.
Li, R. et al. (2020a) ‘Exploring Global and Local Linguistic Representations for Text-to-Image
Synthesis’, IEEE Transactions on Multimedia, 22, pp. 3075–3087.
III
Fashionable
Li, R. et al. (2020b) ‘Exploring Global and Local Linguistic Representations for Text-to-Image
Synthesis’, IEEE Transactions on Multimedia, 22, pp. 3075–3087.
Liang, J., Pei, W. and Lu, F. (2020) ‘CPGAN: Content-Parsing Generative Adversarial Networks
for Text-to-Image Synthesis’, in ECCV.
Maslej-Kresnáková, V. et al. (2020) ‘Comparison of Deep Learning Models and Various Text Pre-
Processing Techniques for the Toxic Comments Classification’, Applied Sciences, 10, p. 8631.
Mihalcea, R. and Leong, C.W. (2006) ‘Toward communicating simple sentences using pictorial
representations’, Machine Translation, 22, pp. 153–173.
Mishra, P. et al. (2020) ‘Text to Image Synthesis using Residual GAN’, 2020 3rd International
Conference on Emerging Technologies in Computer Engineering: Machine Learning and Internet
of Things (ICETCE), pp. 139–144.
Nasir, O.R. et al. (2019) ‘Text2FaceGAN: Face Generation from Fine Grained Textual
Descriptions’, 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM), pp.
58–67.
Nasr, A., Mutasim, R. and Imam, H. (2021) ‘SemGAN: Text to Image Synthesis from Text
Semantics using Attentional Generative Adversarial Networks’, 2020 International Conference on
Computer, Control, Electrical, and Electronics Engineering (ICCCEEE), pp. 1–6.
Parihar, A.S. et al. (2020) ‘A Primer on Conditional Text based Image Generation through
Generative Models’, 2020 5th IEEE International Conference on Recent Advances and
Innovations in Engineering (ICRAIE), pp. 1–6.
Qiao, Y. et al. (2021) ‘R-GAN: Exploring Human-like Way for Reasonable Text-to-Image
Synthesis via Generative Adversarial Networks’, Proceedings of the 29th ACM International
Conference on Multimedia [Preprint].
IV
Fashionable
Radford, A., Metz, L. and Chintala, S. (2016) ‘Unsupervised Representation Learning with Deep
Convolutional Generative Adversarial Networks’, arXiv:1511.06434 [cs] [Preprint]. Available at:
http://arxiv.org/abs/1511.06434 (Accessed: 24 November 2021).
Reed, S. et al. (2016) ‘Generative adversarial text to image synthesis’, in International Conference
on Machine Learning. PMLR, pp. 1060–1069.
Reed, S.E. et al. (2016) ‘Learning What and Where to Draw’, in NIPS.
Rostamzadeh, N. et al. (2018) ‘Fashion-Gen: The Generative Fashion Dataset and Challenge’,
arXiv:1806.08317 [cs, stat] [Preprint]. Available at: http://arxiv.org/abs/1806.08317 (Accessed:
30 November 2021).
Ruan, S. et al. (2021) ‘DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis’,
ArXiv, abs/2108.12141.
Sommer, W.L. and Iosifidis, A. (2020) ‘Text-To-Image Synthesis Method Evaluation Based On
Visual Patterns’, in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), Barcelona, Spain: IEEE, pp. 4097–4101.
doi:10.1109/ICASSP40776.2020.9053034.
Souza, D.M., Wehrmann, J. and Ruiz, D.D.A. (2020) ‘Efficient Neural Architecture for Text-to-
Image Synthesis’, 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8.
Tan, H. et al. (2019) ‘Semantics-Enhanced Adversarial Nets for Text-to-Image Synthesis’, 2019
IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10500–10509.
Tan, H. et al. (2021) ‘KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-
to-Image Synthesis’, IEEE Transactions on Image Processing, 30, pp. 1275–1290.
V
Fashionable
Tanaka, F.H.K. dos S. and Aranha, C. (2019) ‘Data Augmentation Using GANs’,
arXiv:1904.09135 [cs, stat] [Preprint]. Available at: http://arxiv.org/abs/1904.09135 (Accessed:
11 November 2021).
Tao, M. et al. (2020) ‘DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image
Synthesis’, ArXiv, abs/2008.05865.
Vani, A. and Venkatesh, S. (2016) ‘Computer Vision Report : Text to Image Synthesis’, in.
Wang, M. et al. (2020) ‘End-to-End Text-to-Image Synthesis with Spatial Constrains’, ACM
Transactions on Intelligent Systems and Technology (TIST), 11, pp. 1–19.
Xia, W. et al. (2020) ‘TediGAN: Text-Guided Diverse Image Generation and Manipulation’,
ArXiv, abs/2012.03308.
Xu, T. et al. (2018a) ‘AttnGAN: Fine-Grained Text to Image Generation with Attentional
Generative Adversarial Networks’, 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 1316–1324.
Xu, T. et al. (2018b) ‘AttnGAN: Fine-Grained Text to Image Generation with Attentional
Generative Adversarial Networks’, 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 1316–1324.
Yang, Y. et al. (2021) ‘Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-
Image Synthesis’, IEEE Transactions on Image Processing, 30, pp. 2798–2809.
Ye, H. et al. (2021a) ‘Improving Text-to-Image Synthesis Using Contrastive Learning’, arXiv
preprint arXiv:2107.02423 [Preprint].
VI
Fashionable
Yu, J. et al. (2019) ‘Free-Form Image Inpainting With Gated Convolution’, 2019 IEEE/CVF
International Conference on Computer Vision (ICCV), pp. 4470–4479.
Yuan, M. and Peng, Y. (2020) ‘CKD: Cross-Task Knowledge Distillation for Text-to-Image
Synthesis’, IEEE Transactions on Multimedia, 22, pp. 1955–1968.
Zaidi, A. (2017) ‘Text to Image Synthesis Using Stacked Generative Adversarial Networks’, in.
Zhang, H. et al. (2017) ‘Stackgan: Text to photo-realistic image synthesis with stacked generative
adversarial networks’, in Proceedings of the IEEE international conference on computer vision,
pp. 5907–5915.
Zhang, H. et al. (2021) ‘Cross-Modal Contrastive Learning for Text-to-Image Generation’, 2021
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 833–842.
Zhang, M., Li, C. and Zhou, Z.-P. (2021a) ‘Text to image synthesis using multi-generator text
conditioned generative adversarial networks’, Multimedia Tools and Applications, 80, pp. 7789–
7803.
Zhang, M., Li, C. and Zhou, Z.-P. (2021b) ‘Text to image synthesis using multi-generator text
conditioned generative adversarial networks’, Multimedia Tools and Applications, 80, pp. 7789–
7803.
Zhang, Z., Xie, Y. and Yang, L. (2018) ‘Photographic text-to-image synthesis with a
hierarchically-nested adversarial network’, in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 6199–6208.
Zhou, R., Jiang, C. and Xu, Q. (2021) ‘A survey on generative adversarial network-based text-to-
image synthesis’, Neurocomputing, 451, pp. 316–336.
VII
Fashionable
Zhu, B. and Ngo, C.-W. (2020) ‘CookGAN: Causality Based Text-to-Image Synthesis’, 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5518–5526.
Zhu, M. et al. (2019) ‘DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-
To-Image Synthesis’, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 5795–5803.
VIII
Fashionable
APPENDICES
Appendix A – Concept Map
IX
Fashionable
X
Fashionable
XI
Fashionable
XII
Fashionable
XIII
Fashionable
XIV
Fashionable
XV
Fashionable
XVI
Fashionable
XVII
Fashionable
XVIII
Fashionable
XIX
Fashionable
XX
Fashionable
Image Encoder
Conditional Augmentation
XXI
Fashionable
Generator Network
XXII
Fashionable
XXIII
Fashionable
Discriminator Network
XXIV
Fashionable
Attention Model
XXV
Fashionable
XXVI
Fashionable
XXVII
Fashionable
XXVIII