0% found this document useful (0 votes)
69 views128 pages

Research Paper

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views128 pages

Research Paper

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 128

Fashionable

Abstract
Fashion designers encounter a lot of difficulties when crafting and designing in-vogue fashion designs.
As a result, many fashion-design productions don’t come large-scale to the market and are below
standard. These problems affect the demand and supply chain of the fashion design industry.

These problems can be addressed with the usage of text-to-image synthesis. Text-to-image synthesis is
the process of transforming text descriptions into high-quality two-dimensional images. After critically
analyzing the existing systems, text descriptions with a large number of words that transform into
images have not yet been identified in the text-to-image synthesis domain. So, the author has decided
to bridge the gap by building a novel algorithm using Attn generative adversarial networks ensembled
with a contrastive learning approach to synthesize fashion-design-based descriptions to high-quality
fashion designs.

This system is developed using deep learning, following a multi-level architecture of GAN networks.
An image-text encoder is simulated and trained to emphasize the words provided in a text description
and make it semantically consistent with the images. Additionally, during training, contrastive loss of
the image and text is computed to minimize the distance of textual descriptions related to the same
image and maximize those related to different images. an Attn GAN network is employed to train the
text description. After training a maximum of 800 epochs, the GAN model was able to generate images
for a variety of text descriptions for classes (Shirts, Trousers, Blazers, Shorts & Tops and Dresses).
Also, with the use of ESRGAN trained on the FashionGen dataset, the final image that was generated
was of good resolution.

Keywords: Text-to-Image synthesis, Image Generative Models, Computer Vision, Generative


Adversarial Networks, PyTorch

Subject Descriptors

1. Computing Methodologies >> Machine Learning >> Neural Networks


2. Computing Methodologies >> Machine Learning >> Machine Learning Algorithms >>
Ensemble Methods
3. Computing Methodologies >> Artificial Intelligence >> Computer Vision >> Computer Vision
Representation >> Image Representation

Aarthif Nawaz | w1715752 ii


Fashionable

Acknowledgment
It is a great honor to have completed a successful research in the text-to-image synthesis domain by
generating images from descriptions containing numerous words. It would not have been feasible
without the great assistance, directions, and information supplied by numerous people I met along the
way. But it was my supervisor, Mr. Guhanathan Poravi, who helped me finish my project. It was thanks
to his critiques, suggestions, and research viewpoints that I was able to perform proper research. Thank
you to all industry and academic experts who took the time to listen to my research presentations and
provide extensive feedback. The teachers, administration, and colleagues who helped me with my final
year project are greatly appreciated. Finally, but most importantly, I want to thank my parents, who
were always there to help me succeed.

Aarthif Nawaz | w1715752 iii


Fashionable

Contents
Declaration............................................................................................................................................................. i
Abstract................................................................................................................................................................. ii
Acknowledgment ................................................................................................................................................. iii
1. INTRODUCTION............................................................................................................................................ 1
1.1 Chapter Overview ...................................................................................................................................... 1
1.2 Problem Domain ........................................................................................................................................ 1
1.2.1 Generative adversarial networks ....................................................................................................... 1
1.2.2 Text-to-Image Synthesis ..................................................................................................................... 1
1.2.3 In-Vogue Fashion Design Industry and Problems Encountered .................................................... 2
1.3 Problem Definition ..................................................................................................................................... 2
1.3.1 Problem Statement .............................................................................................................................. 3
1.4 Research Motivation .................................................................................................................................. 3
1.5 Existing Work............................................................................................................................................. 3
1.6 Research Gap ............................................................................................................................................. 5
1.7 Contribution to The Body of Knowledge ................................................................................................. 5
1.7.1 Technical Contribution....................................................................................................................... 6
1.7.2 Domain Contribution .......................................................................................................................... 6
1.8 Research Challenges .................................................................................................................................. 6
1.9 Research Questions .................................................................................................................................... 6
1.10 Research Aim ........................................................................................................................................... 7
1.11 Research Objectives ................................................................................................................................. 7
1.12 Project Scope ............................................................................................................................................ 9
1.12.1 In-Scope ............................................................................................................................................. 9
1.12.2 Out-Scope......................................................................................................................................... 10
1.12.3 Diagram Explaining Prototype Feature Diagram ................................................................... 10
1.13 Document Structure............................................................................................................................... 11
1.14 Chapter Summary ................................................................................................................................. 11
2. LITERATURE REVIEW ............................................................................................................................. 12
2.1 Chapter Overview .................................................................................................................................... 12
2.2 Concept Graph ......................................................................................................................................... 12
2.3 Problem Domain ...................................................................................................................................... 12
2.3.1 Introduction to the in-Vogue Fashion Design Industry ................................................................. 12

Aarthif Nawaz | w1715752 iv


Fashionable

2.3.2 How is Text-to-Image Synthesis is Performed?.............................................................................. 13


2.3.3 The Popularity of Text-to-Image Synthesis .................................................................................... 15
2.3.4 Why Text-to-Image Synthesis? ........................................................................................................ 15
2.3.5 Proposed Architecture ...................................................................................................................... 16
2.4 Review of Technologies............................................................................................................................ 16
2.4.1 Deep Learning ................................................................................................................................... 16
2.4.2 Image Generative Models ................................................................................................................. 16
2.4.3 Text-to-Image Synthesis ................................................................................................................... 17
2.4.4 Generative Adversarial Networks ................................................................................................... 18
2.4.5 Text Preprocessing ............................................................................................................................ 19
2.4.6 Text Encoding.................................................................................................................................... 19
2.4.7 Image Encoding ................................................................................................................................. 20
2.4.8 Contrastive Learning ........................................................................................................................ 20
2.4.9 Image Super resolution..................................................................................................................... 20
2.4.10 Algorithmic Selection ...................................................................................................................... 20
2.4.11 Model Evaluation ............................................................................................................................ 23
2.5 Existing Systems ....................................................................................................................................... 24
2.5.1 Supervised Learning Systems of Text-to-Image Synthesis............................................................ 24
2.5.2 Unsupervised Learning Systems for Text-to-Image Synthesis ..................................................... 25
2.5.3 Critical Analysis of Existing Systems .............................................................................................. 27
2.5.4 Benchmarking ................................................................................................................................... 28
2.6 Chapter Summary ............................................................................................................................. 29
3. METHODOLOGY .................................................................................................................................... 30
3.1 Chapter Overview .................................................................................................................................... 30
3.2 Research Methodology ............................................................................................................................ 30
3.3 Development Methodology ...................................................................................................................... 31
3.4 Project Management Methodology ........................................................................................................ 32
3.4.1 Project Deliverables .......................................................................................................................... 32
3.4.2 Project Plan ....................................................................................................................................... 32
3.4.3 Resource Requirements .................................................................................................................... 32
3.4.4 Risk Management ............................................................................................................................. 34
3.5 Chapter Summary ................................................................................................................................... 34
4. SOFTWARE REQUIREMENT SPECIFICATION .................................................................................. 35
4.1 Chapter Overview .................................................................................................................................... 35
Aarthif Nawaz | w1715752 v
Fashionable

4.2 Rich Picture of The System ..................................................................................................................... 35


4.3 Stakeholder Analysis ............................................................................................................................... 35
4.3.1 Stakeholder Onion Model ................................................................................................................ 36
4.3.2 Stakeholder Viewpoints .................................................................................................................... 36
4.4 Selection of Requirement Elicitation Methods ...................................................................................... 38
4.5 Discussion of Findings through Different Elicitation Methods............................................................ 39
4.5.1 Literature Review ............................................................................................................................. 39
4.5.2 Survey................................................................................................................................................. 40
4.5.3 Brainstorming ................................................................................................................................... 45
4.6 Summary of Findings .............................................................................................................................. 46
4.7 Context Diagram ...................................................................................................................................... 47
4.8 Use Case Diagram .................................................................................................................................... 48
4.9 Use Case Descriptions .............................................................................................................................. 48
4.10 Requirements.......................................................................................................................................... 50
4.10.1 Functional Requirements ............................................................................................................... 51
4.10.2 Non-Functional Requirements ....................................................................................................... 52
4.11 Chapter Summary ................................................................................................................................. 52
5. SOCIAL, LEGAL, ETHICAL AND PROFESSIONAL ISSUES ............................................................. 53
5.1 Chapter Overview .................................................................................................................................... 53
5.2 SLEP Issues & Mitigation ....................................................................................................................... 53
5.3 Chapter Summary ................................................................................................................................... 54
6. DESIGN .......................................................................................................................................................... 55
6.1 Chapter Overview .................................................................................................................................... 55
6.2 Design Goals ............................................................................................................................................. 55
6.3 System Architecture Design .................................................................................................................... 55
6.3.1 Tiered Architecture........................................................................................................................... 55
6.4 System Design ........................................................................................................................................... 57
6.4.1 Choice of the Design Paradigm ........................................................................................................ 57
6.4.2 Component Diagram......................................................................................................................... 57
6.4.3 Sequence Diagram............................................................................................................................. 58
6.4.4 Class Diagram ................................................................................................................................... 59
6.4.5 UI Design............................................................................................................................................ 59
6.4.6 System Process Flow Chart .............................................................................................................. 60
6.5 Chapter Summary ................................................................................................................................... 60
Aarthif Nawaz | w1715752 vi
Fashionable

7. IMPLEMENTATION ................................................................................................................................... 61
7.1 Chapter Overview .................................................................................................................................... 61
7.2 Technological Selection ........................................................................................................................... 61
7.2.1 Technological Stack .......................................................................................................................... 61
7.2.2 Data Selection .................................................................................................................................... 61
7.2.3 Selection of Development Framework ............................................................................................ 62
7.2.4 Programming Language ................................................................................................................... 62
7.2.5 Libraries Utilized .............................................................................................................................. 62
7.2.6 IDE’s Utilized .................................................................................................................................... 62
7.2.7 Summary of Technology Selection .................................................................................................. 63
7.3 Implementation of Core Functionalities ................................................................................................ 63
7.3.1 Core Research Contribution ............................................................................................................ 63
7.3.2 System Benchmarking Algorithms .................................................................................................. 66
7.4 Implementation of APIs........................................................................................................................... 67
7.5 Chapter Summary ................................................................................................................................... 67
8. TESTING ........................................................................................................................................................ 68
8.1 Chapter Overview .................................................................................................................................... 68
8.2 Objectives and Goals of Testing ............................................................................................................. 68
8.3 Testing Criteria ........................................................................................................................................ 68
8.4 Model Evaluation ..................................................................................................................................... 68
8.5 Benchmarking .......................................................................................................................................... 71
8.6 Functional Testing ................................................................................................................................... 71
8.7 Module Integration Testing..................................................................................................................... 73
8.8 Non-Functional Testing ........................................................................................................................... 74
8.8.1 Accuracy ............................................................................................................................................ 74
8.8.2 Performance ...................................................................................................................................... 74
8.8.3 Security .............................................................................................................................................. 75
8.8.4 User Friendliness ............................................................................................................................... 75
8.9 Limitations of the Testing Process .......................................................................................................... 76
8.10 Chapter Summary ................................................................................................................................. 76
9. EVALUATION .............................................................................................................................................. 77
9.1 Chapter Overview .................................................................................................................................... 77
9.2 Evaluation Methodology & Approach ................................................................................................... 77
9.3 Evaluation Criteria .................................................................................................................................. 77
Aarthif Nawaz | w1715752 vii
Fashionable

9.4 Self Evaluation ......................................................................................................................................... 78


9.5 Selection of the Evaluators ...................................................................................................................... 80
9.6 Evaluation Results and Expert Opinions ............................................................................................... 80
9.6.1 Qualitative Result Analysis .............................................................................................................. 80
9.6.2 Quantitative Evaluation ................................................................................................................... 83
9.7 Limitations of Evaluation ........................................................................................................................ 84
9.8 Evaluation of Functional Requirements ................................................................................................ 85
9.9 Evaluation of Non-Functional Requirements ........................................................................................ 85
9.10 Chapter Summary ................................................................................................................................. 85
10. CONCLUSION ............................................................................................................................................ 86
10.1 Chapter Overview .................................................................................................................................. 86
10.2 Achievement of Research Aims & Objectives ..................................................................................... 86
10.3 Utilization of Knowledge from the Course .......................................................................................... 86
10.4 Use of Existing Skills .............................................................................................................................. 87
10.5 New Skills................................................................................................................................................ 87
10.6 Achievement of Learning Outcomes .................................................................................................... 88
10.7 Problems & Challenges Faced .............................................................................................................. 88
10.8 Deviations................................................................................................................................................ 89
10.9 Limitations of the Research .................................................................................................................. 89
10.10 Future Enhancements .......................................................................................................................... 90
10.11 Research Contribution of Achievement ............................................................................................. 90
10.12 Concluding Remarks ........................................................................................................................... 90
REFERENCES ...................................................................................................................................................... I
APPENDICES .....................................................................................................................................................IX
Appendix A – Concept Map ...........................................................................................................................IX
Appendix B – Comparison of Text-to-Image Synthesis Models ..................................................................X
Appendix C – Comparison of Supervised Learning Text-to-Image Systems ...........................................XII
Appendix D – Comparison of unsupervised Learning Text-to-image Systems........................................XII
Appendix E – Gantt Chart ........................................................................................................................... XV
Appendix F – Requirement Engineering Survey ...................................................................................... XVI
Appendix G – Design Goals ...................................................................................................................... XVIII
Appendix H – UI Design & Mockups ..........................................................................................................XIX
Appendix I – Implementation DAMSM Network & ATTN GAN.............................................................XX
Appendix J – Selection of Evaluators (Name, Position, Evaluation) ..................................................... XXVI
Aarthif Nawaz | w1715752 viii
Fashionable

Appendix K – Evaluation of Functional Requirements......................................................................... XXVII


Appendix L - Evaluation of Non-Functional Requirements ................................................................ XXVIII

List of Figures
Figure 1.1 - Prototype Feature Diagram (Self-Composed) ................................................................... 10
Figure 2.1 - Proposed Architecture (Self- Composed) .......................................................................... 16
Figure 4.1 - Rich Picture of the System (Self-Composed) .................................................................... 35
Figure 4.2 - Stakeholder Onion Model (Self-Composed) ..................................................................... 36
Figure 4.3 - Context Diagram (Self-Composed) ................................................................................... 47
Figure 4.4 - Use case Diagram (Self-Composed) .................................................................................. 48
Figure 6.1 - Tiered Architecture (Self-Composed) ............................................................................... 55
Figure 6.2 - Component Diagram (Self-Composed) ............................................................................. 58
Figure 6.3 - Sequence Diagram (Self-Composed)................................................................................. 58
Figure 6.4 - Class Diagram (Self-Composed) ....................................................................................... 59
Figure 6.5 - System Process Flowchart (Self-Composed) ..................................................................... 60
Figure 7.1 - Contrastive Loss Function ................................................................................................. 64
Figure 7.2 - Training GAN Network ..................................................................................................... 65
Figure 7.3 - Inception Score Calculation ............................................................................................... 66
Figure 7.4 - R-Precision Calculation ..................................................................................................... 66
Figure 7.5 - API Route (Generate Fashion Design) .............................................................................. 67
Figure 7.6 - API Route (Download Image) ........................................................................................... 67
Figure 8.1 - Inception Score .................................................................................................................. 69
Figure 8.2 - R-Precision......................................................................................................................... 70
Figure 8.3 - DAMSM Model ................................................................................................................. 70
Figure 8.4 - Benchmarking of Existing Systems ................................................................................... 71
Figure 8.5 - GPU Performance .............................................................................................................. 75
Figure 8.6 - Usability of the Prototype .................................................................................................. 76
Figure 9.1 - Presented Solution Evaluation ........................................................................................... 83
Figure 9.2 - Solution to the identified Problem Evaluation ................................................................... 83
Figure 9.3 - Evaluation metrics Evaluation ........................................................................................... 84
Figure 9.4 - Accuracy of the Prototype Evaluation ............................................................................... 84
Aarthif Nawaz | w1715752 ix
Fashionable

Figure 9.5 - Presented GUI Evaluation ................................................................................................. 84

List of Tables
Table 1.1 - Existing Works ...................................................................................................................... 5
Table 1.2 - Research Objectives .............................................................................................................. 9
Table 3.1 - Research Methodology........................................................................................................ 31
Table 3.2 - Project Deliverables ............................................................................................................ 32
Table 3.3 - Hardware Requirements ...................................................................................................... 33
Table 3.4 - Software Requirements ....................................................................................................... 33
Table 3.5 - Risk Management................................................................................................................ 34
Table 4.1 - Stakeholder Viewpoints ...................................................................................................... 38
Table 4.2 - Selection of Requirement Elicitation Methods ................................................................... 39
Table 4.3 - Findings through Literature Review ................................................................................... 40
Table 4.4 - Survey Findings .................................................................................................................. 45
Table 4.5 - Brainstorm Findings ............................................................................................................ 46
Table 4.6 - Summary of Findings .......................................................................................................... 47
Table 4.7 - Use case Description (Input Description) ........................................................................... 49
Table 4.8 - Use case Description (Download Image Output)................................................................ 50
Table 4.9 - MosCOW Techniques ......................................................................................................... 50
Table 4.10 - Functional Requirements .................................................................................................. 51
Table 4.11- Non- Functional Requirements .......................................................................................... 52
Table 5.1 - SLEP Issues & Mitigations ................................................................................................. 54
Table 6.1 - Choice of Design Paradigm ................................................................................................ 57
Table 7.1 - Summary of Technology Selection ..................................................................................... 63
Table 8.1 - Quantitative Test Results .................................................................................................... 70
Table 8.2- Functional Testing ................................................................................................................ 73
Table 8.3 - Module Integration Testing ................................................................................................. 74
Table 9.1 - Evaluation Criteria .............................................................................................................. 78
Table 9.2 - Self-Evaluation .................................................................................................................... 80
Table 9.3 - Research Concept of the Project ......................................................................................... 80
Table 9.4 - Novelty of the Project Evaluation ....................................................................................... 81
Table 9.5 - Proposed Architecture of the Project Evaluation ................................................................ 82

Aarthif Nawaz | w1715752 x


Fashionable

Table 9.6 - Model Implementation Code Evaluation ............................................................................ 82


Table 9.7 – GUI Evaluation ................................................................................................................... 83
Table 10.1 - Utilization of the Course ................................................................................................... 87
Table 10.2 - Achievement of Learning Outcomes ................................................................................ 88
Table 10.3 - Problems & Challenges Faced .......................................................................................... 89

List of Abbreviations

Abbreviations Acronym
AI Artificial Intelligence
CNN Convolutional Neural Network
GAN/s Generative Adversarial Network/s
GPU Graphics Processing Unit
GUI Graphical User Interface
LR Literature Review
LSTM Long Short-Term Memory
OS Operating System
SSADM Structured Systems Analysis and Design Method
DAMSM Deep Attentional Multimodal Similarity Model
SOTA State of The Art
IS Inception Score
LO Learning Outcome
FR Functional Requirement
NFR Non-Functional Requirement
RNN Recurrent Neural Network
SLEP Social, Legal, Ethical and Professional
NLP Natural Language Processing
RO Research Objectives
VAE Variational Autoencoder
CPU Central Processing Unit
FID Fréchet Inception Distance

Aarthif Nawaz | w1715752 xi


Fashionable

1. INTRODUCTION
1.1 Chapter Overview
The introduction chapter gives a comprehensive overview of the entire research project. The
background is covered first, followed by the problem domain and definition. Following that, the
research contributions, aims, and objectives are all addressed in detail. In conclusion, the document
thoroughly discusses the problem, the research domain, and the author's motivation for conducting the
research.

1.2 Problem Domain


1.2.1 Generative adversarial networks
According to (Goodfellow et al., 2014a) Generative adversarial networks are types of generative models
that generate images, audios, videos, and text based on different training characteristics of the dataset.
In other words, the ultimate goal is to generate similar and new output based on the training data.
Different types of GANs have been trained, tested, and evaluated to outperform the SOTA models that
existed in various domains. The GAN network features a generator and discriminator model where the
generator simultaneously trains itself until it can outperform the discriminator. Thus, able to generate
high-quality synthesized outputs based on the training dataset. There are many use-cases of GANs such
as generating realistic-looking images, generating visual emotions of human beings, and audio
generation are widely known scenarios.

1.2.2 Text-to-Image Synthesis


Text-to-image synthesis aims to generate high-quality photosynthetic images that are semantically
consistent with the input text description (Hu et al., 2021). The existing systems which are built upon
using the unsupervised learning approach such as GANs have demonstrated amazing results, which in
turn has been given a lot of importance as it has been usefully applicable in many industries and various
other domains. Previous studies and SOTA have shown commendable results on the visual quality of
the generated images. Text-to-image synthesis has been experimented, tested, and evaluated with
several datasets. The inception score of the GANS has been a remarkable result achieved so far in this
domain. But the limitations of mode collapse and the inability of GANs to generate images on text
descriptions on a fine-grained level taking certain phrases/keywords from the text makes the image
visually incomplete and not useful for the domains which prefer to input text descriptions of large
substantial size.
Aarthif Nawaz | w1715752 1
Fashionable

1.2.3 In-Vogue Fashion Design Industry and Problems Encountered


The fashion design industry is a large-scale manufacturing business that has created plenty of demands
and supply as of the 21st century. Understanding the significance this domain produces. There are also
a lot of problems that the fashion design industry encounters. It takes a lot of hard work, creativity, and
time to output fashion designs. It is very clear that manually creating fashion designs and in-vogue
stylings can be time-consuming with a lot of human effort that will involve a lot of raw materials and
the consumption of in-house energy. With all of that being said, the problem definition for this problem
was bought out.

1.3 Problem Definition


Generating fashion designs and in-vogue styles by the experts in the industry is a lot time-consuming
and it needs a lot of human effort like creativity, the ability to reason out, ability to comprehend various
phenomenal factors (nature, background, luminosity). Though these can be carried out manually. It’s
still a tedious task that needs technical intervention.

Text-to-image synthesis has been applied to modern multimodal applications to facilitate the easiness
of generating images for a wide variety of textual inputs such as keywords and phrases. It has been used
as a generic approach in CAD designing, scientific engineering, graphic designing, image fine-tuning,
and perhaps even animation (Zaidi, 2017). But these transformations only have produced outputs for
simple phrases, clauses, and pre-defined trained text towards the image. As for the basics, generative
adversarial networks (GAN) (Goodfellow et al. 2014) have successfully demonstrated the learning
probability distribution to synthesize realistic examples of images from textual descriptions which has
been vividly used in the text to image domain. Recent progress in generative models, especially
Generative Adversarial Nets (GANs) in many formats has made significant improvements in
synthesizing images and generating plausible images. But it has only proven successful for text
descriptions having a span of less than max of 15-20 words which also uses only the terms specific to
the domain. But if we are to input a larger span of text like storylines, paragraphs, and detailed
descriptions. The GANs haven’t been able to pull out versatile quality images. In simple terms, only
fine-grained texts are of considerable input producing images specific to the domain. (Nasr, Mutasim
and Imam, 2021).The current model boasts the original GAN Inception score by +4.13% and the
Frechet Inception Distance score by +13.93% benchmarked against the CIFAR_100 dataset, which is
not quite good compared to generating images from a large text description(Cheng and Gu, 2020a).

Aarthif Nawaz | w1715752 2


Fashionable

1.3.1 Problem Statement


Designing in-vogue stylings and fashion designs at a mass scale through manual crafting, usage of in-
house energy and raw materials is a lot time-consuming. To avoid all of these the author proposes a
generic text-to-image model that takes in fashion terms along with substantial description in three
natural languages which will produce fashion designs of various in-vogue styles of different designs.

1.4 Research Motivation


Ian Goodfellow’s revolutionary introduction of the initial GAN (Goodfellow et al., 2014b) and its
variants built by other researchers are being applied for image generation in many domains.
Unsupervised learning has traditionally relied on GANS. Additionally, it has significant potential in the
image generating field, which has recently received a lot of attention. (Salimans et al., 2016). This is
especially helpful for the fashion business, where they could easily depict the latest trends in several
ways in only a few paragraphs. Accordingly, the author is convinced that an AI-based solution can
reduce the amount of time, personnel, money, and other resources required to create a fashion design
that will be of in-vogue style. Fashion designers and professionals will have a huge impact on the entire
business as a result of this.

1.5 Existing Work


Several GANs have been used in the past three years as an experimental prototype to test the text-to-
image synthesis in terms of the resolution, the length of the text it could process, the performance time,
non-blurriness of the image, and the number of different image varieties it could generate per
train. Different research papers have used different formats of GANs to train for the above features,
notably (Stack GANs – 2 stage, DF GANS, Mirror GANs, DM-GANs), and variational auto encoders.
Papers in the below section explain how they achieved the text-to-image synthesis tested with different
datasets in depth.

Citation Technique Improvement Limitations


(Bodnar, Wasserstein's GAN-CLS The model boosts by  Resolution of images
2018a) method was employed. 7.07%, the best limited to 64*64 pixels.
For the image to be Inception Score (on the  Phrase level keywords
formed, the input text is Caltech birds’ dataset) considered for training.
encoded as a vector and of the models which  Blurry images with a lot
delivered to the generator use only the phrase- of background noise.

Aarthif Nawaz | w1715752 3


Fashionable

with a succession of CNN level keywords to


layers that generate generate visual
random noise. semantics (Image).
(Zaidi, The two-step procedure, The evaluation of a  Resolution limited to
2017) where each step is its own Stack-GAN 256*256 pix
GAN implementation that productively generates  Phrase level keywords
is easy to handle. Stage 1: highly realistic considered for training.
Using GAN to make a synthetic images from
rough, blurry sketch of text phrase. Model
the image based on the Fretchet Inception
text that was given. Stage score boosts by 14.4%.
2 GAN: The sketched
image gets a higher
resolution based on the
embedded text and rough
image from Stage 1.
(Zhu et al., When a dynamic memory  In order to  Refines photos with poor
2019) component processes the visually depict color and shape.
incoming text data, it a global image  Sentenced level text. But
refines a dynamic feature in less only a few relevant words
memory component's time, propose are taken into
representation of the the dynamic consideration, thereby
highlighted text to memory generating some images
provide more accurate module. outside the scope.
visuals and memory  Produces high-  Generates images from
representations. quality varied scopes irregular of
synthetic the input image scope.
images.
(Zhang, Li The discriminator is  Focused on  Phrase-level textual
and Zhou, constrained by the textual generating descriptions
2021a) description of genuine images more

Aarthif Nawaz | w1715752 4


Fashionable

images in the noise diverse, thus  Uses local word


vector. The image is avoiding mode embedding, thereby
generated using Deep collapse to subsides to a limitation of
Convolutional Generative some extent. words for the training set.
Adversarial Networks  The images
(DCGAN). synthesized
have relatively
high resolution
and high
quality.
(Mishra et Reduced residual loss Stabilizing the learning  Texts are more query
al., 2020) from the generator and process and generating related and fine-grained
discriminator when used images from text relevant words to be
together is GAN's descriptions faster. taken as input.
primary goal.  The quality of images is
low 64*64 pixels.
Table 1.1 - Existing Works

1.6 Research Gap


The research gap that the author offers is to achieve the ability for text-to-image synthesis models to
take in larger descriptions of multiple sentences as inputs and to generate images which was taken as a
future work (Mishra et al., 2020). Moreover, to improve the resolution of the image generated, the
author has employed ESRGAN trained on the dataset to generate images of 512*512 pixels. Also,
another gap that the author will address in the protype is the ability to take in three natural languages
(English, Sinhala, and Tamil) as inputs. That way, the model will be able to generate images from three
natural languages.

1.7 Contribution to The Body of Knowledge


Based on the gap that the author wishes to achieve, which is to extend the model architecture to take in
descriptions with multiple sentences and pertain it to the trending in-vogue style and design industry,
Thus, in a variety of ways, it contributes significantly to the technical as well as domain proportions.

Aarthif Nawaz | w1715752 5


Fashionable

1.7.1 Technical Contribution


There have been mixed results with the text-to-image synthesis models produced so far (Dhivya and
Navas, 2020). It only accepts texts at a fine-grained level. Even when longer sentences are provided,
just the most relevant words are used to generate images, resulting in images that are less detailed and
lack unexpected features. To prevent all of these interactions, the most preferred Attn GAN (Xu et al.,
2018a) will be combined with a contrastive learning approach to learn and train images with larger
textual descriptions of three natural languages.

1.7.2 Domain Contribution


Based on the specified description, the system will generate a variety of diverse outputs of in-vogue
stylings. This way, fashion designers can easily develop designs and obtain vast ideas for more
inventive patterns to use in their creations. This will save time, manpower, mental effort, and raw
materials.

1.8 Research Challenges


According to the authors investigation into the research, there are plenty of challenges to be faced during
the research process and they are listed below.

 Analyzing and choosing the appropriate technology to solve the problem of taking input of
larger descriptions of substantial size and synthesizing them into high quality images.
 Designing and developing a framework that will overcome the limitations like mode collapse,
blurry image, inconsistency of visual and textual semantics incurred from the existing SOTA
text-to-image synthesis models.
 Analyzing and choosing the best text-to-image synthesis unsupervised approach with
contrastive learning to synthesize textual descriptions into high quality two-dimensional images.
Using techniques and identifying the appropriate tools, libraries, and technologies to develop
the text-to-image synthesis model which will accept larger text descriptions.
 Analyzing and identifying tools and technologies to train multi-language-based text-to-image
synthesis models.

1.9 Research Questions


RQ1: What are the current technology frameworks/algorithms/network architectures used in the text-
to-image synthesis models?

Aarthif Nawaz | w1715752 6


Fashionable

RQ2: What are the current problems being faced and areas that should be improved when developing
text-to-image synthesis models to generate better-detailed images for the descriptions given?

RQ3: What are the current problems being faced and areas that should be improved when developing
text-to-image synthesis models to generate better-detailed images for the descriptions given?

RQ4: What are the challenging problems that need to be solved when transforming text to images?

RQ5: What existing text to image synthesis models can be used as a partial resolution to solve the
challenging problems faced when transforming text to images?

1.10 Research Aim


This research aims to design, develop, and evaluate a text to image synthesis model that will take input
of text descriptions with a large number of words of in-vogue fashion-related text in any three natural
languages (English, Sinhala, and Tamil) and generate highly synthesized in vogue fashion design
images with better details, resolution, and features.

To further elaborate on the aim, this research project will produce a text-to-image synthesis model for
fashion designers. This system will allow stakeholders to input descriptions explaining the desired in-
vogue fashion style they wish to visualize, and it will produce high-quality two-dimensional images
with a lot of details present in the image. The outputs of state-of-the-art models in the domain will be
compared using proper evaluation metrics. The research domain and the technological body will be
properly researched to obtain knowledge before proceeding towards the development of this project.
The knowledge gained will be used in developing components needed for the model and several other
areas that will achieve the outcome. The system will be open source and it will have the ability to run
across any device, including mobile and desktop.

1.11 Research Objectives


After investigating research aims and questions, subsequent objectives were allotted in the perspective
of the research.

Research Description Learning Research


Objectives Outcome Question
Problem Carry out in-depth research to identify a potential problem LO1, RQ1,
Identification that needs to be solved. LO4 RQ1

Aarthif Nawaz | w1715752 7


Fashionable

 RO1: Research on the fashion design domain and


analyze the problems encountered.
 RO2: Research on how to overcome the ability to
visualize in-vogue images simply from fashion
textual descriptions.

Literature Conduct extensive research to evaluate how to achieve the LO1, RQ2,
Review target outcome. LO4, RQ3
 RO1: Research and analyze existing text to image LO8
synthesis models.
 RO2: Analyze techniques and technologies used
to transform text to images.
 RO3: Research on how the current text to image
synthesis models has been applied to different
industries.
 RO4: Elaborating the research gap applied to the
fashion design industry and training larger spans
of texts with three natural languages.
 RO5: Provide an analysis document on the
critically evaluated system.
Data Gathering Carry out a requirement gathering analysis. LO3, RQ3
and Analysis  RO1: Gathering feedback on building a text to LO4,
image transformation model that will take larger LO6,
spans of texts in three natural languages applied LO8
for the first time to a fashion design domain.
 RO2: Evaluate the requirements gathered to
develop a GAN-based network that will allow the
transformation of text to images.
 RO3: Gather, evaluate, and define the end-user
requirements through questionnaires.

Aarthif Nawaz | w1715752 8


Fashionable

Design & Plan the timeline, design, a text-to-image synthesis model LO2, RQ4,
Implementation that takes larger spans of fashion-based textual LO5, RQ5
descriptions. LO7,
 RO1: Design and develop a DAMSM image- text LO8
encoder network.
 RO2: Design and develop AttnGAN model
architecture ensembled with contrastive learning
using DL to combine the images and text vectors
into batch layers.
 RO3: Design and develop the frontend and
backend for the system.
Testing and Evaluation of generative models includes inception score LO8, RQ3,
Evaluation and visual quality comparisons. Test and evaluate the LO9 RQ4,
prototype. RQ5
 RO1: Evaluate and test the created text-to-image
model and compare against current existing
models
 RO2: Create a test plan to perform unit,
integration, and functional requirement testing of
the prototype.
 RO3: Produce a detailed report for the academic
and research community.
Table 1.2 - Research Objectives

1.12 Project Scope


Based on the initial literature review and research objectives, focused and non-focused points of
research are defined.

1.12.1 In-Scope
Parts that will be focused on during the research process are as follows:

 Reviewing and analyzing the SOTA generative adversarial network models applied to the text-
to-image domain and other text-to-image synthesis models.

Aarthif Nawaz | w1715752 9


Fashionable

 Deciding on using a proper approach to transform larger spans of text to a detailed image with
all the features mentioned in the text.
 Effectively reading in-vogue fashion-based text descriptions and mapping each feature word
from the text to an image and finally generating an in-vogue fashion design image with all the
relevant details needed from the text.
 Evaluating results of the system based on common evaluation metrics.
 Developing a full-stack application that will take in-vogue fashion text descriptions and generate
a two-dimensional high-quality image.

1.12.2 Out-Scope
Parts that will not be focused on during the research process are as follows:

 The project will be limited to only taking in-vogue fashion based textual descriptions and the
model will only be effective based on fashion-related texts and other supporting words to
generate in-vogue fashion design images.
 Projects focus is only to extend the ability to train larger spans of textual description and the
ability to work on three natural languages. But not to generate videos from texts or generate
images outside of the fashion design scope.

1.12.3 Diagram Explaining Prototype Feature Diagram

Figure 1.1 - Prototype Feature Diagram (Self-Composed)

Aarthif Nawaz | w1715752 10


Fashionable

1.13 Document Structure


There are ten chapters in all to the dissertation. Chapter one presents an overview of the project,
including the identified problem, its domain, the research gap, the objectives, and the scope of the
investigation.

A thorough literature review of the research, existing systems, and possible evaluation metrics is
presented in Chapter two.

All of the approaches utilized in research, software development, and project management are described
in Chapter three.

The prototype's functional and non-functional needs will be discussed in detail using techniques
selected in this chapter's software requirement specification which is discussed in detail in chapter four.

The possible social, legal, ethical, and professional concerns that may arise throughout the course of
the project and the steps that can be taken to mitigate those issues are discussed in detail in chapter five.

In chapter six, we'll look at how to use the literature and requirements we've gathered to create a system
that both solves the problem we've identified and fills in the research gap.

After making design decisions, they had to be written down in a way that could be implemented in a
computer program, which is explained in Chapter seven.

An in-depth examination of software testing procedures is provided in Chapter 8 to ensure that the
system is up to code.

Chapter nine explains how the system was evaluated using a variety of quantitative and qualitative
methods.

Chapter ten wraps up the research and reflects on the project's achievements.

1.14 Chapter Summary


The first chapter provided a thorough overview of the research project. It described how the problem
area was investigated, current systems were evaluated, and a research gap was identified. It explained
how bridging the gap will benefit both the domain and the software industry. The project's research
problems, goal, objectives were determined.

Aarthif Nawaz | w1715752 11


Fashionable

2. LITERATURE REVIEW
2.1 Chapter Overview
Text to image synthesis has been given significant exposure in the recent past. Many of the approaches
followed by different researchers have been experimented with, examined, and evaluated to show the
success and failures this domain produces. And if the existing work is critically analyzed to understand
how these prototypes function, it will be beneficial to continue the research in this domain. This chapter
presents an analysis of the existing work in the domains of text-to-image synthesis. The pros and cons
of the various approaches and their reviews in terms of algorithms supported, features, and
implementations are explored.

2.2 Concept Graph


A visually represented concept-map is offered to understand the broader domain of text-to-image
synthesis. The complete walkthrough of the relationship between the research concepts and the text-to-
image synthesis domain is graphically presented using a funnel approach. The mapping explains the
deep learning approaches, existing work, and the evaluation metrics that will be used. The graph can be
found in Appendix A.

2.3 Problem Domain


2.3.1 Introduction to the in-Vogue Fashion Design Industry
The in-vogue fashion design industry has been profoundly building its essence in various ways. The
active visual aids of design and fashion components have made this industry increase customer buying
patterns in recent times. Fashion designers manually design in-vogue fashion designs that involve the
usage of raw materials, in-house energy, and human effort. These fashion designs comprise of fashion
products that have unique designs focused more on in-vogue styling designs like Hautecouture, Label,
Ensemble, Silhouette, Off-the-rack, Hemline, In vogue, Fashion-forward, Monochrome, Peplum, etc.
and they have been paid a high price to get delivered to the market.(Abid, 2020).

2.3.1.1 Benefits of Using a Text to Image System in this domain


Text-to-Image synthesis has been a difficult task that has also been shown to be incredibly beneficial.
Due to its possible outcomes, it has recently gained great interest from multimedia communities,
architecture firms, CAD design, and other media-related agencies (Tan et al., 2019). The majority of
current methods are based on generative adversarial network (GAN) models, which synthesize images

Aarthif Nawaz | w1715752 12


Fashionable

based on global linguistic representation. However, due to the sparsity of the global representation,
GAN training is challenging and the resulting images lack fine-grained information. As of now, the
majority of the images generated are merely a few sentences long (Hu, Long and Xiao, 2021). But the
promising thing about using text-to-image synthesis is the ability to generate high quality synthesized
images using different techniques, which is beneficial in various ways (Li et al., 2020a).

The art of generating images from text using AI and deep learning has recently received a lot of
attention.(Bodnar, 2018b). The results of several text-to-image synthesis applications have attracted
attention from a variety of interested stakeholders. The fashion design industry uses a lot of in-house
energy and additional unwanted raw materials, human effort, and time consumption in manually
designing in-vogue fashion designs. To avoid all of these hassles, a fashion-related system belonging
to the text-to-image synthesis domain will resolve the issue of manual creation of in-vogue fashion
designs. With just a few pieces of textual description combined with technical intervention, high quality
two-dimensional unique synthesized fashion designs can be generated.

2.3.2 How is Text-to-Image Synthesis is Performed?


Text-to-image synthesis is the procedure of synthesizing textual descriptions of substantial words into
high quality two-dimensional synthesized images. Throughout the existing works of text-to-image
synthesis, the generation of high-quality images has played an important role, as the textual descriptions
were prioritized in order to generate images that were semantically consistent. The results so far using
supervised learning approaches like DCGAN (Radford, Metz and Chintala, 2016), CGAN (Bodnar,
2018b), Stack GAN (Zhang et al., 2017) has been promising, but only to a certain extent. The text and
image were paid attention to only at a fine-grained level (Cheng and Gu, 2020b). The textual description
differs based on the number of words and the visual context it appeals to. Simultaneously focusing on
the texts and the images during the training stages using many existing technologies has not maintained
the semantic consistency. As a result, the visual representation output is vividly out-scoped or
sometimes less detailed (Souza, Wehrmann and Ruiz, 2020). In the early stages, traditional approaches
were used to transform text to images, but during the latter with the emerging deep learning approach,
unsupervised learning has been paid a lot of attention. With the use of attention mechanisms, text
encoders and use of GANS, deep learning approaches have showed remarkable progress over traditional
approaches because of their powerful training capability.

Aarthif Nawaz | w1715752 13


Fashionable

2.3.2.1 Traditional Approaches


Traditional approaches to text-to-image synthesis follow supervised learning. Highly occurring words
in a given text description were given utmost priority. The images generated pertained to the text
description with the most occurring words. The works were first staged in 2007, but unfortunately, due
to the drastically high correlation and the unrelated semantic consistency between the texts and images.
It destroyed the potential of researching more towards this domain. The synthesis was executed in such
a way that the image was searched from the dataset and a supervised learning approach was applied to
it (S. Reed et al., 2016). Text and images that had strong correlation and could be visualized were joined
to form textual units, which then found the most likely image to be generated. (Zhu et al., 2007). But
the images that were generated were far from the textual description. Also, this approach doesn’t have
the potential to generate new images. In fact, the new images generated are the changes in the
characteristics of the training image. In short, to sum up, it was just a manipulation of existing images.
The cons of being unable to generate new images and not being semantically accurate did not benefit
the developers. Thus, research has started on unsupervised learning approaches using deep learning
(Dong et al., 2017).

2.3.2.2 Deep Learning Approaches


Works based on text-to-image synthesis have emerged with the rise of deep learning. Understanding
the potential of the text-to-image synthesis domain, many unsupervised learning approaches and
techniques using deep learning have been established. GANS(Goodfellow et al., 2014a) which was
implemented for Image Generation (Goodfellow et al., 2014a), Image-to-Image translation (Isola et al.,
2017), Image Inpainting(Yu et al., 2019) was used in this domain. CGAN (Bodnar, 2018b), StackGAN
(Zhang et al., 2017), Mirror GAN (Qiao et al., 2019), DC-GAN (Radford, Metz and Chintala, 2016),
DM-GAN (Zhu et al., 2019) and AttnGAN (Xu et al., 2018b) were some of the GAN based techniques
which followed the deep learning approach on the text-to-image synthesis domain. These approaches
followed a unique multi-staged architecture with a self-supervised generator and discriminator which
trained recursively till both the networks were unable to outperform each other. These deep learning
approaches not only helped in the generation of unique high-quality images but also paid attention to
the semantic consistency between the images and the text, such that the unsupervised global spatial
constraints were minimized and images related to the textual description were accurately generated.

Aarthif Nawaz | w1715752 14


Fashionable

2.3.3 The Popularity of Text-to-Image Synthesis


Deep convolutional and recurrent neural network architectures have proven popular for learning
discriminative text feature representations by automatically synthesizing realistic images from text
(Wang et al., 2020). Images defined by natural language elucidations can be visualized and
conceptualized by humans. What looks to the human brain to be a straightforward task is one of the
most difficult challenges in Natural Language Processing and Computer Vision. Images of all kinds are
aplenty in the digital world (Frolov et al., 2020). However, obtaining an image with the requisite
resolution, scale, or mode remains a pipe dream. Due to the complexity involved, capturing advanced
medical imaging, satellite images, or CAD designs with the required quality is quite difficult. For these
disciplines, text-to-image synthesis is a boon. Text-to-image synthesis is also used in various sectors,
such as interior design, graphic design, and so on. Due to the multimodal nature of the challenge and
the intricacy involved in problem-solving, it is considered a difficult task when compared to the reverse
problem of captioning images. At the same time, many new applications are being generated in this
domain, as it solves the problems of manual effort, time consumption, and expenditure of additional
raw materials.

2.3.4 Why Text-to-Image Synthesis?


Technological intervention is a necessary artifact with the advancement of many new industries. It
comes in handy for maintenance, security, recommendations, predictions, and so on and so forth.
Human beings have had a compulsive tendency towards hard work, especially in the design industry.
Understanding the need for a text-to-image application will resolve many hard encounters faced during
work (Frolov et al., 2021). Existing works have started in this domain dating back from 2007, but the
evolution remained stagnant until the advancement of GANs. With the use of unsupervised learning
many new architecture and techniques have been established, but each technique has its own advantages
and disadvantages (Yang et al., 2021). The ideology behind the generation of images shouldn’t just
focus only on a single word from a sentence but focus on the entire sentence or sentences and generate
a semantic consistent visually appealing image mapping all the features backed up from the textual
description (Ruan et al., 2021). As of now, the works have focused on generating images but only to a
fine-grained level. When applied in real world scenarios there are many drawbacks of the text-to-image
synthesis domain (Tan et al., 2019). Apprehending the use of various techniques following the deep
learning approach. Problems such as mode collapse, momentum deoptimization, blurry images,
unpreserved semantic consistency, paying attention to a fine-grained level and training data from a

Aarthif Nawaz | w1715752 15


Fashionable

single language still exists. To overcome this the author has chosen to build a system which can
monopolize substantial amount of text and generate two-dimensional unique images in multiple
languages which will be a welcome turning point in the text-to-image synthesis domain.

2.3.5 Proposed Architecture

Figure 2.1 - Proposed Architecture (Self- Composed)

2.4 Review of Technologies


2.4.1 Deep Learning
Machine learning has a subset called "deep learning" that attempts to replicate the process by which
humans acquire specific sorts of information.

2.4.2 Image Generative Models


Image generative models are models that generate random, but meaningful, distinct images after being
trained on a domain with datasets that indicate important qualities(Wang et al., 2018). Uniquely created
handwritten digits, human faces, and other non-existent objects. These images are so convincing to the
naked eye that it's easy to believe they're real. Flow based deep generative models (Xu, 2020), GANs
(Goodfellow et al., 2014a) and variational autoencoders (Kingma and Welling, 2019) are examples of
Image generative models.

2.4.2.1 Variational Autoencoders


To discover more efficient and compressed visual representations, variational autoencoders, a type of
unsupervised learning neural network, compresses the data before re-creating the original input from
scratch. There are two models that make up the encoder and decoder models. However, even though
Aarthif Nawaz | w1715752 16
Fashionable

they are connected. The decoder is the generating model, while the encoder is the recognition model.
In terms of approach, these two models are complementary to one another(Grønbech et al., 2020). These
autoencoders converts the input to a fixed vector, whereas VAE converts it to a distribution. We take a
z sample from a prior distribution p. (z). Then, using a conditional distribution p(x | z), x is created. The
process can be described mathematically as

pθ (x) = Z pθ (x | z) pθ(z)dz

2.4.2.2 Generative Adversarial Networks


Generative Adversarial Networks (Goodfellow et al., 2014a) produce images with great subjective
visual quality (Grover, Dhar and Ermon, 2018) as compared to VAEs, which are better at slicing thinner
pixel patterns. Real-world content, including images, human conversation, video and music, has been
successfully replicated by them in a wide range of sectors. A GAN is a game with two players, a
generator and a discriminator, based on the card game theory.(Heusel et al., 2017). An input noise
variable (z) is used to generate synthetic samples using the generator G. Real data distributions are
captured by the discriminator in order to mislead it into assigning the generated samples to real data
samples with a high probability. The D discriminator calculates the probability of a sample being taken
from a real-world distribution. Critic: It can tell the difference between false and genuine samples. The
two networks simultaneously try to outperform each other until the generator gets a plausible result.
According to (Karras et al., 2018), GANs are used for many computer vision , natural language tasks.
Face generation, image to image translation, text generation and text-to-image synthesis. As of the
recent past, GANs have gained a lot of attention and popularity because of their ability to perform a
wide range of image generation and text-to-image synthesis tasks with high quality outputs.

2.4.3 Text-to-Image Synthesis


Text-to-image synthesis is the process of transforming textual description to a visual representation of
an image(Zhu and Ngo, 2020). The unsupervised learning approach use a paired tuple of text and image
as datasets to train the GAN network such that it learns the semantic consistency between the text and
image. Once it learns the semantic consistency the GAN network uses adversarial training to generate
new and unique images which is real and doesn’t exist in the real world(Yuan and Peng, 2020). Some
of the examples exhibited by text-to-image are generation of ingredient images from recipe, generation
of birds and flowers, generation of human poses. All of these systems used unsupervised GAN based
approach for the development of the prototype (Qiao et al., 2021). The current existing systems have
been trained, tested and evaluated among few datasets such as flowers, Birds and COCO. Upon,

Aarthif Nawaz | w1715752 17


Fashionable

evaluation the SOTA system which is Attn GAN based on attention driven mechanisms has reached an
inception score of +4.3% and FID score of +13.93%.

GANs began to thrive in text-to-image synthesis problems after their creative power was unlocked by
(Goodfellow et al., 2014a).As a result, due to the high quality synthetic outputs created by GANs, their
performance in such text-to-image tasks is rated substantially higher than that of other generative
models.

2.4.4 Generative Adversarial Networks


2.4.4.1 What is a GAN?
A Generative Adversarial Network (GAN) is an innovative and widely used generative model
(Goodfellow et al., 2014a).There are two key components to it. The generating network is one, while
the discriminator network is the other. These two networks are created to play a minmax game to
identify the balance in between, influenced by "Game Theory." The discriminator is taught to categorize
fake and real images from the image collection, while the generator network is trained to generate fake
images of a specific type, learning the probable distribution of training data. Both networks are trained
at the same level to get the best performance out of the network. The generator network improves over
time as a result of the discriminator network's feedback. Finally, as a result of this adversarial learning,
the generator network generates high-quality counterfeit images, fooling the discriminator network.

2.4.4.2 Uses of Generative Adversarial Networks


GANs perform well in a variety of tasks, including image generation and text-to-image synthesis. GANs
have been shown to be beneficial in a variety of activities. Tcxhey have been widely used to generate
audio, videos, text, and especially images. Data augmentation has played a major role through the use
of GANs(Tanaka and Aranha, 2019). Some of the domains GANs have been applied to are voice
enhancer (Kaneko and Kameoka, 2018) to enhance your voice, image-to-image translation(Isola et al.,
2017) to convert a set of images to a completely different domain, image inpainting (Yu et al., 2019) to
segment the image from the background using masked inputs.

2.4.4.3 How GAN Is Used Effectively for Text-to-Image Synthesis


As per the unsupervised learning approaches of text-to-image synthesis, GAN-based text-to-image
synthesis is one of the most common approaches followed. Generative Adversarial Networks
(Generative Adversarial Networks) are an unsupervised deep learning approach that is widely used in
the fields of video, audio, image, and text generation (Goodfellow et al., 2014a). In the field of text to

Aarthif Nawaz | w1715752 18


Fashionable

image synthesis, many systems were built using GANS. (Arjovsky, Chintala and Bottou, 2017). Given
a dataset inclusive of the text and images as pairs, a generative model extracts the features from words
using a text encoder and generates samples from neural networks at a fine grained level (Parihar et al.,
2020). Based on the training weights of the generative models. It reproduces real data distributions that
do not exist in the real world. As a result, the model is forced to explore and extract meaningful
conceptual representations of two-dimensional images. The main advantage of unsupervised learning
is the ability of generative models to generate samples from unlabeled data (Bekesh et al., 2020). CGAN
and DC-GAN were both built on top of the original unsupervised text-to-image synthesis approach
where the CGAN takes into account extra information such as the class name, file name, and textual
description mapped to the dataset(Jin et al., 2019). while the DC-GAN uses deep convolutional network
for the generation of image realistic synthetic images.

2.4.5 Text Preprocessing


Text preprocessing is crucial in text mining and NLP since it is the first step in transforming your input
into machine-readable, ordered data that can be used by any future procedure. Date and number formats
are frequently used in text descriptions, which include special characters and other formats. When these
text descriptions are transformed straight to word embeddings, data inaccuracy occurs, which results in
erroneous outcomes (Maslej-Kresnáková et al., 2020). Preprocessing the input data by replacing special
characters and structural information (HTML) tags with symbols, removing punctuation marks,
tokenizing, standardizing, and normalizing the raw text to return clean word tokens that serve the
model's performance better(Haddi, Liu and Shi, 2013) .

2.4.6 Text Encoding


Text encoding is the process of recognizing and extracting text attributes from a piece of written content.
Essentially, it is the process of transforming unstructured data into the format required by machine
learning algorithms. These are the raw data-derived encoded vectors that can be used to address the
dataset's textual description. However, we will use word embeddings as our primary source for Text
Encoding because they are the distributed representations of text in n-dimensional space

Since word embedding helps to transform raw data (text document characters) into a meaningful
orientation of word vectors in the embedding space, it is a hot text encoding strategy. Word embedding
techniques collect data from the pattern and occurrence of words and go farther than typical token
modeling approaches in decoding/identifying the meaning/context of the words, allowing the model to
solve the underlying problem with more critical and valuable features. When the model is trained, these
Aarthif Nawaz | w1715752 19
Fashionable

word embeddings are used as machine-readable inputs along with the image(Levy and Goldberg, 2014).
In our case we will be using a character level text encoder (S. Reed et al., 2016) that will encode raw
fashion terms to 1024 dimension size embeddings. These embeddings will be concatenated to the
universal Gaussian latent space z-dim distribution and passed as pair with the encoded image to the
model.

2.4.7 Image Encoding


Text to image synthesis necessitates the conversion of images to semantic feature space. The visual
vector representations and sub-region features are extracted from the image samples using an image
encoder (Tan et al., 2021). In our case, we will encode the image using a python library called numpy
and convert the raw image to dtype float 32 datatype represented in a n dimensional space so that the
model will be able to read the inputs paired along with the text embeddings.

2.4.8 Contrastive Learning


Due to a language discrepancy between the descriptions of the same image, the synthetic images diverge
from the actual truth. By employing a contrastive learning approach, the quality and semantic
consistency of the synthetic image can be enhanced(Zhang et al., 2021). The contrastive learning
approach is utilized in the pre-training stage to learn the continuous text representations for the
description matching to the same image. The contrastive loss is achieved by the GAN model loss when
the textual representation and the visual representation are learned during training. We can collect
captions that describe the same image while discarding captions that describe another image using the
contrastive loss(Ye et al., 2021a).

2.4.9 Image Super resolution


To improve the resolution and quality of the generated image since the image being generated are of
128*128 pixels, ESRGAN (Wang et al., 2018) trained on the fashion dataset was employed where
visual quality is continuously enhanced with more realistic and natural textures whereby the image
quality’s resolution is increased to 512*512 pixels claiming a better image quality.

2.4.10 Algorithmic Selection


There is many GAN based unsupervised learning approaches used to develop text-to-image synthesis
systems. They are discussed in detail below.

Text to Image Synthesis Using Generative Adversarial Networks (Bodnar, 2018b) – For the
first time, text-to-image synthesis research has been conducted using unsupervised learning techniques.

Aarthif Nawaz | w1715752 20


Fashionable

The Wasserstein T-M model was also referred to as the Wasserstein GAN-CLS framework for this
research. The Wasserstein distance similarity was also the first unsupervised learning approach for
conditional image generation. Preliminary results show that the Wasserstein GAN-CLS loss function is
stable. With the help of deep learning, images were generated by the generator and then compared to
the discriminator's output to see which one performed better. However, the images were blurred and
this first attempt at unsupervised learning failed to demonstrate consistency.

Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks -


(Zhang et al., 2017) – StackGAN was one of the most widely approved and used unsupervised learning
models adapted for text-to-image synthesis. This work was developed to resolve the scalability issues
found in the Wasserstein GAN-CLS model. This model was designed and developed using the deep
learning approach. The two-stage GAN was built to generate images, which are learned from the text
embeddings. Stage one generates a rough 64*64 image, and stage two generates a 256*256 image.
Stage one uses a global char cnn-rnn text embedding combined with a random noise to generate a rough
image. With the rough output from stage one, the stage two GAN, along with the text embeddings,
generates high-quality synthesized images of 256*256 resolution. This work was tested and evaluated
using the datasets Birds, Flower, and COCO. The high inception score of 7.7% demonstrated by this
GAN inspired the development of new text-to-image synthesis models and much attention was paid to
this domain.

Fine-Grained Text to Image Generation with Attentional Generative Adversarial


Networks (Xu et al., 2018b) – AttnGAN was inspired by the development of the Stack GAN model. It
was introduced as a refinement to focus more on the semantic consistency between the textual
description and the image. It’s a novel framework that was built to enhance features from previous
SOTA GAN networks in the text-to-image synthesis domain. This novel framework is designed and
developed using the deep learning approach. It’s a multi-stage GAN which is attention-driven for fine-
grained text-to-image generation. With a novel attentional generative network, the relevant words that
proclaim the image feature get generated. In addition, a DAMSM model is adapted to compute the fine-
grained image-text matching loss after training the generator. The AttnGAN was built to provide more
priority to the keywords from the text and generate the image with all the features of the visual
representation. It was successfully evaluated on the flower and CUB datasets and produced an inception
score of 14.4%, which is remarkable compared to other SOTA text-to-image synthesis models.

Aarthif Nawaz | w1715752 21


Fashionable

DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image


Synthesis (Zhu et al., 2019) - DM-GAN was introduced to experiment with the behavior it demonstrates
in the text-to-image synthesis domain. It was created and developed using the deep learning approach.
Fuzzy images formed by this GAN are fine-tuned using a dynamic memory module that keeps a
summary of losses from the generator and discriminator, thereby enhancing the image generated. In
addition, a memory writing gate is built to select the most relevant text keywords from each image and
to optimize for better images. To avoid overfitting the model, they use a response gate to integrate data
from the memory module and the writing gate. Using the flower dataset, the model achieved an
impressive inception score of 11%.

DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthes (Tao
et al., 2020) – This model was introduced as a new approach to the text-to-image synthesis domain that
uses a single self-supervised GAN network. This novel approach follows a one-stage network model
having a single generator and discriminator for the generation of images from text descriptions. A novel
fusion-based module called the deep text-image fusion block deepens the bond of the text-image fusion
process in the generator ensemble with a novel target-aware discriminator composed of a matching-
aware gradient penalty and a one-way output that promotes the generator to generate more realistic
images while maintaining semantic consistency between the image and the text without the use of
additional networks. This GAN was solely introduced to decrease the training time and avoid mode
collapse from the discriminators whilst maintaining the generation of high-quality synthetic images.
This model was evaluated on the flower and bird datasets, and it showed a remarkable inception score
of 12.2%.

FA-GAN: Feature-Aware GAN for Text to Image Synthesis (Jeon, Kim and Kim, 2021) –
The FA GAN was introduced as an inspiration for the Stack GAN. This system uses unsupervised
learning to focus on adversarial training loss. This was built using deep learning architecture.
Combining a self-supervised generator, discriminator, and feature-aware loss, it generates images from
text. The auxiliary decoder for the self-supervised discriminator model provided a feature-aware loss,
which the self-supervised generator used to better represent features in the image. Images were created
using text keywords. Using the MS-COCO dataset, the suggested model reduces the current FID score
from 28.92 to 24.58.

2.4.10.1 Comparison of Text-to-Image Synthesis models


The detailed comparison of text-to-image synthesis models can be found on Appendix B
Aarthif Nawaz | w1715752 22
Fashionable

2.4.11 Model Evaluation


Throughout the works in this domain, various evaluation metrics were used to evaluate the performance
of the text-to-image synthesis models. Main metrics that were used to measure the results and the model
performance are IS, R-Precision, and FID. The metrics above are computed based on the model loss or
the cosine similarity. SOA is the only evaluation metric that is measured based on word and image
correlation. The other four qualitative evaluations were used in different works and have shown
plausible results in the text-to-image synthesis domain.

Inception Score (IS)

The Inception score is an evaluation metric that gauges good correlation with human judgment. IS does
not show image attributes that indicate a text-to-image synthesis method's capacity to appropriately
express the semantics of the input text description (Sommer and Iosifidis, 2020). As a result of its good
correlation with subjective human judgment, it is the most commonly used statistic for evaluating image
quality. This statistic, on the other hand, only takes into account the generated images' quality and
variety. This technique performs well for unrestricted image generation, but fails to collect essential
textual information when the assignment is based on classes or descriptions. Pre-trained Inception-v3
networks are used to evaluate generated images and generate a conditional label distribution(Frolov et
al., 2021).

Frechet Inception Distance (FID)

In terms of features extracted by a pre-trained network, the FID quantifies the difference between the
distribution of real and generated images. The FID is much more consistent than the IS in analyzing
GANs and detects more types of disruptions (Zhang et al., 2021). The FID is generated from real and
generated image samples using a pre-trained Inception-v3 model's last pooling layer to acquire visual
attributes like the IS. To calculate the FID score, the activations between actual and false images follow
a multivariate Gaussian distribution(Frolov et al., 2021).

R-Precision

Using R-precision, we can compare the visual-semantic similarity of retrieved images and generated
images to text descriptions. In addition to the caption from the test dataset, a different caption is
randomly chosen from the dataset. This is done by calculating the cosine similarity between a visual

Aarthif Nawaz | w1715752 23


Fashionable

feature and the text embedding of each caption and sorting the captions by decreasing cosine similarity
(Frolov et al., 2021).

2.5 Existing Systems


Text-to-image synthesis helps us visualize our ideas. A single written description is transformed into
several persuasive images. Text-to-image synthesis can be done in various ways. The initial research
employed supervised learning algorithms to align visuals with textual descriptions(Agnese et al., 2020).
While deep learning's recent advancements have ushered in a new set of unsupervised learning methods,
such as image generative models, computer vision-based techniques, and the automated production of
attractive images from text to image descriptions (Vani and Venkatesh, 2016), With the current
advancement of generative modelling, the text-to-image domain is drastically developing using GANs,
and the process of generating images has become less complex.

Text-to-image synthesis refers to the process of creating a two-dimensional image from a textual
description in any natural language. Seeing an image in your mind's eye from a written description is
an apparently straightforward feat for humans. At the same time, it is one of the most difficult topics in
the field of natural language and computer vision that has received a lot of attention in recent
years(Zhou, Jiang and Xu, 2021). The text to image synthesis systems developed over the years is
categorized to two main sub systems which will be discussed briefly.

1. Supervised learning systems of text to image synthesis.


2. Unsupervised learning systems of text to image synthesis.

2.5.1 Supervised Learning Systems of Text-to-Image Synthesis


During the early stages of the text-to-image evolution. The synthesis was executed in such a way that
the image was searched from the dataset and a supervised learning approach was applied to it. (S. Reed
et al., 2016). The high correlation between words and the image, which was instructive and visualizable,
was combined to construct textual units, and these textual units found the most likely image to be
generated (Zhu et al., 2007). But the images that were generated were far from the textual description.
Also, this approach doesn’t have the potential to generate new images. In fact, the new images generated
are the changes in the characteristics of the training image. In short, to sum up, it was just a manipulation
of existing images. The cons of being unable to generate new images and not being semantically
consistent for each image being generated did not benefit the developers. Thus, research started on
unsupervised learning approaches (Dong et al., 2017).

Aarthif Nawaz | w1715752 24


Fashionable

Some of the supervised-learning systems are discussed below.

PicNet: Toward Communicating Simple Sentences Using Pictorial Representations


(Mihalcea and Leong, 2006) – This work sparked a revolution in the field of text-to-image conversion.
To test the theories, the researchers used relatively simple sentences. Single words are more likely. The
prototype demonstrated that graphical representations can transmit simple, brief phrases. This visual
information is contrasted to the linguistic sentence that was presented. In the end, a grayscale image is
created. The first time this technique was put to use was in a desktop program for teaching youngsters
about the advancements in artificial intelligence via distance learning.

Text-to-Picture Synthesis System for Augmenting Communication (Zhu et al., 2007)- The
first supervised learning system to create images from text was approved. It converts text into speech,
but also displays the text's meaning. Natural language processing, computer vision, computer graphics,
and machine learning were used to create the system. To add images to text, these components are
combined first to identify picture and textual units. Combining image and text creates a full image that
can be viewed from any angle. Newspapers and children's books were used to test this method. In
experiments, blurry images were generated. Only 30% of its images were successful. The training
images helped with a few simple modifications to the images.

2.5.1.1 Comparison of Supervised learning systems of Text-to-Image Synthesis


The detailed comparison of supervised learning systems of text-to-image synthesis can be found on
Appendix C

2.5.2 Unsupervised Learning Systems for Text-to-Image Synthesis


Text-to-Image synthesis has achieved great progresses with the advancement of the Generative
Adversarial Network (GAN). The revolution of systems using unsupervised learning emerged due to
the potential of training models with unlabeled data (Dong et al., 2021).The use of GANs in the
development of text-to-image synthesis aided along with the deep learning approach has built systems
over time. Some of the systems are explained below.

“CookGAN”: Meal Image Synthesis from Ingredients (Han, Guerrero and Pavlovic, 2020)
– This system was built using the model “StackGAN V2", which addresses text-to-image from a
completely different perspective. This system focuses more on the visual effects that are depicted in the
image and preserves fine-grained details and progressively up-samples the images. Textual descriptions
of the image are fed into a simulator network, which makes instantaneous modifications to its
Aarthif Nawaz | w1715752 25
Fashionable

appearance. It has a cycle-constant limitation as well, to enhance image quality and maintain a
consistent appearance. Food chain image generation was the only use for this technology because it
used a visual representation of the ingredients rather than a written means of displaying them.

Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions (Nasir et al.,
2019) - Textual descriptions of facial features are used to generate face models in the system. CelebA
dataset is used for the purpose of building an algorithm that automatically generates images with a list
of attributes. The text-to-face generation problem is then modeled as learning the distribution of faces
(conditioned on text) in the same latent space. An advanced version of GAN is used for conditional
multi-modality learning in this system (DC-GAN with GAN-CLS loss). It is necessary to switch the
labels for actual and fake images, and then introduce noise to the discriminator as a result. Using skip
thought vectors, the text is encoded before being provided to the generator as a tuple along with the
image. The results of generated images for various textual descriptions are impressive, based on the
final training.

Recipe2Image (El, Licht and Yosephian, 2018) - It presents a novel method of creating images
from written descriptions that do not directly describe the visual content of the image. They achieve
this by creating a system that uses their recipes to create images of food with a resolution of 256*256
pixels (or higher). It's unclear how the recipe's visual content relates to its accompanying text because
recipes have a complex language structure, with two segments (ingredients and instructions) each
including several words and expressions. A Stacked Generative Adversarial Network with two recipe
embeddings computed to construct food images conditioned on their recipes forms a baseline for this
challenge, which is a first step in solving this problem. REG is used for the embedding of the ingredients
and the cooking instructions. As a result, a common space is created by concatenating the embeddings
with the image and comparing their cosine similarity to the recipe images. Recipes are then sent into a
two-stage Stack GAN network, which generates the final recipe images.

Dall-E: Zero-Shot Text-to-Image Generation (Ramesh et al., 2021) – The system is trained
using a transformer (Vaswani et al., 2017) to autoregressively model the text and image tokens as a
single stream of data. They've used a two-stage training approach. First, a discrete variational
autoencoder (dVAE) is trained to reduce each 256*256 RGB image to a 32*32 image token grid size.
After concatenating up to 256 BPE-encoded text tokens with the 32 32 = 1024 image tokens, they train
an autoregressive transformer to simulate the combined distribution over the text and image tokens

Aarthif Nawaz | w1715752 26


Fashionable

before creating an image. In-scope models that Dall - E generates are furniture, people, flowers, and
birds, and the system was built with the goal of promoting open AI.

Text to Fashion image synthesis (Ak et al., 2020) – For text-to-image synthesis, the system
uses an e-AttnGAN with greater training stability. AttnGAN's attention module combines phrase and
word context characteristics and feature-wise linear modulation to merge visual and natural language
interpretations (FiLM). Also provided are similarity and feature detection losses between real and
generated pictures, as well as classification losses for "relevant characteristics" in AttnGAN's
multimodal similarity learning. To improve training stability and prevent mode collapse, the
discriminator uses spectral normalization, a two-time scale updating technique, and instance noise. An
LSTM network extracts word and sentence features from text. The hierarchical architecture uses
upsampling layers to build a low-resolution image from the phrase feature. FiLM-ed ResBlocks
integrate linguistic and image data to create higher-resolution images. The architecture includes feature
matching, cosine similarity, and classification losses to improve text-to-image synthesis. e-AttnGAN
surpasses state-of-the-art techniques in inception score, R-precision, and classification accuracy using
FashionGen and DeepFashion-Synthesis datasets.

2.5.2.1 Comparison of Unsupervised learning systems of Text-to-Image Synthesis


The detailed comparison of unsupervised learning systems of text-to-image synthesis models can be
found on Appendix D

2.5.3 Critical Analysis of Existing Systems


The era of text-to-image synthesis started with the development of supervised learning. The pictorial
representation model(Mihalcea and Leong, 2006) which was built using language linguistics and was
trained towards a specified dataset. But the output was images merely produced from words and simple
phrases. Following this research,(Zhu et al., 2007) introduced a process similar to speech synthesis but
in a visual procedure to convert sentences to images. The system first identifies the pictureable and
textual units, then finds the most relevant image parts augmented with the text, and the image is
generated. But the images were not unique and were manipulated from the training dataset, and they
comprised very few features compared to the text description, thus proving it to be semantically
inconsistent. Understanding the drawbacks and challenges supervised learning approaches induced.
Work on unsupervised learning was established. The quality of the images and semantic consistency
between images and text were drastically improved in terms of performance, versatility, and accuracy.
The unsupervised learning approaches used deep learning technology in developing the new algorithms
Aarthif Nawaz | w1715752 27
Fashionable

and are amongst the principal technologies, GANs (Goodfellow et al., 2014a) and Image Generative
models with auto encoders (Kingma and Welling, 2019) were primarily used. (Zhang et al., 2017)
started the work on unsupervised learning using multi stage GAN framework. The text and image were
trained as pairs. Following this technology many existing systems were built on top of it. (Han, Guerrero
and Pavlovic, 2020) built a system for visualizing cooking ingredients and its instructions. This system
was developed using StackGan-V2, which uses a two-stage model architecture. The text is feature
extracted using a char-cnn-rnn model concatenated with a noise dimension before being passed to the
two staged GAN model. The system worked at a fine-grained level paying attention to the highy
correlated word amongst the text description and generate the image. Such that the image proved to be
semantically consistent with the text. (Dong et al., 2017) Developed a system called “I2T21” which
was built using GAN-CLS. The text and image semantic consistency was measured using the
Wasserstein distance. This system was prone to mode collapse and paid attention to the text at a fine-
grained level. In 2021, understanding the image quality and the lack of features presented in the visual
representation of the image (Hossain et al., 2021) introduced a system which focusses more on the
image captioning, the system was built using the GAN-LSTM architecture to extract both semantic and
spatial relationship of an image. Though it paid a lot of attention on the semantic consistency of the text
and image, the generation of images were prone to be entirely fake and sometimes didn’t generated the
image of expected standard due to the high priority given to the image captioning module. (Nasir et al.,
2019, p. 2) developed a system that generates a synthetic face from textual description, It used GAN-
CLS to extract fine-grain details, map them to a latent space and learn their distribution in the latent
space, but as stated this system focused more on the fine-grained details of the text description and
generated the face. (Ak et al., 2020) developed a system for fashion design which focused more on the
image consistency being generated, it was developed using “e-AttnGAN” fused with FILM layer to
enable the consistency of the generation of images, but paid less attention on the text descriptions. Thus,
it too concentrated on the text at a fine-grained level. Throughout the works there were many gains and
losses. However, none of the systems focused on generating images from large text descriptions and
multiple languages which still remains a big gap on the text-to-image synthesis domain.

2.5.4 Benchmarking
Existing systems have laid a lot of potential drawbacks and challenges in terms of accuracy,
performance, quality of images, semantic consistency, hyper parameter tunning, time complexity and
number of models used. According to the development of text-to-image synthesis, dataset plays a key

Aarthif Nawaz | w1715752 28


Fashionable

role. Since the author has chosen an applied research pertained towards the in-vogue fashion domain,
there will be no global datasets that our system will be tested upon. Infact, the system’s scope will focus
only on the fashion dataset (FashionGen) (Rostamzadeh et al., 2018) to measure and evaluate the
system. In terms of performance, the author has pre dominantly focused on avoiding mode collapse and
efficient model training compared to other existing systems. The author decides to achieve this using
AttnGAN network ensembled with a contrastive learning approach as hybrid method such that the loss
functions are preserved. The quality of images will be of size 256*256 pixels which will be better at
better resolution after employing ESRGAN trained on the dataset, which will be better compared to the
image quality with the existing systems, The Image augmentation and encoding with latent space
distribution will be of high priority when generating output of 256*256 resolution. Unlike other certain
existing systems, the semantic consistency between the text description and the visual description will
be preserved throughout using conditioning augmentation, zero gradient loss and adam optimization
with cross entropy classification. The time complexity will be linear compared to other systems. Finally,
we will be using two generators and discriminators, one of each will be self-supervised devised to avoid
mode collapse, momentum, and decrease training time. Acknowledging the above benchmarking
prospects against the existing systems will be beneficial in terms of the development towards the
system.

2.6 Chapter Summary


Based on the above LR, it is clear that text-to-image synthesis is a wide field of study, and that producing
images from extensive text descriptions in different languages is a challenging task. It was proven that
generating images from text is highly dependent on the dataset during training, and that a multi staged
GAN network ensembled with contrastive learning approach should be utilized as a hybrid method to
avoid mode collapse and improve performance. Moreover, using this method will provide the ability of
accepting substantial textual descriptions in multiple languages whilst generating images. It was evident
that the domain needed to train on text-to-image synthesis utilizing paired data of text and images.
Existing text-to-image synthesis models and systems were studied for suitable methodologies and
technologies to address the gap in the domain. The system was then benchmarked against current
systems and a study of evaluation measures for text-to-image synthesis models was conducted.

Aarthif Nawaz | w1715752 29


Fashionable

3. METHODOLOGY
3.1 Chapter Overview
As part of this chapter, we'll go through how we'll do our study and how we'll handle our software
development and project management. The relevant sub-sections of each approach will be explained in
detail.

3.2 Research Methodology

Research Methodology
Philosophy Pragmatism was chosen as the philosophy because research is based on data
to develop a hypothesis and this research is comparing both qualitative and
quantitative results produced by different text-to-image synthesis models
using GANs. This is also applied research on the domain of In-vogue
fashion designs and stylings.
Approach This research aims to test and prove a hypothesis that needs to be solved. This
is to input larger spans of In-vogue fashion-based textual descriptions in three
natural languages (English, Sinhala, Tamil) and achieve high-quality detailed
In-vogue fashion design images as output. A deductive approach was selected
to follow, as the research chooses to apply an existing theory to the domain of
interest.
Strategy The strategy of research is how the answers to the research questions are
proposed. Interviews, documents, and research analysis, experiments, and
surveys were chosen as strategies to fit the research strategy.
Choice Choice of research will depend on the research paradigm that is chosen.
Among the mono, multi and mixed methods for choice, mixed-method was
chosen, as text to image synthesis model had both quantitative and qualitative
results which were gained through interviews, survey papers, and other
documents such as journal articles and conference papers that could be used
as a comparison for the model prototype that is going to be developed in this
research.

Aarthif Nawaz | w1715752 30


Fashionable

Time Horizon Data needs to be gathered at a single point in time to do evaluations. Hence,
out of longitudinal and cross-sectional, cross-sectional time horizon looks the
most convenient and was chosen for the research.
Techniques For the collection and analysis of data, techniques such as observations,
and Procedures documents, conversations, evaluation reports, interviews, and questionnaires
will be used.

Table 3.1 - Research Methodology

Based on the above research methodologies, the below mentioned aspects of the research were
determined.

Research Hypothesis: With the ability of text to image transformation models to take in larger spans
of texts with multiple natural languages by training and validating to provide high-quality fashion
design images. It is possible for developers and other stakeholders to now train larger text descriptions
with multiple natural languages and apply them to any other domain for productive outcomes.

Research Process: Finding out the best possible way to train larger corpora of texts in multiple
languages whilst simultaneously applying those trained texts description on the images to extend the
process for the unsupervised GAN network to produce images for larger textual descriptions.

Prototype Input: Text descriptions of in-vogue fashion-based designs that you desire to visualize.
Prototype Output: High-quality two-dimensional synthesized In-vogue fashion design images
having all the detailed features given as input.
Prototype Features:
1. A novel framework to generate high-quality images from larger spans of texts in multiple
languages.
2. It will be an open-source application to be used by end-users to input in-vogue fashion
descriptions and achieve high-quality in-vogue fashion design images as output.
3. A GUI for better user experience.

3.3 Development Methodology


There are a variety of software development methodologies. The author decided to use the prototype
technique, which involves building and re-testing a system multiple times because the author will
conduct different evaluations at various points in the project. Since the author has to keep making
Aarthif Nawaz | w1715752 31
Fashionable

adjustments and testing until he achieves a positive conclusion, the author chose to go with the prototype
model.

3.4 Project Management Methodology

Agile Prince2 was selected for project management from a number of options because of its emphasis
on management, recursive planning, and flexible delivery while responding to risks. As a result of
these considerations, the author chose agile prince2 methodology since it allows you to focus on both
management and delivery at once, aids in being on time and meeting deadlines on time, fosters
cooperation, and increases stakeholder confidence.
3.4.1 Project Deliverables
Deliverable Component Tentative Delivery Date
Project Proposal 1st Nov 21
Review Paper 15th Oct 21
Literature Review Document 18th Oct 21
Software Requirement Specification 22nd Nov 21
System Design Document 6th Dec 21
Prototype 20th Apr 22
Thesis 20th May 22
Project Research Paper 19st June 22
Table 3.2 - Project Deliverables

3.4.2 Project Plan


Gantt Chart can be found on Appendix E

3.4.3 Resource Requirements


Required hardware, software, dataset, and skills requirements are decided based on research
objectives and the plan to develop and test.
3.4.3.1 Hardware Requirements

Hardware Requirement Justification


Core i7 9th generation, 6 core To provide required intensive processing
processor
power.

Aarthif Nawaz | w1715752 32


Fashionable

16GB RAM To load heavy image datasets and save them in

memory.
Graphics Processing Unit A powerful enough GPU to train models will

be necessary. Can use Google Collab for this.


Storage space of 40 GB or more Disk space is required to save application code,
data sets, and testing files in non-volatile memory.

Table 3.3 - Hardware Requirements

3.4.3.2 Software Requirements

Software Requirement Justification


Operating System (Windows 10, The OS is required to handle all heavy processes. Linux has
Linux) been the favored OS for the deep learning community.
However, a 64- bit version of either Windows 10 or Linux
can
be used.
PyTorch A Pythonic framework by Facebook’s AI

research lab for deep learning.


TensorFlow GPU A deep learning framework built by Google to

handle deep learning development.


Google Collab To research and experiment with different available models,
with GPU power in the
cloud.
MS Office To create all documentation and reports.
Google Drive To keep backups of application code and
datasets.
Zotero To manage all citations and references.
Table 3.4 - Software Requirements

3.4.3.3 Dataset Requirements


 Available datasets required for the project (DeepFashion and FashionGen)

Aarthif Nawaz | w1715752 33


Fashionable

3.4.3.4 Skill Requirements


 Skills and knowledge in GANs.
 Knowledge in using neural networks and evaluating them
 Knowledge in evaluating GANs.
 Knowledge about techniques and loss functions while training with GANs.
3.4.4 Risk Management
Risk Severity Frequency Mitigation Plan
Constant changes for the 4 5 Following the prototype
requirement of the project. Like any methodology will help
other research, this research will also overcome frequent
be incumbent to the evolving requirement changes.
requirements with the recurrence of
the prototype. Such changes need to
be addressed.
In-depth knowledge of the domain 4 5 Thorough research about the
knowledge needs to be acquired as domain and existing
the in-vogue fashion domain is technologies will mitigate the
subjected to many changes. risk of having a lack of domain
and technical knowledge.
Limited Availability of the hardware 5 5 To mitigate this issue, Google
resources as the prototype s y s t e m Collab and cloud services from
may require advanced Amazon, Google or Microsoft
computational power and processing can be used.
speed
Table 3.5 - Risk Management

3.5 Chapter Summary


Research methodology was first explored under the subheadings of philosophy, approach and strategy;
choice; time span and techniques. "Prototyping" was therefore chosen as the best software development
methodology for the investigation. After that, the "Agile Prince 2" project management technique was
selected and the project plan, deliverables, and potential risks associated with the project and how to
mitigate them were discussed

Aarthif Nawaz | w1715752 34


Fashionable

4. SOFTWARE REQUIREMENT SPECIFICATION


4.1 Chapter Overview
The main purpose of this chapter is to gather project requirements. A rich picture of the system
and its surroundings is first depicted. After that, the Saunders Research Onion is used to display
stakeholders of the system, and all stakeholder perspectives are discussed in detail. Following that,
there is a discussion of different approaches to requirement elicitation methods and a critical
analysis of the requirements obtained. A context diagram and a use case diagram with the use case
description are used to describe the system. This chapter also discusses the functional and non-
functional requirements of the system.

4.2 Rich Picture of The System

Figure 4.1 - Rich Picture of the System (Self-Composed)

4.3 Stakeholder Analysis


The stakeholder onion model graphically portrays the contributing and negative stakeholders
involved in the system and its surroundings. The onion model is used to determine the roles played
by the acting stakeholders accordingly from three different environments.

Aarthif Nawaz | w1715752 35


Fashionable

4.3.1 Stakeholder Onion Model

Figure 4.2 - Stakeholder Onion Model (Self-Composed)

4.3.2 Stakeholder Viewpoints


Stakeholder Role Descriptions
System Stakeholders

Aarthif Nawaz | w1715752 36


Fashionable

Fashion Designers Functional Fashion Designers, use the system where they
Beneficiary will input text and visualize fashion designs for
their business benefits.
Purchasing Agents Functional Purchasing Agents are stakeholders who will
Beneficiary use the system to visualize fashion designs
before they commit to purchasing from a fashion
store or any other fashion-based warehouse.
Students/Researchers Functional Students/Researchers are stakeholders who will
Beneficiary use the system to understand how it works, its
functionalities, study its core components, and
attempt to build something new from it as a
research gap.
Containing System Stakeholders
Fashion Consultants Functional Fashion designers after viewing the design
Beneficiary utilized from the system show it to fashion
consultants to receive feedback.
Product Owner Managerial/Financial Assists the developer to develop the system
Beneficiary optimally by removing any obstructions and by
defining clear goals.
Wider Environment Stakeholder
Colleagues Advisory Provides the required guidance and support for
Supervisor a successful prototype of the system to be
completed.
Developer Engineering Is responsible for building the primary software
Employee product before its deployment in production.
Maintenance Operator Setting up the cloud environment in which the
Engineer system must function and deploying new
versions of the system to the production
environment are the responsibilities.
Hackers Negative Hackers may target the system in one way or
Stakeholder another, attempting to modify valuable

Aarthif Nawaz | w1715752 37


Fashionable

inputs and causing irrational behavior inside the


system.
Competitors Negative Attempt to develop better solutions than
Stakeholder Fashionable.
Investors Financial Investors will invest heavily into the system's
Beneficiary development with the expectation that it will
benefit them financially with a good return.
Technical Experts Expert Advisory Technical specialists will assess the system and
provide expert comments, feedback ensuring
that it satisfies the required criteria.
Domain Experts Expert Advisory Domain specialists will assess the system and
provide expert comments, feedback ensuring
that it satisfies the required criteria.
Organizations Financial Organizations that will look at the potential of
Beneficiary the system to be purchased.
Table 4.1 - Stakeholder Viewpoints

4.4 Selection of Requirement Elicitation Methods


Requirement elicitation is the process of gathering relevant project requirements utilizing a variety
of approaches. Several such methods are discussed in this section, as well as why they were chosen
for usage in this project. The following methods are discussed below: Survey, Brainstorming and
literature review.

Method 1: Literature Review


Research gaps and issues in existing works are easy to identify with a thorough literature review,
and these flaws can be highly useful in determining engineering requirements. As a result, a
detailed literature review of the research topic, existing systems, and potential methodologies
and technologies was done, so that it will benefit how the final system should behave.
Method 2: Survey
The author decided to send a questionnaire among general system users who represent fashion
designers and purchasing agents to gather requirements, with a hundred and seventy-nine
participants as a sample from fashion designers and the general public, as well as domain and

Aarthif Nawaz | w1715752 38


Fashionable

technical specialists with a viable fashion background. The goal of the questionnaire is to help
the author understand the expectations of the users regarding the prototype. It also helps the
author determine the system's expected goals and project aims.
Method 3: Brainstorming
During the placement year, the author experienced a client request to build a system that inputs
facial descriptions and generates facial images as output to attain the goal of identifying
criminals which will be useful to the police department. Thus, upon brainstorming and analyzing
the positive outcome this system could produce for other industries, the author decided to dig
deeper into this by performing a literature review to identify gaps and issues in current systems.
As a result, the author decided to develop a text-to-image synthesis application that will input
large text-related fashion terms and generate high-quality fashion designs. This was only
restricted to the fashion design industry due to time constraints.
Table 4.2 - Selection of Requirement Elicitation Methods

4.5 Discussion of Findings through Different Elicitation Methods


4.5.1 Literature Review
Findings
Numerous systems use both traditional and deep learning methodologies. Deep learning
algorithms outperform traditional approaches in terms of performance, but they do have
drawbacks like underfitting, overfitting, mode collapse, non-convergence & diminished gradient
during the training of GAN models for text-to-image synthesis.
Research Findings Citation
Improve semantic consistency between the (Ye et al., 2021b)
image generated and captions.
Improve the quality of images generated. (Qiao et al., 2019)
Ability to generate images from large text (Mishra et al., 2020)
descriptions consisting of many words.
Generate images from text descriptions with (Li et al., 2020b)
multiple language linguistics.
Generate images from text description whilst (Zhu et al., 2019)
avoiding mode collapse (Common problem

Aarthif Nawaz | w1715752 39


Fashionable

faced for GANs when generating images, in


fact restraining the process of generating same
images multiple times).
Table 4.3 - Findings through Literature Review

4.5.2 Survey
A questionnaire was emailed to 179 people who represented a mixture of fashion designers and
the general public alias purchasing gents of fashion designs. Appendix F contains the
questionnaire form that was distributed.

Question Are you a fashion designer?


Aim of Question User identification and filtering based on stakeholder criteria

Observations
It was observed that 53.7% of the participants were
fashion designers who shared their experience crafting
fashion designs and 46.3% of the participants were
purchasing agents of fashion designs who shared their
experience on how they purchase fashion designs.
Conclusion
This survey was divided into two sections to capture the insights of different categories among
people who are interested in fashion design. As per the results, a wide range of people representing
the general public and certain fashion wholesale purchase merchants filled this survey. The
questionnaire also had a wide range of fashion designers participating in this survey from various
fashion organizations. This survey took into consideration the mixture of these responses to ensure
the system meets the requirements filled by these participants for a better user experience.
Question How often do you purchase fashion designs (T-shirts/shirts/jeans, fabric,
ladies wear, shoes, etc)?
Aim of Question This question was for purchasing agents of fashion designs to identify
how often people will use the system to check for designs from the system.

Aarthif Nawaz | w1715752 40


Fashionable

Observation
This part of the survey was designed for the purchasing
agents of fashion design, where most of the purchasing
agents shopped a few times a year while the rest of the
respondents had other times they shopped.

Conclusion
As per the findings, a vast number of users purchase designs a few times a year. To keep the system
up and running and identify the frequency of users who use it, it must be fully functional as a
whole. If at all the system requires updates or version upgrades, it can be done when there is less
traffic on the system. This will ensure that the user will have a quality experience with the system.
Question Do you check for samples online before purchasing a fashion warehouse
or any online/physical fashion store?
Aim of Question Identifying participants who check for samples before purchasing as they
will be the system’s target audience.
Observation
It was observed that 82% of the users check for samples
before making a purchase at a fashion store and 27% of the
users directly approach the store and purchase designs.

Conclusion
The majority of users who purchase fashion designs check for samples before purchasing from a
fashion warehouse or store. These responses ensure that this system caters to its requirements
specifically for these users.
Question How do you check for fashion samples?
Aim of Question Identifying user preferences

Observation

Aarthif Nawaz | w1715752 41


Fashionable

66% of the users check for fashion


samples from social media, 63% of
the users from google search, and
the rest of the users through visual
fit on and recommendations.

Conclusion
The majority of users use social media and Google search to narrow down their search for samples
before making a purchase.

Question Do you think a computer-aided solution using AI which will transform


fashion-related descriptions to fashion designs can help you view more
samples before making your purchase at a fashion store?
Aim of Question To validate if the system will be useful to the end-user.
Observations
87% of the users are interested in the development of this
prototype while the remaining 13% of the users are not
interested under certain instances.

Conclusion
The majority of users, over 87%, think that an AI-based solution that will transform the text into
fashion designs will be helpful for them, as they will be able to visualize fashion designs before
making a purchase.
Question If Yes, please explain how you craft in-vogue fashion designs? (Tools &
Technology)
Aim of Question Identifying the tools and technology used by users to craft fashion designs.
These data will be used to understand the quality of fashion designs being
produced.
Findings

Aarthif Nawaz | w1715752 42


Fashionable

To understand the quality of fashion designs being produced by identifying the current tools and
technology used by fashion designers when crafting in-vogue fashion designs.
Theme Analysis
Identifying user’s Diverse responses from participants were received. Some participants
preferences when preferred to use manual resources to craft fashion designs, but the majority
crafting fashion of the participants preferred an automated approach to crafting fashion
designs designs. These preferences helped the author identify that the quality of
the fashion designs produced depends on the tools and technology used,
thus building the system matching those standards.
Automated To reduce the physical problems encountered, participants used
Production Vs automated tools to craft fashion designs. Only 10% of the participants
Manual Production were willing to manually craft fashion designs as they believed going old
school will always produce quality designs, even though it was time-
consuming.
Quality of Outcome Participants confirmed that automated production using fashion designer
software, and image editing applications were superior compared to
manual production in terms of the quality of fashion designs produced.
Question Have you ever faced issues and encountered problems while designing
in-vogue fashion designs?
Aim of Question Identifying the problems encountered when crafting fashion designs

Observation
58% of the fashion designer of the survey have
encountered problems while crafting fashion designs
either technically or manually, whilst 42% of the
users haven’t mentioned problems they encounter
while crafting fashion designs.

Conclusion

Aarthif Nawaz | w1715752 43


Fashionable

Over 58% of the fashion designers who use handmade instruments and resources to craft fashion
designs run into problems. This confirms the prototype will be very useful for the users who craft
designs manually, making them automate the process and reducing the problems they encounter.
Question Do you think it will be better to limit raw material, industrial resources,
human energy or get rid of it completely while crafting in-vogue fashion
design?
Aim of Question Identifying to check if users are willing to limit the raw materials or
resources while crafting fashion designs
Observation
76% of the users think that it will be better to limit the
raw material and other energy wastes being exposed
during the crafting of fashion designs, while 24% of the
users think it’s not necessary to do so based on particular
reasons.
Conclusion
Most of the participants agreed to reduce the burden of human energy and the wastage of additional
raw materials. The conclusion from these statistics prove that over 75% of users would love to
limit or completely get rid of these resources while crafting fashion designs
Question Will a computer-aided solution that will transform fashion-related
descriptions into in-vogue fashion designs be helpful?
Aim of Question To validate if the system will be useful to the end-user.

Observation
100% of the users who participated as fashion designers have
decided that such a solution will be of complete help to their
industry.

Conclusion
All the participants thought an automated option that would transform the text into fashion designs
would be helpful for the users.

Aarthif Nawaz | w1715752 44


Fashionable

Question If yes, was selected above. Please explain how you will plan to use this
solution to craft in-vogue fashion designs and benefit from it?
Aim of Question To understand what the users will plan on achieving from this system.

Findings
Different perceptive responses from users who will benefit from using the system.
Theme Analysis
Research Gap and All participants approved and liked the idea of having such a system. This
Scope Depth will help reduce the impact of unwanted costs emitted during the design
stage. Overall, this niche describes fashion designs with a lot of fashion
terms. It was valid to develop such a system that synthesizes high-quality
fashion designs from a substantial number of words.
Beneficial Prototype Survey participants said utilizing this prototype has several benefits. Less
Features and raw materials, time, and labor were used. They think this prototype will
Suggestions decrease these effects over time. Participants suggested selecting a
language before converting a written description to a fashion design and
proposing a 360-degree design perspective to visualize the whole design.
Before obtaining the final result, all of these will be considered.
Table 4.4 - Survey Findings

4.5.3 Brainstorming
Criteria Findings
Deriving research Acquiring a research idea based on a client request to develop a text-to-
idea image application for facial image generation depends on descriptions
provided as input.
Identifying Based on the research idea found, analyzing and brainstorming how the
technical text-to-image system will have a technical contribution towards the
contribution from project and how it will have an impact on the fashion design industry. The
research idea pros and cons that end users can face
Deriving how the Since it’s a fashion-based application, I'm brainstorming about how the
user interface GUI should appear. So, based on similar fashion websites developed and

Aarthif Nawaz | w1715752 45


Fashionable

should resemble through insights gathered, the author decided to use a plain template
the prototype coupled with a black-white theme.
Table 4.5 - Brainstorm Findings

4.6 Summary of Findings


ID Finding

Brainst
orming
Review
Survey
Literat
ure
1 Acquired an idea from a third-party source and started
doing thorough research on it.
2 The research gap in the domain of text-to-image synthesis
has extended to synthesizing larger text descriptions into
high quality, two-dimensional images.
3 The prototype should be a system that takes in an input of
fashion-related text descriptions and outputs two-
dimensional fashion designs.
4 To synthesize text descriptions with a large number of
words, the best technique that can be used is to ensemble
AttnGAN with a contrastive learning approach, whereby
the image and text encoder of the AttnGAN will be used to
unify the words that occur close to the captions per image
and disregard the words that occur less frequently in a
caption. Thus, the semantic consistency and balance
between the image and text is maintained.
5 Identifying the tools and technology used whilst crafting
fashion designs, to match the same quality and standard
which should be generated by the porotype.
6 Identifying the problems encountered whilst crafting
fashion designs, and making sure the prototype solves all
those problems.

Aarthif Nawaz | w1715752 46


Fashionable

7 Text descriptions should follow the steps of text-image


preprocessing to tokenize keywords from the captions and
match them to that particular image to maintain semantic
consistency.
8 The prototype should have a graphical user interface.

9 The user interface should be straightforward, with the


capacity to become familiar in a short period of time,
making it more accessible to all users.
10 The prototype should have an option to select a language
(English, Sinhala, or Tamil) before synthesizing the text
description.
11 Output fashion designs should have good viewable quality,
with 360-degree view capability after the process of text-
to-image synthesis.
Table 4.6 - Summary of Findings

4.7 Context Diagram

Figure 4.3 - Context Diagram (Self-Composed)

Aarthif Nawaz | w1715752 47


Fashionable

4.8 Use Case Diagram

Figure 4.4 - Use case Diagram (Self-Composed)

4.9 Use Case Descriptions


Use Case ID UC – 01
Use Case Name Input Description
Description Users can input any fashion-related description that they need to
transform into fashion designs.
End Objective To synthesize fashion designs from the provided text description
Priority High
User/Actor Fashion Designer/ Purchasing Agent/ Fashion Organization & Firms
Trigger The user selects a language, provides a description in a preferred
language, and clicks the “Synthesize” button.
Frequency of Use Realtime
Preconditions 1. The system should be functional.
2. The user should enter the description from the fashionable
website.
3. The user should select a language, type the description in that
preferred language.

Aarthif Nawaz | w1715752 48


Fashionable

4. The user should have clicked the “Synthesize” button.


Included Use Case Input Description in Three Languages.
Extended Use Case Display Provided Descriptions, Display Error Messages,
Synthesize Fashion Designs from Provided Descriptions.
Basic Flow 1. The user selects a language and types fashion-based
descriptions.
2. The front-end sends the chosen language, and the description via
an API to the backend.
3. The backend system takes the description and language, then
converts them to word embeddings before the language.
4. The backend system classifies the extracted word embeddings.
5. The backend system uses the saved model to synthesize fashion
designs from the provided description via word embeddings.
6. The fashion designs as images are sent to the front.
7. The front-end displays the output images.
Exception Flow 1 1. The system displays error messages to the user.
Exception Flow 2 1. Use case ends in failure
Postconditions Users must be able to visualize fashion designs
Table 4.7 - Use case Description (Input Description)

Use Case ID UC – 02
Use Case Name Download image Output
Description Users can download the image outputs which has been synthesized from
fashion descriptions.
End Objective To download the image and visualize it locally, and share it with other
end users.
Priority High
User/Actor Fashion Designer/ Purchasing Agent/ Fashion Organization & Firms
Trigger The user has to click the “Download” button
Frequency of Use Realtime
Preconditions 1. The system should be functional.

Aarthif Nawaz | w1715752 49


Fashionable

2. The user should be able to visualize the synthesized images.


3. The user should have clicked the “Download” button.
Included Use Case Display Output
Basic Flow 1. The user clicks the download button.
2. The front-end sends the image in image/jpeg format to the
backend via API.
3. The backend system takes the image and returns a download link
to the frontend.
4. The frontend shows a popup to save the image in a preferred
location.

Exception Flow 1 1. The system displays error messages to the user.


Exception Flow 2 1. Use case ends in failure
Postconditions Users must be able to download and visualize fashion design images
locally.
Table 4.8 - Use case Description (Download Image Output)

4.10 Requirements
Priority levels of system requirements were defined using the MoSCoW technique, based on their
importance.

Priority Level Description

Must have (M) This level's demand is a prototype's primary functional requirement, and
it must be implemented.
Should have (S) Important needs aren't necessary for the intended prototype to work, but
they do provide a lot of value.
Could have (C) Desirable criteria are always optional and are never regarded as critical to
the project's scope.
Will not have (W) The needs that the system will not have and that is not a top priority at this
time.
Table 4.9 - MosCOW Techniques

Aarthif Nawaz | w1715752 50


Fashionable

4.10.1 Functional Requirements


FR Requirement Description Priority Use Case Mapping

FR1 Enter Users must be able to enter text M Input Description.


description descriptions.
FR2 Choose Users must be able to select a M Input Description in
Language language from the three provided Any Three
languages. Languages.
FR3 Language Language validation should M Input Description.
Validation happen in the front end.
FR4 Model Provided descriptions must be M Text-to-Image
Synthesis synthesized into a high-end Synthesis
fashion design.
FR5 Display Output Synthesized fashion designs from M Display Output
descriptions should be displayed
to the user.
FR6 Download Users must be able to download M Download Image
Image the synthesized fashion design.
FR7 Reenter Users must be able to reenter the M Display Output
description descriptions.
FR8 Display Error Users should be able to receive M Display Error
error messages on system failure Messages
FR9 User data User data should only be used S Text-to-Image
during the text-to-image Synthesis
synthesis process and not for any
other purpose.
FR10 User input User input should not be S Text-to-Image
permanently stored within the Synthesis
system or it’s a database.
Table 4.10 - Functional Requirements

Aarthif Nawaz | w1715752 51


Fashionable

4.10.2 Non-Functional Requirements


NFR Requirement Description Priority

NFR1 User Users should be able to readily grasp and move around the M
Friendliness system without the need for extra training of the text-to-
image synthesis models.
NFR2 Performance Provided that a user enters a text description into the M
system, the text-to-image synthesis process must not take
too long to complete and give the user a result. It must also
be verified that the application will not crash during the
processing of the text and creation of the image.
NFR3 Quality of Once a provided input text description has been M
Image synthesized into a fashion design, the quality of the
resulting fashion design must be acceptable and usable by
the user, as delivering a quality result to the user is equally
important.
NFR4 Security It is critical to maintaining stronger security levels inside M
the system, as it is essential to provide improved security
levels inside the system that processes sensitive images
submitted by users.
NFR5 Scalability The system must be able to adapt to large number of users C
as the system grows.
Table 4.11- Non- Functional Requirements

4.11 Chapter Summary


The chapter's main focus was on recognizing all project stakeholders, determining their roles
towards the system. Then the author gathered requirements within the project scope and
summarized them to produce a final list of proper functional and non-functional requirements. A
rich picture was employed to depict the project's concept, and those stakeholders were depicted in
Saunder's onion model. The literature review, questionnaire, and brainstorming are the main
requirement gathering methods adopted here.

Aarthif Nawaz | w1715752 52


Fashionable

5. SOCIAL, LEGAL, ETHICAL AND PROFESSIONAL ISSUES


5.1 Chapter Overview
In this chapter, the author discusses the social, legal, ethical, and professional issues that may arise
during the project's execution, with the goal of defining mitigation strategies for those issues.

5.2 SLEP Issues & Mitigation


Social Legal
 None of the respondents' personal  The prototype will be released under
information was gathered through the GPL, with the rights of GPL-licensed
questionnaire. Users were made aware packages honored.
that the questionnaire was described as  Participants' personal information and
anonymous, and that by completing the privacy were well-protected during the
questionnaire, they were agreeing to questionnaire's administration.
allow the author to use the information  Those who participated in requirement
gathered for research purposes. collection and project evaluation
 Due to the way things are currently set interviews gave non-sensitive personal
up, responses to the distributed information with their consent.
questionnaire were never incorporated
into the project thesis. Only the
quantitative analysis was added, with
the collected data's privacy,
confidentiality, and anonymity being
respected and protected throughout the
entire process.
Ethical Professional
 Through the questionnaire description,  It was critical to follow best software
participants who completed the engineering practices throughout the
questionnaires were made aware of the software development process in order
project's goals and objectives, as well to ensure that the software was in
as how their participation in the project

Aarthif Nawaz | w1715752 53


Fashionable

contributes to the project's goals and compliance with all applicable industry
objectives through completing the standards and guidelines at all times.
questionnaires.  Each step of the prototype
 This dissertation does not allow for the development process took place in
use of fabrication, falsification, or highly secure environments that were
plagiarism of any kind. All of the data password protected and kept up-to-date
and information presented is correct, with the most recent security patches
and all of the knowledge and facts that available.
were extracted have been properly  There was no piracy in any of the
cited and referenced in the document. software or tools that were used in the
development of the prototype, and
none of them were illegally obtained.
There were no commercial or student
licenses used at any point in the
process, and only open-source licenses
were used throughout the duration of
the project.
 In order to deceive and lead viewers or
evaluators to believe in a successful
state that was never achieved or
achieved successfully, no fabrication
or falsification of data or results from
the project were used.
Table 5.1 - SLEP Issues & Mitigations

5.3 Chapter Summary


In this chapter, the sole purpose was to discuss all social, legal, ethical, and professional issues that
arose as a result of the project and to describe how those issues were mitigated and resolved. All
of the major SLEP issues were discussed in depth under a variety of headings.

Aarthif Nawaz | w1715752 54


Fashionable

6. DESIGN
6.1 Chapter Overview
This chapter discussed the project's design aspects in depth. It covers everything from the system's
core to the user interface. The requirements acquired through the literature review, questionnaire,
and discoveries from the brainstorming process were used to make design selections. The rationale
behind various design decisions is also explained.

6.2 Design Goals


Refer to Appendix G where design goals are defined.

6.3 System Architecture Design


6.3.1 Tiered Architecture

Figure 6.1 - Tiered Architecture (Self-Composed)

The contributing system's tiered design is divided into three tiers: Presentation Layer, Logic Layer,
and Data Layer. The Logic Layer serves as a link between the Data Layer which stores all of the

Aarthif Nawaz | w1715752 55


Fashionable

application's data and the Client Layer, which displays the user-interactive components. When it
comes to operations, Logic Layer organizes the processes better and provides more flexibility.

Data Layer

 Existing Fashion Dataset – These are the stored fashion dataset of type h5 files. This file
consists of the text description, image associated with it, and category type. These datasets
only will be used to create the image-encoder and text-encoder.
 Trained text-encoder and image encoder – This storage contains the trained image-text
encoder that is developed to compute the image-text matching loss for the training of the
generator.
 Trained Models Generator & Discriminator – Using the fashion dataset, text-image
encoders, different models of generator, and discriminators were trained. This storage
offers all the models for the process of synthesizing fashion designs from text descriptions.
 Image Output Storage - This is the storage where after the image is generated is stored in
this file directory, so it can be sent to the client layer to be visualized.

Logic Layer

 GAN Module – This module will consist of the GAN network which will be used to
generate fashion images from the fashion training dataset. This GAN network will consist
of a generator model to generate fake images based on the pre-trained DAMSM encoder
and discriminator model that will be trained separately on real images and read captions.
 Text-to-Image Synthesis Module - This module will be responsible for synthesizing
fashion design from the text description. It consists of a DAMSM encoder to compute the
image-text similarity score, contrastive learning module that will maximize and minimize
loss based on the text-image DAMSM similarity, extract word and sentence embeddings
from text description using RNN encoder and the generator module to generate the fashion
design images from the provided text description.

Presentation Layer

 Landing Webpage Wizard - This will be displayed to the user when they first access the
online application, along with a quick introduction and instructions on how to use it.

Aarthif Nawaz | w1715752 56


Fashionable

 Input Text Description - This is the wizard that will invite the user to provide their text
description to produce a high-quality fashion design. This module will send the text
description to the text-to-image synthesis processing module once the user has entered it.
 Loading Screen Wizard - It will take some time for the system to synthesize the fashion
design from the text description once the user submits the text description. As a result, the
user must be informed that their text description is being processed by the system. As a
result, a loading screen appears.
 Display Image Wizard - The loading screen will transfer the user to a new page that will
display both their input description and the output image once the text-to-image process is
complete and the image is ready.

6.4 System Design


6.4.1 Choice of the Design Paradigm
In software engineering, design paradigms describe the developer's perspective on the recognized
problem and how the solution is organized. As a result, the part that follows provides an overview
of the design paradigms explored as well as the reasoning for their selection.

Design Paradigm Selection Justification


Structured System No The primary concern of SSADM is the system's
Analysis and Design procedure, it manly revolves around the functions
Methods (SSADM) created on the project. it is better suited to projects with
clearly defined project requirements. As a result, it is
not the best paradigm for the project.
Object-Oriented Analysis Yes The primary focus of OOADM is on data structure and
and Design Methods the impact of the capturing system on real-world
(OOADM) objects. It is highly reusable, making it ideal for
applications with changing needs. As a result, the
design paradigm of OOADM is chosen.
Table 6.1 - Choice of Design Paradigm

6.4.2 Component Diagram


The components identified from the high-level architecture, and their data flow between each of
them are represented below.

Aarthif Nawaz | w1715752 57


Fashionable

Figure 6.2 - Component Diagram (Self-Composed)

6.4.3 Sequence Diagram


The sequence diagram demonstrates how the text-to-image synthesis process has been divided into
sub-entities, as well as the sequence of interactions that occur between the user and the system's
core entities when the user anticipates a fashion design from the provided description.

Figure 6.3 - Sequence Diagram (Self-Composed)

Aarthif Nawaz | w1715752 58


Fashionable

6.4.4 Class Diagram


The class diagram depicts the relationships between classes, their methods, and their
characteristics.

Figure 6.4 - Class Diagram (Self-Composed)

6.4.5 UI Design
The project's UI (User Interface) had a straightforward goal. It intended to create a user-friendly
interface that would simplify the process of synthesizing fashion designs from user-provided
descriptions. Another requirement was for the user interface to be responsive for mobile
applications, as consumers tend to save a large number of photographs on their mobile devices. A
four-page responsive web application was created to meet all of these goals and to give a basic yet
comfortable user experience with a lower learning curve. The wireframes below will appear on
mobile and desktop devices.

The UI wireframes can be found on Appendix H

Aarthif Nawaz | w1715752 59


Fashionable

6.4.6 System Process Flow Chart


The system process flow chart depicts the system's flow and how its choices are made when a user
uses the system to synthesize an image from a text description. This primarily demonstrates how
structured programming decisions are made and processes are controlled in the critical parts of the
application's logic tier.

Figure 6.5 - System Process Flowchart (Self-Composed)

6.5 Chapter Summary


This chapter explains in detail all of the design decisions made based on the outcomes of the
literature review and requirement collection via questionnaires and brainstorming. The component
diagram modularized the entire system into separate components and depicted the flow of data
among those components, whilst the tiered architecture provided an understanding of the three
primary tiers of the system. The sequence diagram further modularized the system into four
primary components and demonstrated how it will operate when a user interacts with it at a high
level. The main notion of the system's planned user interfaces was shown in the UI design.

Aarthif Nawaz | w1715752 60


Fashionable

7. IMPLEMENTATION
7.1 Chapter Overview
This chapter describes the implementation of the prototype of the research project which shows
the technological stack of the system and how it was built, then also the data selection was closely
looked at explaining the dataset and its relevant attributes that will be used by the text-to-image
synthesis model, the libraries and the IDE’s used was also discussed in detail. It also describes
core components and the various decisions taken during the development of the system.

7.2 Technological Selection


7.2.1 Technological Stack
Since Python is the most commonly used programming language to develop machine learning
models. The core framework was based upon the Pythonic framework and the PyTorch backend
was used to develop the DAMSM encoder ensembled with contrastive learning and AttnGAN
network. Libraries like scikit-image, nltk, NumPy, pandas, and pillow came in handy here to
preprocess text and images and convert them to vectors so that the model can use them as inputs.
For prototype development, Google Colab was used PyCharm and VS Code was utilized as IDE.
For version control, Git will be utilized. For the backend and the development, python flask will
be used. React JS will be used for the development of the frontend. Google Cloud Engine will be
used to deploy the system on the server so it will be accessible to the public.

7.2.2 Data Selection


The fundamental requirement in a data science project is its data and its relevant attributes, in our
case it is the text and image as tuple pairs. This project's data is comprised of images and written
descriptions. In terms of text-to-image synthesis, the goal of this project is that text descriptions
are synthesized into high-quality two-dimensional fashion designs. As a result, the models were
trained with the help of datasets containing fashion descriptions and images.

The project aimed to use fashion-based text and image pairs to train and develop a text-to-image
synthesis model so that the system could generate fashion design images from large text
descriptions consisting of a substantial amount of words. The FashionGen dataset was used to train
the framework's GAN network. FashionGen included text descriptions and images for T-shirts,
Shirts, Jeans, Pants, Suits & Blazers, and Tops.

Aarthif Nawaz | w1715752 61


Fashionable

7.2.3 Selection of Development Framework


PyTorch - Since this is a deep learning project, the PyTorch framework was chosen for the
development of RNN model for text encoders, CNN model for image encoders, GAN model
comprising of the generator and discriminator for the generation of fashion design images.

React JS - As a requirement of the project, the React JS framework will be used for the frontend.
The front end is a simple user-friendly web application. The usage of JavaScript libraries with the
support of node modules will be incorporated to make the process of the frontend easy, it will be
separated to use a component-page architecture so that the API calls from the frontend can be
easily executed.

Flask - To connect the application's backend and frontend, the Flask web framework was used.
Flask was chosen because communication between the front end and the back end was required.
Flask is also a lightweight web framework that connects both ends via API requests. It also helps
the application run more efficiently.

7.2.4 Programming Language


Python was the development language that were considered for this type of system. Python was
chosen as the language to develop the GAN models and write the code logic. Python was also
chosen because of its strong deep learning and image processing libraries for image generation
and synthesis.

7.2.5 Libraries Utilized


The prototype made use of libraries such as NumPy, Pandas, Pillow, Scikit-image, scipy, NLTK.
Numpy was used to work with all types of text and image vectors, as well as 2D arrays. To deal
with images, Pillow was used. Particularly when opening and saving images. Scikit-image was
used for image resizin and augmentation. The scipy library was used to compute the GAN
network's inception score as well as the contrastive learning loss by applying the computed image-
text matching loss captured from the DAMSM model.

7.2.6 IDE’s Utilized


Google Colab was used as the primary IDE for the research and development of the system's core
components because the training process of the components required high GPU and CPU power
as well as longer training times. As a result, Google Colab with the use of multiple accounts was

Aarthif Nawaz | w1715752 62


Fashionable

the best free option for an IDE. Initially, the backend was developed using PyCharm IDE, and the
front end of the application was developed using Visual Studio Code IDE. PyCharm was chosen
to create the backend because it is the most well-known IDE for Python application development
and also makes it easier to use required packages when developing.

7.2.7 Summary of Technology Selection


Component Tool
Programming Language Python
Development Framework - DAMSM Component PyTorch
Development Framework – GAN Component PyTorch
Development Framework – Final Product Flask
Libraries Numpy, Pillow, Scipy. Scikit-Image,
Pandas, NLTK
UI Framework React JS
IDE – Research Component Google Colab, Kaggle Notebooks
IDE – Product VS Code, PyCharm
Version Control Git
Deployment Google Cloud App Engine
Table 7.1 - Summary of Technology Selection

7.3 Implementation of Core Functionalities


Refer to Appendix I for the DAMSM text and image encoder network & Attn GAN network’s
generator and discriminator.
7.3.1 Core Research Contribution
Below code snippets are entirely contributions of this project, followed to the software engineering
discipline.

Aarthif Nawaz | w1715752 63


Fashionable

7.3.1.1 Contrastive Learning Loss Function

Figure 7.1 - Contrastive Loss Function

7.3.1.2 Training DAMSM Network

Aarthif Nawaz | w1715752 64


Fashionable

7.3.1.3 Training GAN Network

Figure 7.2 - Training GAN Network

Aarthif Nawaz | w1715752 65


Fashionable

7.3.2 System Benchmarking Algorithms


Inception Score and R-Precision are the two main quantitative metrics used. The inception score
is based on the image quality, and R-Precision is the average number of positive image generations
corresponding to the text descriptions generated by the FashionGen dataset.

7.3.2.1 Inception Score Calculation

Figure 7.3 - Inception Score Calculation

7.3.2.2 R-Precision Calculation

Figure 7.4 - R-Precision Calculation

Aarthif Nawaz | w1715752 66


Fashionable

7.4 Implementation of APIs

Figure 7.5 - API Route (Generate Fashion Design)

The above route is to generate the fashion design using the trained GAN Model

Figure 7.6 - API Route (Download Image)

The above API route is to download a particular image

7.5 Chapter Summary


This chapter concentrated on how the high- and low-level design concepts were transformed into
implementation. The technology stack that was used was thoroughly explained, along with the
reasons for using it. All of the code snippets for the system's core functionalities, as well as the
core of the research contribution which is the contrastive learning module and contrastive loss,
were accompanied by implementation.

Aarthif Nawaz | w1715752 67


Fashionable

8. TESTING
8.1 Chapter Overview
This chapter covers how testing was done to test “Fashionables” intended function flow. This
chapter discusses the testings in detail which include model testing, benchmarking, functional
testing, non-functional testing, module and integration testing.

8.2 Objectives and Goals of Testing


The primary goal of software testing is to ensure that the system is operating in accordance with
the expectations set forth by the requirements. The following are the primary goals of the
Fashionable testing process.

 To ensure that all system models are performing as expected and are thoroughly tested in
order to achieve the best results.
 Identify how the system can be benchmarked and achieve proper benchmarking against
other systems.
 To determine whether the system satisfies all the functional requirements and non-
functional requirements.
 To improve the system's experience based on test results.

8.3 Testing Criteria


A criterion for testing the system in two ways is defined to narrow the gap between the anticipated
and implemented systems. The two types of tests are:

1. Functional Quality - Using functional requirements, this focuses on the system's


development characteristics and technical features to see how well they match the
specified design.
2. Structural Quality - Testing the code that has been developed to determine if it is in
compliance with software engineering best practices.

8.4 Model Evaluation


Model testing was carried out in two different ways: text-to-image model testing and GAN model
ensembled with contrastive loss testing.

Aarthif Nawaz | w1715752 68


Fashionable

When looking at model evaluation metrics for text-to-image synthesis approaches, it was
discovered in the Evaluation Methodology section that the Inception Score (IS) and R-Precision
were the most commonly used metrics.

Inception Score

Inception score is the mean calculation for assessing image quality since it has been demonstrated
to correspond well with human opinion.

The predicted images are taken as input and compared with the pretrained inception v3 model
CIFAR-100 images to get the mean score and the standard deviation from the predicted images to
the CIFAR-100 images.

Figure 8.1 - Inception Score

R-Precision

R-precision ranks retrieval performance between retrieved image and text attributes to determine
visual-semantic similarity between text descriptions and generated image. It is defined as the (r/R),
in other words the ratio between all the correct pairs of text and images against the total number
of predicted texts along with the images. So these correct against the total is divided to get the R-
Precision.

The R-Precision function inputs the trained image and text encoder as input and it gets the cosine
similarity between the test text and images from the text and image encoder respectively. The
closest similarity is defined as success against the total processed and is divided to get the precise
score.

Aarthif Nawaz | w1715752 69


Fashionable

Figure 8.2 - R-Precision

Quantitative Testing Results of the model performance

Model Specification IS R-Precision


AttnGAN + Contrastive Learning 4.78 +- 0.3 80%
Table 8.1 - Quantitative Test Results

AttnGAN Model Evaluation

The average discriminator loss for a discriminator text-to-image model ranges between 0.1 to 0.3
(Bodnar, 2018a). 0 indicates that the Discriminator network has overfitted to the training data and
has won the min-max game; otherwise, it indicates that the network has failed. A random set of
outputs is generated by the Generator network in this scenario.

Contrastive Learning Evaluation

The contrastive learning is nothing but a loss function that will be used to gather the captions
corresponding to the same image and ignoring the captions that do not correspond to the same
image. The contrastive learning uses the cross entropy loss function with “sum” reduction to
calculates the cosine similarity between the positive and negative samples. Finally, the loss is
derived by dividing it with current batch size.

DAMSM Model Evaluation

Figure 8.3 - DAMSM Model

Aarthif Nawaz | w1715752 70


Fashionable

The DAMSM Model consisting of an image encoder and text encoder is to compute the
multimodal similarity between the text and image. A good text-image matching loss will be used
by the generator to generate images in a semi-supervised manner.

8.5 Benchmarking
The best-performing ensemble model for the text-to-image synthesis domain to generate images
for long phrases was selected to be used in the final system after extensive testing of various loss
functions, optimizers, splits, epochs, and batch sizes. The inception score and R-Precision was
used to benchmark the system against other systems in the domain.

Model Specification IS (higher the better) R-Precision (higher the better)


Fashionable (Ours) 4.78 +- 0.3 80%
(Xu et al., 2018b) 4.36 +- 0.5 67.82%
(S. Reed et al., 2016) 2.88 +- 0.4 -
(S. E. Reed et al., 2016) 3.62 +- 0.7 -
(Zhang et al., 2017) 3.70 +- 0.4 -
(Zhu et al., 2019) 4.75 +- 0.07 72.31%
(Tao et al., 2020) 5.10 +- 0.00 -
(Qiao et al., 2019) 4.56 +- 0.41 -
(Hong et al., 2018) 4.46 +- 0.09 -
(Liang, Pei and Lu, 2020) 5.27 +- 0.61 93%
Figure 8.4 - Benchmarking of Existing Systems

8.6 Functional Testing


Functional testing of the application was done for the functional requirements of the prototype.

Test FR User Action Expected Actual Result


Case ID Result Result Status
1 FR 1 User inputs description. Description Description Passed
gets entered. gets entered.
2 FR 2 User must be able to select a Language Language Passed
language from the three provided gets selected. gets
languages. selected.

Aarthif Nawaz | w1715752 71


Fashionable

3 FR 3 User must be able to select a Language Language Passed


language and provide descriptions validation validation
to generate fashion designs. happens happens
4 FR 4 User Provided descriptions must be Fashion Fashion Passed
synthesized to a high-end fashion design gets designs gets
design when clicked on generate generated generated
button.
5 FR 5 User must be able to visualize Fashion Fashion Passed
output. Designs get Designs get
displayed displayed
6 FR 6 User must be able to download Image gets Image gets Passed
image downloaded downloaded
7 FR 7 Users must be able to re-enter Ability to re- Ability to re- Passed
description after visualizing the enter enter
output of an image. description description
8 FR 8 Users should be able to receive Receives Receives Passed
error messages on system failure error error
message on message on
system system
failure failure
9 FR 9 User provided data should only be User User Passed
used during the text-to-image provided provided
synthesis process and not for any description description
other purpose. will only be will only be
used for text- used for
to-image text-to-
synthesis image
synthesis
10 FR User input should not be No database, No database, Passed
10 permanently stored within the session, session,
system or it’s database. cookies are cookies are

Aarthif Nawaz | w1715752 72


Fashionable

used to store used to store


user input. user input
Table 8.2- Functional Testing

8.7 Module Integration Testing


Module Input Expected Output Actual Output Status
Text Correct text Text description Text description gets Passed
Description description in any gets processed. processed.
validation selected language.
Correct multiple Text descriptions Text descriptions Passed
text descriptions gets processed. gets processed.
with use of line
separator in any
selected language.
Text description in Error message Error message Passed
wrong selected displayed. displayed.
language.
Multiple text Error message Error message Passed
description without Displayed. displayed.
line separator.
Text Pre- Preprocessed text Preprocessed text
processing Text description in description. description. Passed
selected language.
Multiple Text Preprocessed text Preprocessed text Passed
description in descriptions. descriptions.
selected language.
Text-to-image Preprocessed text High quality High quality fashion Passed
synthesis description in fashion design. design.
selected language.

Aarthif Nawaz | w1715752 73


Fashionable

Preprocessed text High quality High quality fashion Passed


descriptions in fashion designs. designs.
selected language.
Fashion Fashion Designs Show fashion Show fashion Passed
Designs Output designs. designs.
Loader Fashion Designs Download fashion Download fashion Passed
designs. designs.
Table 8.3 - Module Integration Testing

8.8 Non-Functional Testing


Regarding non-functional requirements, four major non-functional requirements stood out in both
the research and the implemented system. These four requirements were: Accuracy, performance,
output quality, and security were the four factors to look out for. Because output quality was
completely covered by other types of tests, performance and security are covered under non-
functional testing as well.

8.8.1 Accuracy
As part of the model review and benchmarking process, accuracy testing was carried out. The goal
was to outperform hand-designed architecture in terms of the proposed approach. A higher
Inception score was achieved by utilizing the provided strategy. Benchmarking included
comparisons to other SOTA models. By doing functional and non-functional testing, the
prototype's correctness was determined.

8.8.2 Performance
Research Development Performance

It was the primary focus of the research that was carried out to investigate the learning of graphical
representations, as well as the graphical text-to-image conversion process. Therefore, in order to
achieve successful results, it was necessary to have not only sufficient CPU and RAM memory,
but also sufficient GPU power as well. Therefore, Google Colab was used, with the company's
powerful Tesla K80 GPU, which had a capacity of 11 Gigabytes of Video RAM, being used for
model training. Because the training phase of the research was the most resource intensive phase,
the Tesla GPU was used for model training. The information about GPU utilization can be seen in
the screenshot below.

Aarthif Nawaz | w1715752 74


Fashionable

Figure 8.5 - GPU Performance

A result of the high consumption of graphical resources, the GPU Video RAM usage exceeded 8
Gigabytes out of a total of 11 Gigabytes of available allocation. As a result, in order to summarize
the effectiveness of developing the research component, the training process was the phase that
required the most time and resources.

8.8.3 Security
When building a system that handles user data, security is crucial. Having it on a public cloud
server is crucial. The implementation below makes the app safer for users.

1. It was decided to deploy a web application with the capability of communicating with the
server using HTTPS, which is an encrypted communication protocol. Thus, no opportunity
will exist for an attacker to intercept communication between the application and the server
in this situation.
2. After a user session ended, none of the user data was retained within the application, which
was also a functional requirement of the application. The use of this practice eliminated the
possibility of unnecessary user data being disclosed to unapproved parties.

8.8.4 User Friendliness


Usability testing of the prototype was conducted as a focused group testing. The same target
audience of evaluations were used as the focused group.

Aarthif Nawaz | w1715752 75


Fashionable

How would you rate the usability of the prototype?

52.9% of the evaluators rated


it as excellent usability.
41.2% of the evaluators rated
the usability as good,
whereas only a few rated it
as neutral.

Figure 8.6 - Usability of the Prototype

8.9 Limitations of the Testing Process


Because the models were consuming a significant amount of hardware resources and time during
testing, it was difficult to push the models to run for extremely large numbers of epochs and batch
sizes. Also since the GAN trainings were trained with a combinations of different loss functions
and optimizations which required higher GPU power a lot of loss functions were not compared
against the evaluation metric. With the limited GPU power, it took longer training period, more
usage of CUDA memory when training ESRGAN for super resolution. Thereby leading to affect
the quality of images.

8.10 Chapter Summary


The goals and objectives of the testing were laid out in this chapter. Tests for functionality and
non-functionality, module integrated testing, and so forth. In order to obtain the best performing
models while not overfitting the training data, extensive model testing was carried out. Additional
comparisons were made with other approaches that had already been developed.

Aarthif Nawaz | w1715752 76


Fashionable

9. EVALUATION
9.1 Chapter Overview
It was time to analyze the research project's overall conclusion after the designed prototype had
been successfully implemented and optimized to achieve the greatest performance feasible through
a large number of training combinations. This chapter will focus on implementing the project's
evaluation process, which includes self-evaluation, domain and technical expert evaluations, and
other stakeholder evaluations.

9.2 Evaluation Methodology & Approach


In developing an evaluation strategy for their research, the author took a quantitative and a
qualitative approach, which they combined. Using two quantitative metrics, the research outcome
was evaluated in this chapter against other credible and novel research in the domain after it had
been determined that it was the best outcome through a series of tests conducted in the previous
chapter. An objective of primary importance in this case is to conduct a qualitative evaluation
based on the feedback of domain and industry experts. Questionnaires were distributed, and
interviews were conducted with domain experts to gather feedback to perform a thematic analysis.

9.3 Evaluation Criteria


Criteria Purpose
The proposed To evaluate the problem and solution novelty of the project
novelty of the project
The scope of the To evaluate on the depth of the scope of the project
project
Proposed To evaluate the proposed architecture for the project.
Architecture
Model The depth of implementing the solution and the model used to solve the
implementation problem.
depth
Proposed solution To evaluate if the system provides the solution to the identified problem.
Development The purpose of this check is to determine whether or not the expected
approach approach and best practices were correctly followed during the

Aarthif Nawaz | w1715752 77


Fashionable

undertaken development process of the prototype that was used to demonstrate the
concept to the consumer.
Quantitative Specifically, the goal of this study is to validate the metrics that have
benchmarking been used to benchmark the model and to assess how well it has
approach performed when compared to other models in the same domain.
Quality of the GUI To evaluate on the GUI quality.
Table 9.1 - Evaluation Criteria

9.4 Self Evaluation


Theme Author’s Evaluation
Concept of the AttnGAN is a novel approach to text-to-image synthesis in the context of
Research machine learning, and it is being used to implement contrastive learning in
the process. Text-to-Image synthesis is a difficult problem in the computer
vision domain, but it is also important in many other domains because it is
a fundamental problem in computer vision. It can be considered timely
research to propose a novel architecture for text-to-image synthesis in light
of recent advances in deep learning while also addressing an important
problem in the field. The success of this research will have ramifications in a
variety of other fields.
Scope and depth So far, many existing works have developed text-to-image synthesis
of the project applications using unsupervised learning and a GAN-based approach. The
GAN-based approach gives plausible results. But at the same time, the text
description is not given semantic consistency when compared to the image.
Thus, the author decided to contribute to the research domain by focusing
on large text descriptions to generate images using model ensembling.
Moreover, as per the SL context, this prototype allows users to generate
images from text descriptions in multiple languages. Also, during the
requirement gathering phase, the scope was refined in response to
suggestions from domain and technical experts on how to improve the
model so that images could be generated more accurately and efficiently.

Aarthif Nawaz | w1715752 78


Fashionable

The refined scope and depth were both significantly higher than average
due to the nature of GANs.
System Design The design, architecture, and implementation of the concepts and
Architecture & components based on the research were complicated as a result of the
Implementation difficulty of the concepts and components. Involved in frequent discussions
and prototyping as well as interview techniques, the functional and
nonfunctional requirements, as well as the main components, were
identified. A variety of cutting-edge technologies were employed in the
development process. Aspects such as code quality and industry standards
were taken into consideration during the entire design and implementation
process. It was completed to a high level of quality in terms of functionality,
design, architecture, and implementation.
Model To generate images from text descriptions, an AttnGAN model ensemble
Implementation with a contrastive learning loss function was used as the model
implementation in this research. This model implementation was the core
component of this research. A very satisfactory level of model
implementation was achieved based on the refined scope, and this was
achieved by following best practices and industry standards throughout its
development.
Solution and The prototype was developed as a web-based solution. There was a lot of
the prototype thought put into the components, and they can be improved even more. It
was noted in the future work section that product level enhancements
would be made. The prototype's usability demonstrated that text-to-image
synthesis can be learned and experimented with in a straightforward and
simple manner.
Limitations and The primary constraint on the development process is the availability of GPU
future resources. Through the use of hyperparameter tuning, the model can be
enhancements made even better. All of the limitations and enhancements that have been
identified are detailed in Chapter 10.

Aarthif Nawaz | w1715752 79


Fashionable

Table 9.2 - Self-Evaluation

9.5 Selection of the Evaluators


Three groups of evaluators were utilized in the selection process in order to obtain evaluations
from a diverse group of individuals with differing levels of education and experience. Evaluation
panels are described in greater detail in the Appendix. The details of the evaluators are mentioned
below

1. An evaluation panel comprised of researchers with intermediate experience in the fields of


generative models and deep learning was selected.
2. Researches from the deep learning and image processing, generative domain were chosen.
3. Industrial experts who were familiar with the computer vision domain were chosen.

The Appendix J contains evaluators in a tabular format with their respective name, position and
affiliations

9.6 Evaluation Results and Expert Opinions


9.6.1 Qualitative Result Analysis
Qualitative result analysis was conducted with thematic analysis, feedback of selected evaluators
is mentioned below.

9.6.1.1 The Research Concept of the Project


Eval ID Feedback
EV9 The project idea is very interesting. Proper research has been carried based on the
LR.
EV3 Text-to-image generation is a very good niche in the deep learning domain. The idea
of generating images from large text descriptions in multiple languages is a valuable
contribution to the text to image domain.
EV11 The idea has a very good scope and will produce a valuable contribution to the
fashion design and the deep learning domain.
EV5 Developing such a research concept within a timeline of 6 months is very
challenging and completing it within the timeline is a wonderful achievement.
Table 9.3 - Research Concept of the Project

Aarthif Nawaz | w1715752 80


Fashionable

9.6.1.2 Novelty of the Project


Eval ID Feedback
EV1 This project using AttnGAN ensembled with contrastive learning is a novel
approach that I have witnessed in the text-to-image synthesis domain.
EV3 I have read research papers on text-to-image synthesis, but understanding that its not
viable enough to generate images from large text description and developing a
prototype to improvise that is a really good attempt.
EV11 Text-to-image models have not yet been proposed in the literature, but one
researcher has come up with a fresh notion for one. Text-to-image synthesis would
benefit from this project's originality, according to the literature review.
EV4 I have also done a novel project using NAS and GANS, and looking at this project
novelty, and the LR of the author there is no such development that I have not seen
yet in this domain.
Table 9.4 - Novelty of the Project Evaluation

9.6.1.3 Proposed Architecture & Solution


Eval ID Feedback
EV5 Features are all good and are easy to use. Prototype offers relevant features and does
the job it is supposed to do.
EV1 It's great to have such a system for real world industry usage.
EV4 The solution of offering multi language feature in Sri-Lankan context allowing the
input of multiple language is an outstanding feature in the system, Apart from the
minor blurry effect in the image quality the rest of the prototype is good, Good Luck.
EV9 Well Implemented
EV10 The presented solution is addressing the problem very well and it is also doing a
good job at producing an identifiable output. The research gap which is being
addressed has been contributed for. However, the outputs being produced have room
for improvement as those are not highly usable in real world scenarios. But, the
produced outputs go on to show the capability and potential of the model and the
technology being used. As an initial contribution to the field of research, addressing
a valid gap on which the solutions/contributions have so much potential to grow on,

Aarthif Nawaz | w1715752 81


Fashionable

this is a very good contribution, specially at the undergraduate level. Quantitative


evaluations seem to have been done on all possible aspects, to compare the
implemented model against other works in the research domain.
Table 9.5 - Proposed Architecture of the Project Evaluation

9.6.1.4 Model Implementation/Code


Eval ID Feedback
EV8 First and foremost, really like the idea of having a system that transform text to
images using ensembled GAN. Good option of having multiple languages specially
for the users in Sri Lanka.
EV1 The Model developed using GANs could have been really complex, but
understanding the effort put in for the development of the prototype withing a short
period of time is good effort. As you mentioned the lack of GPU resources affected
the model output, you could have purchased a cloud computing GPU powered
machine and trained it for a day for better quality. But overall good job.
EV6 The features offered are direct and clear. There is not much clutter in the interface
which keeps the end goal of using the product simple and quick. The flexibility of
languages offered is good since all languages widely using in Sri Lanka has been
included. Overall the implementation is satisfactory in achieving the end use
case/goal of the project/research.
EV10 Using AttnGAN is a very good approach as it focuses on fine grained details and
ensembling it to a loss function to generate images for fashion based industry is a
good research feature. Also the ability to synthesize from multiple languages is a
good additional option for users.
EV9 Very good aspect of offering multi language feature in synthesizing images from
different language text descriptions. Moreover, wonderful job to train an ensembled
GAN network within a short period of time.
Table 9.6 - Model Implementation Code Evaluation

9.6.1.5 GUI
Eval ID Feedback

Aarthif Nawaz | w1715752 82


Fashionable

EV12 The GUI of the application was clear and responsive. It was easy to understand
how the application works and steps to follow to complete my task. The
description about the application was clear and the text and graphics used were
also understandable.
EV11 The responsiveness of the GUI is great and the user interface was simple to use. A
very good idea of portraying how the prototype should be used beforehand makes
the application a very handy.
EV3 The GUI features are convenient and cover most of what is required in this use case.
Much can be admired in terms of the research idea, the project's management, the
outcomes and output, and even the user interface of this project. Considering that
this was an undergraduate effort, this appears to be an impressive outcome.
Table 9.7 – GUI Evaluation

9.6.2 Quantitative Evaluation


1. Do you think the presented solution is having the depth in solving the problem?

Majority of the evaluators


rated that presented
solution was having the
depth of solving the
problem.

Figure 9.1 - Presented Solution Evaluation

2. you think the system provides a solution to the identified problem?

Majority of the evaluators said that the


system provides the most appropriate
solution to solve the identified
problem.

Figure 9.2 - Solution to the identified Problem Evaluation

Aarthif Nawaz | w1715752 83


Fashionable

3. Do you think the used evaluation metrics are relevant for the project?

65% of the evaluators responded


saying that the most correct
evaluation metrics was used for
this project.

Figure 9.3 - Evaluation metrics Evaluation

4. How would you rate the accuracy of the prototype?

Majority of the evaluators rated the


accuracy of the prototype as 4/5.

Figure 9.4 - Accuracy of the Prototype Evaluation

5. How do you rate the presented GUI for the solution?

All of the expert evaluators like the


GUI of the prototype.

Figure 9.5 - Presented GUI Evaluation

9.7 Limitations of Evaluation


In order to complete the project's evaluation phase, we had to work around a few obstacles. As a
result of the country's situation, most of the interviews had to be conducted online. Continuous
power cut schedules delayed a lot of virtual interviews and the need for them to be cut short, which
may have reduced certain feedback that could only be obtained by seeing people in person. A one-
on-one interview with some industrial experts was not possible due to the fact that participants
were spread across the country and even abroad. The only option was to complete an online Google

Aarthif Nawaz | w1715752 84


Fashionable

form; this eliminated any opportunity for hands-on experimentation and understanding of the
prototype by the experts.

9.8 Evaluation of Functional Requirements


Evaluation of functional requirement implementations resulted in a 100% implementation success
rate; a detailed breakdown of requirements is shown in Appendix K.

9.9 Evaluation of Non-Functional Requirements


Evaluation of non-functional requirement implementations resulted in a 100% implementation
success rate; a detailed breakdown of requirements is shown in Appendix L.

9.10 Chapter Summary


This chapter went into great detail on the process used to conduct the evaluation. Different criteria
for evaluating the project were established, dividing it into distinct parts. A self-evaluation of the
project by the author was provided based on those criteria. A thematic analysis was performed
based on the expert opinion acquired through interviews and questionnaires. Quantitative data was
depicted graphically, while the results of the theme analysis were summarized in a table.

Aarthif Nawaz | w1715752 85


Fashionable

10. CONCLUSION
10.1 Chapter Overview
This chapter summarizes the findings from the research as a whole. The project's aims, objectives,
and learning outcomes are laid down, along with the difficulties that were encountered. Knowledge
and abilities developed over four years of study, as well as new talents learned through research,
are presented in this section. Also covered are the project's deviation from its initial design and
restrictions, as well as future work.

10.2 Achievement of Research Aims & Objectives


This aim of this research is to design, develop and evaluate a text-to-image synthesis model that
will take input of text descriptions with a large number of words of in-vogue fashion-related text
in any three natural languages (English, Sinhala, Tamil) and generate highly synthesized in vogue
fashion design images with better details, resolution, and features.

The aim of the project was successfully achieved by designing, developing, and evaluating a
novel text-to-image synthesis model which is fully capable of transforming text descriptions in
three natural languages which has substantial number of words to a high-quality two-dimensional
fashion design.

10.3 Utilization of Knowledge from the Course


Module Description
Programming These modules helped the author earn the fundamentals of object-
Principles & Object- oriented programming and basics of development.
Oriented Programming
Web Design and Beginning with fundamentals and progressing to API execution and
Development, Server- response sending, these modules provided an introduction to client-
side Web server architecture for sophisticated web frameworks. Different
Development backend and frontend frameworks were learnt after gathering the
basic foundation from this module.
Database Design & The basics of database design and different types of SQL and NOSQL
Implementation programming were learnt.

Aarthif Nawaz | w1715752 86


Fashionable

Software Development To begin this research, this module is what you'll need. As a result,
Group Project we had to go through every stage of the process, from finding out the
problem to developing and testing a prototype. Though it was a team
work at that stage, the writing of a final project report was found easy
because of the exposure to the SDGP module.
Client-Server This module helped us understand how the front-end backend works,
Architecture how the APIs should be structured and used during development.
Algorithms, Theory Data structures and algorithms were covered extensively in this
Design & module. This aided in the implementation of time-critical algorithms
Implementation by using logic and reasoning.
Table 10.1 - Utilization of the Course

10.4 Use of Existing Skills


 Using LinkedIn learning and completing courses such as “Generative Adversarial
Networks”, self-learning through websites like freecodecamp and Kaggle. Also acquiring
extensive skills by learning through Udemy courses.
o Deep Learning from A-Z
o How GANs work
o Mathematic loss functions of GANS
 Knowledge, skills gained from my internship at WSO2 was really helpful in developing
my final prototype

10.5 New Skills


Following new skills were gained as a result of doing this project

 Use of GANs Complementing Text-to-Image Synthesis - The author had never had
hands on experience with GANs or model ensembling before. This endeavor necessitated
a research into how these techniques were used in text-to-image synthesis.
 Deep Learning - Prior to this study, the author had no prior knowledge of deep learning.
In order to complete this project, it was necessary to study deep learning techniques from
sources like Coursera and Udemy.
 Ability to Fine-Tune Models – The author extensively research and gained knowledge on
how deep learning models should be optimized and fine-tuned to perform better.

Aarthif Nawaz | w1715752 87


Fashionable

10.6 Achievement of Learning Outcomes


What was Learnt Los
The author was able to conduct proper research into the domain of interest and to LO1,
critically analyze previous works based on the domain after conducting this research. LO2,
LO4
A prototype system for text-to-image synthesis has never been developed before, so LO3
the author needed a plan to gather the information needed to build such a system
from scratch.
Deep learning and GANS were new to the author, so he had to brush up on all of his LO5,
technical knowledge, starting with the basics. It was a steep learning curve. LO6
There was a lot of trial and error during the research's prototyping phase. Identifying LO6
difficulties, understanding how to solve them, and finally reaching desired outcomes
helped to develop independent problem-solving skills.
Detailed documentation was required at every stage of the research process and these LO8
skills were crucial when it came time to write research papers and dissertations.
After developing the prototype, domain and technical experts had to provide their LO9
views and evaluations based on the author's presentation of the project idea
and implemented solution.
Table 10.2 - Achievement of Learning Outcomes

10.7 Problems & Challenges Faced


Problem/Challenge Solution
Long hours of GANS required long hours of training time. The entire model took up to
training time. two and a half months to train for 800 epochs. There was an inability to
train, test, and optimize with different loss functions so as to visualize if
the model performed better upon different hyperparameter tunings.
Multiple Colab notebooks were used to train at the same time to increase
training time.
Hardware High-end technology was required to run the models efficiently because
requirements they were resource-hungry in nature. We needed a laptop with a powerful
CPU and GPU to take on this problem head-on.

Aarthif Nawaz | w1715752 88


Fashionable

Diverse domains Text-to-image synthesis with GANs was necessary in order to solve the
involved in the problem identified in the field of computer vision. To overcome this
research. obstacle, the author had to conduct extensive study on each topic and
dedicate a significant amount of time to mastering deep learning, GAN
development, and text-to-image synthesis skills prior to the start of the
project timeframe.
Vast Learning It was a challenge to learn about and implement several different GAN
Curve models. As soon as the project idea was finalized, trials with the
prototype were begun in order to solve this obstacle.
Table 10.3 - Problems & Challenges Faced

10.8 Deviations
 The initial plan was to use Stack GAN and contrastive learning, but because of the longer
training time and the inability to focus on the fine-grained details, the algorithmic selection
was changed to AttnGAN, which also took longer training time, but this GAN focused
more on the fine-grained details of the description.
 It was very evident that the use of AttnGAN ensemble with contrastive learning to generate
high quality images was very eminent. But with the AttnGAN having multiple generators
and discriminators, and each generator and discriminator corresponding to a specific image
size with low resolution and in addition to the GPU constraints, the author decided to use
only two generators and discriminators, which in turn gave an image resolution of 128*128
pixels. So, in order to improve the resolution, ESRGAN was trained on the dataset to
produce high quality images.

10.9 Limitations of the Research


1. Since the GAN model took longer training time, the dataset size was reduced immensely,
which in turn reduced the quality of the image.
2. Since the AttnGAN was stipulated to only two generators and discriminators because of
GPU constraints to generate the image, the image resolution was only 128*128 pixels. This
image quality was not quite visible to the naked eye. So, the author had to employ
ESRGAN trained on the dataset to generate a high-quality image from the generated image.
This resulted in the performance of the prototype.

Aarthif Nawaz | w1715752 89


Fashionable

3. Though the R-Precision score was superior compared to other models, the model inception
score was pretty low, the reason being that we were unable to train the model with a large
dataset because of GPU constraints affecting the accuracy of the model.

10.10 Future Enhancements


There are many opportunities for future developments based on the restrictions and the novel
dimension that this project is introducing into the area.

 A major future enhancement would be to introduce text-to-video synthesis for the fashion
design industry.
 The model can be further improved by increasing the dataset size and improving the model
inception score.
 The core of the AttnGAN can be improved to up-sample a larger resolution of the image
at the initial stages of the generator and discriminator.
 There may be a number of additional ways to handle the same problem that could increase
the accuracy or performance of the image generation process.

10.11 Research Contribution of Achievement


This study aimed to fill a research gap in the field of text-to-image synthesis by converting a
large number of words from a written description into fashion designs in any of three languages.
Contrastive learning and AttnGAN ensembles were used to bridge this gap in research. It is
undeniable that our approach overcomes both the language barrier and the limitations of text-to-
image synthesis techniques to generate high-quality images from vast numbers of words in a text
description.

10.12 Concluding Remarks


This chapter concludes the research thesis by examining if the study objectives have been reached,
how the author's skills have been employed, how the author has overcome challenges, limitations
of the project, possible future works, and the contribution to the domain. GAN and computer vision
could benefit greatly from this research, which culminates in a significant contribution. All aspects
of the research were carried out in accordance with the strategy that was devised before the project
began. It was highly praised by professionals in academia and industry who have extensive
knowledge of this and other general technological sectors.

Aarthif Nawaz | w1715752 90


Fashionable

REFERENCES
Abid, M. (2020) ‘Fashion Designing’, Medium, 24 January. Available at:
https://medium.com/@maryamabidkhan2/fashion-designing-6b4106dbf254 (Accessed: 23
November 2021).

Agnese, J. et al. (2020) ‘A survey and taxonomy of adversarial neural networks for
tex�_t�_image synthesis’, Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, 10.

Ak, K.E. et al. (2020) ‘Semantically consistent text to fashion image synthesis with an enhanced
attentional generative adversarial network’, Pattern Recognit. Lett., 135, pp. 22–29.

Arjovsky, M., Chintala, S. and Bottou, L. (2017) ‘Wasserstein GAN’, ArXiv, abs/1701.07875.

Bekesh, R. et al. (2020) ‘Structural Modeling of Technical Text Analysis and Synthesis Processes’,
in COLINS.

Bodnar, C. (2018a) ‘Text to Image Synthesis Using Generative Adversarial Networks’, ArXiv,
abs/1805.00676.

Bodnar, C. (2018b) ‘Text to Image Synthesis Using Generative Adversarial Networks’, ArXiv,
abs/1805.00676.

Cheng, Q. and Gu, X. (2020a) ‘Cross-modal Feature Alignment based Hybrid Attentional
Generative Adversarial Networks for text-to-image synthesis’, Digit. Signal Process., 107, p.
102866.

Cheng, Q. and Gu, X. (2020b) ‘Cross-modal Feature Alignment based Hybrid Attentional
Generative Adversarial Networks for text-to-image synthesis’, Digit. Signal Process., 107, p.
102866.

Dhivya, K. and Navas, N.S. (2020) ‘Text to Realistic Image Generation Using Stackgan’, 2020
7th International Conference on Smart Structures and Systems (ICSSS), pp. 1–7.

I
Fashionable

Dong, H. et al. (2017) ‘I2T2I: Learning text to image synthesis with textual data augmentation’,
in 2017 IEEE International Conference on Image Processing (ICIP), pp. 2015–2019.
doi:10.1109/ICIP.2017.8296635.

Dong, Y. et al. (2021) ‘Unsupervised text-to-image synthesis’, Pattern Recognit., 110, p. 107573.

El, O.B., Licht, O. and Yosephian, N. (2018) ‘Recipe 2 Image : Multimodal High-Resolution Text
to Image Synthesis using Stacked Generative Adversarial Network’, in.

Frolov, S. et al. (2020) ‘Leveraging Visual Question Answering to Improve Text-to-Image


Synthesis’, ArXiv, abs/2010.14953.

Frolov, S. et al. (2021) ‘Adversarial Text-to-Image Synthesis: A Review’, Neural networks : the
official journal of the International Neural Network Society, 144, pp. 187–209.

Goodfellow, I. et al. (2014a) ‘Generative Adversarial Nets’, in NIPS.

Goodfellow, I. et al. (2014b) ‘Generative Adversarial Nets’, in NIPS.

Grønbech, C.H. et al. (2020) ‘scVAE: variational auto-encoders for single-cell gene expression
data’, Bioinformatics [Preprint].

Grover, A., Dhar, M. and Ermon, S. (2018) ‘Flow-GAN: Combining Maximum Likelihood and
Adversarial Learning in Generative Models’, arXiv:1705.08868 [cs, stat] [Preprint]. Available at:
http://arxiv.org/abs/1705.08868 (Accessed: 10 November 2021).

Haddi, E., Liu, X. and Shi, Y. (2013) ‘The Role of Text Pre-processing in Sentiment Analysis’, in
ITQM.

Han, F., Guerrero, R. and Pavlovic, V. (2020) ‘CookGAN: Meal Image Synthesis from
Ingredients’, arXiv:2002.11493 [cs] [Preprint]. Available at: http://arxiv.org/abs/2002.11493
(Accessed: 13 November 2021).

Heusel, M. et al. (2017) ‘GANs Trained by a Two Time-Scale Update Rule Converge to a Local
Nash Equilibrium’, in NIPS.

II
Fashionable

Hong, S. et al. (2018) ‘Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis’, 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7986–7994.

Hossain, Md.Z. et al. (2021) ‘Text to Image Synthesis for Improved Image Captioning’, IEEE
Access, 9, pp. 64918–64928. doi:10.1109/ACCESS.2021.3075579.

Hu, K. et al. (2021) ‘Text to Image Generation with Semantic-Spatial Aware GAN’, ArXiv,
abs/2104.00567.

Hu, T., Long, C. and Xiao, C. (2021) ‘CRD-CGAN: Category-Consistent and Relativistic
Constraints for Diverse Text-to-Image Generation’, ArXiv, abs/2107.13516.

Isola, P. et al. (2017) ‘Image-to-Image Translation with Conditional Adversarial Networks’, 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976.

Jeon, E.S., Kim, K. and Kim, D. (2021) ‘FA-GAN: Feature-Aware GAN for Text to Image
Synthesis’, ArXiv, abs/2109.00907.

Jin, Q. et al. (2019) ‘Image Generation Method Based on Improved Condition GAN’, 2019 6th
International Conference on Systems and Informatics (ICSAI), pp. 1290–1294.

Kaneko, T. and Kameoka, H. (2018) ‘CycleGAN-VC: Non-parallel Voice Conversion Using


Cycle-Consistent Adversarial Networks’, 2018 26th European Signal Processing Conference
(EUSIPCO), pp. 2100–2104.

Karras, T. et al. (2018) ‘Progressive Growing of GANs for Improved Quality, Stability, and
Variation’, ArXiv, abs/1710.10196.

Kingma, D.P. and Welling, M. (2019) ‘An Introduction to Variational Autoencoders’, Foundations
and Trends® in Machine Learning, 12(4), pp. 307–392. doi:10.1561/2200000056.

Levy, O. and Goldberg, Y. (2014) ‘Neural Word Embedding as Implicit Matrix Factorization’, in
NIPS.

Li, R. et al. (2020a) ‘Exploring Global and Local Linguistic Representations for Text-to-Image
Synthesis’, IEEE Transactions on Multimedia, 22, pp. 3075–3087.

III
Fashionable

Li, R. et al. (2020b) ‘Exploring Global and Local Linguistic Representations for Text-to-Image
Synthesis’, IEEE Transactions on Multimedia, 22, pp. 3075–3087.

Liang, J., Pei, W. and Lu, F. (2020) ‘CPGAN: Content-Parsing Generative Adversarial Networks
for Text-to-Image Synthesis’, in ECCV.

Maslej-Kresnáková, V. et al. (2020) ‘Comparison of Deep Learning Models and Various Text Pre-
Processing Techniques for the Toxic Comments Classification’, Applied Sciences, 10, p. 8631.

Mihalcea, R. and Leong, C.W. (2006) ‘Toward communicating simple sentences using pictorial
representations’, Machine Translation, 22, pp. 153–173.

Mishra, P. et al. (2020) ‘Text to Image Synthesis using Residual GAN’, 2020 3rd International
Conference on Emerging Technologies in Computer Engineering: Machine Learning and Internet
of Things (ICETCE), pp. 139–144.

Nasir, O.R. et al. (2019) ‘Text2FaceGAN: Face Generation from Fine Grained Textual
Descriptions’, 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM), pp.
58–67.

Nasr, A., Mutasim, R. and Imam, H. (2021) ‘SemGAN: Text to Image Synthesis from Text
Semantics using Attentional Generative Adversarial Networks’, 2020 International Conference on
Computer, Control, Electrical, and Electronics Engineering (ICCCEEE), pp. 1–6.

Parihar, A.S. et al. (2020) ‘A Primer on Conditional Text based Image Generation through
Generative Models’, 2020 5th IEEE International Conference on Recent Advances and
Innovations in Engineering (ICRAIE), pp. 1–6.

Qiao, T. et al. (2019) ‘MirrorGAN: Learning Text-To-Image Generation by Redescription’, 2019


IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1505–1514.

Qiao, Y. et al. (2021) ‘R-GAN: Exploring Human-like Way for Reasonable Text-to-Image
Synthesis via Generative Adversarial Networks’, Proceedings of the 29th ACM International
Conference on Multimedia [Preprint].

IV
Fashionable

Radford, A., Metz, L. and Chintala, S. (2016) ‘Unsupervised Representation Learning with Deep
Convolutional Generative Adversarial Networks’, arXiv:1511.06434 [cs] [Preprint]. Available at:
http://arxiv.org/abs/1511.06434 (Accessed: 24 November 2021).

Ramesh, A. et al. (2021) ‘Zero-Shot Text-to-Image Generation’, arXiv:2102.12092 [cs] [Preprint].


Available at: http://arxiv.org/abs/2102.12092 (Accessed: 13 November 2021).

Reed, S. et al. (2016) ‘Generative adversarial text to image synthesis’, in International Conference
on Machine Learning. PMLR, pp. 1060–1069.

Reed, S.E. et al. (2016) ‘Learning What and Where to Draw’, in NIPS.

Rostamzadeh, N. et al. (2018) ‘Fashion-Gen: The Generative Fashion Dataset and Challenge’,
arXiv:1806.08317 [cs, stat] [Preprint]. Available at: http://arxiv.org/abs/1806.08317 (Accessed:
30 November 2021).

Ruan, S. et al. (2021) ‘DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis’,
ArXiv, abs/2108.12141.

Salimans, T. et al. (2016) ‘Improved Techniques for Training GANs’, in NIPS.

Sommer, W.L. and Iosifidis, A. (2020) ‘Text-To-Image Synthesis Method Evaluation Based On
Visual Patterns’, in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), Barcelona, Spain: IEEE, pp. 4097–4101.
doi:10.1109/ICASSP40776.2020.9053034.

Souza, D.M., Wehrmann, J. and Ruiz, D.D.A. (2020) ‘Efficient Neural Architecture for Text-to-
Image Synthesis’, 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8.

Tan, H. et al. (2019) ‘Semantics-Enhanced Adversarial Nets for Text-to-Image Synthesis’, 2019
IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10500–10509.

Tan, H. et al. (2021) ‘KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-
to-Image Synthesis’, IEEE Transactions on Image Processing, 30, pp. 1275–1290.

V
Fashionable

Tanaka, F.H.K. dos S. and Aranha, C. (2019) ‘Data Augmentation Using GANs’,
arXiv:1904.09135 [cs, stat] [Preprint]. Available at: http://arxiv.org/abs/1904.09135 (Accessed:
11 November 2021).

Tao, M. et al. (2020) ‘DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image
Synthesis’, ArXiv, abs/2008.05865.

Vani, A. and Venkatesh, S. (2016) ‘Computer Vision Report : Text to Image Synthesis’, in.

Vaswani, A. et al. (2017) ‘Attention is All you Need’, ArXiv, abs/1706.03762.

Wang, M. et al. (2020) ‘End-to-End Text-to-Image Synthesis with Spatial Constrains’, ACM
Transactions on Intelligent Systems and Technology (TIST), 11, pp. 1–19.

Wang, X. et al. (2018) ‘ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks’,


arXiv:1809.00219 [cs] [Preprint]. Available at: http://arxiv.org/abs/1809.00219 (Accessed: 11
May 2022).

Xia, W. et al. (2020) ‘TediGAN: Text-Guided Diverse Image Generation and Manipulation’,
ArXiv, abs/2012.03308.

Xu, J. (2020) ‘Flow-based Deep Generative Models’, in.

Xu, T. et al. (2018a) ‘AttnGAN: Fine-Grained Text to Image Generation with Attentional
Generative Adversarial Networks’, 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 1316–1324.

Xu, T. et al. (2018b) ‘AttnGAN: Fine-Grained Text to Image Generation with Attentional
Generative Adversarial Networks’, 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 1316–1324.

Yang, Y. et al. (2021) ‘Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-
Image Synthesis’, IEEE Transactions on Image Processing, 30, pp. 2798–2809.

Ye, H. et al. (2021a) ‘Improving Text-to-Image Synthesis Using Contrastive Learning’, arXiv
preprint arXiv:2107.02423 [Preprint].

VI
Fashionable

Ye, H. et al. (2021b) ‘Improving Text-to-Image Synthesis Using Contrastive Learning’,


arXiv:2107.02423 [cs] [Preprint]. Available at: http://arxiv.org/abs/2107.02423 (Accessed: 10
February 2022).

Yu, J. et al. (2019) ‘Free-Form Image Inpainting With Gated Convolution’, 2019 IEEE/CVF
International Conference on Computer Vision (ICCV), pp. 4470–4479.

Yuan, M. and Peng, Y. (2020) ‘CKD: Cross-Task Knowledge Distillation for Text-to-Image
Synthesis’, IEEE Transactions on Multimedia, 22, pp. 1955–1968.

Zaidi, A. (2017) ‘Text to Image Synthesis Using Stacked Generative Adversarial Networks’, in.

Zhang, H. et al. (2017) ‘Stackgan: Text to photo-realistic image synthesis with stacked generative
adversarial networks’, in Proceedings of the IEEE international conference on computer vision,
pp. 5907–5915.

Zhang, H. et al. (2021) ‘Cross-Modal Contrastive Learning for Text-to-Image Generation’, 2021
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 833–842.

Zhang, M., Li, C. and Zhou, Z.-P. (2021a) ‘Text to image synthesis using multi-generator text
conditioned generative adversarial networks’, Multimedia Tools and Applications, 80, pp. 7789–
7803.

Zhang, M., Li, C. and Zhou, Z.-P. (2021b) ‘Text to image synthesis using multi-generator text
conditioned generative adversarial networks’, Multimedia Tools and Applications, 80, pp. 7789–
7803.

Zhang, Z., Xie, Y. and Yang, L. (2018) ‘Photographic text-to-image synthesis with a
hierarchically-nested adversarial network’, in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 6199–6208.

Zhou, R., Jiang, C. and Xu, Q. (2021) ‘A survey on generative adversarial network-based text-to-
image synthesis’, Neurocomputing, 451, pp. 316–336.

VII
Fashionable

Zhu, B. and Ngo, C.-W. (2020) ‘CookGAN: Causality Based Text-to-Image Synthesis’, 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5518–5526.

Zhu, M. et al. (2019) ‘DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-
To-Image Synthesis’, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 5795–5803.

Zhu, X. et al. (2007) ‘A Text-to-Picture Synthesis System for Augmenting Communication’, in


AAAI.

VIII
Fashionable

APPENDICES
Appendix A – Concept Map

IX
Fashionable

Appendix B – Comparison of Text-to-Image Synthesis Models


Citation Improvement Limitations
(Bodnar,  The model was developed  Resolution of images limited to
2018b) based on the Wasserstein 256*256 pix
similarity  Only English natural language
 The model boosts by 7.07%  Phrase level semantics
the best Inception Score (on  Blurry images with a lot of
the Caltech birds’ dataset) of background artifacts
the models which use only
the phrase-level visual
semantics
(Zaidi, 2017)  Examined the training and  Resolution limited to 256*256
evaluation of a Stack-GAN pix
for the highly realistic  Only English natural language
synthesis of images from text  Phrase level semantics
phrase
(Zhu et al.,  Propose a dynamic memory  Refines initial images with
2019) module to select the most wrong color and rough shapes.
relevant words from a  Sentenced level text. But only a
sentenced-based visual on the few relevant words are taken
global image feature. into consideration, thereby
 Produce high-level generating some images outside
synthesized quality images. the scope.
 Generates images from varied
scopes irregular of the input
image scope.
 Only English natural language.
(Zhang, Li,  Focused on generating  Phrase-level textual
and Zhou, images more diverse, thus descriptions.
2021)  Only English natural language.

X
Fashionable

avoiding mode collapse to


some extent.
 The Images synthesized by
MTC-GAN have relatively
high resolution and high
quality.
 Multiple generators are used
to avoid mode collapse and
faster training time.
(Jeon, Kim  Images generated were of  Phrase-level textual
and Kim, high quality descriptions.
2021)  Less mode collapse attacks  Only English natural language.
during training process  Feature aware loss pre
dominantly gets mutated
towards the generation of new
images, thus sounding less
unique
(Zhang, Xie  Images were of extremely  Phrase-level textual
and Yang, high quality. descriptions.
2018)  Inception score for the image  Only English natural language.
generation boosted up to  Only focused on important
60%. keywords during the training
and testing period, thus images
and text were less matching.
(Xia et al.,  Optimizes image after every  Only phrase level keywords
2020) loss. were given high priority.
 High quality images were  Only English natural language.
generated to mimic robotic
emotions.

XI
Fashionable

Appendix C – Comparison of Supervised Learning Text-to-Image Systems


Citation Improvements Limitations
(Mihalcea and Was the first research hypothesis  Only few phrases were taken
Leong, 2006) to prove the possibility of into consideration.
producing visual representations  Pictorial representations were
as image from simple sentences. not unique.
 Images were less quality.
 30% model accuracy.
 Potential image errors.
(Zhu et al., 2007) First proper development  Keywords/phrases of high
approach for text to image priority were given
synthesis in the supervised importance.
learning domain. Phrases and  Images were less detailed.
keywords were given much  Images were of less quality.
importance to generate images.

Appendix D – Comparison of unsupervised Learning Text-to-image Systems


Citation Improvements Limitations
(Han, Guerrero  Semantic consistency  Generate low resolution of
and Pavlovic, between ingredients image 128*128 pixels.
2020) instruction and the image is  Only English natural
preserved. language.
 Use of cycle-consistency  Only ingredients, cooking
constraint to improve image instructions and image
quality. recipe can be taken as input.
 Only fine-grained level text
given as priority.
(Dong et al.,  Introduced an image-caption  Image resolution is of size
2017) and image-text mapping 128*128 pixels.

XII
Fashionable

module to preserve semantic  Longer training time


consistency. compared to other systems.
 RNN LSTM vector to decode  Only English natural
latent vectors and focus more language.
on the text terms and  Only fine grained level text
represent it in the image. given as priority
(Hossain et al.,  Focused more on the image  Only image synthesis is
2021) captioning done from fake images.
 The use of GCN-LSTM  Applied to only English
architecture to extract both natural language.
semantic and spatial  Longer training time due to
relationship of an image. complex network
 Good resolution of image architecture.
quality 256*256 pixels.
(Nasir et al.,  Extracting fine-grain details,  Limited only towards face
2019, p. 2) map them to a latent space generation.
and learn their distribution in  Applied to only English
the latent space to decrease natural language.
training time.  Generation of low-
 Feature extraction done on resolution image 128*128
sentences using skipthought pixels.
vectors to understand  Not improving the selection
sentences. of the wrong image being
 Inferences of the context of generated.
the image and the text is  Not proper evaluation metric
semantically preserved used to study the systems
performance.
(El, Licht and  Using a semantic  Some images and text are
Yosephian, regularization and non- not semantically consistent.
2018) semantic regularization for

XIII
Fashionable

high level classification  Applied to only English


between the text and images. natural language.
 Concatenating the embedding  Image synthesis was done
to focus more on the text. more towards fake images
 Less training time.
(Ak et al., 2020)  FiLM layers integrate  Usage of different learning
language and visual features rates for generator and
to boost attention by using discriminator results in
sentence/word-context mode collapse during
features. training period.
 Improves text/image  Fine grained level of text
similarity and stabilizes the given priority.
training by avoiding mode  Applied to only use English
collapse. natural language.
 The "e-AttnGAN" prototype
model beat state-of-the-art
models, and the researchers
share an ablation study to
recommend improvements.
(Ramesh et al.,  Applied to a variety of  Generate image of low
2021) domains using multi modal resolution.
image encapsulation module.  Fine grained level of textual
information is given priority.
 Longer training time.
 Usage of only English
natural language.

XIV
Fashionable

Appendix E – Gantt Chart

XV
Fashionable

Appendix F – Requirement Engineering Survey

XVI
Fashionable

XVII
Fashionable

Appendix G – Design Goals


Design Goal Description
Performance The backend end and the frontend mustn’t have any mistakes, with each other
via API call. Training the text-to-image synthesis model is typically regarded as
a time-consuming task in this project. However, when it comes to processing
the trained model will be utilized, thereby making sure that it is served to a user
who interacts with it to obtain output smoothly, it is important that the output is
generated quickly and without frustrating the user.
Versatility It is critical for the components in the system to be created flexibly. In the GAN
and DAMSM models, contrastive learning loss functions are separated into
multiple classes. Anyone should be able to easily modify the models in the
system to test or reform the productivity of the system.
Scalability The system can handle various unique and different loss functions that are
developed to maintain semantic consistency of the image and text descriptions
generated, as this will be an important aspect in boosting system performance.
In addition, the system must be able to run on a server that can serve several
users at the same time.
Accessibility The system must be constructed in such a way that the end-user can utilize it
without difficulty. For the users, it must be a straightforward and easy-to-
understand system with a single-page web browser to input text descriptions and
visualize the output fashion designs. It must be a system where code is
commented in detail and step-by-step documentation is provided which will be
easy for a new developer to take up the development process to upgrade the
system and add additional features.

XVIII
Fashionable

Appendix H – UI Design & Mockups

XIX
Fashionable

Appendix I – Implementation DAMSM Network & ATTN GAN


Text Encoder

XX
Fashionable

Image Encoder

Conditional Augmentation

XXI
Fashionable

Generator Network

XXII
Fashionable

XXIII
Fashionable

Discriminator Network

XXIV
Fashionable

Attention Model

XXV
Fashionable

Appendix J – Selection of Evaluators (Name, Position, Evaluation)


Eval ID Name Position Affiliation
EV1 Maninda Edirisooriya Software Engineer WSO2
EV2 Dilantha Harith Wanasinghe AI/ML Engineer DIPTERON UG
EV3 Vinula Uthsara Buthgamumudalige Software Engineer WSO2
EV4 Vishmi Ganepola Software Engineer Pearson Lanka
(pvt) Ltd
EV5 Indula Kulawardana Senior Data Scientist Deep Data Insight
EV6 Sravani Rao Data Scientist Quantic Tech
Analysis
EV7 Sahan Dilshan Machine Learning WSO2
Engineer
EV8 Chamath Peiris Associate Data Creative Software
Enginner
EV9 Amila Viraj Data Scientist Engenuity Ai
EV10 Prince Canuma Data Scientist Neptune Ai
EV11 Hasal Fernando Data Engineer Circles.Life
EV12 Amjad Nazar Software Engineer Calcey
Technologies

XXVI
Fashionable

Appendix K – Evaluation of Functional Requirements


IM – Implemented NIM – Not Implemented

ID Requirement and Description Priority Status


Level
FR1 Users must be able to enter text descriptions. M IM
FR2 Users must be able to select a language from the three M IM
provided languages.
FR3 Language validation should happen in the front end. M IM
FR4 Text descriptions about the fashion industry must be M IM
identified.
FR5 Synthesized fashion designs from descriptions should be M IM
displayed to the user.
FR6 Users must be able to download the synthesized fashion M IM
design.
FR7 Users must be able to reenter the descriptions after M IM
visualizing the output of the fashion design image.
FR8 Users should be able to receive error messages on system M IM
failure
FR9 User data should only be used during the text-to-image S IM
synthesis process and not for any other purpose.
FR10 User input should not be permanently stored within the S IM
system or it’s a database.
Functional requirement completion rate = (10/10)*100 100%

XXVII
Fashionable

Appendix L - Evaluation of Non-Functional Requirements


IM – Implemented NIM – Not Implemented

ID Description Priority Status


Level
NFR1 Users should be able to readily grasp and move around the M IM
system without the need for extra training in the text-to-
image synthesis models.
NFR2 Provided that a user enters a text description into the M NIM
system, the text-to-image synthesis process must not take
too long to complete and give the user a result. It must also
be verified that the application will not crash during the
processing of the text and creation of the image.
NFR3 Once a provided input text description has been M IM
synthesized into a fashion design, the quality of the
resulting fashion design must be acceptable and usable by
the user, as delivering a quality result to the user is equally
important.
NFR4 It is critical to maintain stronger security levels inside the M IM
system, as it is essential to provide improved security
levels inside the system that processes sensitive images
submitted by users.
NFR5 The system must be able to adapt to a large number of C IM
users as the system grows.
Completion rate of Non-Functional Requirements = (4/5)*100 80%

XXVIII

You might also like