0% found this document useful (0 votes)

10 views17 pages

Genassist

GenAssist is a system designed to enhance accessibility in image generation for blind and low vision (BLV) creators by providing prompt verification and detailed visual descriptions of generated images. It utilizes large language models and vision-language models to generate questions and summarize visual content, enabling creators to assess and compare images effectively. A study with BLV creators demonstrated that GenAssist significantly improves the process of image selection and generation, making visual authoring more accessible.

Uploaded by

mostafam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views17 pages

Genassist

Uploaded by

mostafam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

GenAssist: Making Image Generation Accessible

Mina Huh Yi-Hao Peng Amy Pavel

The University of Texas at Austin Carnegie Mellon University The University of Texas at Austin
Austin, TX, USA Pittsburgh, PA, USA Austin, TX, USA
minahuh@cs.utexas.edu yihaop@cs.cmu.edu apavel@cs.utexas.edu

Prompt & Generated Images Prompt Verification Questions Comparison Description

Generate Questions Answer
A young chef cooks dinner for his parents.
GPT-4
Is the chef in the image? Questions Similarities
How old is the young chef? All four images depict people cooking
1 2 BLIP-2
Are the parents visible in the image? in a well-lit kitchen with happy
expressions on their faces.
Generate
Visual Content & Style Questions
Summary Differences
What is the setting of this image GPT-4 The differences among these images
What is the emotion in the image? BLIP-2 include medium (vector art, stock
What is the likely use of this image? CLIP photo), focus on subject (a boy, a
What is the medium of this image? father, a mother) [...] Image 1 is a
3 44
(preset, informed by prompt guidelines) cartoon of a father and his children...
Detect Objects
Detic Per-Image Descriptions
Image-Based Questions GPT-4
In this stock photo, a young boy
Caption BLIP-2 What are they cooking? wears a chef's hat as he stands in a
BLIP-2 2
Generate Questions What is the chef wearing? modern kitchen. He is preparing a
GPT-4 Are cooking utensils visible? salad using a knife...

Figure 1: GenAssist makes image generation accessible by providing rich visual descriptions of image generation results. Given
a text prompt and set of generated images, GenAssist uses a large language model (GPT-4) to generate prompt verifcation
questions from the prompt and image-based questions from the image captions. GenAssist then answers the visual questions
(BLIP-2) and uses a vision-language model (CLIP) and an object detection model (Detic) to extract additional visual information.
GenAssist then uses GPT-4 to summarize all of the information into comparison descriptions and per-image descriptions.
ABSTRACT CCS CONCEPTS
Blind and low vision (BLV) creators use images to communicate • Human-centered computing; • Accessibility systems and
with sighted audiences. However, creating or retrieving images is tools;
challenging for BLV creators as it is difcult to use authoring tools or
assess image search results. Thus, creators limit the types of images KEYWORDS
they create or recruit sighted collaborators. While text-to-image Accessibility, Generative AI, Image Generation, Creativity Support
generation models let creators generate high-fdelity images based Tools
on a text description (i.e. prompt), it is difcult to assess the content
and quality of generated images. We present GenAssist, a system ACM Reference Format:
to make text-to-image generation accessible. Using our interface, Mina Huh, Yi-Hao Peng, and Amy Pavel. 2023. GenAssist: Making Image
creators can verify whether generated image candidates followed Generation Accessible. In The 36th Annual ACM Symposium on User Interface
the prompt, access additional details in the image not specifed in Software and Technology (UIST ’23), October 29–November 01, 2023, San
Francisco, CA, USA. ACM, New York, NY, USA, 17 pages. https://doi.org/10.
the prompt, and skim a summary of similarities and diferences
1145/3586183.3606735
between image candidates. To power the interface, GenAssist uses a
large language model to generate visual questions, vision-language
models to extract answers, and a large language model to summarize 1 INTRODUCTION
the results. Our study with 12 BLV creators demonstrated that BLV creators use images in presentations [52], social media [5],
GenAssist enables and simplifes the process of image selection and videos [24], and art [8]. To obtain images, creators currently either
generation, making visual authoring more accessible to all. describe their desired images to the sighted collaborators who then
search for or create the image [52, 75], or limit the types of images
they create [61]. Large-scale text-to-image generation models, such
as DALL-E [58], Stable Difusion [60], and Midjourney [41], present
an opportunity for these creators to generate images directly from
This work is licensed under a Creative Commons Attribution International
4.0 License. text descriptions (i.e., prompts). However, current text-to-image
generation tools are inaccessible to BLV creators, as creators must
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA visually inspect the content and quality of the generated images to
© 2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0132-0/23/10. iteratively refne their prompt and select from multiple generated
https://doi.org/10.1145/3586183.3606735 candidate images.
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel

While BLV creators can gain access to images using automated image generation performance. Participants all expressed excite-
descriptions [34, 40], existing descriptions are intended primar- ment about using GenAssist in their own workfows for authoring
ily for image consumption. As a result, the descriptions leave out images and for new uses.
details that may help authors decide whether or not to use the Our work contributes:
image (e.g., style, lighting, colors, objects, emotions). Prior work
also enables users to gain fexible access to the spatial layout of • Design opportunities making image generation accessible,
objects in images [32], but exploring details per image makes it derived from a formative study
difcult to assess similarities and diferences between image op- • GenAssist, a system that provides access to image generation
tions provided during image generation. To make authoring visuals results via prompt-guided summaries and descriptions
more accessible, prior work has explored describing visuals to help • User study that demonstrates how BLV creators use GenAs-
creators author presentations [52] or videos [24]. While such work sist to interpret and generate images
helps creators identify low-quality visuals (e.g., blurry footage in
a video [24]) or graphic design changes (e.g., changing slide lay- 2 BACKGROUND
outs [52]), prior work has not yet explored how to improve the As we aim to enhance the experience of content BLV creators
accessibility of image generation. working with AI-powered image-generation tools, our work builds
To understand the opportunities and challenges of text-to-image upon prior research that explores: the accessibility of authoring
generation, we conducted a formative study with 8 BLV creators tools and images, and text-to-image generation tools.
who regularly create or search for images. Creators in our study re-
ported their existing strategies for making images themselves (e.g.,
using SVG editors or code), searching for images, or asking others 2.1 Accessibility of Authoring Tools
to search for or create images (similar to prior work [5, 24, 52]). Enabling access to authoring tools unlocks new forms of self-
All creators expressed excitement about using image generation expression. Recent research investigated how BLV people take and
to improve their efciency and expressivity in image authoring. edit photos and videos [5, 24], compose music [48], draw digital
Creators all used image generation for the frst time during our images [8], and make presentations [52, 61, 83]. Such work includes
study and enjoyed creating high-fdelity images for their own uses studies of current practices that highlight accessibility concerns
(e.g., creating a logo for their website, making a card for their fam- of existing authoring tools and the authored visuals. For example,
ily). While we invited participants to ask the researchers visual features of current authoring tools remain difcult to access using
questions to gain access to the visual details (e.g. “What are the screen readers [24, 35, 51], and it can be difcult to assess the efect
diferences?”, “Is the color calm or aggressive?”), it remained chal- of the visual edits such as color changes [61].
lenging for participants to: craft a well-specifed prompt especially To improve the accessibility of authoring tools, researchers have
without visual experience, assess how well the generated image explored methods for providing feedback to authors as they mod-
followed the prompt, recognize generated details that were not ify visual elements. For example, prior work has developed tactile
originally specifed in the prompt, and understand or remember devices that assist BLV designers in understanding and adjusting
the similarities and diferences between images. the layout of user interface elements [33, 53]. Tactile feedback has
To improve the accessibility of image generation, we present also been used to help developers interpret code structure, such as
GenAssist, a system that provides access to text-to-image genera- indentation [15]. Other prior work has used audio notifcations to
tion results via prompt-guided image descriptions and comparisons inform users about scene changes when reviewing videos [24, 49],
(Figure 1). Our system lets creators skim an overview of similarities while text descriptions have been used to convey visual details im-
and diferences between images using our comparison descriptions portant to authoring such as brightness and layout [24, 52]. Sound
and per image descriptions (Figure 1, right), assess if the images and text feedback have also been used to keep blind authors in-
followed their prompt using prompt verifcation (Figure 1, center), formed about their collaborators’ edits to documents [30]. Similar
and recognize visual details not in the prompt using our content and to prior research, we also aim to make authoring tools accessible by
style extraction (Figure 1, center). Creators can also interactively providing in-situ feedback, but we instead provide creation-specifc
ask questions across multiple images to gain additional details. information to facilitate authoring images.
Our interface design enables creators to easily navigate visual in- In addition to ofering authoring feedback, researchers have
formation via a screen reader-accessible table format. Our tables developed systems to automate visual authoring. Prior systems rec-
let creators selectively gain information about individual images ommend 2D layouts for visual elements during graphic design [45]
(columns) or visual questions (rows) (Figure 4). and transform text into visual presentations [29, 63, 79]. To accom-
We evaluated GenAssist in a within-subjects study with 12 BLV modate individual preferences and mitigate the impact of errors
creators who compared GenAssist with a baseline interface that produced during generation, these systems typically ofer multiple
was designed to encompass practices of accessing images (e.g., au- options for users to choose from and allow iterative generation
tomated caption [77], object detection [40], and Visual Question attempts. Iterative generation and selection are not accessible for
Answering [34]). Participants rated GenAssist as more useful than BLV creators, as it requires visually inspecting the output designs
the baseline interface for understanding similarities and diferences to choose a generated option or revise the input. In this work, we
between the images, and they reported higher satisfaction with their seek to make automated authoring tools, such as image generation,
more accessible to BLV creators. Our approach provides a struc-
tured format for assessing and comparing generated results, and
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA

on-demand access to additional visual details to support creators this work, we chose to use MidJourney due to its popularity among
in selecting a result and revising their input. designers and content creators for its high-quality results. MidJour-
ney enables creators to generate 4 candidate images for a single
2.2 Accessibility of Images text prompt via a text-based interface hosted on Discord. However,
our approach is not limited to any particular model, as we focus
Improving the accessibility of image generation systems involves
on comparing and describing multiple generated results from a
not only ensuring access to image generation features, but also mak-
single prompt, helping creators select the ideal image from various
ing the produced images accessible. A primary method for making
candidates produced by image generation tools.
images more accessible is representing them as text descriptions,
With the development of these models, recent works have con-
such as image captions or alt text (e.g., “A person walking on the
ducted studies to understand the relationship between content
street”). Early work hired crowd workers to create alt text [6, 72],
creators and AI generative tools, introducing design guidelines for
while recent research has developed machine-learning-based sys-
such systems [28, 36]. These guidelines emphasize the need for
tems that automatically generate image descriptions [34, 71, 81].
more user controllability. Researchers have thus developed various
Building on auto-generated captions, researchers have developed
tools to help designers better make use of generative AI, includ-
systems that further improve users’ understanding of images by pro-
ing assistance in exploring and writing better prompts [36, 76],
viding additional information, such as regional descriptions [40, 84],
recommending potential illustrations for news articles [37], and
and structuring detailed descriptions into an overview [14, 32, 43].
supporting collaboration between writers and artists [27]. While
This approach enables users to review visual information more ef-
these studies ofer valuable insights into how designers interact
ciently and has been found to help blind people better understand
with generative models, none have focused on creators with disabil-
images compared to using captions alone [31]. Our work builds
ities. Given the potential of text-to-image models for BLV creators,
upon this idea by presenting descriptions of image generation re-
our work is the frst to explore how to increase inclusivity in the
sults in a hierarchical, easy-to-compare format, and tailoring the
expressiveness of image generation tools and make this emerging
descriptions to the task of authoring rather than consuming images.
authoring approach more broadly accessible.
Automatic descriptions do not always capture all of the impor-
tant image details. Visual Question Answering (VQA) tools can fll
3 FORMATIVE STUDY
this gap by ofering on-demand information to visual questions
(e.g., “What is the person walking on the street wearing?”). Previous To understand the strategies and challenges of authoring and search-
research has explored what visual questions blind people would ing for images, we conducted a formative study with BLV creators.
like to have answered [9] and provided on-demand visual question The formative study consisted of a semi-structured interview to
answering support using both crowdsourcing [6, 25] and automated investigate current strategies and challenges of obtaining images,
methods [17]. While VQA provides control over visual information and two image generation tasks to explore current strategies and
gathering, it takes efort to ask individual questions. We investigate challenges of using text-to-image generation.
what types of visual questions BLV creators ask to create images
during our formative study (similar to Brady et al. [9]), then use 3.1 Method
VQA to extract visual information and summarize this information We recruited 8 BLV creators who create or use visual assets on
as image descriptions. Thus, we explore how VQA and image de- a regular basis (P1-P8, Table 4). Participants were recruited using
scriptions work together as interconnected rather than separate mailing lists and compensated 50 USD for the 1.5-hour remote study
accessibility solutions. conducted via Zoom1 . Participants were totally blind (6 participants)
or legally blind (2 participants) with light and color perception. All
2.3 Text-to-Image Generation Tools participants had previously produced or selected images for their
work across several professions: teacher (English, Music), professor
In recent years, signifcant progress has been made in the feld of
(Computer Science, Climate), software engineer, graduate student,
generative image models, particularly text-to-image models. These
and artist. 7 participants had prior knowledge of text-to-image
models employ pre-trained vision-language models to encode text
generation models, none had previously used such tools.
input into guiding vectors for image generation, allowing users
We frst conducted a semi-structured interview asking partici-
to create images using text prompts. This advancement can be at-
pants how they currently created or used visual assets, and what ac-
tributed to various factors, including innovations in deep learning
cessibility barriers they encountered with their current approaches.
architectures (e.g., Variational Autoencoders (VAEs) [26] and Gener-
We then provided a short tutorial on text-to-image generation and
ative Adversarial Networks (GANs) [16]), novel training paradigms
shared Midjourney’s guidelines for creating text prompts [42] and
like masked modeling for language and vision tasks [10, 12, 13, 70],
example prompts from a Midjourney dataset [69]. Participants then
and the availability of large-scale image-text datasets [62]. With
completed two image generation tasks (20 minutes per task): a
these advancements, recent difusion-based models like DALL-E
guided task in which participants generated a cover image for a
2 [57], Stable Difusion [60], and Midjourney [41] have success-
news article [44] given the article’s title and full text, and a freeform
fully demonstrated the ability to synthesize high-quality images
task in which participants generated their own image. To limit
in versatile styles, including photorealism. This opens up poten-
onboarding time, participant emailed us their prompt (text and/or
tial practical applications for the content production industry [37].
image) instead of using Midjourney’s Discord interface, then we
However, none of the image generation tools provide text descrip-
tions of the output so they are not accessible to BLV creators. In 1 This study was approved by our institution’s Institutional Review Board (IRB).
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel

shared the four generated candidate images back to the partici- generated prompts during the free-form task motivated by their
pants. We encouraged participants to ask questions about the four own creation goals, they mentioned it was challenging to know
candidate images to select one or change the prompt. what content would efectively convey the article in the guided
We recorded and transcribed the formative studies. To analyze task: “I have no experience reading a news article with images, so it’s
the types of visual questions asked in the image generation task, hard to think of one. What do these images usually contain?” (P7).
two of the researchers labeled questions based on their goals and Understanding Image Candidates with Visual Questions. Af-
the types of information asked.2 ter generating images, participants asked visual questions to un-
derstand and select the images. Participants asked a total of 89
3.2 Findings questions (47 asked in the guided task, 42 in the freeform task). The
Current Practice. Participants reported that they currently use goals of the questions asked were to check whether the generated
images for a variety of contexts including slides, website images, images followed the prompt (51), compare two or more images (34),
paintings for commission, cartoons, scientifc diagrams, and music request clarifcation of the answer provided by the interviewer (3),
album covers (Table 4). Five participants noted that they created or understand a single image (1). The type of visual information
images on their own using image creation software such as SVG asked by participants also varied. Participants asked about medium
editors, slides, photoshop, and ProCreate (P7, P1, P5, P6), code (5), settings (6), object presences (18), object types (11), position
packages including Python and Latex (P4, P5), or by taking photos attributes (11), color/light/perspective (16), and others (22).
(P3). Among them, three participants asked sighted people to review Participants typically started by asking general questions, nar-
them (P3, P4, P6), and two participants reviewed the images using rowing down to more specifc questions as they ruled out images.
accessibility tools (e.g., audioScreen, tactile graphs, ZoomText) (P7, For example, P4 progressively asked: “Can you describe the im-
P3). Five participants searched for images online (P7, P8, P2, P3, P5), ages?”, “What are the diferences between the four images?”, “What
and three participants recruited another person to create or search are the diferences between the [store] isles?”, “Is the second image
the images for them (P7, P4, P5). realistic?”. Alternatively, participants started their questioning by
All participants who searched for images mentioned that they directly checking if the image followed their prompts, such as in
ask sighted people to describe the images for them in addition P5’s frst question: “Do we actually get the woman sitting at a desk?”
to reading any available alt text. P7 noted “Alt text has never been Finally, P1 and P2 started with questions about the style of the
helpful. It’s too short without important details.” P8 and P5 mentioned images: “Is it realistic or cartoony?” (P1) and “Is the color calm or
that while a few established websites (e.g., New York Times, NASA) aggressive?” (P2). Through asking questions, participants realized
have good alt text, Google Image Search returns options other than diferences between their prompt and the generated images: “it
established websites and “it is hard to compare the results of the image seems like the model generator is flling in details according to the
search” (P5). Participants also noted barriers to asking others to context, even if I didn’t specify some details. I didn’t specify the clothes
describe the image search results including fnding available people but in all images, the women are wearing ofce clothes” (P5). Partici-
to describe the images and avoiding false perceptions: “I only ask pants then asked follow-up questions based on new details. While
a handful of people because it might lead to some subconscious bias the visual questions revealed the content and structure of what par-
‘that I’m not independent’, cause it’s a basic task” (P7). ticipants wanted to know about the images, participants reported
that asking questions for each image was “very time-consuming
Generating Prompts. All prompts written by participants specifed
and confusing” (P4). 5 participants noted that they would prefer
the content they wanted to appear in the image (e.g., P6 used the
to receive descriptions before asking questions, and participants
prompt “A person pushing a grocery cart down a produce aisle.”),
reported that remembering all of the answers was difcult, as P2
and only two participants specifed the style of the image (P1 and
summarized: “ I wish there were more description provided in the frst
P7 specifed “a photograph of...”). Participants mentioned several
place. I don’t know what to ask. Also, it’s hard to remember all the
challenges of creating prompts. First, while prompt guidelines [42]
answers for each image.”
recommend users to specify multiple attributes in their prompt (e.g.,
style, lighting), participants reported that they were unfamiliar with Selecting an Image Candidate. While participants initially asked
visual attributes (“I’m trying not to leave much to system randomness, questions based on their prompt, they ultimately selected the fnal
I want to detail more things. But I don’t know a lot about diferent image considering both prompt-based descriptions and descriptions
styles.” — P5) and others found it difcult to remember what to of extra details produced by the model. P7 suggested that informa-
mention in the prompt: “I want the model to behave more like a tion on whether the prompt is refected in each image should be
wizard – asking me a series of questions ‘What do you want to create?’, presented early so that he can decide whether to explore the image
‘What style?’ and so on. It is hard to create detailed prompts in one in detail or skip to the next candidate. P8 highlighted the impor-
attempt (P2). Participants also noticed that it is challenging to create tance of additional details: “The model has randomness. It showed
a prompt that AI would be capable of generating: “If I pin down items I didn’t ask for and didn’t show what I asked for in the prompt.
something really specifc or narrow [in the prompt], AI seems to break I want much information to be surfaced so that I can make a decision.
down” (P1). P5 mentioned that transparency could inform prompt Whether that unexpected parts can be still used.” We also observed
iteration: “I want to know how the model works! [...] then I will that similarities between images guided participants in deciding
know how to write a good prompt.” Finally, while participants easily whether to further explore the images or to refne the prompt. For
instance, after P3 generated images using a prompt “A photo looking
2 SeeSupplemental Material for the full list of prompts, images, and visual questions down on a kitchen table with a plate of pizza, a plate of fried chicken,
of the formative study
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA

and a bowl of ice cream on it.”, he realized that all four images did (Figure 1). To illustrate GenAssist, we follow Vito, a professional
not display drinks and iterated the prompt to explicitly mention blogger who uses a screen reader to author his articles. Vito recently
“fzzy drinks”. On the other hand, diferences between the images wrote an article about the benefts of teaching children to cook, and
ultimately informed the fnal selection, as participants cited unique he wants to add an image to the article to engage his sighted readers.
backgrounds, objects, and mediums as reasons for selecting the He attempts to use image search to fnd a stock photo of “a young
image (e.g., P3 selected the fnal image because that was the only chef” but notices that many of the images are missing detailed
image that presented a dog putting his paw on the books. ). captions and alt text, or feature adult chefs instead of children. He
Uses of Image Generation. When participants generated their decides to create an image using text-to-image generation with
own images in the free-form task, participants created a variety of the prompt “a young chef is cooking dinner for his parents”. The
images ranging from logos, art, website decorative images, presenta- text-to-image generation model returns four candidates:
tions, and music album cover. All participants expressed excitement
about using the text-to-image model as part of their image creation
process in the future. Participants mentioned with image gener-
ation, they can create new types of images they had not created
before. P6 mentioned “With SVG editor, I cannot make realistic im-
ages. But now I can!” Also, participants mentioned that the quick
creation will lead them to use images more often: “Because it’s so To decide whether to use one of these images or change his prompt,
quick, I will use it for communication. Similar to how sighted people Vito enters his prompt and image results into GenAssist.
draw on a whiteboard during a Zoom meeting, I can quickly generate
an image because representing a concept visually is easier for sighted 4.1 Prompt Verifcation
team members.” (P8). P4 also compared the experience of image gen-
While the text-to-image model generates output images based on
eration with image search “This simplifes things when I’m looking
the prompt, the generated image often does not refect the specif-
for things very niche, something that is hard to fnd online.” Finally,
cations in the prompt, especially if the prompt is long, complicated,
participants also mentioned the beneft of creating images alone.
or ambiguous [22]. To help users assess how well their generated
P7 said that because there is no need to ask a sighted person to help
images adhered to their prompt, GenAssist provides prompt veri-
search images, it brings more autonomy and privacy. Participants
fcation. To perform prompt verifcation, we frst use GPT-4 [46]
also noted limitations and potential downsides of image genera-
to generate visual questions that verify each part of the prompt.
tion including potential bias (P8, P4), copyright and training data
We input the text instruction “Generate visual questions that verify
concerns (P3, P4), wanting to use it only for inspiration (P1), and
whether each part of the prompt is correct. Number the questions.”
potential errors (P8). However, P8 expressed that he expected future
followed by the user’s prompt. GPT-4 outputs a series of questions:
models to produce fewer errors.
Input Prompt Verification Questions
Generate visual questions that verify whether 1. Is there a chef in the image?
3.3 Refection each part of the prompt is correct. Number 2. How old is the young chef?
the questions. Prompt:
Creators in our formative study currently employ resourceful strate- 3. Is the young chef cooking food?
A young chef is cooking dinner for his parents.
gies for creating or searching for images, but all creators expressed 4. Are the parents present in the image?
excitement to use image generation in their workfow. To improve
access to image generation, our formative study reveals design op- We generate answers to the visual prompt verifcation questions for
portunities (D1-D5) to make image generation accessible through each of the four generated candidate images using the BLIP-2 model
technical or social support for: with the ViT-G Flan-T5-XXL setup [34]. For each generated image
and prompt verifcation question, we instruct the BLIP-2 model
D1. Authoring prompts that specify content and style.
with the starting sequence “Answer the given question. Don’t imagine
D2. Understanding high-level image similarities and diferences.
any contents that are not in the image.” to reduce hallucinations with
D3. Assessing if images followed the prompt.
non-existent information:
D4. Accessing image details not specifed by the prompt.
Prompt Verification Questions Image Answers (BLIP-2)
D5. Organizing responses to visual questions.
Is there a chef in the image? Yes Yes Yes Yes
These design opportunities address key user tasks in accessible text- How old is the young chef? Young kid Young kid Young kid Young man
to-image generation: generating the prompt (D1), understanding Is the young chef cooking food? Yes Yes Yes Yes
and selecting images (D2, D3, D4, D5), and revising the prompt for it- Are the parents present in the image? Yes No Yes Yes
1 2 3 4
eration (D4). Our work aims to help creators understand their image
generation results through prompt-guided descriptions and com- To help users quickly fnd which images do or do not adhere to
parisons (D2-D5). While providing high-quality descriptions may the prompt, we use GPT-4 to summarize the responses to each
help creators improve their future prompts (D1), future work should question using the following prompt: “Below are the answers of four
explore how to actively support creators in authoring prompts. similar images to one visual question. Write one sentence summary
that captures the similarities and diferences of these results. The
4 SYSTEM summary should ft within 250 character limit”. When using GPT-4’s
We present GenAssist, a system that supports accessible image chat completion API, we set the role of the system as “You are a
generation via prompt-guided image descriptions and comparisons helpful assistant that is describing images for blind and low vision
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel

individuals.”. The temperature value was set to 0.8. The summaries Content & Style Questions Image Answers (Detic)
Spoon, pot, Spoon, Spoon, Spoon, pot,
either indicate that all images have the same answer (e.g., “All What are the objects in the image? cup, tub, sink,
tomato,
fork, knife,
apple,
window,
flowerpot,
apron, lettuce, hat, sausage, plate,
images have a chef in the image”), or they alert users to diferences: bowl… bowl… plate… frog…
1 2 3 4
Prompt Verification Questions Prompt Verification Summary
Is there a chef in the image? Three images depict a young kid, while Image 4 For the remaining questions covering medium, lighting, perspective,
depicts a young man.
and errors, we answer the question for each image candidate by
Are the parents present in the image? Three images show parents present in the image,
while Image 2 does not. using CLIP [56] to determine the similarity between the image and
a limited set of answer choices (similar to CLIP interrogator [19]).
To enable screen reader users to easily access the answers to To provide answers that could inform future prompts, we curated
each question, we present the prompt verifcation results as a table our answer choices for medium, lighting, and perspective from
including the prompt verifcation questions (rows, with the question Midjourney’s list of styles [20] and DALL-E’s prompt book [47]. To
in column #1), prompt verifcation summaries (column #2), and per- address common image generation errors, we retrieved the answer
image prompt verifcation answers (columns #3-6) (Figure 4). choices for our errors question from prior work [18, 59]. We include
Using our prompt verifcation table, Vito reads the answers sum- the full list of answer choices in the Supplementary Material. For
maries to check if the images follow his prompt. He notices that the each question, GenAssist presents the top three answer choices
4th image contains an older chef, so it does not apply to his article with a similarity score between the answer choice embedding and
about teaching children how to cook. While Vito also realizes the the image embedding above a threshold of 0.18:
2nd image does not feature the chef’s parents, he keeps the image
Content & Style Questions Image Answers (CLIP)
in consideration as it may still apply to his article. Cartoon, Cartoon,
A stock Vector
What is the medium of the image? storybook, storybook,
photo art
illustration illustration
4.2 Visual Content & Style Extraction What is the lighting of the image? Natural Natural Natural Natural
lighting lighting lighting lighting
Generated image candidates often feature similarities or diferences Medium Centered Medium Medium
What is the perspective of the image?
that are not present in the original prompt. For example, Vito’s shot shot shot shot
Poorly
prompt “A young chef is cooking dinner for his parents” does not What are the errors in this image? drawn None None None
specify the style such that the resulting images include three illus- hands
1 2 3 4
trations and one photo. To enable access to image content and style
To inform creators about unfamiliar visual style types, GenAssist
details that were not specifed in the prompt, we extract the visual
provides the defnition and the usage for each answer choice for vi-
content and visual style of the generated image candidates. To sur-
sual style questions (Medium, Lighting, Perspective) by generating
face content and style similarities and diferences that are important
the description with GPT-4 and the prompt “Describe the defnition
for improving image generation prompts, we used text-to-image
and the usage of the following [QUESTION NAME] in one sentence:
prompt guidelines [20, 42, 47] to inform our approach.
[STYLE NAME]”. Similar to the prompt verifcation table, we present
We frst created a list of visual questions about the image based
the prompt guideline results in a table format including the prompt
on existing prompt guidelines, i.e. prompt guideline questions. The
guideline questions (rows, with the question in column #1), prompt
prompt guideline questions consist of questions about the content
guideline summaries (column #2), and per-image prompt guide-
of the image (subjects, setting, objects), the purpose of the image
line answers (columns #3-6). We further split the prompt guideline
(emotion, likely use), the style of the image (medium, lighting, per-
results into two tables to improve ease of navigation: the visual
spective, color), and an additional question about errors in the image
content table includes answers to the content and purpose questions,
to surface distortions in the generated images such as blurring or
and the visual style table includes answers to the style and errors
unnatural human body features (Table 1).
questions. Finally, users can ask their own questions at the bottom
To answer our prompt guideline questions for each image, we
of either table and GenAssist adds a row to the table by generat-
answered 5 questions (setting, subjects, emotion, likely use, col-
ing the answer for each image using BLIP-2, and the summary of
ors) using Visual Question Answering with BLIP-2, similar to our
answers using GPT-4. Using the visual content table, Vito notices
prompt verifcation approach:
from the objects summary that Image 1 has more food items than
Content & Style Questions Image Answers (BLIP-2)
Images 2-4. As the purpose of the article is partially to introduce
What is the setting of the image? Kitchen Kitchen Kitchen Kitchen
children to more ingredients, he decides to remove Image 1 from
Father and Chef, Father, Father,
What are the subjects of the image?
children
kitchen, mother mother consideration. Using the visual style table, Vito realizes that Image
vegetables and son and son
2 is a photo, while the other images are illustrations. As Vito was
What is the emotion of the image? Happy Happy Happy Happy initially searching for a photo, he notes he may want to further
A children’s refne his prompt to get more photo results. Vito also wants to
Where would this image be used? On a In a cooking On a
website cookbook website
class check if the images will match his blog which is primarily black
Brown, blue, Black, white, Blue and Red, yellow,
What are the main colors?
yellow red, green white green and white, so he adds a question about the background color:
1 2 3 4 User Question Image Answer Summary
What color is the background? Image 1 and Image 4 are light brown, Image 2 is
black and Image 3 is blue.
For our objects question, we used Detic [85], a state-of-the-art
object detection model, with an open detection vocabulary and a As Image 2 fts his article and includes a black background, he ranks
confdence threshold of 0.3 to enable users to access all objects: Image 2 as his current top choice.
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA

Category Name Question Model 1 2 Similarities

Content Setting What is the setting of the image? BLIP-2 All four images depict people
Subjects What are the subjects of the image? BLIP-2 cooking in a well-lit kitchen with
happy expressions on their faces.
Objects What are the objects in this image? Detic
Emotion What is the emotion of the image? BLIP-2
Usage Where would this image likely be used? BLIP-2 Differences
Style Medium What is the medium of the image? CLIP
44
Image 1 is a cartoon of a father
& Errors Lighting What is the lighting in this image? CLIP 3 and his children cooking.
Perspective What is the perspective of this image? CLIP Image 2 shows a photo of a
Colors What are the main colors used in this image? BLIP-2 young boy preparing a salad.
Errors What are the errors in this image? CLIP Image 3 is a vector art of a
family preparing sausages.
Table 1: Our prompt guideline questions including the ques- Image 4 is a cartoon of a family
tion category, question name, and question, along with the cooking a meal together in the
kitchen with a window.
model we used to answer the question (BLIP-2 [34], CLIP [56],
or Detic [85]).
Figure 3: GenAssist’s image comparison descriptions.

To generate the comparison description, we simply provide all

2
1 the information extracted from our pipeline to GPT-4 with the
In this stock photo, a young boy wears a chef's hat as he stands in
a modern kitchen. He is preparing a salad using a knife while prompt “Below is the information for four images. Write one para-
ingredients are on the kitchen counter. The boy looks happy. The
colors used are black, white, red and green. This image would likely graph about the similarities between the four images and one para-
be used in a cookbook to show children preparing healthy meals. graph about the diferences between the four images. The summary
3
should be concise.”. GPT-4 briefy summarizes the image similarities
In this vector art image, a family is cooking together in a well-lit and diferences (Figure 3). To help users quickly assess whether to
kitchen. There is a young boy chef with a man and woman,
preparing food with pots, pans and spoons on a gas stove. They're revise their prompt or continue exploring, we present the compar-
happy while cooking snacks for their family. The main colors used
are blue and white. This image would fit in a children's cooking class.
ison description and per-image description at the top of the
page before the prompt verifcation and prompt guidelines tables.
With the per-image description, Vito can quickly recall the con-
Figure 2: GenAssist’s per-image descriptions. tent of Image 2 before making his fnal selection. With the com-
parison descriptions, Vito can quickly notice that Image 2 was the
only image that contained a photo, then updated his prompt to get
4.3 Description Summarization additional photos rather than illustrations.
To enable users to quickly assess their image results, we summarize
the results from our pipeline to create a per-image description for 4.4 Implementation
each image and a summary of image similarities and diferences. We implemented GenAssist using Gradio [1], an open-source Python
To generate per-image descriptions, we frst obtain the BLIP-2 library for the front-end web interface. The interface was deployed
caption for each image that provides a concise overview of the through Hugging face 3 space with an NVIDIA A100 GPU (large,
image content (e.g., “A family preparing food in the kitchen with 40GB GPU Memory). Uses’ interaction logs were saved in the Fire-
a window.”). Then, we obtain additional detail about the image base database. We followed the guidelines of W3C [73] and tested
by generating questions about the caption with GPT-4 with the the compatibility of the GenAssist with all three major screen read-
prompt: “Given the caption, generate 10 visual questions that are ers: NVDA, JAWS, and VoiceOver. GenAssist’s tables follow the
likely to be asked by blind and low vision individuals”. Unlike the recommendations of W3C tables with two headers 4 .
other questions in our pipeline that are common across all images,
this step enables the GenAssist to ask image-specifc questions to 5 PIPELINE EVALUATION
add detail (e.g., “What is the view outside the window?” is only asked We measured the coverage of the descriptions generated by GenAs-
for Image 4). We generate the answers to these questions using sist and the accuracy of the information presented in GenAssist’s
BLIP-2. tables. We compare the coverage of GenAssist-generated caption
We create individual image descriptions by frst aggregating with the human-generated caption and the caption generated by a
all information acquired in our pipeline for each image includ- state-of-the-art image captioning model BLIP-2 [34].
ing the prompt verifcation, prompt guideline, and caption-detail
question-answer pairs for each image. Then, we guide GPT-4 with 5.1 Method
the aggregated visual information and the prompt “Below is the
We selected 20 image sets (20 prompts x 4 generated images for
information of an image. Write a description of this image for the
each prompt = 80 total images) from Midjourney’s community feed
blind and low vision audience. Describe the medium frst. Your re-
spanning diferent prompt lengths, content types, and styles. We
sponse should ft within 250 character limit. Do not add additional
recruited two people with experience describing images to provide
information that was not provided. Do not describe parts that are not
clear or cannot be determined from the given information.” GPT-4 3 https://huggingface.co/spaces

generates rich descriptions for each image (Figure 2). 4 https://www.w3.org/WAI/tutorials/tables/two-headers/

UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel

Prompt

Category Sub-category Correct (%) Correct (#)
Generated

Prompt verification 92.82 418

Images
Content Seting 97.53 81
Subjects 98.60 143

Objects 82.86 1243
Summary Attributes/Questions Summary Image1 Image2 Image3 Image4
Emotion 87.5 80
Table
This is a digital In this vector art, a
Similarities: All four images
feature a person in a seated
position, who appears to be engaged
in a task or activity. Differences:
This black and white sketch
depicts a young man sitting in
a chair, wearing headphones
and reading a book. The man
painting of a man
sitting in a black
chair, reading an open
In this centered-shot
photo, a relaxed man with
dark brown hair and dark
skin sits alone in a plane
cartoon man with black
hair and a serious
expression sits in a blue
Usage 97.50 80
newspaper. He has short chair. He is wearing a
Image 1, 3 4 show someone reading has brown hair and an wearing a blue shirt. He
hair slicked back, is hoodie and reading a
and image 2 depicts a person expression of concentration. A is writing on a tilted
Caption writing. All images have headphones
visible and only Image 1, 3, and 4
have microphones. The setting of
microphone is situated next to
the chair. The image uses
shades of green, blue, black,
wearing a hoodie and
sweatpants, and appears
focused. There are no
surface using a pen, which
suggests he is working on
something serious like a
physical newspaper. He is
in front of a microphone
and wearing headphones,
Style Medium 82.76 174
image 1, 2, and 4 is indoors, while
Image 3 are in an airplane cabin.
The medium and color schemes used
and white to create a positive
atmosphere. This storybook
illustration or cartoon would
other objects around
him. The colors used
are white, black, and
gray. It is likely used
work report. The interior
of the plane is well-lit
with blue and white colors
but not speaking. The
main colors used are
orange, black, and white.
This image could be used
Lighting 94.33 141
in each image are also distinct. likely be used on a website.
on a website.
dominating the scene.
on a website.

Perspective 71.83 142
Colors 99.1 221
Prompt
Attributes/Questions Summary Image1 Image2 Image3 Image4

Is there a person present in the All four images depict the presence of a person, indicating that there is at least one

Verification Errors 60.00 5

yes yes yes yes
image? person in each picture.

All four images depict a negative response to the question of whether the person is an

Table
Is the person an interpreter? no no no no
interpreter.

Table 2: We report the accuracy (percentage and number

Does the interpreter have headphones
All four images depict an interpreter wearing headphones. yes yes yes yes
on?

Are there microphones visible in the Three out of four images do not have visible microphones, while one image shows the
no no yes no
image? presence of a microphone.

Is the interpreter reading notes?

Is the interpreter listening to

All four images show an interpreter reading notes, with no significant differences
observed.

All four images depict the interpreter listening, with no discernible differences between
yes yes yes yes
of correctly predicted information) of the pipeline results
yes yes yes yes
something?

Is the interpreter speaking?

them.

All four images show that the interpreter is not speaking, with no variations or
differences observed.
no no no no
(Prompt verifcation, Content, Style, and Errors) with 20 sets

of images.
Visual Attributes/Questions Summary Image1 Image2 Image3 Image4

Content
The images share varying levels of detail about
the location, with some featuring specific
a chair in front of a a chair in front
Setting elements such as a microphone, while others a chair inside a plane
microphone of a microphone

Table
provide more general information like a chair
or the inside of a plane.

The four images show a man in different a man reading a

a man reading a book, man,
settings and poses, but all include objects newspaper,
Subjects a microphone, and a headphones, a man sitting in a plane
such as headphones, chairs, and some form of headphones, and
chair paper, chair
reading or listening material. a microphone

Emotion
All four images are positive in nature, with
three of them specifically depicting a positive
emotion and one showing the action of reading a
book.
positive
he is reading
a book
positive positive and the other researcher reviewed the annotations. To compute
Usage
These images share a common use on a website,
with the exception of Image3 which would likely
be seen in a newspaper article.
on a website on a website in a newspaper article on a website
the accuracy of the detailed visual information in GenAssist, one
of the researchers examined the 20 sets of images with the three
strap, chair,
rearview_mirror, hatbox,
All images contain chairs, person, headset, jacket, chair, button, handle, polo_shirt,
microphone, chair, microphone,
earphone, trousers, and book. Other objects person, vent, television_set, radio_receiver,
lamp, jacket, person, chair, lamp,
vary, including lamps, shoes, and microphones. backpack, flap, cushion, hook, hinge, bolt,
shoe, stool, scarf, person, cushion,
Objects Image3 has more diverse objects like TV, radio, stool, ski, trousers, book, knob, choker,

tables generated by the GenAssist (prompt verifcation table, visual

headset, watch, shoe, necktie,
and knob. Image1 has more clothing items. headset, cellular_telephone, person, earphone,
earphone, towel, headset, sock,
Image2 has a backpack and ski. Image4 has a earphone, earplug, latch, handkerchief, camera,
gasmask, sock, earphone,
hatbox and necktie. boot, control, lightbulb, suitcase
trousers, book trousers, book
trousers, book

content table, and visual style table) and counted the number of
Visual Style
correct and incorrect answers in each table.
Attributes/Questions Summary Image1 Image2 Image3 Image4

& Errors
vector
a sketch, a a sketch,
The images have varying mediums, but Image1 and Image4 share a cartoon style. Image2 and art, a
storybook vector art, a
Medium Image3 are more diverse, with Image2 having vector art and digital painting, while Image3 is a photo character

Table
illustration, a digital
a straightforward photo. portrait,
cartoon painting
a cartoon

All four images have the same lighting attribute of spotlight. They appear to be similar in
Lighting Spotlight Spotlight Spotlight Spotlight
terms of lighting, without any noticeable differences between them.

The four images vary in their perspective attributes. Images 1 and 2 share a centered-shot Landscape

5.2 Results
Headshot, Centered-
Perspective composition, but differ in their subject matter. Image 3 is described as simply a centered- Shot,
Centered-Shot Shot
shot. Details about image 4 are not provided, so no comparison can be made. Centered-Shot

Error None of the images has errors

5.2.1 Coverage. We summarize our coverage evaluation results

All images have white and/or black as main colors, with some exceptions; they also include orange,
green, blue, white, black, blue and
Main Colors other colors such as blue, green and orange - but each image has different secondary and black,
black, white gray white

Custom
accent colors. white

in Table 3. Overall, GenAssist’s comparison descriptions covered

Question

Attributes/Questions Summary Image1 Image2 Image3 Image4

Table more similarities and diferences than the human describers’. In
Questions Summary Image Answers
the coverage of diferences, GenAssist spotted more than twice the
number of total diferences than the human describers (4.55 vs. 2.25).
Figure 4: The GenAssist interface consists of screen reader
The coverage of GenAssist’s individual image descriptions was
accessible tables that enable users to fexibly gain more in-
comparable to that of human describers. When compared to human-
formation about the content of interest.
generated description, GenAssist captured more information about
the content and styles but revealed fewer image generation errors.
For instance, one human describer specifed in the comparison
descriptions for 10 randomly selected image sets each. For each description “...All of the images have some AI generation error with
image set, the describers provided descriptions of each individual fngers or clothing. ”. While GenAssist and the baseline used the
image, and the similarities and diferences between the images. We same GPT-4 prompt to extract the similarities and diferences, the
provided describers with prompt guidelines [42], image description baseline’s comparison description did not capture many diferences.
guidelines [2], an example set of descriptions created by GenAssist,
and the prompt for each image set to inform their descriptions. 5.2.2 Accuracy. Table 2 summarizes the results of the accuracy
Both describers spent 3.5 hours to create descriptions for the 10 evaluation. Prompt verifcation, content, and style categories all
sets of images — or around 21 minutes per image set. achieved over 90% accuracy except for medium, perspective and
We compared the coverage of GenAssist-generated descriptions emotion. In the 80 images in the dataset, GenAssist only detected
to those generated by a baseline captioning tool (BLIP-2) and hu- fve images as having errors, and detected the correct error types
man describers. For comparison, we annotated the similarities and in three of them. The most common errors made in our pipeline
diferences descriptions for all 20 sets of images and annotated the were from perspective, medium, and error categories which are all
individual descriptions for 10 sets of images. We chose the 10 sets extracted using the CLIP score. For perspective and medium, the
with the longest human descriptions to compare GenAssist with the majority of the errors were due to CLIP matching images to common
highest quality descriptions. Because BLIP-2 cannot take multiple style expressions (e.g., natural lighting, centered-shot) which likely
images as input to extract similarities and diferences, we gener- refects prevalence of these expressions in the training data. In the
ated captions of the 4 images using BLIP-2, then prompted GPT-4 incorrect output of errors, GenAssist detected cartoon or sketch
with the same prompt we used in our system to generate summary images as ‘poorly drawn faces’ errors. One reason for the relatively
descriptions. We tallied whether the descriptions contained details low accuracy of object detection results is that we empirically set
about the image in each of our set of pre-defned visual information the output threshold of GenAssist’s object detection (Detic) as 0.3
categories (Table 1). We counted only the correct information in to present diverse objects to users in addition to information about
the descriptions. One of the researchers annotated the descriptions the main subject extracted by BLIP-2 in our pipeline.
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA

Human Description Baseline Description GenAssist Description

All of the pictures are about a man and woman All four images show people walking on the beach. All four images depict a young couple walking on a sunny
walking on a sunny beach beach

The styles are different in cartoon and realistic. Image 2 specifies the time of day as sunset. They differ in the type of medium, lighting, shot types,
The haircut of the man and woman are different. Image 4 emphasizes that the couple is holding and colors used. The first three images mainly use blue and
The sun’s appearance in frame are different. hands. white, while the fourth image has a black and white theme.
Image 3 specifies the age of the couple as young. Image 1 and 3 show the couple walking on dry sand, while
image 2 and 4 show the couple walking on the shore. The
couple is holding hands in image 2, while in image 1 and 3,
they are not. The images have different intended uses,
ranging from a website or album cover to a poster for
promoting beach trips.

All feature a single black woman who could All images involve women and mirrors. All images feature a black woman with long hair in a
reasonably be hair stylists given their setting. positive and happy mood.

Images that feature a mirror have varying Image 1 takes place in a hair salon. Differences include setting, color scheme, and activity.
degrees of realism; the second image may Image 2 depicts a woman getting her hair done. Image 1 shows a hair stylist in an empty salon with blue and
reasonably feature the same person, though the Image 3 shows a woman sitting on a chair. black hair, while Image 2 features a woman getting her hair
third image does not show the correct side of the Image 4 takes place in a woman's room. styled in a mirror with black and white lighting. Image 3 also
subject. features a hair stylist with dark brown hair, but in a room with
an orange dress and no tools. Image 4 is a digital illustration
of a woman in her room with a plant, wearing a hoodie, and
looking into a mirror hanging on a wall.

Figure 5: Two image sets and the descriptions of the similarities and diferences used in the pipeline coverage evaluation (each
image set described by a diferent human describer). GenAssist captured more information in the similarities and diferences
caption than the human describers.

Total Content (#) Total Style (#) Total Error (#) Total (#)
(Correct Only) Human Baseline GenAssist Human Baseline GenAssist Human Baseline GenAssist Human Baseline GenAssist
� 1.5 1.65 2.45 0.70 0.00 0.80 0.10 0.00 0.00 2.35 1.65 3.25
Similarities � 0.61 0.59 1.10 0.80 0.00 0.83 0.31 0.00 0.00 0.83 0.85 1.29
� 1.50 1.95 2.35 0.65 0.35 2.20 0.05 0.00 0.00 2.25 2.30 4.55
Diferences � 0.69 0.39 0.49 0.75 0.49 1.01 0.22 0.00 0.00 0.84 0.93 1.26
Per-Image � 1.71 0.69 1.71 0.71 0.04 0.68 0.05 0.00 0.01 2.47 0.73 2.41
Descriptions � 0.39 0.10 0.26 0.22 0.07 0.30 0.05 0.00 0.03 0.74 0.33 0.75
Table 3: We compared the coverage of GenAssist-generated descriptions to those generated by a baseline captioning tool and
human describers. GenAssist captured more similarities and diferences than the human describers.

6 USER EVALUATION Procedure. We frst asked participants demographic and back-

We conducted a user study with 12 BLV visual content creators to ground questions about how they use images in their work. We then
compare GenAssist with a baseline interface. gave a 15-minute tutorial on both the GenAssist interface and the
baseline interface using S0 (Figure 6). Participants then completed
two tasks: the interpretation task and the generation task.
6.1 Method In the interpretation task, participants used both interfaces to
evaluate pre-generated images (Figure 6). For each set of images,
In a within-subjects study, participants used GenAssist and a base-
we provided participants with an example scenario (e.g., Select
line interface to interpret image generation results (interpretation
an image for a blog post titled ‘My grandma still dances!’). Using
task) and to generate images (generation task).
GenAssist or the baseline interface, participants were asked to
Participants. We recruited BLV creators who create or use visual identify the similarities and diferences in the image candidates
assets on a regular basis using mailing lists (P7-P18, Table 4). Par- and choose a fnal image. For each interface, users were given one
ticipants described their vision as totally or legally blind and they short prompt image set (S1 or S3) and one long prompt image
were students, consultants, software engineers, video creators, and set (S2 or S4). The order of the interfaces and image sets were
artists. P7 and P8 participated in the formative study. counterbalanced and randomly assigned to participants. After each
Baseline. The baseline interface included for each image: the image interface, we conducted a post-stimulus survey that included the
caption from BLIP-2 [34], a list of objects from Detic [85], and the following ratings: Mental Demand, Performance, Efort, Frustration,
ability to interactively ask visual questions powered by BLIP-2 [34]. and Usefulness of the caption in understanding diferences between
We designed the baseline to encompass commonly used captioning images. All ratings were on a 7-point Likert scale.
and object detection tools available in commercial devices and appli- In the generation task, we provided participants with the title
cations (e.g., SeeingAI [40]). As such captions tend to be concise, we and frst 5 paragraphs of two articles, then asked participants to
added visual question answering via BLIP-2 [34] to let participants create a relevant image for the article by coming up with their own
gain additional information on-demand.
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel

Prompt (S0) �=4.00; � =-2.00; �<0.05). For generation tasks, participants rated
A young chef is cooking the dinner for his parents
that they were signifcantly more satisfed with the fnal image
Prompt (S1) (�=3.17, �=3.00 vs. �=5.00, �=5.50; � =-2.17; �<0.05). Signifcance
George Washington and Abraham Lincoln shaking hands. was measured with the Wilcoxon Signed Rank test.
Prompt (S2) Gaining a summary of image content. With GenAssist across
A video of an old lady dancing. Happy smile, cute granny.
Security camera footage. On CCTV. Ultra realistic
both tasks, all participants started by reading the summary table
Prompt (S3) including the comparison description (summary of similarities and
Trending stock photo diferences), as well as the per-image descriptions. Participants
Prompt (S4)
all stated that the summary table was helpful for understanding
Man sitting at his computer, home office, fireplace, the images they generated, as P6 explained: “I cannot do without
Paint man like he was a crystal clear water the summary. Highlighting the diferences was very useful.” (P6). In
addition, participants noted that the summary table’s per-image de-
Figure 6: We selected two sets of images from Midjourney’s scriptions were valuable for understanding the images. For example,
community feed generated with a short prompt without de- P19 mentioned “This is more like an audio description because I can
tailed descriptions of objects or styles (S1, S3) and two sets make a very clear mental image!” and slowed down his screen reader
with a long prompt with detailed descriptions of objects or pace to mimic the experience of listening to an audio description.
styles (S2, S4). We selected long and short prompts to explore P20 reported “I always thought that AI is not as capable of describing
how users compared images when they are similar (long as humans, because usually alt-text generated by AI is short and
prompts) vs. dissimilar (short prompts). doesn’t capture much information. But reading this, I am rethinking
AI’s capabilities.”. P12 found the detailed descriptions particularly
prompts. We selected the two articles from the New York Times: helpful when authoring rather than interpreting images: “The frst
‘Why Multitasking is Bad for You’ and ‘My Kids Want Plastic Toys. I table (comparison description table) is so comprehensive. When I’m
Want to Go Green.’ [67, 68]. The order of the interfaces and articles authoring images I need more information than when I’m looking at
was counterbalanced and randomly assigned to participants. After what others uploaded.” (P12).
each interface, we asked the participants to choose one image from Using the baseline, participants all initially read all of the infor-
the generated images and explain their reasoning. We also con- mation they had access to (the caption and objects) for each image.
ducted a post-stimulus survey that included the following ratings: all participants mentioned the inconvenience of having short image
Mental Demand, Performance, Efort, Frustration, Usefulness of captions for gaining an overview, especially when the generated
the caption, Satisfaction with the fnal image, and Confdence in images are similar to each other. For example, after reading the
posting the fnal image. All ratings were on a 7-point Likert scale. BLIP-2 caption of S4, P18 asked “Are they all same images?”
At the end of the study, we conducted a semi-structured interview Selectively accessing additional information. While all partici-
to understand participants’ strategies using GenAssist and the pros pants accessed the summary table frst, we observed multiple strate-
and cons of both GenAssist and the baseline. gies of using additional information provided by GenAssist to un-
The study was 1.5 hours long, conducted in a 1:1 session via derstand the diferences between the generated images. First, P9,
Zoom, and approved by our institution’s IRB. We compensated P7, P16, P18, and P20 checked the information from all tables before
participants 50 USD for their time. making their decision. P20 mentioned “They are equally important
Analysis. We recorded the study video, user-generated prompts but in diferent ways. If the generated images are diferent, the sum-
and images, and the survey responses. We transcribed the exit mary table would be sufcient. For similar ones, I’d have to go down
interviews and participants’ spontaneous comments during the the tables more.” P16 noted “We never have too much information. All
tasks and grouped the transcript according to (1) strategies of using the details provided here matter to me”. After checking all the tables,
GenAssist and (2) perceived benefts and limitations of our system. P18 and P20 revisited the summary table again to remember and
organize all information. The other seven participants (P10-P12,
6.2 Results P8-P15, P17, P19) checked the tables selectively. Participants’ prefer-
ences refected their prior experiences creating images. For instance,
Overall, all participants stated they would like to use GenAssist
P7 who typically creates images using an SVG editor prioritized
rather than the baseline interface to create images in the future.
the prompt verifcation table. He said “I detail more things in the
Participants expressed that GenAssist would be immediately useful
prompt and want everything to be in the image, ‘cause I am more used
in their workfows: “This is usable out of the box!” [...] “I need access
to programming-drawing.” P13 skipped the style and errors table
to this technology” (P14), “I’d even pay for this! I really need this”
as he was not familiar with the concepts despite the defnitions
(P15). In particular, participants rated GenAssist to be signifcantly
provided: “As a born blind person, most information in the visual
more useful for understanding the diferences between images in
attributes is not useful as it’s hard to imagine those.” Participants also
both tasks (interpretation: �=1.50, �=1.00 vs. �=3.58, �=4.00; � =-
mentioned that they liked that GenAssist provided the breakdown
2.31; �<0.05; generation: �=1.92, �=2.00 vs. �=4.33, �=5.00; � =-2.77;
of the summary description into multiple tables. P16 described that
�<0.01) (Figure 7). For the interpretation task, participants reported
GenAssist has “So much transparency because it provides access to
signifcantly better performance (�=1.83, �=2.00 vs. �=3.67, �=3.00;
intermediate tables that constitute the summary table, just like a
� =-2.47; �<0.05), signifcantly less frustration (�=1.75, �=1.00 vs.
[prramming tool]! I can look at the inside of the models and see what
�=3.50, �=3.50; � =2.46; �<0.05), and efort (�=2.25, �=2.00 vs. �=4.00,
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA

Interpretation Task Generation Task

Mental Demand System 2 3 4 1 2 Mental Demand System 3 4 3 2
Baseline 1 3 4 4 Baseline 4 2 2 1 2 1

Performance* System 5 5 1 1 Performance System 3 6 1 1 1

Baseline Baseline 3 2 3 3 1
2 1 4 1 1 2 1

Effort* System 4 5 1 1 1 Effort System 3 3 2 2 2

Baseline Baseline 2 3 4 2 1
1 4 1 2 2 2

Frustration* System 8 1 1 2 Frustration System 6 1 4 1

Baseline 2 2 2 3 1 1 1 Baseline 4 1 4 2 1

Useful for understanding System 9 1 1 1 Useful for understanding System 5 3 4

the differences* Baseline 4 2 3 3 the differences** Baseline 1 2 1 1 3 3 1

1 3 4 2 1 1
0 3 6 9 12 Outcome satisfaction* System
Baseline 2 3 1 3 3

Number of respondents Outcome confidence System 2 3 2 1 2 2

Baseline 2 2 1 3 2 2

0 3 6 9 12
1 (positive) 2 3 4 5 6 7 (negative) Number of respondents

Figure 7: Distribution of the rating scores for GenAssist and the baseline interface (1 = positive, 7 = negative) in the two tasks.
Note that a lower value indicates positive feedback and vice versa. The asterisks indicate the statistical signifcance as a result
of Wilcoxon text ( p < 0.05 is marked with * and p < 0.01 is marked with **). In the interpretation task, GenAssist signifcantly
outperformed the baseline interface in performance, efort, frustration, and usefulness for understanding the diferences
between images. In the generation task, GenAssist was signifcantly lower in being useful for understanding the diferences
and in outcome satisfaction.

they’re doing.” P10 and P11 both mentioned that they appreciated table (‘Is the data showed falling or rising?’ and ‘What is the date of
the order of the tables: “The summary [table] is the bigger picture. the x-axis?’ for S3 in Figure 6). When asked about the reason for
Then the tables go into the details. I also like that the prompt questions not asking any additional questions, P18 said “Looking at captions I
come frst because they’re important.” already had a big picture so I didn’t ask additional questions.” P7 sim-
Participants also employed multiple strategies for navigating ilarly refected: “I like that [GenAssist] asks questions that I haven’t
within the tables. Participants browsed through questions in the thought of but are still important. The answers to the questions told
tables to identify questions they found to be important and skipped me additional stuf about the images.” In contrast, with the base-
questions that were less important (e.g., not interested, or already line interface, participants asked many additional visual questions.
appeared in the summary descriptions). We also identifed multiple Because each image was presented separately, participants often
patterns of navigating within the tables. Participants checked all asked the same question for each image to compare the answers.
cells in a row when they found the table to be important. For in- Most of the questions were about the objects detected, especially
stance, P11 checked the answers of all four images in the prompt when the object was not mentioned in the caption or did not seem
verifcation table. In other cases, participants frst checked the ques- relevant to the setting (e.g., P11 asked “Where is the beachball in the
tions, then decided whether to read the row or skip to the next picture?” after reading the object detection results of an image with
row. Participants skipped rows if the answers to the questions were the kitchen setting). P10 who experienced the baseline condition
already mentioned in the summary table, or if they were not inter- after GenAssist refected that “This one [Baseline] is not simply laid
ested in the question. For example, P8 skipped the medium, lighting, out for me. The previous one [GenAssist] is easy peasy presenting
and perspective row in the visual style & errors table and only at- everything for me. And this one is ‘Here you have to fgure out.”
tended to the error row. Sometimes, participants only checked the Refning and Iterating Prompt. In the generation task, none of
answer cells if the summary column highlighted the diferences be- the participants refned the prompt using the baseline and fve
tween the images and skipped to the next row if the summary stated participants refned the prompt when using GenAssist (P9, P10,
mainly the similarities between the images. Participants stated that P13, P16, P17). Among the remaining 7 participants, 5 participants
GenAssist’s table format was easy to navigate. P19 noted the ease reported that they did not iterate as they were satisfed with the
of navigation within the table: ”I like having control with the tables. results, and 2 participants were unsure how to iterate the prompt
If the question or summary doesn’t seem interesting, I can skip to the after realizing that the image generation model did not refect some
next row instead of reading all answers of four images.” parts of the original prompt (P15, P20).
Asking additional information. With the baseline, most par- Participants often quickly made the decision to revise the prompt
ticipants (12 participants in the interpretation task, 9 participants while reading the summary table and before they moved on to other
in the generation task) asked follow-up questions to try to under- tables. For instance, while generating an image about an article
stand the images, while with our system participants rarely asked about multitasking, P10 frst attempted to generate an image with
follow-up questions (1 participant in the interpretation task and the following prompt ‘A woman who is holding the iPhone is texting
none in the generation task). P16 was the only participant who on it while she glances at another device which displayed some funny
asked additional visual questions with GenAssist after reading the videos going on. She’s in the kitchen trying to cook. it looks like the
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel

Original Prompt New Prompt

is not present. Because the answer generated by BLIP-2 was ‘There
A woman who is holding the iPhone is texting on A woman who is holding the iPhone is texting on it
it while she glances at another device which while she glances at another device which displayed is no television.’, P11 was more confused and did not consider the im-
displayed some funny videos going on. She's in some funny videos going on. She's in the kitchen.
the kitchen trying to cook. it looks like the food Dinner is being prepared but the soup is boiling over age due to uncertainty. Also, P16 asked ‘Where is the lollipop in the
is smoking. and the potatoes are being smoldered.
image?’ for an image without a lollipop (S1 in Figure 6) and BLIP-2
answered with a hallucination ‘In the man’s mouth.’, misleading P16.
While GenAssist features the same list of objects, participants did
Comparison Description Comparison Description
not experience this issue as they prioritized other information or
“…The first two are centered on the subject's “All four images feature a woman in a kitchen scene recognized misinformation by referencing across multiple informa-
astonishment while holding a phone. Image 3
shows a woman in her twenties smoking an e-
holding a cell phone. … In Image 1 and Image 2,
steam is coming out of a pot. … In Image 2, she tion sources. While using GenAssist, P10, P7, P16 pointed out that
cigarette, and Image 4 shows a woman smoking
a cigarette. …”
is close to the fire. …”
some visual information in the tables conficted with one another.
For instance, in the second image of S2 (Figure 6), the summary
Figure 8: P10 generated the frst set of images, noticed that table stated that the woman is walking in the street, but when the
the image generation model has made errors in the image GenAssist asked ‘Is she dancing?’ for prompt verifcation, BLIP-2
(depicting the woman smoking instead of the food smoking), answered with ‘Yes’, which confused the participants. P16 hypoth-
and corrected her prompt by replacing “smoking” with “be- esized that the caption mentioned walking because the dancing
ing smoldered”. action is hard to capture in one image frame and thus the image
is actually showing her dancing. Still, participants did not notice
inaccurate information in GenAssist if there was no confict. For ex-
ample, a woman was described as looking happy but had a neutral
food is smoking’ Figure 8. However, she quickly noticed that most
expression (the 4th image for P10’s 2nd prompt in Figure 8). P10
of the images generated depicted the woman as smoking instead of
removed the image from consideration as she wanted the woman
the food as smoking. She quickly iterated the prompt by replacing
to look stressed rather than happy.
the word with ‘smoldering’ to generate a new set of images.
In addition, participants reported that GenAssist informed them Future improvements for GenAssist. Participants noted sugges-
about the capabilities of the image generation model and guided tions on how to improve GenAssist’s description in the future. First,
them to refne their prompts. P20 mentioned “After reading the P9 and P8 participants noted that the visual information provided
tables, it makes me think of what AI is capable of generating and by GenAssist was long and difcult to process at once. This refects
what is not. It can’t exactly refect what I try to accomplish when the users’ subjective ratings on mental demand which is comparable in
prompt is too complicated, so I will have to adjust my expectation and GenAssist and the baseline in the interpretation task. Participants
adjust my prompt.” Participants also noted that GenAssist is helpful suggested allowing users to remove image columns and question
for learning how to generate a detailed prompt (P7, P16, P17). P16 rows from consideration. P8 mentioned “I want to flter images based
stated “Visual [styles & errors] table is helpful for learning new styles.” on certain answers so that from then on, I won’t consider all four im-
Similarly, P7 said “If I don’t specify the styles, I think AI is generating ages and it will be easier!” P17 also shared that he wanted GenAssist
[the styles] based on the context and content. So I know which style is to learn from his interactions with the cells so that gradually it will
good for which.” present only the rows of interest.
Participants mentioned the difculties of writing good prompts.
Selecting an Image Candidate. To choose the fnal image from
P13 said “Even if I read the defnitions about the style, it’s hard to
the four image candidates in the generation task, participants using
feel what efect it will give.” In the generation task, none of the
GenAssist often considered whether the image followed the prompt,
participants specifed the medium in the prompt as they were not
whether additional details added by the generation model were
familiar with it. This often resulted in the image generations having
relevant, and whether the image style or emotion was appropriate
varied styles. In addition, P7 and P16 mentioned that it is difcult to
to the usage context. P17 said “I choose the third image because it
decide on what content to put in the prompt to efectively convey
has the information that I described. Also, P7 mentioned “I will not
the message. P16 mentioned “I want to give it the whole book and
choose the cartoon image because I want to be more serious here.” Some
make it generate.” After experiencing that the generation model
participants changed the choice of image as they moved on to the
cannot refect all the details in the prompt when the prompt is too
next tables in the GenAssist. For example, P8 who generated images
long and complex, P12 stated “I want [GenAssist] to tell me what
of multiple plastic containers to portray the pollution problem
[the generation model] can generate and what it can not.”
updated his choice as he read the style and errors table: “Oh so the
last image has many colors, I want to change to this one because I
want it to be colorful!” 7 DISCUSSION
Noticed and unnoticed errors. Participants encountered errors us- In this section, we refect on our fndings from the development
ing both interfaces. In the baseline, all participants read the objects and evaluation of GenAssist. We also discuss future opportunities
following the captions, but objects occasionally contained errors for research exploring accessible media authoring tools.
(e.g., labeling as another object that has similar shapes, colors, or Scope of GenAssist. GenAssist uses a text-to-image generation
textures). When the participants noticed objects irrelevant to the model [41] to generate image candidates, vision-language mod-
context, they often asked about the object but the questions about els [34, 56] to extract visual information, and a large language
non-existent objects often led to further confusion. For instance, model [46] to synthesize descriptions. The scope of GenAssist
P11 asked ‘Where is the television?’ for an image where a television refects the limitations of the models it uses. First, we designed
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA

GenAssist to support the images that text-to-image generation distortions). GenAssist could also read image descriptions with mul-
models currently support: content-driven photos or illustrations tiple voice styles to help creators distinguish generation candidates.
with simple structures. However, both text-to-image generation GenAssist’s ability to attend to multiple similar images and sur-
and GenAssist do not yet support images that are information-rich face diferences can be useful in broader contexts. Our study partic-
or densely structured such as information visualizations [64, 65] ipants expressed interest in using GenAssist for comparing image
or diagrams [3, 66]. As text-to-image generation improves, future search results or similar photos in social media. It can also help BLV
research will explore extending GenAssist to complex graphics with people in decision-making situations based on visual information
text. For example, GenAssist could help creators recognize if their (e.g., online shopping, communicating with the design team in the
prompt-generated diagram contains the desired text (by integrat- software development, selecting a photo from similar shots).
ing Optical Character Recognition), relationships, and perceptual Implications for Visual Question Answering. Comparing GenAs-
qualities (e.g., legibility, saliency of important information). sist to our baseline of typical descriptions with visual question an-
Second, the descriptions that GenAssist is capable of provid- swering (VQA), all participants rated GenAssist as more useful for
ing are also limited by the capabilities of the pre-trained vision- understanding diferences between images and creators asked fewer
language models [34, 56, 85]. For example, while GenAssist helped follow up questions with GenAssist. GenAssist reduced follow-up
creators notice image generation errors such as omitted prompt questions by predicting visual questions based on the formative
details [36], distortions to human bodies [78], and objects placed study and applying the questions to multiple images. Our predict-
illogically [80], some errors remained undetected. Also, GenAssist ask-summarize approach also reduced the requirement for reading
occasionally included hallucinations (e.g., missing or non-existent individual question answers. Future VQA systems intended for real-
objects) in the descriptions. While these issues may be mitigated world environments may beneft from our approach as repetitive
with improvements to text-to-image models (e.g., better aligning questions, “unknown unknowns”, and complex visuals are likely.
with human preferences [78]) and vision language models (e.g.,
Support in Creating Prompts. In the formative study, we distilled
better composition reasoning [38], reducing hallucinations [7]),
the need to support creating prompts (D1. Authoring prompts that
GenAssist could also learn what prompts are prone to generation
specify content and style). While we do not directly support prompt
errors and guide BLV creators in creating strong prompts.
creation, we designed our system to reveal visual content and styles
Finally, while GenAssist’s pipeline surfaced large diferences
based on prompt guidelines to inform users about details the model
between images (e.g., diferent objects, characters, expressions, or
flled in. In the user study, participants cited that reading the tables
styles), its descriptions often missed smaller diferences between
in GenAssist helped inform their prompt iterations and learn about
images that were less likely to be described in training data cap-
what styles to use. Prior work has explored using structured search
tions (e.g., slightly diferent compositions or makeup styles). Thus,
for visual concepts for writing prompts [37, 39], and combining our
GenAssist is currently useful in the early stages of prompt itera-
system with such prior work is a promising avenue for future work.
tion, where large diferences between images remain. In the future,
We are currently exploring suggesting content and styles for the
GenAssist could detect detailed changes by adding more detailed or
prompt when the user specifes the context of image use and new
domain-specifc content and style questions, or integrating vision
ways to help users add specifcity to their prompt (e.g. a chatbot,
models that explicitly compare images [74].
as suggested in the formative study). In addition to text input, we
Understanding Multiple Images. Creators in the formative study can also consider multimodal input from users in the future such as
revealed that it is difcult to understand multiple images at the image prompts [54], sketch prompts [11, 82], or music prompts [55]
same time (D2. Understanding high-level image similarities and to create an image for a music album cover, as desired by P6.
diferences). To tackle this challenge, we designed GenAssist with
Supporting Creators with Diferent Visual Impairments. BLV
three strategies: (1) providing the overview of similarities and dif-
creators’ interest in color or style information (e.g., medium, light-
ferences between the generated image candidates, (2) progressively
ing, angle) often depended on their prior experience with visuals
disclosing the information from high-level to low-level to give the
and onset of blindness. GenAssist supports creators in selectively
user control over the level of detail received [23, 43, 50], and (3)
accessing description details, but in the future GenAssist will let
presenting the descriptions in a table format so that users can easily
creators control which details to flter out or prioritize. To support
navigate between images to compare them. Participants highlighted
creators without knowledge of visual style, GenAssist could recom-
that not only these detailed summaries but also the ability to se-
mend popular styles given the image’s intended use, provide style
lectively gain information about the underlying questions were
descriptions, or deliver style in another modality (e.g., sound [21],
helpful in narrowing down their choices. For example, some partici-
tactile interfaces). We will also improve GenAssist in the future to
pants prioritized the prompt verifcation table to assess if the image
support users with remaining vision beyond providing descriptions.
followed their instructions (D3. Assessing if images followed the
For example, GenAssist could provide descriptions based on the
prompt), and other participants used the content and style table to
current zoom viewing window or support further visual edits to
learn how to improve their prompts (D4. Accessing image details
the generated images, as desired by P1.
not specifed by the prompt). In the future, GenAssist could support
sorting or fltering images based on visual attributes to limit the Implications of GenAssist on Creativity. Text-to-image genera-
number of images they consider at once (e.g., sorting images based tion models have sparked conversations about their implications
on prompt adherence or fltering images that have AI-generated for creativity. For BLV creators, image generation can improve
creative agency compared to existing approaches for creating or
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel

selecting images. In our formative study, creators wanted to use [2] AccessiblePublishing.ca. 2023 (accessed Apr 2, 2023). Guide to Image Descriptions.
image generation as it provided fewer limits over content and style https://www.accessiblepublishing.ca/a-guide-to-image-description/
[3] David Austin and Volker Sorge. 2023. Authoring Web-accessible Mathematical
than searching for images online and greater autonomy than asking Diagrams. In Proceedings of the 20th International Web for All Conference. 148–152.
a sighted person to create the image. GenAssist supports BLV cre- [4] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. 2022.
Text2live: Text-driven layered image and video editing. In European conference
ators in exercising creative control over generated images by letting on computer vision. Springer, 707–723.
creators examine image details to revise the prompt or make an [5] Cynthia L Bennett, Jane E, Martez E Mott, Edward Cutrell, and Meredith Ringel
informed selection. Compared to sighted artists who use generated Morris. 2018. How teens with visual impairments take, edit, and share photos
on social media. In Proceedings of the 2018 CHI conference on human factors in
images primarily as references [37], BLV creators often intend to computing systems. 1–12.
use generated images directly. In the future, GenAssist will further [6] Jefrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller,
creative control by supporting prompt-based editing [4]. Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White,
et al. 2010. Vizwiz: nearly real-time answers to visual questions. In Proceedings
Implications of GenAssist on Communication. We designed of the 23nd annual ACM symposium on User interface software and technology.
333–342.
GenAssist to support communication goals of BLV creators. BLV [7] Ali Furkan Biten, Lluís Gómez, and Dimosthenis Karatzas. 2022. Let there be
creators in our formative study aimed to create images to express a clock on the beach: Reducing object hallucination in image captioning. In
their ideas to a broad audience and achieve self-expression. Images Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
1381–1390.
are particularly useful for capturing visual attention and commu- [8] Jens Bornschein and Gerhard Weber. 2017. Digital drawing tools for blind users: A
nicating with sighted people who have difculty reading text. For state-of-the-art and requirement analysis. In Proceedings of the 10th International
example, P4 generated an image of his family to share with his child. Conference on Pervasive Technologies Related to Assistive Environments. 21–28.
[9] Erin Brady, Meredith Ringel Morris, Yu Zhong, Samuel White, and Jefrey P
BLV creators also wanted to use GenAssist in the workplace and Bigham. 2013. Visual challenges in the everyday lives of blind people. In Proceed-
on digital platforms. As GenAssist exists in an ableist environment ings of the SIGCHI conference on human factors in computing systems. 2117–2126.
[10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
that prioritizes visual communication, there is a risk that GenAssist Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
may cause sighted people to expect image-based communication Askell, et al. 2020. Language models are few-shot learners. Advances in neural
from BLV people. Tools like GenAssist must be coupled with re- information processing systems 33 (2020), 1877–1901.
[11] Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and Shi-Min Hu. 2009.
search and activism to make digital, workplace, and educational Sketch2photo: Internet image montage. ACM transactions on graphics (TOG) 28,
environments accessible — e.g., enabling non-visual communication 5 (2009), 1–10.
and providing access to existing visuals. Our work also reveals that [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
Pre-training of deep bidirectional transformers for language understanding. arXiv
generated images themselves should be shared with descriptions in preprint arXiv:1810.04805 (2018).
addition to the prompt that might not accurately refect the image. [13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg
Generative AI for Accessible Media Authoring. Advances in Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers
large-scale generative models enable people to create new types for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[14] Facebook. 2021. How Facebook is using AI to improve photo descriptions for
of content, yet no existing research has explored people with dis- people who are blind or visually impaired. https://ai.facebook.com/blog/how-
abilities as the users of these tools [28]. We see opportunities for facebook-is-using-ai-to-improve-photo-descriptions-for-people-who-are-
generative AI models to broaden the type of content that people blind-or-visually-impaired/
[15] Olutayo Falase, Alexa F Siu, and Sean Follmer. 2019. Tactile code skimmer: A
with disabilities can create. For example, our study participants tool to help blind programmers feel the structure of code. In Proceedings of the
mentioned that they are interested in using generative models for 21st International ACM SIGACCESS Conference on Computers and Accessibility.
536–538.
creating dynamic graphics like cartoons and videos. Similarly, gen- [16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
erative models may be useful for people with motor impairments Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial
authoring visual media, or people with hearing impairments au- networks. Commun. ACM 63, 11 (2020), 139–144.
[17] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman,
thoring music. Jiebo Luo, and Jefrey P Bigham. 2018. Vizwiz grand challenge: Answering visual
questions from blind people. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 3608–3617.
8 CONCLUSION [18] https://github.com/mikhail bot/. 2023 (accessed Apr 2, 2023). Stable Difusion
Negative Prompts. https://github.com/mikhail-bot/stable-difusion-negative-
We created GenAssist, an accessible text-to-image generation sys- prompts
tem for BLV creators. Informed by our formative study with 8 [19] https://github.com/pharmapsychotic/. 2023 (accessed Apr 2, 2023). CLIP Inter-
BLV creators, our interface enables users to verify the adherence rogator. https://github.com/pharmapsychotic/clip-interrogator
[20] https://github.com/willwulfken/. 2023 (accessed Apr 2, 2023). Midjourney
of generated images to their prompts, access additional image de- Styles and Keywords. https://github.com/willwulfken/MidJourney-Styles-and-
tails, and quickly assess similarities and diferences between image Keywords-Reference
[21] https://huggingface.co/spaces/ffloni/. 2023 (accessed Apr 2, 2023). Image to
candidates. Our system is powered by large language and vision- Music. https://huggingface.co/spaces/ffloni/img-to-music
language models that generate visual questions, extract answers, [22] Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay
and summarize the visual information. Our user study with 12 BLV Krishna, and Noah A Smith. 2023. TIFA: Accurate and Interpretable Text-
to-Image Faithfulness Evaluation with Question Answering. arXiv preprint
creators demonstrated the efectiveness of our approach. We hope arXiv:2303.11897 (2023).
this research will catalyze future work in supporting people with [23] Mina Huh, YunJung Lee, Dasom Choi, Haesoo Kim, Uran Oh, and Juho Kim. 2022.
disabilities to express their creativity. Cocomix: Utilizing Comments to Improve Non-Visual Webtoon Accessibility. In
Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems.
1–18.
[24] Mina Huh, Saelyne Yang, Yi-Hao Peng, Xiang’Anthony’ Chen, Young-Ho Kim,
REFERENCES and Amy Pavel. 2023. AVscript: Accessible Video Editing with Audio-Visual
[1] Abubakar Abid, Ali Abdalla, Ali Abid, Dawood Khan, Abdulrahman Alfozan, and Scripts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing
James Zou. 2019. Gradio: Hassle-free sharing and testing of ml models in the Systems.
wild. arXiv preprint arXiv:1906.02569 (2019).
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA

[25] Jiho Kim, Arjun Srinivasan, Nam Wook Kim, and Yea-Seul Kim. CHI 2023. Ex- the 2021 CHI Conference on Human Factors in Computing Systems. 1–12.
ploring Chart Question Answering for Blind and Low Vision Users. (CHI 2023). [52] Yi-Hao Peng, Jason Wu, Jefrey Bigham, and Amy Pavel. 2022. Difscriber: Describ-
[26] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. ing Visual Design Changes to Support Mixed-Ability Collaborative Presentation
arXiv preprint arXiv:1312.6114 (2013). Authoring. In Proceedings of the 35th Annual ACM Symposium on User Interface
[27] Hyung-Kwon Ko, Subin An, Gwanmo Park, Seung Kwon Kim, Daesik Kim, Bo- Software and Technology. 1–13.
hyoung Kim, Jaemin Jo, and Jinwook Seo. 2022. We-toon: A Communication [53] Venkatesh Potluri, Liang He, Christine Chen, Jon E Froehlich, and Jennifer
Support System between Writers and Artists in Collaborative Webtoon Sketch Mankof. 2019. A multi-modal approach for blind and visually impaired de-
Revision. In Proceedings of the 35th Annual ACM Symposium on User Interface velopers to edit webpage designs. In Proceedings of the 21st International ACM
Software and Technology. 1–14. SIGACCESS Conference on Computers and Accessibility. 612–614.
[28] Hyung-Kwon Ko, Gwanmo Park, Hyeon Jeon, Jaemin Jo, Juho Kim, and Jinwook [54] Han Qiao, Vivian Liu, and Lydia Chilton. 2022. Initial Images: Using Image
Seo. 2023. Large-scale text-to-image generation models for visual artists’ creative Prompts to Improve Subject Representation in Multimodal AI Generated Art. In
works. In Proceedings of the 28th International Conference on Intelligent User Creativity and Cognition. 15–28.
Interfaces. 919–933. [55] Yue Qiu and Hirokatsu Kataoka. 2018. Image generation associated with mu-
[29] Mackenzie Leake, Hijung Valentina Shin, Joy O Kim, and Maneesh Agrawala. sic data. In Proceedings of the IEEE Conference on Computer Vision and Pattern
2020. Generating Audio-Visual Slideshows from Text Articles Using Word Con- Recognition Workshops. 2510–2513.
creteness.. In CHI, Vol. 20. 25–30. [56] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
[30] Cheuk Yin Phipson Lee, Zhuohao Zhang, Jaylin Herskovitz, JooYoung Seo, and Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Anhong Guo. CHI 2022. CollabAlly: Accessible Collaboration Awareness in et al. 2021. Learning transferable visual models from natural language supervision.
Document Editing. (CHI 2022). In International conference on machine learning. PMLR, 8748–8763.
[31] Jaewook Lee, Jaylin Herskovitz, Yi-Hao Peng, and Anhong Guo. 2022. Image- [57] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.
Explorer: Multi-Layered Touch Exploration to Encourage Skepticism Towards 2022. Hierarchical text-conditional image generation with clip latents. arXiv
Imperfect AI-Generated Image Captions. In Proceedings of the 2022 CHI Conference preprint arXiv:2204.06125 (2022).
on Human Factors in Computing Systems. 1–15. [58] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec
[32] Jaewook Lee, Yi-Hao Peng, Jaylin Herskovitz, and Anhong Guo. 2021. Image Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation.
Explorer: Multi-Layered Touch Exploration to Make Images Accessible. In Pro- In International Conference on Machine Learning. PMLR, 8821–8831.
ceedings of the 23rd International ACM SIGACCESS Conference on Computers and [59] Mr D Murahari Reddy, Mr Sk Masthan Basha, Mr M Chinnaiahgari Hari, and
Accessibility. 1–4. Mr N Penchalaiah. 2021. Dall-e: Creating images from text. UGC Care Group I
[33] Jingyi Li, Son Kim, Joshua A Miele, Maneesh Agrawala, and Sean Follmer. 2019. Journal 8, 14 (2021), 71–75.
Editing spatial layouts through tactile templates for people with visual impair- [60] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
ments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Ommer. 2021. High-Resolution Image Synthesis with Latent Difusion Models.
Systems. 1–11. arXiv:2112.10752 [cs.CV]
[34] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping [61] Anastasia Schaadhardt, Alexis Hiniker, and Jacob O Wobbrock. 2021. Understand-
language-image pre-training with frozen image encoders and large language ing blind screen-reader users’ experiences of digital artboards. In Proceedings of
models. arXiv preprint arXiv:2301.12597 (2023). the 2021 CHI Conference on Human Factors in Computing Systems. 1–19.
[35] Junchen Li, Garreth W. Tigwell, and Kristen Shinohara. 2021. Accessibility of [62] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross
high-fdelity prototyping tools. In Proceedings of the 2021 CHI Conference on Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell
Human Factors in Computing Systems. 1–17. Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next
[36] Vivian Liu and Lydia B Chilton. 2022. Design guidelines for prompt engineering generation image-text models. arXiv preprint arXiv:2210.08402 (2022).
text-to-image generative models. In Proceedings of the 2022 CHI Conference on [63] Athar Sefd, Prasenjit Mitra, and Lee Giles. 2021. SlideGen: an abstractive section-
Human Factors in Computing Systems. 1–23. based slide generator for scholarly documents. In Proceedings of the 21st ACM
[37] Vivian Liu, Han Qiao, and Lydia Chilton. 2022. Opal: Multimodal Image Genera- Symposium on Document Engineering. 1–4.
tion for News Illustration. In Proceedings of the 35th Annual ACM Symposium on [64] Ather Sharif, Olivia H Wang, Alida T Muongchan, Katharina Reinecke, and
User Interface Software and Technology. 1–17. Jacob O Wobbrock. 2022. Voxlens: Making online data visualizations accessible
[38] Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, and Ranjay with an interactive javascript plug-in. In Proceedings of the 2022 CHI Conference
Krishna. 2023. CREPE: Can Vision-Language Foundation Models Reason Com- on Human Factors in Computing Systems. 1–19.
positionally?. In Proceedings of the IEEE/CVF Conference on Computer Vision and [65] Ather Sharif, Andrew M Zhang, Katharina Reinecke, and Jacob O Wobbrock.
Pattern Recognition. 10910–10921. 2023. Understanding and Improving Drilled-Down Information Extraction from
[39] Shane McGeehan. 2023 (accessed Apr 2, 2023). Prompter. https://prompterguide. Online Data Visualizations for Screen-Reader Users. In Proceedings of the 20th
com/prompter/ International Web for All Conference. 18–31.
[40] Microsoft. 2021. Seeing AI. https://www.microsoft.com/en-us/ai/seeing-ai [66] Volker Sorge, Mark Lee, and Sandy Wilkinson. 2015. End-to-end solution for
[41] Midjourney. 2023 (accessed Apr 2, 2023). Midjourney. https://www.midjourney. accessible chemical diagrams. In Proceedings of the 12th International Web for All
com Conference. 1–10.
[42] Midjourney. 2023 (accessed Apr 2, 2023). Midjourney Propmt Guidelines. https: [67] NY Times. 2023 (accessed Apr 2, 2023). My Kids Want Plastic Toys. I Want to
//docs.midjourney.com/docs/prompts Go Green. https://time.com/6126981/my-kids-want-plastic-toys-i-want-to-go-
[43] Meredith Ringel Morris, Jazette Johnson, Cynthia L Bennett, and Edward Cutrell. green-heres-a-fx/
2018. Rich representations of visual content for screen reader users. In Proceedings [68] NY Times. 2023 (accessed Apr 2, 2023). Why Multitasking is Bad for You. https:
of the 2018 CHI conference on human factors in computing systems. 1–11. //time.com/4737286/multitasking-mental-health-stress-texting-depression/
[44] Hospital News. 2016. You are what you eat. https://hospitalnews.com/you-are- [69] Iulia Turc and Gaurav Nemade. 2022. Midjourney User Prompts & Generated
what-you-eat-why-nutrition-matters/ Images (250k). https://doi.org/10.34740/KAGGLE/DS/2349267
[45] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann. 2015. Designscape: [70] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Design with interactive layout suggestions. In Proceedings of the 33rd annual Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
ACM conference on human factors in computing systems. 1221–1224. you need. Advances in neural information processing systems 30 (2017).
[46] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] [71] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show
[47] Guy Parsons. 2023 (accessed Apr 2, 2023). DALL-E2 Propmt Book. https: and tell: A neural image caption generator. In Proceedings of the IEEE conference
//dallery.gallery/the-dalle-2-prompt-book/ on computer vision and pattern recognition. 3156–3164.
[48] William Christopher Payne, Alex Yixuan Xu, Fabiha Ahmed, Lisa Ye, and Amy [72] Luis Von Ahn and Laura Dabbish. 2004. Labeling images with a computer game.
Hurst. 2020. How blind and visually impaired composers, producers, and song- In Proceedings of the SIGCHI conference on Human factors in computing systems.
writers leverage and adapt music technology. In Proceedings of the 22nd Interna- 319–326.
tional ACM SIGACCESS Conference on Computers and Accessibility. 1–12. [73] W3C Web Accessibility Initiative (WAI). 2022 (accessed Dec 12, 2022). Introduc-
[49] Yi-Hao Peng, Jefrey P Bigham, and Amy Pavel. 2021. Slidecho: Flexible Non- tion to web accessibility. https://www.w3.org/WAI/fundamentals/accessibility-
Visual Exploration of Presentation Videos. In The 23rd International ACM SIGAC- intro/
CESS Conference on Computers and Accessibility. 1–12. [74] Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B Chan. 2020. Compare
[50] Yi-Hao Peng, Peggy Chi, Anjuli Kannan, Meredith Morris, and Irfan Essa. 2023. and reweight: Distinctive image captioning using similar images sets. In Computer
Slide Gestalt: Automatic Structure Extraction in Slide Decks for Non-Visual Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Access. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Proceedings, Part I 16. Springer, 370–386.
Systems. [75] Ruolin Wang, Zixuan Chen, Mingrui Ray Zhang, Zhaoheng Li, Zhixiu Liu, Zihan
[51] Yi-Hao Peng, JiWoong Jang, Jefrey P Bigham, and Amy Pavel. 2021. Say It All: Dang, Chun Yu, and Xiang’Anthony’ Chen. 2021. Revamp: Enhancing Accessible
Feedback for Improving Non-Visual Presentation Accessibility. In Proceedings of Information Seeking Experience of Online Shopping for Blind or Low Vision
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel

Users. In Proceedings of the 2021 CHI Conference on Human Factors in Computing

Systems. 1–14.
[76] Yunlong Wang, Shuyuan Shen, and Brian Y Lim. 2023. RePrompt: Automatic
Prompt Editing to Refne AI-Generative Art Towards Precise Expressions. arXiv
preprint arXiv:2302.09466 (2023).
[77] Shaomei Wu, Jefrey Wieland, Omid Farivar, and Julie Schiller. 2017. Automatic
alt-text: Computer-generated image descriptions for blind users on a social
network service. In proceedings of the 2017 ACM conference on computer supported
cooperative work and social computing. 1180–1192.
[78] Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. 2023. Bet-
ter aligning text-to-image models with human preference. arXiv preprint
arXiv:2303.14420 (2023).
[79] Haijun Xia. 2020. Crosspower: Bridging graphics and linguistics. In Proceedings
of the 33rd Annual ACM Symposium on User Interface Software and Technology.
722–734.
[80] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie
Tang, and Yuxiao Dong. 2023. Imagereward: Learning and evaluating human
preferences for text-to-image generation. arXiv preprint arXiv:2304.05977 (2023).
[81] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural
image caption generation with visual attention. In International conference on
machine learning. PMLR, 2048–2057.
[82] Lvmin Zhang and Maneesh Agrawala. 2023. Adding Conditional Control to
Text-to-Image Difusion Models. arXiv:2302.05543 [cs.CV]
[83] Zhuohao Zhang and Jacob O. Wobbrock. CHI 2023. A11yBoard: Making Digital
Artboards Accessible to Blind and Low-Vision Users.
[84] Yu Zhong, Walter S Lasecki, Erin Brady, and Jefrey P Bigham. 2015. Regionspeak:
Quick comprehensive spatial descriptions of complex images for blind users. In
Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing
Systems. 2353–2362.
[85] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and Ishan
Misra. 2022. Detecting Twenty-thousand Classes using Image-level Supervision.
In ECCV.
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA

A STUDY PARTICIPANTS DEMOGRAPHICS

PID Gender Age Visual Impairment Onset Job Images Produced

P1 Non-binary 40 Legally blind Congenital Artist Paintings, Cartoons

P2 Male 50 Totally blind Congenital Professor (CS) Presentations, Scientific figures
P3 Female 29 Legally blind Congenital Teacher (English) Presentations, Course website
P4 Male 28 Totally blind Acquired Teacher (Music) Website logos
P5 Male 59 Totally blind Congenital Professor (Climate) Presentations, Scientific figures
P6 Male 42 Totally blind Acquired Sofware engineer Website images, Music album cover
P7 Male 32 Totally blind Acquired Sofware engineer Website images
P8 Male 30 Totally blind Acquired Graduate student Presentations
P9 Female 41 Totally blind Congenital Graduate student Presentations, Social media images
P10 Female 30 Totally blind Acquired Graduate student Presentations, Website images
P11 Female 37 Totally blind Congenital Accessibility consultant Website images
P12 Male 50 Legally blind Totally blind Finance consultant Charts, Graphs
P13 Male 61 Totally blind Congenital YouTuber, Musician Video thumbnails
P14 Male 44 Totally blind Congenital Author, Photographer Book covers
P15 Male 20 Totally blind Congenital University student Book covers
P16 Male 36 Totally blind Congenital Artist Event flyers
P17 Male 26 Totally blind Congenital Accessibility consultant Icons, Video thumbnails
P18 Male 47 Legally blind Acquired Sofware engineer Brochures, Website images
Table 4: Participant table for formative and comparison study.

Image Captionbot For Assistive Technology
No ratings yet
Image Captionbot For Assistive Technology
3 pages
Base Paper Batch 9 Final Updated 3
No ratings yet
Base Paper Batch 9 Final Updated 3
10 pages
Group No.17: Class-Ai - A Sub-Edi
No ratings yet
Group No.17: Class-Ai - A Sub-Edi
14 pages
AI Assistant For Visually Impaired 3
No ratings yet
AI Assistant For Visually Impaired 3
6 pages
0th Review
No ratings yet
0th Review
18 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
Meta
No ratings yet
Meta
17 pages
Text To Image Synthesis Using Self
No ratings yet
Text To Image Synthesis Using Self
20 pages
Text and Image Generation With Generative Ai
No ratings yet
Text and Image Generation With Generative Ai
14 pages
Dynamic Image Generation From Text Prompt Research Paper-JOT-5135
100% (1)
Dynamic Image Generation From Text Prompt Research Paper-JOT-5135
7 pages
Research Paper - Virtual Assistant
No ratings yet
Research Paper - Virtual Assistant
15 pages
Image Captioning with Deep Learning
No ratings yet
Image Captioning with Deep Learning
5 pages
Image Synthesis From An Ethical Perspective: Oliver Bendel
No ratings yet
Image Synthesis From An Ethical Perspective: Oliver Bendel
10 pages
Image Synthesis From An Ethical Perspective
No ratings yet
Image Synthesis From An Ethical Perspective
11 pages
Deep Learning Based Text To Image Genera
No ratings yet
Deep Learning Based Text To Image Genera
6 pages
Algorithms 17 00136
No ratings yet
Algorithms 17 00136
20 pages
Generating Caption For Image Using Beam Search and Analyzation With Unsupervised Image Captioning Algo
No ratings yet
Generating Caption For Image Using Beam Search and Analyzation With Unsupervised Image Captioning Algo
8 pages
Image Captioning For The Visually Impaired
No ratings yet
Image Captioning For The Visually Impaired
5 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
AI Image Captioning App Report
No ratings yet
AI Image Captioning App Report
31 pages
Scene Description
No ratings yet
Scene Description
6 pages
Image Captioning For Assisting The Visually Impaired
No ratings yet
Image Captioning For Assisting The Visually Impaired
10 pages
BLV GenAI ASSETS24
No ratings yet
BLV GenAI ASSETS24
14 pages
Text-To-Image Generation Using Generative AI
No ratings yet
Text-To-Image Generation Using Generative AI
5 pages
Deep Neural Architecture For Natural Language Image Synthesis Fortamil Text Using BASEGAN and Hybrid Super Resolution GAN (HSRGAN)
No ratings yet
Deep Neural Architecture For Natural Language Image Synthesis Fortamil Text Using BASEGAN and Hybrid Super Resolution GAN (HSRGAN)
12 pages
Visual Image Caption Generator
No ratings yet
Visual Image Caption Generator
8 pages
Survey Paper
No ratings yet
Survey Paper
9 pages
Image Caption Bot With Keras and Speech Generation For
No ratings yet
Image Caption Bot With Keras and Speech Generation For
7 pages
AI Trends of May 2023 You Need To Know by Gonzalo Recio Medium
No ratings yet
AI Trends of May 2023 You Need To Know by Gonzalo Recio Medium
1 page
Mpai05 - Final Document
No ratings yet
Mpai05 - Final Document
40 pages
Final All Correct
No ratings yet
Final All Correct
49 pages
Stable Diffusion With Generative Ai
No ratings yet
Stable Diffusion With Generative Ai
3 pages
(Nsdi24) Nirvana
No ratings yet
(Nsdi24) Nirvana
18 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
Biologically Inspired Cognitive Architec
No ratings yet
Biologically Inspired Cognitive Architec
5 pages
BTP - 6 Sem - Part1
No ratings yet
BTP - 6 Sem - Part1
40 pages
Deep Recurrent Architecture Based Scene Description Generator For Visually Impaired
No ratings yet
Deep Recurrent Architecture Based Scene Description Generator For Visually Impaired
6 pages
Multi-Language Image to Speech Conversion
No ratings yet
Multi-Language Image to Speech Conversion
31 pages
Final PPT
No ratings yet
Final PPT
13 pages
Investigating Use Cases of AI-Powered Scene Description Applications For Blind and Low Vision People
No ratings yet
Investigating Use Cases of AI-Powered Scene Description Applications For Blind and Low Vision People
21 pages
Journey DB
No ratings yet
Journey DB
20 pages
Advancements in Text To Image Generation Through Generative AI
No ratings yet
Advancements in Text To Image Generation Through Generative AI
8 pages
Design Guidelines For Prompt Engineering
100% (1)
Design Guidelines For Prompt Engineering
23 pages
Research Paper of Generating Caption From Image
No ratings yet
Research Paper of Generating Caption From Image
5 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
8 pages
6657-Article Text-24532-1-10-20250618
No ratings yet
6657-Article Text-24532-1-10-20250618
13 pages
What's in A Text-To-Image Prompt The Potential of Stable Diffusion in Visual Arts Education
No ratings yet
What's in A Text-To-Image Prompt The Potential of Stable Diffusion in Visual Arts Education
12 pages
Image Caption Generation Methodologies
No ratings yet
Image Caption Generation Methodologies
7 pages
Introduction To Recurrent Neural Network
No ratings yet
Introduction To Recurrent Neural Network
10 pages
Gen AI
No ratings yet
Gen AI
26 pages
Ip Adaptor
No ratings yet
Ip Adaptor
16 pages
Rishab Paper Final
No ratings yet
Rishab Paper Final
7 pages
Image Captioning Based Website Forvisuall y Impaired
No ratings yet
Image Captioning Based Website Forvisuall y Impaired
5 pages
Ip2023 07 006
No ratings yet
Ip2023 07 006
10 pages
Lesson 02 Getting Started With Generative AI
No ratings yet
Lesson 02 Getting Started With Generative AI
67 pages
An Adaptive Approach To Text To Image
No ratings yet
An Adaptive Approach To Text To Image
5 pages
ACFrOgDue3K5VpDVWq3 TRJqGwgxWZYnRmC34d1zzUQQdfyf7mshQhNh7FuZS 1QF qkY82truBm87vRLQam2YUZRTH
No ratings yet
ACFrOgDue3K5VpDVWq3 TRJqGwgxWZYnRmC34d1zzUQQdfyf7mshQhNh7FuZS 1QF qkY82truBm87vRLQam2YUZRTH
7 pages
Engproc 20 00016 With Cover
No ratings yet
Engproc 20 00016 With Cover
7 pages
NeurIPS 2021 Cogview Mastering Text To Image Generation Via Transformers Paper
No ratings yet
NeurIPS 2021 Cogview Mastering Text To Image Generation Via Transformers Paper
14 pages
Chiller Suction Pressure
No ratings yet
Chiller Suction Pressure
7 pages
DLD Number System and Conversion
No ratings yet
DLD Number System and Conversion
18 pages
Shubhangee's Free Kundali (20090914024410)
No ratings yet
Shubhangee's Free Kundali (20090914024410)
5 pages
MBA - III Semester II Internal Test: Part - A (2X4 8 MARKS) Answer ANY 4 Questions. Each Question Carries 2 Marks
No ratings yet
MBA - III Semester II Internal Test: Part - A (2X4 8 MARKS) Answer ANY 4 Questions. Each Question Carries 2 Marks
6 pages
Dynamics of Machines UNIT-02 Part-A
No ratings yet
Dynamics of Machines UNIT-02 Part-A
2 pages
2024 2025 Provisional Calendar Prof Afebende-1
No ratings yet
2024 2025 Provisional Calendar Prof Afebende-1
1 page
Letter of Mariano Herbosa
No ratings yet
Letter of Mariano Herbosa
3 pages
2023 WomenWeavers 1
No ratings yet
2023 WomenWeavers 1
5 pages
Classical Period Art
100% (1)
Classical Period Art
39 pages
Ricardian Theory of Rent
No ratings yet
Ricardian Theory of Rent
3 pages
Farm Bills (2020)
No ratings yet
Farm Bills (2020)
1 page
AMO Course Registration Form 2020
No ratings yet
AMO Course Registration Form 2020
2 pages
Bisdig - Week 6 General Information 12nov20
No ratings yet
Bisdig - Week 6 General Information 12nov20
11 pages
SHS-PD Q1 M2 Knowing Oneself - Characteristics, Habits, and Experiences
100% (1)
SHS-PD Q1 M2 Knowing Oneself - Characteristics, Habits, and Experiences
24 pages
The White Curse (Gazellian Series 2) - VentreCanard - Wattpad - Wattpad
No ratings yet
The White Curse (Gazellian Series 2) - VentreCanard - Wattpad - Wattpad
349 pages
Word Problems: Ice Breaker Skills Check and Misconceptions Practice Questions?
No ratings yet
Word Problems: Ice Breaker Skills Check and Misconceptions Practice Questions?
21 pages
The Contemporary World: Defining Globalization
No ratings yet
The Contemporary World: Defining Globalization
30 pages
Analysis
No ratings yet
Analysis
39 pages
Aafco Tabla
100% (1)
Aafco Tabla
2 pages
Final Project Template (Research Methodology) 2024
No ratings yet
Final Project Template (Research Methodology) 2024
6 pages
HR Career Journey & Achievements
No ratings yet
HR Career Journey & Achievements
3 pages
Paid Client Sample Data CRM Hacker
No ratings yet
Paid Client Sample Data CRM Hacker
12 pages
Reflection Letter To A Son
No ratings yet
Reflection Letter To A Son
2 pages
Advanced NumPy Broadcasting and Strides Guide
No ratings yet
Advanced NumPy Broadcasting and Strides Guide
21 pages
MKT 516 MCQ
No ratings yet
MKT 516 MCQ
206 pages
Studies in Hispanic Cinemas: Volume: 4 - Issue: 2
100% (3)
Studies in Hispanic Cinemas: Volume: 4 - Issue: 2
68 pages
22-Indophil Textile Mill Workers Union vs. Calica
No ratings yet
22-Indophil Textile Mill Workers Union vs. Calica
4 pages
Review- The Soul of Mbira Twenty Years On- A Retrospect Reviewed Work(s)- The Soul of Mbira- Music and Traditions of the Shona People of Zimbabwe by Paul Berliner Review by- Keith Goddard and John M. Chernoff
No ratings yet
Review- The Soul of Mbira Twenty Years On- A Retrospect Reviewed Work(s)- The Soul of Mbira- Music and Traditions of the Shona People of Zimbabwe by Paul Berliner Review by- Keith Goddard and John M. Chernoff
21 pages
IG Religious Studies Paper 2 Exemplar Responses
No ratings yet
IG Religious Studies Paper 2 Exemplar Responses
45 pages
EP Chapter 7 Notes - Sun-Earth-Moon
No ratings yet
EP Chapter 7 Notes - Sun-Earth-Moon
36 pages

Genassist

Uploaded by

Genassist

Uploaded by

GenAssist: Making Image Generation Accessible

Mina Huh Yi-Hao Peng Amy Pavel

Prompt & Generated Images Prompt Verification Questions Comparison Description

Category Name Question Model 1 2 Similarities

To generate the comparison description, we simply provide all

generates rich descriptions for each image (Figure 2). 4 https://www.w3.org/WAI/tutorials/tables/two-headers/

Prompt verification 92.82 418

Verification Errors 60.00 5

Table 2: We report the accuracy (percentage and number

Is the interpreter reading notes?

Is the interpreter listening to

Is the interpreter speaking?

The four images show a man in different a man reading a

tables generated by the GenAssist (prompt verifcation table, visual

Error None of the images has errors

5.2.1 Coverage. We summarize our coverage evaluation results

in Table 3. Overall, GenAssist’s comparison descriptions covered

Human Description Baseline Description GenAssist Description

6 USER EVALUATION Procedure. We frst asked participants demographic and back-

Interpretation Task Generation Task

Performance* System 5 5 1 1 Performance System 3 6 1 1 1

Effort* System 4 5 1 1 1 Effort System 3 3 2 2 2

Frustration* System 8 1 1 2 Frustration System 6 1 4 1

Useful for understanding System 9 1 1 1 Useful for understanding System 5 3 4

Number of respondents Outcome confidence System 2 3 2 1 2 2

Original Prompt New Prompt

Users. In Proceedings of the 2021 CHI Conference on Human Factors in Computing

A STUDY PARTICIPANTS DEMOGRAPHICS

PID Gender Age Visual Impairment Onset Job Images Produced

P1 Non-binary 40 Legally blind Congenital Artist Paintings, Cartoons

You might also like