Genassist
Genassist
Figure 1: GenAssist makes image generation accessible by providing rich visual descriptions of image generation results. Given
a text prompt and set of generated images, GenAssist uses a large language model (GPT-4) to generate prompt verifcation
questions from the prompt and image-based questions from the image captions. GenAssist then answers the visual questions
(BLIP-2) and uses a vision-language model (CLIP) and an object detection model (Detic) to extract additional visual information.
GenAssist then uses GPT-4 to summarize all of the information into comparison descriptions and per-image descriptions.
ABSTRACT CCS CONCEPTS
Blind and low vision (BLV) creators use images to communicate • Human-centered computing; • Accessibility systems and
with sighted audiences. However, creating or retrieving images is tools;
challenging for BLV creators as it is difcult to use authoring tools or
assess image search results. Thus, creators limit the types of images KEYWORDS
they create or recruit sighted collaborators. While text-to-image Accessibility, Generative AI, Image Generation, Creativity Support
generation models let creators generate high-fdelity images based Tools
on a text description (i.e. prompt), it is difcult to assess the content
and quality of generated images. We present GenAssist, a system ACM Reference Format:
to make text-to-image generation accessible. Using our interface, Mina Huh, Yi-Hao Peng, and Amy Pavel. 2023. GenAssist: Making Image
creators can verify whether generated image candidates followed Generation Accessible. In The 36th Annual ACM Symposium on User Interface
the prompt, access additional details in the image not specifed in Software and Technology (UIST ’23), October 29–November 01, 2023, San
Francisco, CA, USA. ACM, New York, NY, USA, 17 pages. https://doi.org/10.
the prompt, and skim a summary of similarities and diferences
1145/3586183.3606735
between image candidates. To power the interface, GenAssist uses a
large language model to generate visual questions, vision-language
models to extract answers, and a large language model to summarize 1 INTRODUCTION
the results. Our study with 12 BLV creators demonstrated that BLV creators use images in presentations [52], social media [5],
GenAssist enables and simplifes the process of image selection and videos [24], and art [8]. To obtain images, creators currently either
generation, making visual authoring more accessible to all. describe their desired images to the sighted collaborators who then
search for or create the image [52, 75], or limit the types of images
they create [61]. Large-scale text-to-image generation models, such
as DALL-E [58], Stable Difusion [60], and Midjourney [41], present
an opportunity for these creators to generate images directly from
This work is licensed under a Creative Commons Attribution International
4.0 License. text descriptions (i.e., prompts). However, current text-to-image
generation tools are inaccessible to BLV creators, as creators must
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA visually inspect the content and quality of the generated images to
© 2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0132-0/23/10. iteratively refne their prompt and select from multiple generated
https://doi.org/10.1145/3586183.3606735 candidate images.
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel
While BLV creators can gain access to images using automated image generation performance. Participants all expressed excite-
descriptions [34, 40], existing descriptions are intended primar- ment about using GenAssist in their own workfows for authoring
ily for image consumption. As a result, the descriptions leave out images and for new uses.
details that may help authors decide whether or not to use the Our work contributes:
image (e.g., style, lighting, colors, objects, emotions). Prior work
also enables users to gain fexible access to the spatial layout of • Design opportunities making image generation accessible,
objects in images [32], but exploring details per image makes it derived from a formative study
difcult to assess similarities and diferences between image op- • GenAssist, a system that provides access to image generation
tions provided during image generation. To make authoring visuals results via prompt-guided summaries and descriptions
more accessible, prior work has explored describing visuals to help • User study that demonstrates how BLV creators use GenAs-
creators author presentations [52] or videos [24]. While such work sist to interpret and generate images
helps creators identify low-quality visuals (e.g., blurry footage in
a video [24]) or graphic design changes (e.g., changing slide lay- 2 BACKGROUND
outs [52]), prior work has not yet explored how to improve the As we aim to enhance the experience of content BLV creators
accessibility of image generation. working with AI-powered image-generation tools, our work builds
To understand the opportunities and challenges of text-to-image upon prior research that explores: the accessibility of authoring
generation, we conducted a formative study with 8 BLV creators tools and images, and text-to-image generation tools.
who regularly create or search for images. Creators in our study re-
ported their existing strategies for making images themselves (e.g.,
using SVG editors or code), searching for images, or asking others 2.1 Accessibility of Authoring Tools
to search for or create images (similar to prior work [5, 24, 52]). Enabling access to authoring tools unlocks new forms of self-
All creators expressed excitement about using image generation expression. Recent research investigated how BLV people take and
to improve their efciency and expressivity in image authoring. edit photos and videos [5, 24], compose music [48], draw digital
Creators all used image generation for the frst time during our images [8], and make presentations [52, 61, 83]. Such work includes
study and enjoyed creating high-fdelity images for their own uses studies of current practices that highlight accessibility concerns
(e.g., creating a logo for their website, making a card for their fam- of existing authoring tools and the authored visuals. For example,
ily). While we invited participants to ask the researchers visual features of current authoring tools remain difcult to access using
questions to gain access to the visual details (e.g. “What are the screen readers [24, 35, 51], and it can be difcult to assess the efect
diferences?”, “Is the color calm or aggressive?”), it remained chal- of the visual edits such as color changes [61].
lenging for participants to: craft a well-specifed prompt especially To improve the accessibility of authoring tools, researchers have
without visual experience, assess how well the generated image explored methods for providing feedback to authors as they mod-
followed the prompt, recognize generated details that were not ify visual elements. For example, prior work has developed tactile
originally specifed in the prompt, and understand or remember devices that assist BLV designers in understanding and adjusting
the similarities and diferences between images. the layout of user interface elements [33, 53]. Tactile feedback has
To improve the accessibility of image generation, we present also been used to help developers interpret code structure, such as
GenAssist, a system that provides access to text-to-image genera- indentation [15]. Other prior work has used audio notifcations to
tion results via prompt-guided image descriptions and comparisons inform users about scene changes when reviewing videos [24, 49],
(Figure 1). Our system lets creators skim an overview of similarities while text descriptions have been used to convey visual details im-
and diferences between images using our comparison descriptions portant to authoring such as brightness and layout [24, 52]. Sound
and per image descriptions (Figure 1, right), assess if the images and text feedback have also been used to keep blind authors in-
followed their prompt using prompt verifcation (Figure 1, center), formed about their collaborators’ edits to documents [30]. Similar
and recognize visual details not in the prompt using our content and to prior research, we also aim to make authoring tools accessible by
style extraction (Figure 1, center). Creators can also interactively providing in-situ feedback, but we instead provide creation-specifc
ask questions across multiple images to gain additional details. information to facilitate authoring images.
Our interface design enables creators to easily navigate visual in- In addition to ofering authoring feedback, researchers have
formation via a screen reader-accessible table format. Our tables developed systems to automate visual authoring. Prior systems rec-
let creators selectively gain information about individual images ommend 2D layouts for visual elements during graphic design [45]
(columns) or visual questions (rows) (Figure 4). and transform text into visual presentations [29, 63, 79]. To accom-
We evaluated GenAssist in a within-subjects study with 12 BLV modate individual preferences and mitigate the impact of errors
creators who compared GenAssist with a baseline interface that produced during generation, these systems typically ofer multiple
was designed to encompass practices of accessing images (e.g., au- options for users to choose from and allow iterative generation
tomated caption [77], object detection [40], and Visual Question attempts. Iterative generation and selection are not accessible for
Answering [34]). Participants rated GenAssist as more useful than BLV creators, as it requires visually inspecting the output designs
the baseline interface for understanding similarities and diferences to choose a generated option or revise the input. In this work, we
between the images, and they reported higher satisfaction with their seek to make automated authoring tools, such as image generation,
more accessible to BLV creators. Our approach provides a struc-
tured format for assessing and comparing generated results, and
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA
on-demand access to additional visual details to support creators this work, we chose to use MidJourney due to its popularity among
in selecting a result and revising their input. designers and content creators for its high-quality results. MidJour-
ney enables creators to generate 4 candidate images for a single
2.2 Accessibility of Images text prompt via a text-based interface hosted on Discord. However,
our approach is not limited to any particular model, as we focus
Improving the accessibility of image generation systems involves
on comparing and describing multiple generated results from a
not only ensuring access to image generation features, but also mak-
single prompt, helping creators select the ideal image from various
ing the produced images accessible. A primary method for making
candidates produced by image generation tools.
images more accessible is representing them as text descriptions,
With the development of these models, recent works have con-
such as image captions or alt text (e.g., “A person walking on the
ducted studies to understand the relationship between content
street”). Early work hired crowd workers to create alt text [6, 72],
creators and AI generative tools, introducing design guidelines for
while recent research has developed machine-learning-based sys-
such systems [28, 36]. These guidelines emphasize the need for
tems that automatically generate image descriptions [34, 71, 81].
more user controllability. Researchers have thus developed various
Building on auto-generated captions, researchers have developed
tools to help designers better make use of generative AI, includ-
systems that further improve users’ understanding of images by pro-
ing assistance in exploring and writing better prompts [36, 76],
viding additional information, such as regional descriptions [40, 84],
recommending potential illustrations for news articles [37], and
and structuring detailed descriptions into an overview [14, 32, 43].
supporting collaboration between writers and artists [27]. While
This approach enables users to review visual information more ef-
these studies ofer valuable insights into how designers interact
ciently and has been found to help blind people better understand
with generative models, none have focused on creators with disabil-
images compared to using captions alone [31]. Our work builds
ities. Given the potential of text-to-image models for BLV creators,
upon this idea by presenting descriptions of image generation re-
our work is the frst to explore how to increase inclusivity in the
sults in a hierarchical, easy-to-compare format, and tailoring the
expressiveness of image generation tools and make this emerging
descriptions to the task of authoring rather than consuming images.
authoring approach more broadly accessible.
Automatic descriptions do not always capture all of the impor-
tant image details. Visual Question Answering (VQA) tools can fll
3 FORMATIVE STUDY
this gap by ofering on-demand information to visual questions
(e.g., “What is the person walking on the street wearing?”). Previous To understand the strategies and challenges of authoring and search-
research has explored what visual questions blind people would ing for images, we conducted a formative study with BLV creators.
like to have answered [9] and provided on-demand visual question The formative study consisted of a semi-structured interview to
answering support using both crowdsourcing [6, 25] and automated investigate current strategies and challenges of obtaining images,
methods [17]. While VQA provides control over visual information and two image generation tasks to explore current strategies and
gathering, it takes efort to ask individual questions. We investigate challenges of using text-to-image generation.
what types of visual questions BLV creators ask to create images
during our formative study (similar to Brady et al. [9]), then use 3.1 Method
VQA to extract visual information and summarize this information We recruited 8 BLV creators who create or use visual assets on
as image descriptions. Thus, we explore how VQA and image de- a regular basis (P1-P8, Table 4). Participants were recruited using
scriptions work together as interconnected rather than separate mailing lists and compensated 50 USD for the 1.5-hour remote study
accessibility solutions. conducted via Zoom1 . Participants were totally blind (6 participants)
or legally blind (2 participants) with light and color perception. All
2.3 Text-to-Image Generation Tools participants had previously produced or selected images for their
work across several professions: teacher (English, Music), professor
In recent years, signifcant progress has been made in the feld of
(Computer Science, Climate), software engineer, graduate student,
generative image models, particularly text-to-image models. These
and artist. 7 participants had prior knowledge of text-to-image
models employ pre-trained vision-language models to encode text
generation models, none had previously used such tools.
input into guiding vectors for image generation, allowing users
We frst conducted a semi-structured interview asking partici-
to create images using text prompts. This advancement can be at-
pants how they currently created or used visual assets, and what ac-
tributed to various factors, including innovations in deep learning
cessibility barriers they encountered with their current approaches.
architectures (e.g., Variational Autoencoders (VAEs) [26] and Gener-
We then provided a short tutorial on text-to-image generation and
ative Adversarial Networks (GANs) [16]), novel training paradigms
shared Midjourney’s guidelines for creating text prompts [42] and
like masked modeling for language and vision tasks [10, 12, 13, 70],
example prompts from a Midjourney dataset [69]. Participants then
and the availability of large-scale image-text datasets [62]. With
completed two image generation tasks (20 minutes per task): a
these advancements, recent difusion-based models like DALL-E
guided task in which participants generated a cover image for a
2 [57], Stable Difusion [60], and Midjourney [41] have success-
news article [44] given the article’s title and full text, and a freeform
fully demonstrated the ability to synthesize high-quality images
task in which participants generated their own image. To limit
in versatile styles, including photorealism. This opens up poten-
onboarding time, participant emailed us their prompt (text and/or
tial practical applications for the content production industry [37].
image) instead of using Midjourney’s Discord interface, then we
However, none of the image generation tools provide text descrip-
tions of the output so they are not accessible to BLV creators. In 1 This study was approved by our institution’s Institutional Review Board (IRB).
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel
shared the four generated candidate images back to the partici- generated prompts during the free-form task motivated by their
pants. We encouraged participants to ask questions about the four own creation goals, they mentioned it was challenging to know
candidate images to select one or change the prompt. what content would efectively convey the article in the guided
We recorded and transcribed the formative studies. To analyze task: “I have no experience reading a news article with images, so it’s
the types of visual questions asked in the image generation task, hard to think of one. What do these images usually contain?” (P7).
two of the researchers labeled questions based on their goals and Understanding Image Candidates with Visual Questions. Af-
the types of information asked.2 ter generating images, participants asked visual questions to un-
derstand and select the images. Participants asked a total of 89
3.2 Findings questions (47 asked in the guided task, 42 in the freeform task). The
Current Practice. Participants reported that they currently use goals of the questions asked were to check whether the generated
images for a variety of contexts including slides, website images, images followed the prompt (51), compare two or more images (34),
paintings for commission, cartoons, scientifc diagrams, and music request clarifcation of the answer provided by the interviewer (3),
album covers (Table 4). Five participants noted that they created or understand a single image (1). The type of visual information
images on their own using image creation software such as SVG asked by participants also varied. Participants asked about medium
editors, slides, photoshop, and ProCreate (P7, P1, P5, P6), code (5), settings (6), object presences (18), object types (11), position
packages including Python and Latex (P4, P5), or by taking photos attributes (11), color/light/perspective (16), and others (22).
(P3). Among them, three participants asked sighted people to review Participants typically started by asking general questions, nar-
them (P3, P4, P6), and two participants reviewed the images using rowing down to more specifc questions as they ruled out images.
accessibility tools (e.g., audioScreen, tactile graphs, ZoomText) (P7, For example, P4 progressively asked: “Can you describe the im-
P3). Five participants searched for images online (P7, P8, P2, P3, P5), ages?”, “What are the diferences between the four images?”, “What
and three participants recruited another person to create or search are the diferences between the [store] isles?”, “Is the second image
the images for them (P7, P4, P5). realistic?”. Alternatively, participants started their questioning by
All participants who searched for images mentioned that they directly checking if the image followed their prompts, such as in
ask sighted people to describe the images for them in addition P5’s frst question: “Do we actually get the woman sitting at a desk?”
to reading any available alt text. P7 noted “Alt text has never been Finally, P1 and P2 started with questions about the style of the
helpful. It’s too short without important details.” P8 and P5 mentioned images: “Is it realistic or cartoony?” (P1) and “Is the color calm or
that while a few established websites (e.g., New York Times, NASA) aggressive?” (P2). Through asking questions, participants realized
have good alt text, Google Image Search returns options other than diferences between their prompt and the generated images: “it
established websites and “it is hard to compare the results of the image seems like the model generator is flling in details according to the
search” (P5). Participants also noted barriers to asking others to context, even if I didn’t specify some details. I didn’t specify the clothes
describe the image search results including fnding available people but in all images, the women are wearing ofce clothes” (P5). Partici-
to describe the images and avoiding false perceptions: “I only ask pants then asked follow-up questions based on new details. While
a handful of people because it might lead to some subconscious bias the visual questions revealed the content and structure of what par-
‘that I’m not independent’, cause it’s a basic task” (P7). ticipants wanted to know about the images, participants reported
that asking questions for each image was “very time-consuming
Generating Prompts. All prompts written by participants specifed
and confusing” (P4). 5 participants noted that they would prefer
the content they wanted to appear in the image (e.g., P6 used the
to receive descriptions before asking questions, and participants
prompt “A person pushing a grocery cart down a produce aisle.”),
reported that remembering all of the answers was difcult, as P2
and only two participants specifed the style of the image (P1 and
summarized: “ I wish there were more description provided in the frst
P7 specifed “a photograph of...”). Participants mentioned several
place. I don’t know what to ask. Also, it’s hard to remember all the
challenges of creating prompts. First, while prompt guidelines [42]
answers for each image.”
recommend users to specify multiple attributes in their prompt (e.g.,
style, lighting), participants reported that they were unfamiliar with Selecting an Image Candidate. While participants initially asked
visual attributes (“I’m trying not to leave much to system randomness, questions based on their prompt, they ultimately selected the fnal
I want to detail more things. But I don’t know a lot about diferent image considering both prompt-based descriptions and descriptions
styles.” — P5) and others found it difcult to remember what to of extra details produced by the model. P7 suggested that informa-
mention in the prompt: “I want the model to behave more like a tion on whether the prompt is refected in each image should be
wizard – asking me a series of questions ‘What do you want to create?’, presented early so that he can decide whether to explore the image
‘What style?’ and so on. It is hard to create detailed prompts in one in detail or skip to the next candidate. P8 highlighted the impor-
attempt (P2). Participants also noticed that it is challenging to create tance of additional details: “The model has randomness. It showed
a prompt that AI would be capable of generating: “If I pin down items I didn’t ask for and didn’t show what I asked for in the prompt.
something really specifc or narrow [in the prompt], AI seems to break I want much information to be surfaced so that I can make a decision.
down” (P1). P5 mentioned that transparency could inform prompt Whether that unexpected parts can be still used.” We also observed
iteration: “I want to know how the model works! [...] then I will that similarities between images guided participants in deciding
know how to write a good prompt.” Finally, while participants easily whether to further explore the images or to refne the prompt. For
instance, after P3 generated images using a prompt “A photo looking
2 SeeSupplemental Material for the full list of prompts, images, and visual questions down on a kitchen table with a plate of pizza, a plate of fried chicken,
of the formative study
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA
and a bowl of ice cream on it.”, he realized that all four images did (Figure 1). To illustrate GenAssist, we follow Vito, a professional
not display drinks and iterated the prompt to explicitly mention blogger who uses a screen reader to author his articles. Vito recently
“fzzy drinks”. On the other hand, diferences between the images wrote an article about the benefts of teaching children to cook, and
ultimately informed the fnal selection, as participants cited unique he wants to add an image to the article to engage his sighted readers.
backgrounds, objects, and mediums as reasons for selecting the He attempts to use image search to fnd a stock photo of “a young
image (e.g., P3 selected the fnal image because that was the only chef” but notices that many of the images are missing detailed
image that presented a dog putting his paw on the books. ). captions and alt text, or feature adult chefs instead of children. He
Uses of Image Generation. When participants generated their decides to create an image using text-to-image generation with
own images in the free-form task, participants created a variety of the prompt “a young chef is cooking dinner for his parents”. The
images ranging from logos, art, website decorative images, presenta- text-to-image generation model returns four candidates:
tions, and music album cover. All participants expressed excitement
about using the text-to-image model as part of their image creation
process in the future. Participants mentioned with image gener-
ation, they can create new types of images they had not created
before. P6 mentioned “With SVG editor, I cannot make realistic im-
ages. But now I can!” Also, participants mentioned that the quick
creation will lead them to use images more often: “Because it’s so To decide whether to use one of these images or change his prompt,
quick, I will use it for communication. Similar to how sighted people Vito enters his prompt and image results into GenAssist.
draw on a whiteboard during a Zoom meeting, I can quickly generate
an image because representing a concept visually is easier for sighted 4.1 Prompt Verifcation
team members.” (P8). P4 also compared the experience of image gen-
While the text-to-image model generates output images based on
eration with image search “This simplifes things when I’m looking
the prompt, the generated image often does not refect the specif-
for things very niche, something that is hard to fnd online.” Finally,
cations in the prompt, especially if the prompt is long, complicated,
participants also mentioned the beneft of creating images alone.
or ambiguous [22]. To help users assess how well their generated
P7 said that because there is no need to ask a sighted person to help
images adhered to their prompt, GenAssist provides prompt veri-
search images, it brings more autonomy and privacy. Participants
fcation. To perform prompt verifcation, we frst use GPT-4 [46]
also noted limitations and potential downsides of image genera-
to generate visual questions that verify each part of the prompt.
tion including potential bias (P8, P4), copyright and training data
We input the text instruction “Generate visual questions that verify
concerns (P3, P4), wanting to use it only for inspiration (P1), and
whether each part of the prompt is correct. Number the questions.”
potential errors (P8). However, P8 expressed that he expected future
followed by the user’s prompt. GPT-4 outputs a series of questions:
models to produce fewer errors.
Input Prompt Verification Questions
Generate visual questions that verify whether 1. Is there a chef in the image?
3.3 Refection each part of the prompt is correct. Number 2. How old is the young chef?
the questions. Prompt:
Creators in our formative study currently employ resourceful strate- 3. Is the young chef cooking food?
A young chef is cooking dinner for his parents.
gies for creating or searching for images, but all creators expressed 4. Are the parents present in the image?
excitement to use image generation in their workfow. To improve
access to image generation, our formative study reveals design op- We generate answers to the visual prompt verifcation questions for
portunities (D1-D5) to make image generation accessible through each of the four generated candidate images using the BLIP-2 model
technical or social support for: with the ViT-G Flan-T5-XXL setup [34]. For each generated image
and prompt verifcation question, we instruct the BLIP-2 model
D1. Authoring prompts that specify content and style.
with the starting sequence “Answer the given question. Don’t imagine
D2. Understanding high-level image similarities and diferences.
any contents that are not in the image.” to reduce hallucinations with
D3. Assessing if images followed the prompt.
non-existent information:
D4. Accessing image details not specifed by the prompt.
Prompt Verification Questions Image Answers (BLIP-2)
D5. Organizing responses to visual questions.
Is there a chef in the image? Yes Yes Yes Yes
These design opportunities address key user tasks in accessible text- How old is the young chef? Young kid Young kid Young kid Young man
to-image generation: generating the prompt (D1), understanding Is the young chef cooking food? Yes Yes Yes Yes
and selecting images (D2, D3, D4, D5), and revising the prompt for it- Are the parents present in the image? Yes No Yes Yes
1 2 3 4
eration (D4). Our work aims to help creators understand their image
generation results through prompt-guided descriptions and com- To help users quickly fnd which images do or do not adhere to
parisons (D2-D5). While providing high-quality descriptions may the prompt, we use GPT-4 to summarize the responses to each
help creators improve their future prompts (D1), future work should question using the following prompt: “Below are the answers of four
explore how to actively support creators in authoring prompts. similar images to one visual question. Write one sentence summary
that captures the similarities and diferences of these results. The
4 SYSTEM summary should ft within 250 character limit”. When using GPT-4’s
We present GenAssist, a system that supports accessible image chat completion API, we set the role of the system as “You are a
generation via prompt-guided image descriptions and comparisons helpful assistant that is describing images for blind and low vision
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel
individuals.”. The temperature value was set to 0.8. The summaries Content & Style Questions Image Answers (Detic)
Spoon, pot, Spoon, Spoon, Spoon, pot,
either indicate that all images have the same answer (e.g., “All What are the objects in the image? cup, tub, sink,
tomato,
fork, knife,
apple,
window,
flowerpot,
apron, lettuce, hat, sausage, plate,
images have a chef in the image”), or they alert users to diferences: bowl… bowl… plate… frog…
1 2 3 4
Prompt Verification Questions Prompt Verification Summary
Is there a chef in the image? Three images depict a young kid, while Image 4 For the remaining questions covering medium, lighting, perspective,
depicts a young man.
and errors, we answer the question for each image candidate by
Are the parents present in the image? Three images show parents present in the image,
while Image 2 does not. using CLIP [56] to determine the similarity between the image and
a limited set of answer choices (similar to CLIP interrogator [19]).
To enable screen reader users to easily access the answers to To provide answers that could inform future prompts, we curated
each question, we present the prompt verifcation results as a table our answer choices for medium, lighting, and perspective from
including the prompt verifcation questions (rows, with the question Midjourney’s list of styles [20] and DALL-E’s prompt book [47]. To
in column #1), prompt verifcation summaries (column #2), and per- address common image generation errors, we retrieved the answer
image prompt verifcation answers (columns #3-6) (Figure 4). choices for our errors question from prior work [18, 59]. We include
Using our prompt verifcation table, Vito reads the answers sum- the full list of answer choices in the Supplementary Material. For
maries to check if the images follow his prompt. He notices that the each question, GenAssist presents the top three answer choices
4th image contains an older chef, so it does not apply to his article with a similarity score between the answer choice embedding and
about teaching children how to cook. While Vito also realizes the the image embedding above a threshold of 0.18:
2nd image does not feature the chef’s parents, he keeps the image
Content & Style Questions Image Answers (CLIP)
in consideration as it may still apply to his article. Cartoon, Cartoon,
A stock Vector
What is the medium of the image? storybook, storybook,
photo art
illustration illustration
4.2 Visual Content & Style Extraction What is the lighting of the image? Natural Natural Natural Natural
lighting lighting lighting lighting
Generated image candidates often feature similarities or diferences Medium Centered Medium Medium
What is the perspective of the image?
that are not present in the original prompt. For example, Vito’s shot shot shot shot
Poorly
prompt “A young chef is cooking dinner for his parents” does not What are the errors in this image? drawn None None None
specify the style such that the resulting images include three illus- hands
1 2 3 4
trations and one photo. To enable access to image content and style
To inform creators about unfamiliar visual style types, GenAssist
details that were not specifed in the prompt, we extract the visual
provides the defnition and the usage for each answer choice for vi-
content and visual style of the generated image candidates. To sur-
sual style questions (Medium, Lighting, Perspective) by generating
face content and style similarities and diferences that are important
the description with GPT-4 and the prompt “Describe the defnition
for improving image generation prompts, we used text-to-image
and the usage of the following [QUESTION NAME] in one sentence:
prompt guidelines [20, 42, 47] to inform our approach.
[STYLE NAME]”. Similar to the prompt verifcation table, we present
We frst created a list of visual questions about the image based
the prompt guideline results in a table format including the prompt
on existing prompt guidelines, i.e. prompt guideline questions. The
guideline questions (rows, with the question in column #1), prompt
prompt guideline questions consist of questions about the content
guideline summaries (column #2), and per-image prompt guide-
of the image (subjects, setting, objects), the purpose of the image
line answers (columns #3-6). We further split the prompt guideline
(emotion, likely use), the style of the image (medium, lighting, per-
results into two tables to improve ease of navigation: the visual
spective, color), and an additional question about errors in the image
content table includes answers to the content and purpose questions,
to surface distortions in the generated images such as blurring or
and the visual style table includes answers to the style and errors
unnatural human body features (Table 1).
questions. Finally, users can ask their own questions at the bottom
To answer our prompt guideline questions for each image, we
of either table and GenAssist adds a row to the table by generat-
answered 5 questions (setting, subjects, emotion, likely use, col-
ing the answer for each image using BLIP-2, and the summary of
ors) using Visual Question Answering with BLIP-2, similar to our
answers using GPT-4. Using the visual content table, Vito notices
prompt verifcation approach:
from the objects summary that Image 1 has more food items than
Content & Style Questions Image Answers (BLIP-2)
Images 2-4. As the purpose of the article is partially to introduce
What is the setting of the image? Kitchen Kitchen Kitchen Kitchen
children to more ingredients, he decides to remove Image 1 from
Father and Chef, Father, Father,
What are the subjects of the image?
children
kitchen, mother mother consideration. Using the visual style table, Vito realizes that Image
vegetables and son and son
2 is a photo, while the other images are illustrations. As Vito was
What is the emotion of the image? Happy Happy Happy Happy initially searching for a photo, he notes he may want to further
A children’s refne his prompt to get more photo results. Vito also wants to
Where would this image be used? On a In a cooking On a
website cookbook website
class check if the images will match his blog which is primarily black
Brown, blue, Black, white, Blue and Red, yellow,
What are the main colors?
yellow red, green white green and white, so he adds a question about the background color:
1 2 3 4 User Question Image Answer Summary
What color is the background? Image 1 and Image 4 are light brown, Image 2 is
black and Image 3 is blue.
For our objects question, we used Detic [85], a state-of-the-art
object detection model, with an open detection vocabulary and a As Image 2 fts his article and includes a black background, he ranks
confdence threshold of 0.3 to enable users to access all objects: Image 2 as his current top choice.
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA
Prompt
Category Sub-category Correct (%) Correct (#)
Generated
Perspective 71.83 142
Colors 99.1 221
Prompt
Attributes/Questions Summary Image1 Image2 Image3 Image4
Is there a person present in the All four images depict the presence of a person, indicating that there is at least one
All four images depict a negative response to the question of whether the person is an
Table
Is the person an interpreter? no no no no
interpreter.
Are there microphones visible in the Three out of four images do not have visible microphones, while one image shows the
no no yes no
image? presence of a microphone.
All four images depict the interpreter listening, with no discernible differences between
yes yes yes yes
of correctly predicted information) of the pipeline results
yes yes yes yes
something?
All four images show that the interpreter is not speaking, with no variations or
differences observed.
no no no no
(Prompt verifcation, Content, Style, and Errors) with 20 sets
of images.
Visual Attributes/Questions Summary Image1 Image2 Image3 Image4
Content
The images share varying levels of detail about
the location, with some featuring specific
a chair in front of a a chair in front
Setting elements such as a microphone, while others a chair inside a plane
microphone of a microphone
Table
provide more general information like a chair
or the inside of a plane.
Emotion
All four images are positive in nature, with
three of them specifically depicting a positive
emotion and one showing the action of reading a
book.
positive
he is reading
a book
positive positive and the other researcher reviewed the annotations. To compute
Usage
These images share a common use on a website,
with the exception of Image3 which would likely
be seen in a newspaper article.
on a website on a website in a newspaper article on a website
the accuracy of the detailed visual information in GenAssist, one
of the researchers examined the 20 sets of images with the three
strap, chair,
rearview_mirror, hatbox,
All images contain chairs, person, headset, jacket, chair, button, handle, polo_shirt,
microphone, chair, microphone,
earphone, trousers, and book. Other objects person, vent, television_set, radio_receiver,
lamp, jacket, person, chair, lamp,
vary, including lamps, shoes, and microphones. backpack, flap, cushion, hook, hinge, bolt,
shoe, stool, scarf, person, cushion,
Objects Image3 has more diverse objects like TV, radio, stool, ski, trousers, book, knob, choker,
content table, and visual style table) and counted the number of
Visual Style
correct and incorrect answers in each table.
Attributes/Questions Summary Image1 Image2 Image3 Image4
& Errors
vector
a sketch, a a sketch,
The images have varying mediums, but Image1 and Image4 share a cartoon style. Image2 and art, a
storybook vector art, a
Medium Image3 are more diverse, with Image2 having vector art and digital painting, while Image3 is a photo character
Table
illustration, a digital
a straightforward photo. portrait,
cartoon painting
a cartoon
All four images have the same lighting attribute of spotlight. They appear to be similar in
Lighting Spotlight Spotlight Spotlight Spotlight
terms of lighting, without any noticeable differences between them.
The four images vary in their perspective attributes. Images 1 and 2 share a centered-shot Landscape
5.2 Results
Headshot, Centered-
Perspective composition, but differ in their subject matter. Image 3 is described as simply a centered- Shot,
Centered-Shot Shot
shot. Details about image 4 are not provided, so no comparison can be made. Centered-Shot
Custom
accent colors. white
Question
Attributes/Questions Summary Image1 Image2 Image3 Image4
Table more similarities and diferences than the human describers’. In
Questions Summary Image Answers
the coverage of diferences, GenAssist spotted more than twice the
number of total diferences than the human describers (4.55 vs. 2.25).
Figure 4: The GenAssist interface consists of screen reader
The coverage of GenAssist’s individual image descriptions was
accessible tables that enable users to fexibly gain more in-
comparable to that of human describers. When compared to human-
formation about the content of interest.
generated description, GenAssist captured more information about
the content and styles but revealed fewer image generation errors.
For instance, one human describer specifed in the comparison
descriptions for 10 randomly selected image sets each. For each description “...All of the images have some AI generation error with
image set, the describers provided descriptions of each individual fngers or clothing. ”. While GenAssist and the baseline used the
image, and the similarities and diferences between the images. We same GPT-4 prompt to extract the similarities and diferences, the
provided describers with prompt guidelines [42], image description baseline’s comparison description did not capture many diferences.
guidelines [2], an example set of descriptions created by GenAssist,
and the prompt for each image set to inform their descriptions. 5.2.2 Accuracy. Table 2 summarizes the results of the accuracy
Both describers spent 3.5 hours to create descriptions for the 10 evaluation. Prompt verifcation, content, and style categories all
sets of images — or around 21 minutes per image set. achieved over 90% accuracy except for medium, perspective and
We compared the coverage of GenAssist-generated descriptions emotion. In the 80 images in the dataset, GenAssist only detected
to those generated by a baseline captioning tool (BLIP-2) and hu- fve images as having errors, and detected the correct error types
man describers. For comparison, we annotated the similarities and in three of them. The most common errors made in our pipeline
diferences descriptions for all 20 sets of images and annotated the were from perspective, medium, and error categories which are all
individual descriptions for 10 sets of images. We chose the 10 sets extracted using the CLIP score. For perspective and medium, the
with the longest human descriptions to compare GenAssist with the majority of the errors were due to CLIP matching images to common
highest quality descriptions. Because BLIP-2 cannot take multiple style expressions (e.g., natural lighting, centered-shot) which likely
images as input to extract similarities and diferences, we gener- refects prevalence of these expressions in the training data. In the
ated captions of the 4 images using BLIP-2, then prompted GPT-4 incorrect output of errors, GenAssist detected cartoon or sketch
with the same prompt we used in our system to generate summary images as ‘poorly drawn faces’ errors. One reason for the relatively
descriptions. We tallied whether the descriptions contained details low accuracy of object detection results is that we empirically set
about the image in each of our set of pre-defned visual information the output threshold of GenAssist’s object detection (Detic) as 0.3
categories (Table 1). We counted only the correct information in to present diverse objects to users in addition to information about
the descriptions. One of the researchers annotated the descriptions the main subject extracted by BLIP-2 in our pipeline.
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA
The styles are different in cartoon and realistic. Image 2 specifies the time of day as sunset. They differ in the type of medium, lighting, shot types,
The haircut of the man and woman are different. Image 4 emphasizes that the couple is holding and colors used. The first three images mainly use blue and
The sun’s appearance in frame are different. hands. white, while the fourth image has a black and white theme.
Image 3 specifies the age of the couple as young. Image 1 and 3 show the couple walking on dry sand, while
image 2 and 4 show the couple walking on the shore. The
couple is holding hands in image 2, while in image 1 and 3,
they are not. The images have different intended uses,
ranging from a website or album cover to a poster for
promoting beach trips.
All feature a single black woman who could All images involve women and mirrors. All images feature a black woman with long hair in a
reasonably be hair stylists given their setting. positive and happy mood.
Images that feature a mirror have varying Image 1 takes place in a hair salon. Differences include setting, color scheme, and activity.
degrees of realism; the second image may Image 2 depicts a woman getting her hair done. Image 1 shows a hair stylist in an empty salon with blue and
reasonably feature the same person, though the Image 3 shows a woman sitting on a chair. black hair, while Image 2 features a woman getting her hair
third image does not show the correct side of the Image 4 takes place in a woman's room. styled in a mirror with black and white lighting. Image 3 also
subject. features a hair stylist with dark brown hair, but in a room with
an orange dress and no tools. Image 4 is a digital illustration
of a woman in her room with a plant, wearing a hoodie, and
looking into a mirror hanging on a wall.
Figure 5: Two image sets and the descriptions of the similarities and diferences used in the pipeline coverage evaluation (each
image set described by a diferent human describer). GenAssist captured more information in the similarities and diferences
caption than the human describers.
Total Content (#) Total Style (#) Total Error (#) Total (#)
(Correct Only) Human Baseline GenAssist Human Baseline GenAssist Human Baseline GenAssist Human Baseline GenAssist
� 1.5 1.65 2.45 0.70 0.00 0.80 0.10 0.00 0.00 2.35 1.65 3.25
Similarities � 0.61 0.59 1.10 0.80 0.00 0.83 0.31 0.00 0.00 0.83 0.85 1.29
� 1.50 1.95 2.35 0.65 0.35 2.20 0.05 0.00 0.00 2.25 2.30 4.55
Diferences � 0.69 0.39 0.49 0.75 0.49 1.01 0.22 0.00 0.00 0.84 0.93 1.26
Per-Image � 1.71 0.69 1.71 0.71 0.04 0.68 0.05 0.00 0.01 2.47 0.73 2.41
Descriptions � 0.39 0.10 0.26 0.22 0.07 0.30 0.05 0.00 0.03 0.74 0.33 0.75
Table 3: We compared the coverage of GenAssist-generated descriptions to those generated by a baseline captioning tool and
human describers. GenAssist captured more similarities and diferences than the human describers.
Prompt (S0) �=4.00; � =-2.00; �<0.05). For generation tasks, participants rated
A young chef is cooking the dinner for his parents
that they were signifcantly more satisfed with the fnal image
Prompt (S1) (�=3.17, �=3.00 vs. �=5.00, �=5.50; � =-2.17; �<0.05). Signifcance
George Washington and Abraham Lincoln shaking hands. was measured with the Wilcoxon Signed Rank test.
Prompt (S2) Gaining a summary of image content. With GenAssist across
A video of an old lady dancing. Happy smile, cute granny.
Security camera footage. On CCTV. Ultra realistic
both tasks, all participants started by reading the summary table
Prompt (S3) including the comparison description (summary of similarities and
Trending stock photo diferences), as well as the per-image descriptions. Participants
Prompt (S4)
all stated that the summary table was helpful for understanding
Man sitting at his computer, home office, fireplace, the images they generated, as P6 explained: “I cannot do without
Paint man like he was a crystal clear water the summary. Highlighting the diferences was very useful.” (P6). In
addition, participants noted that the summary table’s per-image de-
Figure 6: We selected two sets of images from Midjourney’s scriptions were valuable for understanding the images. For example,
community feed generated with a short prompt without de- P19 mentioned “This is more like an audio description because I can
tailed descriptions of objects or styles (S1, S3) and two sets make a very clear mental image!” and slowed down his screen reader
with a long prompt with detailed descriptions of objects or pace to mimic the experience of listening to an audio description.
styles (S2, S4). We selected long and short prompts to explore P20 reported “I always thought that AI is not as capable of describing
how users compared images when they are similar (long as humans, because usually alt-text generated by AI is short and
prompts) vs. dissimilar (short prompts). doesn’t capture much information. But reading this, I am rethinking
AI’s capabilities.”. P12 found the detailed descriptions particularly
prompts. We selected the two articles from the New York Times: helpful when authoring rather than interpreting images: “The frst
‘Why Multitasking is Bad for You’ and ‘My Kids Want Plastic Toys. I table (comparison description table) is so comprehensive. When I’m
Want to Go Green.’ [67, 68]. The order of the interfaces and articles authoring images I need more information than when I’m looking at
was counterbalanced and randomly assigned to participants. After what others uploaded.” (P12).
each interface, we asked the participants to choose one image from Using the baseline, participants all initially read all of the infor-
the generated images and explain their reasoning. We also con- mation they had access to (the caption and objects) for each image.
ducted a post-stimulus survey that included the following ratings: all participants mentioned the inconvenience of having short image
Mental Demand, Performance, Efort, Frustration, Usefulness of captions for gaining an overview, especially when the generated
the caption, Satisfaction with the fnal image, and Confdence in images are similar to each other. For example, after reading the
posting the fnal image. All ratings were on a 7-point Likert scale. BLIP-2 caption of S4, P18 asked “Are they all same images?”
At the end of the study, we conducted a semi-structured interview Selectively accessing additional information. While all partici-
to understand participants’ strategies using GenAssist and the pros pants accessed the summary table frst, we observed multiple strate-
and cons of both GenAssist and the baseline. gies of using additional information provided by GenAssist to un-
The study was 1.5 hours long, conducted in a 1:1 session via derstand the diferences between the generated images. First, P9,
Zoom, and approved by our institution’s IRB. We compensated P7, P16, P18, and P20 checked the information from all tables before
participants 50 USD for their time. making their decision. P20 mentioned “They are equally important
Analysis. We recorded the study video, user-generated prompts but in diferent ways. If the generated images are diferent, the sum-
and images, and the survey responses. We transcribed the exit mary table would be sufcient. For similar ones, I’d have to go down
interviews and participants’ spontaneous comments during the the tables more.” P16 noted “We never have too much information. All
tasks and grouped the transcript according to (1) strategies of using the details provided here matter to me”. After checking all the tables,
GenAssist and (2) perceived benefts and limitations of our system. P18 and P20 revisited the summary table again to remember and
organize all information. The other seven participants (P10-P12,
6.2 Results P8-P15, P17, P19) checked the tables selectively. Participants’ prefer-
ences refected their prior experiences creating images. For instance,
Overall, all participants stated they would like to use GenAssist
P7 who typically creates images using an SVG editor prioritized
rather than the baseline interface to create images in the future.
the prompt verifcation table. He said “I detail more things in the
Participants expressed that GenAssist would be immediately useful
prompt and want everything to be in the image, ‘cause I am more used
in their workfows: “This is usable out of the box!” [...] “I need access
to programming-drawing.” P13 skipped the style and errors table
to this technology” (P14), “I’d even pay for this! I really need this”
as he was not familiar with the concepts despite the defnitions
(P15). In particular, participants rated GenAssist to be signifcantly
provided: “As a born blind person, most information in the visual
more useful for understanding the diferences between images in
attributes is not useful as it’s hard to imagine those.” Participants also
both tasks (interpretation: �=1.50, �=1.00 vs. �=3.58, �=4.00; � =-
mentioned that they liked that GenAssist provided the breakdown
2.31; �<0.05; generation: �=1.92, �=2.00 vs. �=4.33, �=5.00; � =-2.77;
of the summary description into multiple tables. P16 described that
�<0.01) (Figure 7). For the interpretation task, participants reported
GenAssist has “So much transparency because it provides access to
signifcantly better performance (�=1.83, �=2.00 vs. �=3.67, �=3.00;
intermediate tables that constitute the summary table, just like a
� =-2.47; �<0.05), signifcantly less frustration (�=1.75, �=1.00 vs.
[prramming tool]! I can look at the inside of the models and see what
�=3.50, �=3.50; � =2.46; �<0.05), and efort (�=2.25, �=2.00 vs. �=4.00,
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA
1 3 4 2 1 1
0 3 6 9 12 Outcome satisfaction* System
Baseline 2 3 1 3 3
0 3 6 9 12
1 (positive) 2 3 4 5 6 7 (negative) Number of respondents
Figure 7: Distribution of the rating scores for GenAssist and the baseline interface (1 = positive, 7 = negative) in the two tasks.
Note that a lower value indicates positive feedback and vice versa. The asterisks indicate the statistical signifcance as a result
of Wilcoxon text ( p < 0.05 is marked with * and p < 0.01 is marked with **). In the interpretation task, GenAssist signifcantly
outperformed the baseline interface in performance, efort, frustration, and usefulness for understanding the diferences
between images. In the generation task, GenAssist was signifcantly lower in being useful for understanding the diferences
and in outcome satisfaction.
they’re doing.” P10 and P11 both mentioned that they appreciated table (‘Is the data showed falling or rising?’ and ‘What is the date of
the order of the tables: “The summary [table] is the bigger picture. the x-axis?’ for S3 in Figure 6). When asked about the reason for
Then the tables go into the details. I also like that the prompt questions not asking any additional questions, P18 said “Looking at captions I
come frst because they’re important.” already had a big picture so I didn’t ask additional questions.” P7 sim-
Participants also employed multiple strategies for navigating ilarly refected: “I like that [GenAssist] asks questions that I haven’t
within the tables. Participants browsed through questions in the thought of but are still important. The answers to the questions told
tables to identify questions they found to be important and skipped me additional stuf about the images.” In contrast, with the base-
questions that were less important (e.g., not interested, or already line interface, participants asked many additional visual questions.
appeared in the summary descriptions). We also identifed multiple Because each image was presented separately, participants often
patterns of navigating within the tables. Participants checked all asked the same question for each image to compare the answers.
cells in a row when they found the table to be important. For in- Most of the questions were about the objects detected, especially
stance, P11 checked the answers of all four images in the prompt when the object was not mentioned in the caption or did not seem
verifcation table. In other cases, participants frst checked the ques- relevant to the setting (e.g., P11 asked “Where is the beachball in the
tions, then decided whether to read the row or skip to the next picture?” after reading the object detection results of an image with
row. Participants skipped rows if the answers to the questions were the kitchen setting). P10 who experienced the baseline condition
already mentioned in the summary table, or if they were not inter- after GenAssist refected that “This one [Baseline] is not simply laid
ested in the question. For example, P8 skipped the medium, lighting, out for me. The previous one [GenAssist] is easy peasy presenting
and perspective row in the visual style & errors table and only at- everything for me. And this one is ‘Here you have to fgure out.”
tended to the error row. Sometimes, participants only checked the Refning and Iterating Prompt. In the generation task, none of
answer cells if the summary column highlighted the diferences be- the participants refned the prompt using the baseline and fve
tween the images and skipped to the next row if the summary stated participants refned the prompt when using GenAssist (P9, P10,
mainly the similarities between the images. Participants stated that P13, P16, P17). Among the remaining 7 participants, 5 participants
GenAssist’s table format was easy to navigate. P19 noted the ease reported that they did not iterate as they were satisfed with the
of navigation within the table: ”I like having control with the tables. results, and 2 participants were unsure how to iterate the prompt
If the question or summary doesn’t seem interesting, I can skip to the after realizing that the image generation model did not refect some
next row instead of reading all answers of four images.” parts of the original prompt (P15, P20).
Asking additional information. With the baseline, most par- Participants often quickly made the decision to revise the prompt
ticipants (12 participants in the interpretation task, 9 participants while reading the summary table and before they moved on to other
in the generation task) asked follow-up questions to try to under- tables. For instance, while generating an image about an article
stand the images, while with our system participants rarely asked about multitasking, P10 frst attempted to generate an image with
follow-up questions (1 participant in the interpretation task and the following prompt ‘A woman who is holding the iPhone is texting
none in the generation task). P16 was the only participant who on it while she glances at another device which displayed some funny
asked additional visual questions with GenAssist after reading the videos going on. She’s in the kitchen trying to cook. it looks like the
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel
GenAssist to support the images that text-to-image generation distortions). GenAssist could also read image descriptions with mul-
models currently support: content-driven photos or illustrations tiple voice styles to help creators distinguish generation candidates.
with simple structures. However, both text-to-image generation GenAssist’s ability to attend to multiple similar images and sur-
and GenAssist do not yet support images that are information-rich face diferences can be useful in broader contexts. Our study partic-
or densely structured such as information visualizations [64, 65] ipants expressed interest in using GenAssist for comparing image
or diagrams [3, 66]. As text-to-image generation improves, future search results or similar photos in social media. It can also help BLV
research will explore extending GenAssist to complex graphics with people in decision-making situations based on visual information
text. For example, GenAssist could help creators recognize if their (e.g., online shopping, communicating with the design team in the
prompt-generated diagram contains the desired text (by integrat- software development, selecting a photo from similar shots).
ing Optical Character Recognition), relationships, and perceptual Implications for Visual Question Answering. Comparing GenAs-
qualities (e.g., legibility, saliency of important information). sist to our baseline of typical descriptions with visual question an-
Second, the descriptions that GenAssist is capable of provid- swering (VQA), all participants rated GenAssist as more useful for
ing are also limited by the capabilities of the pre-trained vision- understanding diferences between images and creators asked fewer
language models [34, 56, 85]. For example, while GenAssist helped follow up questions with GenAssist. GenAssist reduced follow-up
creators notice image generation errors such as omitted prompt questions by predicting visual questions based on the formative
details [36], distortions to human bodies [78], and objects placed study and applying the questions to multiple images. Our predict-
illogically [80], some errors remained undetected. Also, GenAssist ask-summarize approach also reduced the requirement for reading
occasionally included hallucinations (e.g., missing or non-existent individual question answers. Future VQA systems intended for real-
objects) in the descriptions. While these issues may be mitigated world environments may beneft from our approach as repetitive
with improvements to text-to-image models (e.g., better aligning questions, “unknown unknowns”, and complex visuals are likely.
with human preferences [78]) and vision language models (e.g.,
Support in Creating Prompts. In the formative study, we distilled
better composition reasoning [38], reducing hallucinations [7]),
the need to support creating prompts (D1. Authoring prompts that
GenAssist could also learn what prompts are prone to generation
specify content and style). While we do not directly support prompt
errors and guide BLV creators in creating strong prompts.
creation, we designed our system to reveal visual content and styles
Finally, while GenAssist’s pipeline surfaced large diferences
based on prompt guidelines to inform users about details the model
between images (e.g., diferent objects, characters, expressions, or
flled in. In the user study, participants cited that reading the tables
styles), its descriptions often missed smaller diferences between
in GenAssist helped inform their prompt iterations and learn about
images that were less likely to be described in training data cap-
what styles to use. Prior work has explored using structured search
tions (e.g., slightly diferent compositions or makeup styles). Thus,
for visual concepts for writing prompts [37, 39], and combining our
GenAssist is currently useful in the early stages of prompt itera-
system with such prior work is a promising avenue for future work.
tion, where large diferences between images remain. In the future,
We are currently exploring suggesting content and styles for the
GenAssist could detect detailed changes by adding more detailed or
prompt when the user specifes the context of image use and new
domain-specifc content and style questions, or integrating vision
ways to help users add specifcity to their prompt (e.g. a chatbot,
models that explicitly compare images [74].
as suggested in the formative study). In addition to text input, we
Understanding Multiple Images. Creators in the formative study can also consider multimodal input from users in the future such as
revealed that it is difcult to understand multiple images at the image prompts [54], sketch prompts [11, 82], or music prompts [55]
same time (D2. Understanding high-level image similarities and to create an image for a music album cover, as desired by P6.
diferences). To tackle this challenge, we designed GenAssist with
Supporting Creators with Diferent Visual Impairments. BLV
three strategies: (1) providing the overview of similarities and dif-
creators’ interest in color or style information (e.g., medium, light-
ferences between the generated image candidates, (2) progressively
ing, angle) often depended on their prior experience with visuals
disclosing the information from high-level to low-level to give the
and onset of blindness. GenAssist supports creators in selectively
user control over the level of detail received [23, 43, 50], and (3)
accessing description details, but in the future GenAssist will let
presenting the descriptions in a table format so that users can easily
creators control which details to flter out or prioritize. To support
navigate between images to compare them. Participants highlighted
creators without knowledge of visual style, GenAssist could recom-
that not only these detailed summaries but also the ability to se-
mend popular styles given the image’s intended use, provide style
lectively gain information about the underlying questions were
descriptions, or deliver style in another modality (e.g., sound [21],
helpful in narrowing down their choices. For example, some partici-
tactile interfaces). We will also improve GenAssist in the future to
pants prioritized the prompt verifcation table to assess if the image
support users with remaining vision beyond providing descriptions.
followed their instructions (D3. Assessing if images followed the
For example, GenAssist could provide descriptions based on the
prompt), and other participants used the content and style table to
current zoom viewing window or support further visual edits to
learn how to improve their prompts (D4. Accessing image details
the generated images, as desired by P1.
not specifed by the prompt). In the future, GenAssist could support
sorting or fltering images based on visual attributes to limit the Implications of GenAssist on Creativity. Text-to-image genera-
number of images they consider at once (e.g., sorting images based tion models have sparked conversations about their implications
on prompt adherence or fltering images that have AI-generated for creativity. For BLV creators, image generation can improve
creative agency compared to existing approaches for creating or
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel
selecting images. In our formative study, creators wanted to use [2] AccessiblePublishing.ca. 2023 (accessed Apr 2, 2023). Guide to Image Descriptions.
image generation as it provided fewer limits over content and style https://www.accessiblepublishing.ca/a-guide-to-image-description/
[3] David Austin and Volker Sorge. 2023. Authoring Web-accessible Mathematical
than searching for images online and greater autonomy than asking Diagrams. In Proceedings of the 20th International Web for All Conference. 148–152.
a sighted person to create the image. GenAssist supports BLV cre- [4] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. 2022.
Text2live: Text-driven layered image and video editing. In European conference
ators in exercising creative control over generated images by letting on computer vision. Springer, 707–723.
creators examine image details to revise the prompt or make an [5] Cynthia L Bennett, Jane E, Martez E Mott, Edward Cutrell, and Meredith Ringel
informed selection. Compared to sighted artists who use generated Morris. 2018. How teens with visual impairments take, edit, and share photos
on social media. In Proceedings of the 2018 CHI conference on human factors in
images primarily as references [37], BLV creators often intend to computing systems. 1–12.
use generated images directly. In the future, GenAssist will further [6] Jefrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller,
creative control by supporting prompt-based editing [4]. Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White,
et al. 2010. Vizwiz: nearly real-time answers to visual questions. In Proceedings
Implications of GenAssist on Communication. We designed of the 23nd annual ACM symposium on User interface software and technology.
333–342.
GenAssist to support communication goals of BLV creators. BLV [7] Ali Furkan Biten, Lluís Gómez, and Dimosthenis Karatzas. 2022. Let there be
creators in our formative study aimed to create images to express a clock on the beach: Reducing object hallucination in image captioning. In
their ideas to a broad audience and achieve self-expression. Images Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
1381–1390.
are particularly useful for capturing visual attention and commu- [8] Jens Bornschein and Gerhard Weber. 2017. Digital drawing tools for blind users: A
nicating with sighted people who have difculty reading text. For state-of-the-art and requirement analysis. In Proceedings of the 10th International
example, P4 generated an image of his family to share with his child. Conference on Pervasive Technologies Related to Assistive Environments. 21–28.
[9] Erin Brady, Meredith Ringel Morris, Yu Zhong, Samuel White, and Jefrey P
BLV creators also wanted to use GenAssist in the workplace and Bigham. 2013. Visual challenges in the everyday lives of blind people. In Proceed-
on digital platforms. As GenAssist exists in an ableist environment ings of the SIGCHI conference on human factors in computing systems. 2117–2126.
[10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
that prioritizes visual communication, there is a risk that GenAssist Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
may cause sighted people to expect image-based communication Askell, et al. 2020. Language models are few-shot learners. Advances in neural
from BLV people. Tools like GenAssist must be coupled with re- information processing systems 33 (2020), 1877–1901.
[11] Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and Shi-Min Hu. 2009.
search and activism to make digital, workplace, and educational Sketch2photo: Internet image montage. ACM transactions on graphics (TOG) 28,
environments accessible — e.g., enabling non-visual communication 5 (2009), 1–10.
and providing access to existing visuals. Our work also reveals that [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
Pre-training of deep bidirectional transformers for language understanding. arXiv
generated images themselves should be shared with descriptions in preprint arXiv:1810.04805 (2018).
addition to the prompt that might not accurately refect the image. [13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg
Generative AI for Accessible Media Authoring. Advances in Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers
large-scale generative models enable people to create new types for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[14] Facebook. 2021. How Facebook is using AI to improve photo descriptions for
of content, yet no existing research has explored people with dis- people who are blind or visually impaired. https://ai.facebook.com/blog/how-
abilities as the users of these tools [28]. We see opportunities for facebook-is-using-ai-to-improve-photo-descriptions-for-people-who-are-
generative AI models to broaden the type of content that people blind-or-visually-impaired/
[15] Olutayo Falase, Alexa F Siu, and Sean Follmer. 2019. Tactile code skimmer: A
with disabilities can create. For example, our study participants tool to help blind programmers feel the structure of code. In Proceedings of the
mentioned that they are interested in using generative models for 21st International ACM SIGACCESS Conference on Computers and Accessibility.
536–538.
creating dynamic graphics like cartoons and videos. Similarly, gen- [16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
erative models may be useful for people with motor impairments Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial
authoring visual media, or people with hearing impairments au- networks. Commun. ACM 63, 11 (2020), 139–144.
[17] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman,
thoring music. Jiebo Luo, and Jefrey P Bigham. 2018. Vizwiz grand challenge: Answering visual
questions from blind people. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 3608–3617.
8 CONCLUSION [18] https://github.com/mikhail bot/. 2023 (accessed Apr 2, 2023). Stable Difusion
Negative Prompts. https://github.com/mikhail-bot/stable-difusion-negative-
We created GenAssist, an accessible text-to-image generation sys- prompts
tem for BLV creators. Informed by our formative study with 8 [19] https://github.com/pharmapsychotic/. 2023 (accessed Apr 2, 2023). CLIP Inter-
BLV creators, our interface enables users to verify the adherence rogator. https://github.com/pharmapsychotic/clip-interrogator
[20] https://github.com/willwulfken/. 2023 (accessed Apr 2, 2023). Midjourney
of generated images to their prompts, access additional image de- Styles and Keywords. https://github.com/willwulfken/MidJourney-Styles-and-
tails, and quickly assess similarities and diferences between image Keywords-Reference
[21] https://huggingface.co/spaces/ffloni/. 2023 (accessed Apr 2, 2023). Image to
candidates. Our system is powered by large language and vision- Music. https://huggingface.co/spaces/ffloni/img-to-music
language models that generate visual questions, extract answers, [22] Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay
and summarize the visual information. Our user study with 12 BLV Krishna, and Noah A Smith. 2023. TIFA: Accurate and Interpretable Text-
to-Image Faithfulness Evaluation with Question Answering. arXiv preprint
creators demonstrated the efectiveness of our approach. We hope arXiv:2303.11897 (2023).
this research will catalyze future work in supporting people with [23] Mina Huh, YunJung Lee, Dasom Choi, Haesoo Kim, Uran Oh, and Juho Kim. 2022.
disabilities to express their creativity. Cocomix: Utilizing Comments to Improve Non-Visual Webtoon Accessibility. In
Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems.
1–18.
[24] Mina Huh, Saelyne Yang, Yi-Hao Peng, Xiang’Anthony’ Chen, Young-Ho Kim,
REFERENCES and Amy Pavel. 2023. AVscript: Accessible Video Editing with Audio-Visual
[1] Abubakar Abid, Ali Abdalla, Ali Abid, Dawood Khan, Abdulrahman Alfozan, and Scripts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing
James Zou. 2019. Gradio: Hassle-free sharing and testing of ml models in the Systems.
wild. arXiv preprint arXiv:1906.02569 (2019).
GenAssist: Making Image Generation Accessible UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA
[25] Jiho Kim, Arjun Srinivasan, Nam Wook Kim, and Yea-Seul Kim. CHI 2023. Ex- the 2021 CHI Conference on Human Factors in Computing Systems. 1–12.
ploring Chart Question Answering for Blind and Low Vision Users. (CHI 2023). [52] Yi-Hao Peng, Jason Wu, Jefrey Bigham, and Amy Pavel. 2022. Difscriber: Describ-
[26] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. ing Visual Design Changes to Support Mixed-Ability Collaborative Presentation
arXiv preprint arXiv:1312.6114 (2013). Authoring. In Proceedings of the 35th Annual ACM Symposium on User Interface
[27] Hyung-Kwon Ko, Subin An, Gwanmo Park, Seung Kwon Kim, Daesik Kim, Bo- Software and Technology. 1–13.
hyoung Kim, Jaemin Jo, and Jinwook Seo. 2022. We-toon: A Communication [53] Venkatesh Potluri, Liang He, Christine Chen, Jon E Froehlich, and Jennifer
Support System between Writers and Artists in Collaborative Webtoon Sketch Mankof. 2019. A multi-modal approach for blind and visually impaired de-
Revision. In Proceedings of the 35th Annual ACM Symposium on User Interface velopers to edit webpage designs. In Proceedings of the 21st International ACM
Software and Technology. 1–14. SIGACCESS Conference on Computers and Accessibility. 612–614.
[28] Hyung-Kwon Ko, Gwanmo Park, Hyeon Jeon, Jaemin Jo, Juho Kim, and Jinwook [54] Han Qiao, Vivian Liu, and Lydia Chilton. 2022. Initial Images: Using Image
Seo. 2023. Large-scale text-to-image generation models for visual artists’ creative Prompts to Improve Subject Representation in Multimodal AI Generated Art. In
works. In Proceedings of the 28th International Conference on Intelligent User Creativity and Cognition. 15–28.
Interfaces. 919–933. [55] Yue Qiu and Hirokatsu Kataoka. 2018. Image generation associated with mu-
[29] Mackenzie Leake, Hijung Valentina Shin, Joy O Kim, and Maneesh Agrawala. sic data. In Proceedings of the IEEE Conference on Computer Vision and Pattern
2020. Generating Audio-Visual Slideshows from Text Articles Using Word Con- Recognition Workshops. 2510–2513.
creteness.. In CHI, Vol. 20. 25–30. [56] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
[30] Cheuk Yin Phipson Lee, Zhuohao Zhang, Jaylin Herskovitz, JooYoung Seo, and Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Anhong Guo. CHI 2022. CollabAlly: Accessible Collaboration Awareness in et al. 2021. Learning transferable visual models from natural language supervision.
Document Editing. (CHI 2022). In International conference on machine learning. PMLR, 8748–8763.
[31] Jaewook Lee, Jaylin Herskovitz, Yi-Hao Peng, and Anhong Guo. 2022. Image- [57] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.
Explorer: Multi-Layered Touch Exploration to Encourage Skepticism Towards 2022. Hierarchical text-conditional image generation with clip latents. arXiv
Imperfect AI-Generated Image Captions. In Proceedings of the 2022 CHI Conference preprint arXiv:2204.06125 (2022).
on Human Factors in Computing Systems. 1–15. [58] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec
[32] Jaewook Lee, Yi-Hao Peng, Jaylin Herskovitz, and Anhong Guo. 2021. Image Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation.
Explorer: Multi-Layered Touch Exploration to Make Images Accessible. In Pro- In International Conference on Machine Learning. PMLR, 8821–8831.
ceedings of the 23rd International ACM SIGACCESS Conference on Computers and [59] Mr D Murahari Reddy, Mr Sk Masthan Basha, Mr M Chinnaiahgari Hari, and
Accessibility. 1–4. Mr N Penchalaiah. 2021. Dall-e: Creating images from text. UGC Care Group I
[33] Jingyi Li, Son Kim, Joshua A Miele, Maneesh Agrawala, and Sean Follmer. 2019. Journal 8, 14 (2021), 71–75.
Editing spatial layouts through tactile templates for people with visual impair- [60] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
ments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Ommer. 2021. High-Resolution Image Synthesis with Latent Difusion Models.
Systems. 1–11. arXiv:2112.10752 [cs.CV]
[34] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping [61] Anastasia Schaadhardt, Alexis Hiniker, and Jacob O Wobbrock. 2021. Understand-
language-image pre-training with frozen image encoders and large language ing blind screen-reader users’ experiences of digital artboards. In Proceedings of
models. arXiv preprint arXiv:2301.12597 (2023). the 2021 CHI Conference on Human Factors in Computing Systems. 1–19.
[35] Junchen Li, Garreth W. Tigwell, and Kristen Shinohara. 2021. Accessibility of [62] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross
high-fdelity prototyping tools. In Proceedings of the 2021 CHI Conference on Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell
Human Factors in Computing Systems. 1–17. Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next
[36] Vivian Liu and Lydia B Chilton. 2022. Design guidelines for prompt engineering generation image-text models. arXiv preprint arXiv:2210.08402 (2022).
text-to-image generative models. In Proceedings of the 2022 CHI Conference on [63] Athar Sefd, Prasenjit Mitra, and Lee Giles. 2021. SlideGen: an abstractive section-
Human Factors in Computing Systems. 1–23. based slide generator for scholarly documents. In Proceedings of the 21st ACM
[37] Vivian Liu, Han Qiao, and Lydia Chilton. 2022. Opal: Multimodal Image Genera- Symposium on Document Engineering. 1–4.
tion for News Illustration. In Proceedings of the 35th Annual ACM Symposium on [64] Ather Sharif, Olivia H Wang, Alida T Muongchan, Katharina Reinecke, and
User Interface Software and Technology. 1–17. Jacob O Wobbrock. 2022. Voxlens: Making online data visualizations accessible
[38] Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, and Ranjay with an interactive javascript plug-in. In Proceedings of the 2022 CHI Conference
Krishna. 2023. CREPE: Can Vision-Language Foundation Models Reason Com- on Human Factors in Computing Systems. 1–19.
positionally?. In Proceedings of the IEEE/CVF Conference on Computer Vision and [65] Ather Sharif, Andrew M Zhang, Katharina Reinecke, and Jacob O Wobbrock.
Pattern Recognition. 10910–10921. 2023. Understanding and Improving Drilled-Down Information Extraction from
[39] Shane McGeehan. 2023 (accessed Apr 2, 2023). Prompter. https://prompterguide. Online Data Visualizations for Screen-Reader Users. In Proceedings of the 20th
com/prompter/ International Web for All Conference. 18–31.
[40] Microsoft. 2021. Seeing AI. https://www.microsoft.com/en-us/ai/seeing-ai [66] Volker Sorge, Mark Lee, and Sandy Wilkinson. 2015. End-to-end solution for
[41] Midjourney. 2023 (accessed Apr 2, 2023). Midjourney. https://www.midjourney. accessible chemical diagrams. In Proceedings of the 12th International Web for All
com Conference. 1–10.
[42] Midjourney. 2023 (accessed Apr 2, 2023). Midjourney Propmt Guidelines. https: [67] NY Times. 2023 (accessed Apr 2, 2023). My Kids Want Plastic Toys. I Want to
//docs.midjourney.com/docs/prompts Go Green. https://time.com/6126981/my-kids-want-plastic-toys-i-want-to-go-
[43] Meredith Ringel Morris, Jazette Johnson, Cynthia L Bennett, and Edward Cutrell. green-heres-a-fx/
2018. Rich representations of visual content for screen reader users. In Proceedings [68] NY Times. 2023 (accessed Apr 2, 2023). Why Multitasking is Bad for You. https:
of the 2018 CHI conference on human factors in computing systems. 1–11. //time.com/4737286/multitasking-mental-health-stress-texting-depression/
[44] Hospital News. 2016. You are what you eat. https://hospitalnews.com/you-are- [69] Iulia Turc and Gaurav Nemade. 2022. Midjourney User Prompts & Generated
what-you-eat-why-nutrition-matters/ Images (250k). https://doi.org/10.34740/KAGGLE/DS/2349267
[45] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann. 2015. Designscape: [70] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Design with interactive layout suggestions. In Proceedings of the 33rd annual Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
ACM conference on human factors in computing systems. 1221–1224. you need. Advances in neural information processing systems 30 (2017).
[46] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] [71] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show
[47] Guy Parsons. 2023 (accessed Apr 2, 2023). DALL-E2 Propmt Book. https: and tell: A neural image caption generator. In Proceedings of the IEEE conference
//dallery.gallery/the-dalle-2-prompt-book/ on computer vision and pattern recognition. 3156–3164.
[48] William Christopher Payne, Alex Yixuan Xu, Fabiha Ahmed, Lisa Ye, and Amy [72] Luis Von Ahn and Laura Dabbish. 2004. Labeling images with a computer game.
Hurst. 2020. How blind and visually impaired composers, producers, and song- In Proceedings of the SIGCHI conference on Human factors in computing systems.
writers leverage and adapt music technology. In Proceedings of the 22nd Interna- 319–326.
tional ACM SIGACCESS Conference on Computers and Accessibility. 1–12. [73] W3C Web Accessibility Initiative (WAI). 2022 (accessed Dec 12, 2022). Introduc-
[49] Yi-Hao Peng, Jefrey P Bigham, and Amy Pavel. 2021. Slidecho: Flexible Non- tion to web accessibility. https://www.w3.org/WAI/fundamentals/accessibility-
Visual Exploration of Presentation Videos. In The 23rd International ACM SIGAC- intro/
CESS Conference on Computers and Accessibility. 1–12. [74] Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B Chan. 2020. Compare
[50] Yi-Hao Peng, Peggy Chi, Anjuli Kannan, Meredith Morris, and Irfan Essa. 2023. and reweight: Distinctive image captioning using similar images sets. In Computer
Slide Gestalt: Automatic Structure Extraction in Slide Decks for Non-Visual Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Access. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Proceedings, Part I 16. Springer, 370–386.
Systems. [75] Ruolin Wang, Zixuan Chen, Mingrui Ray Zhang, Zhaoheng Li, Zhixiu Liu, Zihan
[51] Yi-Hao Peng, JiWoong Jang, Jefrey P Bigham, and Amy Pavel. 2021. Say It All: Dang, Chun Yu, and Xiang’Anthony’ Chen. 2021. Revamp: Enhancing Accessible
Feedback for Improving Non-Visual Presentation Accessibility. In Proceedings of Information Seeking Experience of Online Shopping for Blind or Low Vision
UIST ’23, October 29–November 01, 2023, San Francisco, CA, USA Mina Huh, Yi-Hao Peng, and Amy Pavel