0% found this document useful (0 votes)
18 views45 pages

Image in Words

The document introduces ImageInWords (IIW), a human-in-the-loop framework designed to create hyper-detailed image descriptions for training Vision-Language models, addressing issues of existing datasets that often contain short and inaccurate descriptions. The IIW dataset, consisting of 9018 images, significantly improves description quality, showing enhancements in comprehensiveness, specificity, and human-likeness compared to previous datasets and model outputs. Evaluations demonstrate that models fine-tuned with IIW data excel in generating accurate image representations and richer descriptions, benefiting various applications including text-to-image generation and support for visually impaired individuals.

Uploaded by

720matheusmendes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views45 pages

Image in Words

The document introduces ImageInWords (IIW), a human-in-the-loop framework designed to create hyper-detailed image descriptions for training Vision-Language models, addressing issues of existing datasets that often contain short and inaccurate descriptions. The IIW dataset, consisting of 9018 images, significantly improves description quality, showing enhancements in comprehensiveness, specificity, and human-likeness compared to previous datasets and model outputs. Evaluations demonstrate that models fine-tuned with IIW data excel in generating accurate image representations and richer descriptions, benefiting various applications including text-to-image generation and support for visually impaired individuals.

Uploaded by

720matheusmendes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

ImageInWords:

Unlocking Hyper-Detailed Image Descriptions

Roopal Garg1 , Andrea Burns1 , Burcu Karagol Ayan1 , Yonatan Bitton1 , Ceslee
Montgomery1 , Yasumasa Onoe1 , Andrew Bunner1 , Ranjay Krishna1,3 , Jason
Baldridge2 , and Radu Soricut1
1
Google Research, 2 Google DeepMind, 3 University of Washington
https://github.com/google/imageinwords

Abstract. Despite the longstanding adage “an image is worth a thou-


sand words,” creating accurate and hyper-detailed image descriptions for
training Vision-Language models remains challenging. Current datasets
typically have web-scraped descriptions that are short, low-granularity,
and often contain details unrelated to the visual content. As a result,
models trained on such data generate descriptions replete with missing
information, visual inconsistencies, and hallucinations. To address these
issues, we introduce ImageInWords (IIW), a carefully designed human-
in-the-loop annotation framework for curating hyper-detailed image de-
scriptions and a new dataset resulting from this process. We validate the
framework through evaluations focused on the quality of the dataset and
its utility for fine-tuning with considerations for readability, comprehen-
siveness, specificity, hallucinations, and human-likeness. Our dataset sig-
nificantly improves across these dimensions compared to recently released
datasets (+66%) and GPT-4V outputs (+48%). Furthermore, models
fine-tuned with IIW data excel by +31% against prior work along the
same human evaluation dimensions. Given our fine-tuned models, we
also evaluate text-to-image generation and vision-language reasoning.
Our model’s descriptions can generate images closest to the original, as
judged by both automated and human metrics. We also find our model
produces more compositionally rich descriptions, outperforming the best
baseline by up to 6% on ARO, SVO-Probes, and Winoground datasets.

1 Introduction

Today’s state-of-the-art Vision-Language Models (VLMs) are trained using large,


noisy web datasets. Datasets such as WebImageText [41], ALIGN [18], Concep-
tual Captions [47] and LAION [46] are crawled from the internet and rely on
alt-text associated with each image as an imperfect proxy description for its vi-
sual contents. Some alt-text captions only mention the geographical location of
where the photo was taken (e.g. “Europe”), or the camera model used to capture
the photo (e.g. “Canon EOS R6 Mark II”), or it is SEO-specific (e.g., “keep calm
and carry on”). While data filtering and post-processing can remove noisy text,
alt-text is inherently ambiguous between conveying the content or the intent
2 Garg et al.

Fig. 1: ImageInWords Seeded Annotation Framework. Humans enrich and refine out-
puts sequentially, building upon previous human and/or machine provided inputs. The
human annotation flow starts with fine-grained object captions in Task 1. They are used
as building blocks to help compose image-level hyper-detailed descriptions in Task 2.
The VLMs (in orange) are updated in an active learning loop to produce better object-
level captions and image-level descriptions as more annotated data becomes available.
Screenshots of the end-to-end annotation process and UI are in the Supplemental.

of the image [56]. Therefore, scraping image descriptions from the web as the
primary source of effective vision-language (VL) pairing is fundamentally flawed
and limits model capabilities [17, 29, 42, 48, 51].
With the goal of curating higher quality image-text data, recent works have
released dense human-written caption datasets (e.g., DCI [52], DOCCI [32])
or model-generated datasets (e.g., DAC [13], PixLore [5]). Both types have
their limitations, as using annotators without comprehensive guidelines results
in vague, subjective outputs that vary by human attention span, bias, and ef-
fort [6, 31, 36, 63]. In contrast, model-generated captions are cheaper to generate
but result in outputs that are incomplete and rife with hallucinations [11, 44].
In this work, we describe ImageInWords (IIW), a human-in-the-loop frame-
work for curating hyper-detailed and hallucination-free image descriptions, and
its resulting annotations. ImageInWords combines the irreplaceable quality of hu-
man annotators with seeded metadata from machine generations. The process
begins with object detectors first identifying individual object instances in the
image. Next, a VLM generates granular captions for each detected object which
seed our human annotation process. These seed captions may contain hallucina-
tions or lack object-level comprehensiveness and specificity. Our crowd workers
augment and fix the object-level captions to make them richer and hallucination
free to seed the next step. Next, we operate at image-level, where an image cap-
tion is generated by the VLM to seed our final image description. Crowd workers
now consume the image-level seed captions along with the object-level human
ImageInWords 3

annotations to fill in contextual gaps missing from the existing image captions.
We design guidelines for crowd workers to attend to concepts beyond objects,
such as visual perspective, spatial arrangement, and human-object interactions.
This process is iterative to ensure a final high quality dataset (Fig. 1).
Following this process, we construct the IIW dataset containing 9018 images,
each with its hyper-detailed description. In comparison to other dense descrip-
tion datasets (Tab. 1), our dataset contains more information along multiple
dimensions: our descriptions have an average of 9.8 sentences, and 52.5 nouns,
28 adjectives, 5 adverbs, and 19.1 verbs per description. To evaluate description
quality, we perform a side-by-side (SxS) human evaluation against those from
human-generated datasets (DCI, DOCCI) and from those generated by GPT-4V.
Evaluators rated our descriptions as more comprehensive, specific, and human-
like, containing fewer hallucinations and better leading sentences (Tab. 3) at an
average of +66% (DCI and DOCCI) and +48% (GPT-4V).
To comprehensively evaluate the utility of our framework’s generated descrip-
tions, we use it as a fine-tuning dataset. We fine-tune a PaLI-3 5B model [8] to
successfully generate hyper-detailed descriptions. We conduct four evaluations to
compare the same model fine-tuned on our dataset versus other fine-tuning sets.
First, a battery of well-established readability metrics finds IIW to be superior to
DCI and DOCCI, when measuring either the annotations directly or fine-tuned
model-generated descriptions. Second, using these fine-tuned models, we gener-
ate descriptions on the Localized Narratives (LocNar) [39] dataset to create a
common pool of image-description pairs and run human SxS evaluations. We
find that IIW model-generated outputs are preferred more by +31% compared
to models fine-tuned on prior work. Third, we show that images generated using
our model’s descriptions were considered to be a closer reconstruction of the
original image than descriptions from other models. The generations were eval-
uated both by human evaluators as well as CLIP. Fourth, we test the generated
descriptions’ compositionality by swapping the images in compositionality rea-
soning benchmarks with the generated description (ARO [66], SVO-Probes [15]
and Winoground [51]). Our trained model results in higher reasoning accuracy
compared to LLaVA-v1.5 and InstructBLIP.
Overall, our framework produces higher quality image description data that
serve as an effective fine-tuning dataset, and our evaluations along a dozen di-
mensions validate its utility. We posit that the level of detail attained in our an-
notations facilitates a strong new capability that can help advance VL research
and its applications; this level of descriptive detail is important for multiple
downstream applications, including generating descriptions for Text-to-Image
(T2I) models [4, 45, 65] and in applications for people with visual impairments.

2 Related Work

Image captioning has been studied for many years, starting with CNN and LSTM
encoder-decoder frameworks for generating generic captions [2, 16, 43, 55], to the
more recent Transformer-based VLMs which are evaluated on more challenging
4 Garg et al.

Table 1: Dataset Statistics Comparing ImageInWords (IIW) to Prior Work. We include


the number of samples (i.e., number of captions/descriptions) and the average number
of tokens, sentences, nouns (NN), adjectives (ADJ), adverbs (ADV), and verbs (VB).
The statistics are noted as average counts.

Sample Tokens Tokens Sentences NN ADJ ADV VB


Dataset
Count / Sentence / Description
SVP [22] 19,561 11.9 68.5 5.7 17.1 6.7 1.1 5.0
LocNar [39] 873,107 15.7 41.0 2.6 10.7 1.6 0.4 3.5
DCI [52] 7,805 15.8 148.0 9.3 35.3 16.3 3.6 10.5
DOCCI [32] 14,647 19.2 135.7 7.1 34.0 16.6 2.7 9.6
IIW (ours) 9,018 22.1 217.2 9.8 52.5 28.0 5.0 19.1

captions [9,25,53] (e.g., VizWiz [14], NoCaps [1], TextCaps [49]). These datasets,
among many others, contain captions with an average of 15 words or fewer [1,
12, 14, 19, 23, 27, 30, 38, 38, 47, 49, 64] and may differ by caption grounding level
(e.g. whole image or region-level captions) or image domain (e.g. images taken
by people who are blind or images capturing text).
There are, however, few dense image description datasets. PixLore [5] pro-
posed using multiple vision-language datasets to generate more verbose captions
with BLIP-2 [25]. DAC [13] uses a machine-generated approach: pretrained lan-
guage models expand the original image caption and pretrained VLMs are used
to generate captions over smaller image regions. The resulting descriptions are
then used to fine-tune a VLM model for better compositional reasoning. While
model-only approaches are cost effective and avoid the challenges of designing
instructions for crowd workers, they risk introducing hallucinations and sys-
tematic biases, which may not be easily mitigated. Using only crowd workers,
DOCCI [32] collects image descriptions that we later show can be considerably
improved. Closest to IIW is DCI [52], which uses human annotators to reach
denser image descriptions. DCI uses the SAM [20] object detector to generate
smaller regions to be described and then composes them into an overall image
description. In contrast to DCI, we seed our annotation pipeline with VLM gen-
erated outputs and allow crowd workers to update or correct every component
of the seeded information. The output is then sequentially refined over multiple
annotation rounds to produce a single coherent hyper-detailed annotation out-
put. In comparison to DCI’s “caption-extra” annotation, we collect significantly
better descriptions. In Tab. 1, we provide quantitative comparisons to prior work
designed for longer image captions.

3 ImageInWords Dataset Collection

The IIW dataset is composed of 9018 (IIW-Train: 8573, IIW-Test: 445) images
that are sampled from a dataset built similar to WebLI [9], and human anno-
tated. Details on the human annotator pool are provided in the Supplemental.
In Sec. 3.1, we briefly review the foundational guidelines we define for use with
crowd workers. We describe our annotation methodology which consists of seed-
ImageInWords 5

ing and sequential description refinement in Sec. 3.2. Finally, details on the types
of image-text annotations we collect (Fig. 3) are described in Sec. 3.3.

3.1 Annotation Guidelines


Our goal is to curate image descriptions rich enough to paint a vivid mental
picture that is closely aligned with the actual image. To reach this level of de-
tailed descriptions, composed in an articulate manner, we compile an extensive
set of guidelines for human annotators and iterate over them with multiple pilot
rounds. A high-level version of the guidelines is presented here; see the Supple-
mentary for a detailed version.
Annotators are asked to operate as if they are instructing a painter to paint
with their words and only include details that can be deduced from visual cues,
erring on the side of higher precision. To compose flowing, coherent descriptions,
unnecessary fragmentation of sentences should be avoided; annotators should
avoid the use of filler phrases like “in this image,” “we can see,” “there is a,” and
“this is a picture of ” since they are verbose and add no visual detail.
While describing the overall image, we instruct annotators to start with a
newspaper style TLDR (Too Long Didn’t Read; meant to serve as a succinct
summary) sentence that paints a very clear, high-level idea. Objects should be
described in the order of their saliency, while noting individual objects and re-
lationships in a well organized manner. Descriptions should include the overall
setting, background, and style, and the annotators should consider the camera
angle, overall composition, and any rendered text in the image. We also ask them
to pay special attention to attributes that relate to people, apparel, art pieces,
and locale specific attributes. When noting object attributes in the image, the
following are used as example features (but are not limited to): function, shape,
size, color, design, pattern, texture, material, condition, opacity, orientation, lo-
cation, relationship to other components/objects, and text written on objects.
Attributes that make the objects unique should be explicitly included.

3.2 Annotation Methodology


This section describes the seeded, sequential process employed in annotating the
IIW dataset.
Seeded Annotation Describing images in detail is a highly subjective and
complicated task. To ease some of the effort required here, we use PaLI-3 5B
outputs to seed the annotation process. This provides a starting point for crowd
workers, expediting human annotations instead of them starting from scratch.
While VLMs have quickly improved in their ability to capture details about im-
ages, attempts to generate a consistent rich output still fall prey to hallucinations
and information recall issues. Starting from a seeded caption, our human anno-
tation pipeline is designed to ensure that VLM hallucinations can be corrected
and missing details filled in to obtain detailed descriptions.
An initial machine generated caption and optional high precision, domain
specific metadata (e.g., art style or title of a painting) provides a minimal quality
6 Garg et al.

350
1200
300
250 1000
200 800
150 600
100 400
50 200
1 2 3 1 2 3
(a) Description Token Count per (b) Time(sec) per
Annotation Round Annotation Round
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
(1,3) (1,2) (2,3) (1,2)
(c) Jaccard-Similarity b/w Annotation Rounds (d) Jaccard-Similarity b/w Annotation Rounds
in the Beginning after Human-in-the-loop Learning

Fig. 2: Effects of Sequential Annotation: Over annotation rounds, (a) token count
goes up as (b) time spent goes down with (c) higher agreement, measured by Jaccard
Similarity [61]. A very low agreement across rounds (1,3) indicates considerable edits
which annotators largely agreed upon in rounds (2,3). Round (1,2) provides the highest
mean token count gain of ∼30. (d) Over time with a constant human annotator pool,
each learns from the other via an implicit feedback loop and a high agreement rate in
round (1,2) can now be observed as was previously only seen in round (2,3) in (c).

and coverage guarantee from the human outputs. As data is collected, the PaLI-
3 5B model is fine-tuned to produce better quality descriptions in an active
learning loop (as reflected with loops in Fig. 1).
Sequential Augmentation In addition to seeding the IIW annotation pipeline
with VLM outputs to ease human annotation, we also improve the efficiency of
our framework with sequential image description augmentations. For this, hu-
mans augment on top of a previous crowd worker’s or VLM’s outputs instead
of starting from scratch each time in parallel. The seed information and annota-
tions from previous rounds are available to the annotators for reference. During
the annotation process, we observed (Fig. 2) that it is far more effective in terms
of both time and output to read and augment image descriptions than to write
from scratch. From Fig. 2 we see that if annotations were done in parallel, we
would have 3 competing outputs per image, each with their own style, perspec-
tive, strength, and weakness, and each containing ∼170 words and taking ∼800
seconds. Whereas, in the sequential process, we get a single all-inclusive output
per image that has been verified and augmented by three humans with +20% to-
ken count in -30% time. Higher Jaccard Similarity over rounds suggests a higher
inter-annotator agreement on the output, which serves as a proxy for quality.
Finally, our framework has the benefit of injecting an implicit human-to-
human learning loop, as each human annotator has the opportunity to read
and learn from other perspectives across the annotation rounds, leading to im-
proved individual quality. This is evident from a ∼2x improved inter-annotator
agreement between rounds (1, 2); compare (c) and (d) in Fig. 2.
ImageInWords 7

Fig. 3: IIW Annotation Tasks. Objects and their attributes are first individually an-
notated to note the salient objects and focus on coverage of their attributes in Task 1.
These outputs, along with a seed VLM caption, are passed to humans to build the ini-
tial image-level description. The initial caption is then human augmented and refined
in N sequential rounds to attain the final hyper-detailed description in Task 2.

3.3 Annotation Framework


Based on the above guidelines and methodologies, we present the IIW annotation
framework to annotate images across two tasks. The tasks are annotated sequen-
tially, seeded from VLMs or previous human annotations (Fig. 3). Each task can
have multiple sequential rounds of annotations, each building on top of another.
Annotation examples and UI screenshots are provided in the Supplementary.
Annotation Task 1: Fine Grained Objects and Attributes Similar to the
Visual Genome [23] dataset, we design this annotation task to capture a (label,
bounding box, object description) triplet per salient image object. An object’s
label is open vocabulary with no verbosity restrictions, and its description is fo-
cused on the object but additionally takes the context of the image into account.
The bounding box localizes where the object is in the image (Fig. 3 (left)). To
seed the data, we first used an internal object detection (OD) model to obtain
a list of (label, bounding box) pairs. Then, object captions are generated by
cropping the image to the object bounding box and generating a caption via a
periodically fine-tuned PaLI-3 5B. Our methodology is agnostic to which VLM,
OD (or image-segmentation) model is used.
From the seeded list of (label, bounding box, object caption), the annota-
tors are first asked to determine the salient objects and fix the list of (label,
bounding box) by editing, removing, adding or merging the object annotations
based on their accuracy, importance, and role in the overall image. They have
the freedom to update the labels to make them more accurate and specific. The
object captions are then updated to obtain a final list of (label, bounding box,
object description). By limiting the scope to specific individual objects, human
annotators can better focus and capture details more comprehensively. All of the
8 Garg et al.

annotations are done using the guidelines in Sec. 3.1 as reference. More details
on the annotation UI, seeded inputs and outputs are in the Supplemental.
Annotation Task-2: Overall Image Description Our second annotation
task is to formulate the final holistic hyper-detailed description using the guide-
lines from Sec. 3.1. Seeded data from Task-1 (detailed above), optional domain
specific metadata (e.g., art style of a painting, otherwise requiring domain ex-
pertise from humans), and a VLM seed image caption are used to hint and help
the annotators compose the overall image description.
The bulk of the annotation responsibility falls on the first annotator who
composes the initial description; note that the crowd worker annotation order
is randomly assigned per sample and the same annotator is not re-employed
across rounds for the same sample. This output is then refined and augmented
in sequential rounds to improve the quality of the output. Sequential annotation
helps mitigate subjectivity and quality drops as data is human verified across the
rounds. The annotators are encouraged to focus more on the augmentation efforts
and only remove things if they are obvious errors from previous rounds. They
are however free to re-frame existing information to augment new details. In our
data collection experiments, we started with 3 sequential annotation rounds and
monitored the n-gram Jaccard similarity between the outputs. With human-in-
the-loop learning, once a high round-over-round output similarity was achieved
(we used a 0.8 threshold), we reduced the numbers of rounds. Optionally, early
stopping support could be added to the annotation framework itself to make this
instance specific. We found our similarity threshold can be met between the first
two rounds, i.e., (1,2), (Fig. 2) suggesting a high individual-annotator quality.

4 Experiments
We perform qualitative and quantitative experiments to evaluate the quality of
the IIW dataset and its utility for fine-tuning. We start the evaluation with
text-based automatic readability metrics in Sec. 4.1 and extend to human SxS
evaluations (defined in Sec. 4.2) to compare our human annotations to prior
work (e.g. DCI, DOCCI, GPT-4V) in Sec. 4.3.
In Sec. 4.4, we fine-tune separate PaLI-3 5B models on DCI, DOCCI and IIW
training splits, with their detailed human-authored text as target. Each model is
trained with an identical setup (∼40 epochs, learning-rate 0.0003, batch-size 32)
and using the generic input instruction: “Generate a detailed image description.”
Additional details on the fine-tuning setup are provided in the Supplemental.
Existing text similarity metrics like BLEU [37], and ROUGE [26] have been
shown to correlate poorly with human judgement as they are heavily dependent
on n-gram overlaps, and thus ill-suited for long texts [24]. As such, to get reliable
results, IIW fine-tuned model outputs are compared with models fine-tuned with
prior work (DCI and DOCCI), and also GPT-4V outputs.
Finally, we quantify the richness of the IIW model outputs via two down-
stream evaluations. First, in Sec. 4.5, we use generated descriptions from DCI,
DOCCI, and IIW fine-tuned models to prompt a Text-to-Image (T2I) model for
ImageInWords 9

Table 2: Readability Metrics on Human and Model Annotated Data. We include


ARI [57], Flesch Kincaid (FK) [58], Gunning Fog (GF) [59], and SMOG [60] metrics.
They approximate grade level needed to comprehend the text and results indicate a
more mature writing style in IIW human-authored and model generated outputs.

Human Authored Model Generated


Dataset
ARI↑ FK↑ GF↑ SMOG↑ ARI↑ FK↑ GF↑ SMOG↑
DCI 5.8 5.7 8.1 8.1 2.9 3.7 6.2 6.9
DOCCI 7.5 7.1 9.5 8.7 6.4 6.6 8.7 8.2
IIW 10.4 9.5 11.8 11.5 9.3 9.0 11.3 11.7

image reconstruction and evaluate which descriptions result in higher fidelity


generated images. Then, in Sec. 4.6, we quantitatively show how IIW models
can generate descriptions to aid in vision-language compositional reasoning.

4.1 Automatic Readability Measurements


Before we compare descriptions with human SxS, we use a suite of readability
metrics to quantify writing style differences between DCI, DOCCI, and IIW. We
run heuristics based readability metrics over both human-authored and model-
generated descriptions (see Sec. 4.4 for more on fine-tuned model evaluation)
representing each style, and present the results in Tab. 2. Each metric roughly
estimates the level of education needed to understand a piece of written text
using different units, e.g. education years or grade-level. While they are proxy
signals, a pattern across all can be seen as a clear indication of a more mature
and articulate writing style for IIW in comparison with the other alternatives.

4.2 Side-by-Side (SxS) Evaluation Framework


We design our human SxS framework to evaluate on 5 metrics: Comprehensive-
ness, Specificity, Hallucinations, quality of the first few line(s) as a TLDR (Too
Long Didn’t Read; meant to serve as a succinct summary), and Human-Like. The
evaluation is done on a 5 point scale defined using “substantially better” (+ +) or
“marginally better” (+) ratings on both sides of a “neutral” (-). Higher numbers
indicate higher quality across each metric, and our tables report percentages for
ease of comparison. This setup is used to run IIW human-authored (Sec. 4.3) and
IIW fine-tuned model (Sec. 4.4) quality comparative evaluations. We emphasize
that this is an extremely challenging human annotation task, where per image,
two text pieces of 100+ words need to be evaluated across 5 metrics in a SxS
setting. On average, we observe each comparison takes 15-20 minutes. Details
on the annotation setup and UI are in the Supplemental.

4.3 Human Annotation Quality Evaluation


To run a SxS experiment on human-authored description quality, we first need a
common pool of human annotated images. For this, we additionally annotate the
10 Garg et al.

Table 3: Human SxS to Evaluate IIW Human-Authored Data. The eval reports per-
centages comparing data from prior work with data annotated by the IIW framework.

DCI Test DOCCI Test


Metric DCI IIW DOCCI IIW
++ + - + ++ ++ + - + ++

Comprehensiveness 3 7 19 30 41 4 6 38 33 19
Specificity 5 3 4 20 68 3 2 8 22 65
Hallucinations 2 3 48 32 15 0 12 41 34 13
TLDR 3 0 3 20 74 1 4 11 30 54
Human-Like 1 1 14 25 59 1 0 30 46 23

DCI test set (112) and a comparable number of samples (100) from the DOCCI
test set with our IIW annotation framework. We thus have human-authored IIW
annotations for direct comparison on images in the DCI and DOCCI datasets.
Table 3 reports preference percentages for each human-authored test set on
our five metrics. Comparing IIW to DCI and DOCCI, Comprehensiveness is
higher by +61% and +42%, Specificity by +80% and +82%, Hallucinations
are lower by 42% and 35%, TLDR quality is higher by +91% and +79%, and
Human-Likeness improves by +82% and +68%, respectively. This indicates that
the IIW human-authored image descriptions on images from DCI and DOCCI
are considerably better than those originally published with prior work.
To further quantify the quality of IIW human annotations, we compare with
GPT-4V outputs [33–35] in Tab. 5 (right). We use GPT-4V to generate image
descriptions on 100 IIW images. The descriptions are generated with the prompt
“Generate a detailed image description” and no other specifications. The results
from the Model-Human section of Tab. 5 show that we reach Comprehensive-
ness (+35%), Specificity (+53%), Hallucination (+59%), TLDR (+70%), and
Human-Likeness (+21%) improvements over GPT-4V outputs. Although GPT-
4V performs relatively better than the human-authored data from DCI and
DOCCI when compared to IIW annotations, we assess that considerable future
modeling efforts are needed for VLMs to reach IIW human-authored quality.

4.4 IIW Fine-tuned VLM Evaluation

After evaluating IIW human annotations, we turn to quantifying the impact of


fine-tuning with IIW data versus fine-tuning with prior annotations. We begin
by directly evaluating model-generated outputs with automated text similarity
metrics, and then run additional human SxS for comparison. Using the standard
automatic metrics, Tab. 4 illustrates how fine-tuned models largely perform bet-
ter in replicating their own style. We also report CIDEr, BERTScore [67], and
BLEURT [40] metrics in the Supplemental.
We report these results simply to emphasize the limitations of these metrics
when measuring the quality of hyper-detailed image descriptions. Our main eval-
uation instead uses the same human SxS setup as in Sec. 4.2. We evaluate the
three fine-tuned models on a random sample of LocNar Eval images, which can
ImageInWords 11

Table 4: Cross Dataset Automatic Metric Evaluation of Fine-tuned Models

DCI Test (112) DOCCI Test (5k) IIW Test (445)


PaLI-ft
bleu-4 rouge-1 rouge-2 bleu-4 rouge-1 rouge-2 bleu-4 rouge-1 rouge-2
DCI 4.97 35.38 12.70 5.24 39.55 12.95 2.30 31.70 8.58
DOCCI 4.24 34.60 10.70 8.68 45.50 17.07 3.50 36.10 10.02
IIW 3.02 31.59 8.02 4.60 38.10 10.06 5.66 38.57 11.73

Table 5: Human SxS to Evaluate IIW Model Predictions. Model Generated compares
IIW to prior work DCI and DOCCI using PaLI-5B fine-tuned models and GPT-4V
outputs. Model-Human then compares GPT-4V model to IIW human-annotations.

Model Generated Model-Human


LocNar Eval IIW-400 IIW-400
Metric
DCI IIW DOCCI IIW GPT-4V IIW GPT-4V IIW
++ + - + ++ ++ + - + ++ ++ + - + ++ ++ + - + ++
Comprehensive 7 10 24 32 27 5 22 42 26 5 21 29 36 10 4 3 10 39 29 19
Specificity 6 10 14 24 46 6 14 23 33 24 46 32 12 8 2 6 10 15 35 34
Hallucincations 12 21 43 11 13 9 25 39 21 6 22 29 23 20 6 0 6 29 34 31
TLDR 9 11 9 30 41 6 7 17 42 28 7 15 27 31 20 5 6 8 47 34
Human-Like 11 5 13 32 39 6 12 41 27 14 8 22 60 7 3 6 13 41 27 13

be considered as out-of-distribution for each of these fine-tuning datasets. The


results mirror Tab. 3’s human-authored statistics more closely. We see IIW gains
over existing datasets (DCI, DOCCI) on Comprehensiveness (+42, +4)%, Speci-
ficity (+54, +37)%, TLDR (+51, +57)% and Human-Likeness (+55, +23)% with
a relatively small hallucination trade-off (-9, -7)%, largely dominated by marginal
rated losses. Overall, compared to DCI and DOCCI, IIW model-generated out-
puts show a higher average preference from human judgement by +31%.
From Tab. 5 (middle), we see that the IIW PaLI-5B fine-tuned model has
clear room for improvement compared to GPT-4V, as expected given its 5B
size. It is worth noting that it competes well on the Human-Likeness writing-
style metric, and actually excels at learning the TLDR concept, which we built
as a distinct feature of our dataset.

4.5 Reconstructing Images with IIW Descriptions


To complement our direct qualitative analysis, we consider how model-generated
descriptions are useful for other downstream use-cases. We first consider how
IIW generated descriptions can empower T2I models to produce more controlled
and specific image reconstructions. For this study, we re-use the PaLI-5B (DCI,
DOCCI and IIW) fine-tuned VLMs from Sec. 4.4 to generate descriptions on 240
images from the LocNar eval set. We then split the image description for each
(original image, VLM) pair into sentences as units which are fed as cumulative
inputs (i.e., sentence 1, sentence 1 & 2, sentence 1 & 2 & 3...) to an Imagen
model variant [45]. By breaking up a description into its sentences, we aim to
study how IIW descriptions are higher quality at the same input unit length
compared to prior work. We ultimately evaluate ∼1000 generated images across
12 Garg et al.

Table 6: T2I Reconstruction Rankings from Image Descriptions. The original image
is compared to generated images from cumulative sentence inputs on both relative
(Mean Rank Position) and absolute (CLIP similarity) metrics. For both, we limit the
comparisons to a rough estimate of maximum token length in the T2I and CLIP Model.

Mean Rank Position ↓ CLIP Similarity ↑


PaLI-ft
1 1-2 1-3 1-4 1-5 1 1-2 1-3 1-4
DCI 2.05 2.06 1.95 2.00 1.88 0.844 0.852 0.855 0.850
DOCCI 1.74 1.79 1.83 1.84 1.86 0.853 0.862 0.865 0.855
IIW 1.63 1.69 1.62 1.66 1.66 0.861 0.867 0.870 0.868

Fig. 4: Example T2I Outputs and the Resulting Human Rankings. We show an exam-
ple output when the first sentence of the image description from DCI, DOCCI and IIW
PaLI-5B fine-tuned models are fed as input to the same T2I model. Richer information
in the first sentence from IIW makes the T2I model reconstruct the original image
more closely. Additional examples are provided in the Supplemental.

the varied input sentence unit lengths (over the 240 random LocNar eval images)
as a 3-way human ranking evaluation (Fig. 4) and report results in Tab. 6. To
assess image-reconstruction power, we also report the CLIP [41] similarity score
between the reconstructed image and the original image. The results indicate
that IIW’s detailed outputs consistently lead to better T2I reconstruction, with
highest mean rank and CLIP similarity regardless of the length of input units.
Additional rank plots and examples are shared in the Supplemental.

4.6 Compositional Reasoning with IIW Descriptions

We look to a second downstream evaluation to quantify the impact of our hyper-


detailed image descriptions. Specifically, we use IIW generated descriptions to aid
in vision-language compositional reasoning. Probing datasets ARO [62], SVO-
Probes [15], and Winoground [51] modify image captions to no longer match
ImageInWords 13

Table 7: VL Compositional Reasoning Accuracy with Image Descriptions. We evaluate


if richer IIW descriptions can help to distinguish the true matching image caption
in ARO, SVO-Probes, and Winoground Datasets. The COCO and Flickr30k Order
Subsets of ARO are not reported due to a very high language bias baseline of 98%.

ARO [66]
Image Description Model SVO-Probes [15] Winoground [51]
VG-A VG-R
None (Language Bias Baseline) 56.50 59.94 50.71 49.88
InstructBLIP-Vicuna-7B 83.99 62.73 89.35 65.25
LLaVA-V1.5-7B 84.80 63.71 87.89 63.38
IIW PaLI-3 5B 90.37 66.19 88.66 69.38

the paired image1 : changing visual attributes or relationships, swapping verbs,


or shuffling image captions such that they contain the same words but reflect
different semantics. This is done to evaluate different types of vision-language
reasoning, e.g., visual attribute understanding or verb understanding.
In this experiment we evaluate if IIW descriptions can be used to distinguish
the real image caption from the incorrect negative caption in ARO, SVO-Probes,
and Winoground datasets using an LLM-only setup. We prompt PaLM2-340B [3]
to select which of the caption options is true given the image description (see the
Supplemental for the exact input prompt). This essentially replaces the image in
these datasets with a generated description; the amount the description is able
to boost accuracy on these compositional reasoning tests should correlate to
the description’s comprehensiveness and specificity. We compare IIW fine-tuned
models to two larger (7B) open source models: InstructBLIP-Vicuna-7B [10] and
LLaVA-V1.5-7B [28] in Tab. 7, with additional models in the Supplemental.
Our first baseline is the no-image condition, which simply asks an LLM which
image caption is more likely. This serves an important language-bias baseline (the
first row of Tab. 7), and quantifies whether the vision-language compositional
reasoning task really requires vision at all. Our results show that SVO-Probes
and Winoground have the lowest language bias (baseline performs nearly at
random). On the other hand, ARO visual genome attribution and relation sub-
sets are not quite at random baseline; we also note that we do not include the
Flickr30k nor COCO order ARO subsets, as the LLM can distinguish the true
caption at 98% accuracy without any image description.
When incorporating image descriptions, all models perform significantly bet-
ter than the language-bias baseline. Our IIW fine-tuned model results in the
best task performance for ARO Visual Genome Attribution and Relation (VG-
A, VG-R) and Winoground, with accuracy gains of nearly 34%, 6%, and 20%,
respectively. Moreover, we can further boost performance compared to the In-
structBLIP and LLaVA image captions: we improve reasoning accuracy by about
6%, 2%, and 4% compared to the best image description model-based baseline.
For SVO-Probes, we find smaller differences amongst the image description mod-
els, with IIW, InstructBLIP, and LLaVA within ∼1 point of each other.
1
SVO-Probes has a negative image for each positive image-caption pair. The negative
images also have captions, so we use those in our experiments.
14 Garg et al.

5 Data Release
We release a subset of human- and model-annotated IIW image & descriptions,
as well as human SxS results on Human Authored and Model-Human sourced
pairs of descriptions. The model generated descriptions may have hallucinations,
information recall losses, or non-human like writing style artifacts. By releasing
this subset along with human SxS judgements, we encourage the development
of new metrics and evaluation systems to detect them in an automated, scalable
manner. It also promotes fair comparison across methods in future work. The
set is released under a CC BY 4.0 license on GitHub.
Human Annotated We provide human-authored annotations from the IIW
framework. IIW-400, a new eval dataset of 400 random images sampled from
DOCCI-AAR [32]. Full IIW Task-1 and Task-2 annotations (Sec. 3.3) along with
100 human SxS results each on GPT-4V and IIW PaLI-5B models (including
model predictions). DCI-Test, 112 images re-annotated with the IIW framework
along with human SxS results comparing them to DCI’s original human annota-
tions. DOCCI-Test, 100 random images re-annotated with the IIW framework
along with human SxS results comparing them to DOCCI’s human annotations.
Model Annotated We release 2.4k random images annotated by the IIW
PaLI-5B model (Tab. 4) comprising of the IIW-400 set (as used for the human
SxS evaluation), 1k samples from the LocNar Eval set (from the OpenImages [21]
subset), and 1k samples from the XM3600 images [50].

6 Conclusion
In this work, we described ImageInWords (IIW), a new framework for rich, hyper-
detailed image descriptions. Our annotation guidelines and seeded sequential
annotation process lead to human authored descriptions that are strongly pre-
ferred over both prior work’s human annotations (+66%) and fine-tuned models
(+31%). Images constructed with IIW descriptions were ranked 1st more of-
ten regardless of how much of the image description was used, reflecting higher
saliency earlier and better overall quality. Our compositional reasoning evalua-
tion showed IIW descriptions to best contain fine-grained visual detail needed
to decipher true from false visual attributes and semantics, with accuracy gains
of up to 6% over our most performant baselines. Collectively, our results show
the quality and utility of IIW image descriptions as state-of-the-art.
In future work, we will explore the potential of hyper-detailed image descrip-
tions to improve tasks like image retrieval, visual question answering, synthetic
data generation, T2I fine-tuning, and automatic evaluation metrics to measure
these outputs. We are working on extending our careful and meticulous data
vetting process to ensure geodiverse images and continuing to improve the effi-
ciency of our data collection framework. Our current work focused on producing
detailed descriptions in English, and future iterations will expand this to more
languages with locale-specificity. Our goal is to make the annotation guidelines
holistic, reduce human effort and dependency in the annotation process, and
help shift the narrative from captions to descriptions.
ImageInWords 15

References
1. Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D.,
Parikh, D., Lee, S., Anderson, P.: nocaps: novel object captioning at scale. In:
Proceedings of the IEEE International Conference on Computer Vision. pp. 8948–
8957 (2019)
2. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang,
L.: Bottom-up and top-down attention for image captioning and visual question
answering (2018)
3. Anil, R., Dai, A.M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri,
S., Taropa, E., Bailey, P., Chen, Z., Chu, E., Clark, J.H., Shafey, L.E., Huang,
Y., Meier-Hellstern, K., Mishra, G., Moreira, E., Omernick, M., Robinson, K.,
Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang, Y., Abrego, G.H., Ahn, J., Austin,
J., Barham, P., Botha, J., Bradbury, J., Brahma, S., Brooks, K., Catasta, M.,
Cheng, Y., Cherry, C., Choquette-Choo, C.A., Chowdhery, A., Crepy, C., Dave,
S., Dehghani, M., Dev, S., Devlin, J., Díaz, M., Du, N., Dyer, E., Feinberg, V.,
Feng, F., Fienber, V., Freitag, M., Garcia, X., Gehrmann, S., Gonzalez, L., Gur-
Ari, G., Hand, S., Hashemi, H., Hou, L., Howland, J., Hu, A., Hui, J., Hurwitz,
J., Isard, M., Ittycheriah, A., Jagielski, M., Jia, W., Kenealy, K., Krikun, M.,
Kudugunta, S., Lan, C., Lee, K., Lee, B., Li, E., Li, M., Li, W., Li, Y., Li, J.,
Lim, H., Lin, H., Liu, Z., Liu, F., Maggioni, M., Mahendru, A., Maynez, J., Misra,
V., Moussalem, M., Nado, Z., Nham, J., Ni, E., Nystrom, A., Parrish, A., Pellat,
M., Polacek, M., Polozov, A., Pope, R., Qiao, S., Reif, E., Richter, B., Riley, P.,
Ros, A.C., Roy, A., Saeta, B., Samuel, R., Shelby, R., Slone, A., Smilkov, D., So,
D.R., Sohn, D., Tokumine, S., Valter, D., Vasudevan, V., Vodrahalli, K., Wang,
X., Wang, P., Wang, Z., Wang, T., Wieting, J., Wu, Y., Xu, K., Xu, Y., Xue, L.,
Yin, P., Yu, J., Zhang, Q., Zheng, S., Zheng, C., Zhou, W., Zhou, D., Petrov, S.,
Wu, Y.: Palm 2 technical report (2023)
4. Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang,
J., Lee, J., Guo, Y., Manassra, W., Dhariwal, P., Chu, C., Jiao, Y., Ramesh, A.:
Improving image generation with better captions (2023), https://cdn.openai.
com/papers/dall-e-3.pdf
5. Bonilla, D.: Pixlore: A dataset-driven approach to rich image captioning (2023)
6. Burghardt, K., Hogg, T., Lerman, K.: Quantifying the impact of cognitive biases
in question-answering systems (2019)
7. Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: A language modeling
framework for object detection (2022)
8. Chen, X., Wang, X., Beyer, L., Kolesnikov, A., Wu, J., Voigtlaender, P., Mustafa,
B., Goodman, S., Alabdulmohsin, I., Padlewski, P., Salz, D., Xiong, X., Vlasic,
D., Pavetic, F., Rong, K., Yu, T., Keysers, D., Zhai, X., Soricut, R.: Pali-3 vision
language models: Smaller, faster, stronger (2023)
9. Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D.,
Goodman, S., Grycner, A., Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J.,
Ding, N., Rong, K., Akbari, H., Mishra, G., Xue, L., Thapliyal, A., Bradbury,
J., Kuo, W., Seyedhosseini, M., Jia, C., Ayan, B.K., Riquelme, C., Steiner, A.,
Angelova, A., Zhai, X., Houlsby, N., Soricut, R.: Pali: A jointly-scaled multilingual
language-image model (2023)
10. Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi,
S.: Instructblip: Towards general-purpose vision-language models with instruction
tuning (2023)
16 Garg et al.

11. Dai, W., Liu, Z., Ji, Z., Su, D., Fung, P.: Plausible may not be faithful: Probing
object hallucination in vision-language pre-training (2023)
12. Desai, K., Kaul, G., Aysola, Z., Johnson, J.: Redcaps: web-curated image-text data
created by the people, for the people (2021)
13. Doveh, S., Arbelle, A., Harary, S., Herzig, R., Kim, D., Cascante-bonilla, P., Al-
fassy, A., Panda, R., Giryes, R., Feris, R., Ullman, S., Karlinsky, L.: Dense and
aligned captions (dac) promote compositional reasoning in vl models (2023)
14. Gurari, D., Zhao, Y., Zhang, M., Bhattacharya, N.: Captioning images taken by
people who are blind (2020)
15. Hendricks, L.A., Nematzadeh, A.: Probing image-language transformers for verb
understanding (2021)
16. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9,
1735–80 (12 1997). https://doi.org/10.1162/neco.1997.9.8.1735
17. Hsieh, C.Y., Zhang, J., Ma, Z., Kembhavi, A., Krishna, R.: Sugarcrepe: Fixing
hackable benchmarks for vision-language compositionality. Advances in Neural In-
formation Processing Systems 36 (2024)
18. Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q.V., Sung, Y.,
Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning
with noisy text supervision (2021)
19. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to
objects in photographs of natural scenes. In: Proceedings of the 2014 conference
on empirical methods in natural language processing (EMNLP). pp. 787–798 (2014)
20. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T.,
Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything
(2023)
21. Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-El-Haija, S., Kuznetsova,
A., Rom, H., Uijlings, J., Popov, S., Kamali, S., Malloci, M., Pont-Tuset, J.,
Veit, A., Belongie, S., Gomes, V., Gupta, A., Sun, C., Chechik, G., Cai, D.,
Feng, Z., Narayanan, D., Murphy, K.: Openimages: A public dataset for large-
scale multi-label and multi-class image classification. Dataset available from
https://storage.googleapis.com/openimages/web/index.html (2017)
22. Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for gen-
erating descriptive image paragraphs. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. pp. 317–325 (2017)
23. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalan-
tidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Li, F.F.: Visual genome: Con-
necting language and vision using crowdsourced dense image annotations (2016)
24. Kryściński, W., Keskar, N.S., McCann, B., Xiong, C., Socher, R.: Neural text
summarization: A critical evaluation. arXiv preprint arXiv:1908.08960 (2019)
25. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large language models (2023)
26. Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text
Summarization Branches Out. pp. 74–81. Association for Computational Linguis-
tics, Barcelona, Spain (Jul 2004), https://aclanthology.org/W04-1013
27. Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P.,
Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft coco: Common objects in context
(2015)
28. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)
29. Ma, Z., Hong, J., Gul, M.O., Gandhi, M., Gao, I., Krishna, R.: Crepe: Can
vision-language foundation models reason compositionally? In: Proceedings of the
ImageInWords 17

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).


pp. 10910–10921 (June 2023)
30. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation
and comprehension of unambiguous object descriptions. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016)
31. Marshall, C., Shipman, F.: Experiences surveying the crowd: Reflections on meth-
ods, participation, and reliability. In: Proceedings of the 3rd Annual ACM Web
Science Conference, WebSci 2013. pp. 234–243 (05 2013). https://doi.org/10.
1145/2464464.2464485
32. Onoe, Y., Rane, S., Berger, Z., Bitton, Y., Cho, J., Garg, R., Ku, A., Parekh,
Z., Pont-Tuset, J., Tanzer, G., Wang, S., Baldridge, J.: DOCCI: Descriptions of
connected and contrasting images (2024)
33. OpenAI, :, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L.,
Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I.,
Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I.,
Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd,
M., Brakman, A.L., Brockman, G., Brooks, T., Brundage, M., Button, K., Cai, T.,
Campbell, R., Cann, A., Carey, B., Carlson, C., Carmichael, R., Chan, B., Chang,
C., Chantzis, F., Chen, D., Chen, S., Chen, R., Chen, J., Chen, M., Chess, B.,
Cho, C., Chu, C., Chung, H.W., Cummings, D., Currier, J., Dai, Y., Decareaux,
C., Degry, T., Deutsch, N., Deville, D., Dhar, A., Dohan, D., Dowling, S., Dunning,
S., Ecoffet, A., Eleti, A., Eloundou, T., Farhi, D., Fedus, L., Felix, N., Fishman,
S.P., Forte, J., Fulford, I., Gao, L., Georges, E., Gibson, C., Goel, V., Gogineni,
T., Goh, G., Gontijo-Lopes, R., Gordon, J., Grafstein, M., Gray, S., Greene, R.,
Gross, J., Gu, S.S., Guo, Y., Hallacy, C., Han, J., Harris, J., He, Y., Heaton, M.,
Heidecke, J., Hesse, C., Hickey, A., Hickey, W., Hoeschele, P., Houghton, B., Hsu,
K., Hu, S., Hu, X., Huizinga, J., Jain, S., Jain, S., Jang, J., Jiang, A., Jiang, R.,
Jin, H., Jin, D., Jomoto, S., Jonn, B., Jun, H., Kaftan, T., Łukasz Kaiser, Kamali,
A., Kanitscheider, I., Keskar, N.S., Khan, T., Kilpatrick, L., Kim, J.W., Kim, C.,
Kim, Y., Kirchner, H., Kiros, J., Knight, M., Kokotajlo, D., Łukasz Kondraciuk,
Kondrich, A., Konstantinidis, A., Kosic, K., Krueger, G., Kuo, V., Lampe, M., Lan,
I., Lee, T., Leike, J., Leung, J., Levy, D., Li, C.M., Lim, R., Lin, M., Lin, S., Litwin,
M., Lopez, T., Lowe, R., Lue, P., Makanju, A., Malfacini, K., Manning, S., Markov,
T., Markovski, Y., Martin, B., Mayer, K., Mayne, A., McGrew, B., McKinney, S.M.,
McLeavey, C., McMillan, P., McNeil, J., Medina, D., Mehta, A., Menick, J., Metz,
L., Mishchenko, A., Mishkin, P., Monaco, V., Morikawa, E., Mossing, D., Mu, T.,
Murati, M., Murk, O., Mély, D., Nair, A., Nakano, R., Nayak, R., Neelakantan, A.,
Ngo, R., Noh, H., Ouyang, L., O’Keefe, C., Pachocki, J., Paino, A., Palermo, J.,
Pantuliano, A., Parascandolo, G., Parish, J., Parparita, E., Passos, A., Pavlov, M.,
Peng, A., Perelman, A., de Avila Belbute Peres, F., Petrov, M., de Oliveira Pinto,
H.P., Michael, Pokorny, Pokrass, M., Pong, V., Powell, T., Power, A., Power, B.,
Proehl, E., Puri, R., Radford, A., Rae, J., Ramesh, A., Raymond, C., Real, F.,
Rimbach, K., Ross, C., Rotsted, B., Roussez, H., Ryder, N., Saltarelli, M., Sanders,
T., Santurkar, S., Sastry, G., Schmidt, H., Schnurr, D., Schulman, J., Selsam, D.,
Sheppard, K., Sherbakov, T., Shieh, J., Shoker, S., Shyam, P., Sidor, S., Sigler, E.,
Simens, M., Sitkin, J., Slama, K., Sohl, I., Sokolowsky, B., Song, Y., Staudacher, N.,
Such, F.P., Summers, N., Sutskever, I., Tang, J., Tezak, N., Thompson, M., Tillet,
P., Tootoonchian, A., Tseng, E., Tuggle, P., Turley, N., Tworek, J., Uribe, J.F.C.,
Vallone, A., Vijayvergiya, A., Voss, C., Wainwright, C., Wang, J.J., Wang, A.,
Wang, B., Ward, J., Wei, J., Weinmann, C., Welihinda, A., Welinder, P., Weng, J.,
18 Garg et al.

Weng, L., Wiethoff, M., Willner, D., Winter, C., Wolrich, S., Wong, H., Workman,
L., Wu, S., Wu, J., Wu, M., Xiao, K., Xu, T., Yoo, S., Yu, K., Yuan, Q., Zaremba,
W., Zellers, R., Zhang, C., Zhang, M., Zhao, S., Zheng, T., Zhuang, J., Zhuk, W.,
Zoph, B.: Gpt-4 technical report (2023)
34. OpenAI: Gpt-4. https://openai.com/research/gpt-4,2023 (2023), [Online; ac-
cessed 19-February-2024]
35. OpenAI: Gpt-4v(ision) technical work and authors. https://cdn.openai.com/
contributions/gpt-4v.pdf,2023 (2023), [Online; accessed 19-February-2024]
36. Pandey, R., Purohit, H., Castillo, C., Shalin, V.L.: Modeling and mitigating
human annotation errors to design efficient stream processing systems with
human-in-the-loop machine learning. International Journal of Human-Computer
Studies 160, 102772 (2022). https://doi.org/https://doi.org/10.1016/j.
ijhcs.2022.102772, https://www.sciencedirect.com/science/article/pii/
S1071581922000015
37. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic
evaluation of machine translation. In: Isabelle, P., Charniak, E., Lin, D. (eds.)
Proceedings of the 40th Annual Meeting of the Association for Computational
Linguistics. pp. 311–318. Association for Computational Linguistics, Philadelphia,
Pennsylvania, USA (Jul 2002). https://doi.org/10.3115/1073083.1073135,
https://aclanthology.org/P02-1040
38. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazeb-
nik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer
image-to-sentence models. In: Proceedings of the IEEE international conference on
computer vision. pp. 2641–2649 (2015)
39. Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V.: Connecting
vision and language with localized narratives. In: ECCV (2020)
40. Pu, A., Chung, H.W., Parikh, A.P., Gehrmann, S., Sellam, T.: Learning compact
metrics for mt. In: Proceedings of EMNLP (2021)
41. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International conference on machine learning. pp.
8748–8763. PMLR (2021)
42. Ray, A., Radenovic, F., Dubey, A., Plummer, B.A., Krishna, R., Saenko, K.: Cola:
A benchmark for compositional text-to-image retrieval (2023)
43. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object
detection with region proposal networks. In: Cortes, C., Lawrence, N., Lee, D.,
Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Sys-
tems. vol. 28. Curran Associates, Inc. (2015), https://proceedings.neurips.cc/
paper _ files / paper / 2015 / file / 14bfa6bb14875e45bba028a21ed38046 - Paper .
pdf
44. Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object halluci-
nation in image captioning (2019)
45. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour,
S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., Salimans, T., Ho, J., Fleet, D.J.,
Norouzi, M.: Photorealistic text-to-image diffusion models with deep language un-
derstanding (2022)
46. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M.,
Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy,
S., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: Laion-5b: An open
large-scale dataset for training next generation image-text models (2022)
ImageInWords 19

47. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned,
hypernymed, image alt-text dataset for automatic image captioning. In: Gurevych,
I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565. Association
for Computational Linguistics, Melbourne, Australia (Jul 2018). https://doi.
org/10.18653/v1/P18-1238, https://aclanthology.org/P18-1238
48. Shekhar, R., Pezzelle, S., Klimovich, Y., Herbelot, A., Nabi, M., Sangineto, E.,
Bernardi, R.: Foil it! find one mismatch between image and language caption.
In: Proceedings of the 55th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers). Association for Computational Lin-
guistics (2017). https://doi.org/10.18653/v1/p17-1024, http://dx.doi.org/
10.18653/v1/P17-1024
49. Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: a dataset for image cap-
tioning with reading comprehension (2020)
50. Thapliyal, A.V., Pont-Tuset, J., Chen, X., Soricut, R.: Crossmodal-3600: A mas-
sively multilingual multimodal evaluation dataset (2022)
51. Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.:
Winoground: Probing vision and language models for visio-linguistic composition-
ality (2022)
52. Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano,
A.: A picture is worth more than 77 text tokens: Evaluating clip-style models on
dense captions (2023)
53. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need (2023)
54. Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-based image description
evaluation (2015)
55. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image
caption generator (2015)
56. Wikipedia contributors: Alt attribute — Wikipedia, the free encyclopedia. https:
//en.wikipedia.org/w/index.php?title=Alt_attribute&oldid=1189330128
(2023), [Online; accessed 15-January-2024]
57. Wikipedia contributors: Automated readability index — Wikipedia, the free
encyclopedia. https : / / en . wikipedia . org / w / index . php ? title = Automated _
readability_index&oldid=1145735758 (2023), [Online; accessed 22-February-
2024]
58. Wikipedia contributors: Flesch–kincaid readability tests — Wikipedia, the free en-
cyclopedia. https://en.wikipedia.org/w/index.php?title=FleschâĂŞKincaid_
readability_tests&oldid=1192056958 (2023), [Online; accessed 22-February-
2024]
59. Wikipedia contributors: Gunning fog index — Wikipedia, the free encyclopedia.
https://en.wikipedia.org/w/index.php?title=Gunning_fog_index&oldid=
1181089308 (2023), [Online; accessed 22-February-2024]
60. Wikipedia contributors: Smog — Wikipedia, the free encyclopedia. https://en.
wikipedia.org/w/index.php?title=SMOG&oldid=1192815974 (2023), [Online; ac-
cessed 22-February-2024]
61. Wikipedia contributors: Jaccard index — Wikipedia, the free encyclopedia
(2024), https://en.wikipedia.org/w/index.php?title=Jaccard_index&oldid=
1196092673, [Online; accessed 24-January-2024]
62. Yarom, M., Bitton, Y., Changpinyo, S., Aharoni, R., Herzig, J., Lang, O., Ofek,
E., Szpektor, I.: What you see is what you read? improving text-image alignment
evaluation (2023)
20 Garg et al.

63. Ye, A., Santy, S., Hwang, J.D., Zhang, A.X., Krishna, R.: Cultural and linguistic
diversity improves visual representations. arXiv preprint arXiv:2310.14356 (2023)
64. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual
denotations: New similarity metrics for semantic inference over event descriptions.
Transactions of the Association for Computational Linguistics 2, 67–78 (02 2014).
https://doi.org/10.1162/tacl_a_00166, https://doi.org/10.1162/tacl_a_
00166
65. Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A.,
Yang, Y., Ayan, B.K., Hutchinson, B., Han, W., Parekh, Z., Li, X., Zhang, H.,
Baldridge, J., Wu, Y.: Scaling autoregressive models for content-rich text-to-image
generation (2022)
66. Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why
vision-language models behave like bags-of-words, and what to do about it? (2023)
67. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating
text generation with BERT. In: 8th International Conference on Learning Repre-
sentations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net
(2020), https://openreview.net/forum?id=SkeHuCVFDr
ImageInWords 21

7 Supplementary Material
7.1 Annotation Guidelines
We now present the full detailed annotation guideline used for IIW annotations.
Image descriptions should be composed such that they paint a vivid mental
picture of an actual image in the mind of someone hearing the description and has
their eyes closed. In order to reach this level of detail composed in an articulate
manner, we compile an extensive set of annotation guidelines. We iterated over
these guidelines with multiple pilot rounds.
The annotators are asked to operate as if they are instructing a painter to
paint with their words and only include details that can be deduced from visual
cues, erring on the side of higher precision. Unnecessary fragmentation of sen-
tences should be avoided to compose in a flowy coherent style avoiding the use
of filler phrases like: “In this image,” “we can see,” “there is a,” “this is a picture
of,” since they add no visual detail and come at a cost of verbosity.
Objects form the lego-blocks of an image. Interactions and spatial arrange-
ments among them help to form the context of the image. In complex multi-
object images with dense settings, noting each and every object independently
can become cumbersome and highly dependent on the effort the particular hu-
man annotator puts in. To define this better and expect a consistent behavior
from the annotation outputs, we introduce the notion of salient objects. Key ob-
jects without which the image would lose its context and meaning are considered
salient. This can include individual objects or combinations of them depending
on the role they play in the image; consider the following 2 cases as examples:

– Three people in the blurry background of an image, with the scene set inside
a coffee shop, who play no concrete role individually can be grouped as people
in the background instead of 3 individual people object annotations.
– Two people in the foreground and in-focus, engaged in a conversation in the
same scene. The two individuals are likely the focus of the image and hence
worth noting individually in detail as separate objects. This is likely what
the photographer was attempting to capture.

While annotating each of these salient objects in an image, the annotators


should consider the following axes as reference (but not limit themselves to this
list), paying special attention to features that makes them unique or salient:

– Function Purpose of the component or the role it plays in the image


– Shape Specific geometric shape, organic, or abstract
– Size Large, small, or relative size to other objects
– Color Specific color with nuances like solid or variegated
– Design/Pattern Solid, flowers, or geometric
– Texture Smooth, rough, bumpy, shiny, or dull
– Material Wooden, metallic, glass, or plastic
– Condition Good, bad, old, new, damaged, or worn out
– Opacity Transparent, translucent, or opaque
22 Garg et al.

– Orientation Upright, horizontal, inverted, or tilted


– Location Foreground, middle ground, or background
– Relationship to other components Interactions or relative spatial ar-
rangement
– Text written on objects Where and how its written, font and its at-
tributes, single/multi-line, or multiple pieces of individual text

Humans typically associate a set of default features to objects. Consider the


following examples:

– Car by default is assumed to have 4 of each: tires, door, windows and 1 of


each: trunk, hood, steering wheel, roof. Mentioning them separately might
not be that useful as it adds no specific visual detail that we did not already
know as the norm. Now, if the car is a coupe, has a missing window, or
contains a door painted with a different color than the overall color, i.e.,
making it a unique feature, then that would be worth mentioning in the
description since it holds specific added visual value.
– The Golden Gate Bridge by default is orange. That being said, it does not
hurt to include extra detail depending on the use-case. If the annotators do
not recognize the bridge as a famous well known entity, then it would make
sense to include the color and additional attributes.

When composing the overall image description, start with a newspaper style
tldr sentence that paints a very clear high level picture. Describe the objects
in order of their saliency while noting the description of individual objects and
relationships in a coherent manner. Include the overall setting, background, style,
and consider:

– Overall composition Arrangement of the elements in the image, focal


point, balanced, or asymmetrical
– Lighting Natural or artificial, light source
– Color palette Colors or how they interact with each other
– Texture Smooth or rough, shiny or dull
– Depth of field Entire image or only a portion of it is in focus, what effect
this has on the overall composition
– Subject matter Main subject of the image, other elements that are present,
how they relate to the subject matter
– Mood or feeling Overall mood or feeling of the image

Camera angle (i.e., the position of the camera in relation to the subject)
is crucial, as this sets a precedence for what level and kind of information to
expect. The choice of camera angle can have a significant impact on the mood
and meaning of a photograph. Different camera angles can be used to create
different effects and convey different messages, e.g., details about a close-up
are different from those of a wide angle shot. Examples of camera angles (see
Figure 5):
ImageInWords 23

Fig. 5: Camera Angles to Consider when Annotating Images. These are important to
set a precedence on the level and kind of information to expect in the image description.

– Eye level: The camera is positioned at the same level as the subject’s eyes.
This is the most natural and neutral camera angle.
– High angle: The camera is positioned above the subject. This angle can make
the subject appear smaller, weaker, or less important.
– Low angle: The camera is positioned below the subject, anywhere below the
eye line, looking up. This angle can make the subject appear larger, stronger,
or more important. Sometimes, it is even directly below the subject’s feet.
– Ground level: The camera is positioned at the ground level. This angle cap-
tures what is in the frame at ground level, that is, the feet, or maybe the
character lying on the ground.
– Dutch tilt: The camera is tilted on its axis. This angle can be used to create
a sense of unease or disorientation.
– Bird’s-eye view: The camera is positioned directly above the subject. This
angle can be used to show the subject’s relationship to their surroundings.
– Worm’s-eye view: The camera is positioned directly below the subject. This
angle can be used to create a sense of awe or wonder.
– Top-down view or Overhead shot: The camera is above the subject and
you’re taking the photograph downwards from straight above, and not at
any kind of angle. It is typically closer to the subject than a bird’s eye view
(see Figure 5 for comparison).

Some other terms that are sometimes used to describe camera angles and
depths:
24 Garg et al.

– Close-up: A close-up is a photograph that is taken from a very close distance.


Close-ups can be used to show details that would not be visible from a further
distance.
– Medium shot: A medium shot is a photograph that shows the subject from
the waist up or from the knees up. Medium shots are often used to show the
subject’s body language and facial expressions.
– Long shot: A long shot is a photograph that shows the subject from a dis-
tance. Long shots can be used to show the subject’s relationship to their
surroundings.
– Full shot: A full shot is a photograph that shows the subject’s entire body.
Full shots are often used to show the subject’s height and stature.
– Over-the-shoulder shot: An over-the-shoulder shot is a photograph that is
taken from behind one person’s shoulder, showing the other person in the
foreground. Over-the-shoulder shots are often used to create a sense of inti-
macy or connection between the two people.
– Point-of-view shot: A point-of-view shot is a photograph that is taken from
the perspective of the subject. Point-of-view shots can be used to create a
sense of immersion in the scene.
When text is present, include detail such as whether the text is in a single line
or spread along multiple lines, if text is in multiple lines whether there is mutual
alignment, the features of the font such as size, style, color, and orientation (e.g.,
vertical, horizontal, arched), casing (e.g., lower, upper, mixed), and attributes
like italics, underlined, bold, written in quotes, clearly visible or blurred. Describe
the words in the case they are written.
If text is written in multiple lines, we should:

– Quote them as individual units that exist on the same line


– Mention its mutual alignment using references like vertically stacked, aligned
to the left, etc.

For example, in Figure 6, the phrase (“Juice,” “ACROSS THE,” “Universe”)


has words “Juice” and “Universe” as capitalized while the phrase “ACROSS THE”
is all uppercase, and components are aligned along a diagonal. Information on
the font color, type, and shadow effect should be included. As another example
from the same image, the phrase (“FREE,” “ARCADE,” “GAMES”) are all upper-
cased, vertically stacked and centrally aligned.
If you have a good idea of the font family and are confident, that would be
valuable to note.
When people are present, special notes should be kept in mind. The tone
should be respectful to the subject and not make assumptions or try to guess
their gender, identity, ancestry, where they are from, sexuality, religion, etc. We
emphasize that the descriptions should be noted in objective, neutral and fair
language for related attributes and focus solely on the visual aspects. Consider
the following axes with respect to attributes here:
– How much of their body is visible
ImageInWords 25

Fig. 6: An Example where Quoting Text in a Detailed Manner can Enable Precise
Reconstruction. The word-casing and alignment attributes of the multi-line phrase
(“Juice,” “ACROSS THE,” “Universe”) has words “Juice” and “Universe” as capitalized
while the phrase “ACROSS THE” is all upper-cased, all components are aligned along a
diagonal. Information on the font color, type, shadow effect should be included. For the
phrase (“FREE,” “ARCADE,” “GAMES”) all words are upper-cased, vertically stacked
and centrally aligned.

– Whether the face is fully visible


– Whether they are facing the camera or looking somewhere else
– Where and what they are looking at
– What the person is doing (standing, posing, sitting, running, playing a sport)
– What they are wearing. For each piece, note the clothing item name (dress,
pants, short, gloves, shoes), color, pattern (plain, striped), length (if appli-
cable)
– What they are carrying, details about that object (bag, purse, camera)
– Whether they are using any assistance device (wheelchair, cane)
– Whether they have any unique features like marks, tattoos, scars on their
body that are visible. If applicable, note the respective positions on their
body where each is present
– For professions with known gender biases like “nurse,” “doctor,” or “con-
struction worker,” explicitly include the gender (if clearly deducible) and do
26 Garg et al.

not operate under the assumption that one gender is more common in that
profession.

For any apparel , the descriptions should focus on overall style, unique de-
tails, silhouette of the garment, how it fits, fabric, color, shades, and tone of the
garment. If the branding is visually visible, it should be included while attributes
like size should be skipped unless visually verifiable.
Where applicable use locale specific names of objects like clothing (e.g.,
sherwani, kurta, kimono, saree), food (e.g., shawarma, dosa, paneer tikka) etc.
The aim is to capture the locale specific vocabulary so the downstream models
can pick them up instead of using generic abstract terms.
For art pieces, include art styles, time periods, mediums, moods, viewpoints,
subject matters, cultures as much as possible from the visual cues.

7.2 Dataset Collection

Human Annotation Worker Pool We employed and worked with a fixed


human annotator pool comprising of 20+ annotators with mixed backgrounds
in creative writing, art, history, photography and related relevant domain sub-
jects to utilize critical domain expertise and perspectives. The pool is based in
multiple countries, with a US majority currently. In the future, we plan to inten-
tionally increase diversity in our annotator pool to ensure more locale-specific
vocabulary in our image descriptions. The annotators were compensated appro-
priately taking their skill-set, qualifications, location and the complexity of the
task into account.
For text-to-image generation rankings, we employed an internal group of six
people to rank the images generated by different model-generated image de-
scriptions (i.e., we did not hire crowd workers). People participating are domain
experts, familiar with text-to-image generation technology.

Annotation Methodology Seeded Annotation Considerations to keep in


mind:

1. Quality of the seeding data is critical. It is counter productive if it’s noisy


as the human annotators will take longer to comb signal from the noise than
to come up with the information themselves. We recommend to restrict the
use of seeding signal to high precision only.
2. Risk of biasing the outputs as the human annotators may take the easy
route of relying on the seed signal more heavily than intended. We suggest
to note this point explicitly in the annotation guidelines and spot check the
annotations for quality control. Additionally, running annotations with no
seeding and comparing the outputs can be helpful to judge the bias being
induced.

Sequential Augmentation Considerations to keep in mind:


ImageInWords 27

Fig. 7: Human-in-the-Loop Learning. Over time with a constant annotator pool, each
annotator gets an opportunity to read and learn from others’ perspective via an implicit
feedback loop. This has shown to improve individual annotator quality as shown in the
main paper.

1. Heavy reliance on the quality of the base dense description from the first
annotator. If the quality is not good, the annotator in the next round will
spend considerable time fixing the input. There are 2 mitigating steps:
(a) Monitor this at the beginning of the annotation project when the an-
notators are still new to the task using metrics like edit-distance and
provide explicit feedback back to the annotators as needed.
(b) Annotators in each round have the option to start from scratch if they
deem the quality from the previous round to be considerably low. Use
this as feedback for the annotator from the previous round by presenting
them the edited output to learn from.

Human-in-the-Loop Learning Our annotation framework implicitly unlocks


a feedback loop for the annotators due to the sequential augmentation process
discussed above. Each annotator gets an opportunity to read and learn from
other’s perspective which in turn improves their individual quality. As an exam-
ple from Fig. 7, we demonstrate how Annotator-1 get an opportunity to learn
from Annotator-3 for the first image and Annotator-2 gets an opportunity to
learn from Annotator-1 in the second image.
Model-in-the-Loop Annotation We employ an active learning loop for the
VLMs where after some initial annotation data is available, a model version M1
can be trained over the base VLM to improve the seed description quality. As
more data gets annotated, M1 can be updated by M2 , M3 , ..., Mn to reduce the
human effort needed.
28 Garg et al.

Fig. 8: IIW Annotation UI for Task-1. We illustrate the seed object-detection objects
and VLM generated object-level captions with object cropped image bytes as input.

Advantages:
1. Reduces the dependency on the human both in terms of number of rounds
and annotation time.
2. Provides a way to evaluate current model quality by monitoring the time,
volume and patterns of augmentations during the human annotation stage.
Some considerations to keep in mind:
1. As discussed above, the effectiveness relies very heavily on the capability of
the model, i.e., having high comprehensiveness and low hallucinations.

Annotation Framework We now discuss the annotation framework with con-


crete examples and UI illustrations:
Annotation Task-1: Fine Grained Objects and Attributes In Task-1, the
human annotators are presented with seed annotations for the objects from an
Object-Detection (OD) model and VLM generated seed captions for each object
(see Figure 8). The annotators can then annotate to note the salient objects and
their corresponding description (see Figure 9).
Annotators can make the following augmentations to annotate salient ob-
jects:

– Edit make adjustments to the label and/or bounding box. This can include:
• Making the labels more specific, e.g Animal to German Shepherd
• Enlarging or tightening the bounds of the bounding box by expanding
or contracting the seed box.
ImageInWords 29

Fig. 9: IIW Annotation UI for Task-1. We illustrate the human augmented salient
objects and their human-authored descriptions. The annotations are built on seed
information from Figure 8. This example demonstrates how humans can alter the seed
annotations based on the annotation guidelines, which can include merging, deleting,
editing and adding new salient objects and then describing each.

– Remove any invalid pre-populated objects or considerably invalid bounding


boxes.
– Add any missing salient object by drawing out a tight bounding box and
adding an appropriate fine-grained label to it.
– Merge if object(s) are fragmented and/or pre-populated as two or more
objects, the annotators can remove the individual objects and create a new
single object.
• Closely placed objects of the same/similar label/type which individu-
ally hold low value but can be described as a collection to hold a higher
context value should be combined, e.g: five identical cups in an image
lined up next to each other do not need to be tagged as separate ob-
jects. If there are attributes that separate one or more of them from the
others, we expect the annotators to split them in groups and proceed
accordingly.
• Sub-components of a larger object should not be explicitly tagged unless
there is something unique and/or worth mentioning about them. Think if
missing this detail would create a different mental picture than the actual
image, e.g., doors, windows, or tires of a Car can be omitted unless there
is something unique about them, as they are standard expectations from
a Car object.
30 Garg et al.

Fig. 10: IIW Annotation UI for Task-2 with seed VLM description. This VLM has
been fine-tuned in an active learning mode as data was collected iteratively. The seed
caption from the same VLM (PaLI-5B) without the IIW fine-tuning is “a pink bicycle
with a basket of flowers on it.” The seed annotation is then refined and augmented by
human annotators, see Figure 11 for the final resulting description.

For each (label, bounding box) pair, we ask the annotators to generate a
detailed description focused on the object in the context of the image considering
the several axes as reference (see Section 7.1).
Annotation Task-2: Overall Image Description In Task-2, human annota-
tors are presented with the annotations from Task-1 and a seeded VLM descrip-
tion (see Figure 10) which is then refined by human annotators in sequential
rounds to produce the final hyper-detailed description (see Figure 11).

7.3 IIW Fine-Tuning Tasks


We define seven tasks with the IIW Task-1 and Task-2 annotations to fine-tune
two IIW based VLM model variants of PaLI-3 5B [8]. Our models include IIW
Combined, trained on a mixture of all seven tasks and IIW-Task-2 based aka IIW
Model, which is only trained on the final most detailed image description out-
put. The seven tasks can be grouped into three categories: image region, salient
objects, and detailed description based tasks, see Figure 12 for illustration.
As we later discuss, we generally find the IIW (Task 2 only) Model to be
preferred over the IIW Combined variant, but include details on the additional
training tasks and resulting ablations here for completeness. All results in the
main paper use the IIW Model.

Image Region Tasks Using one object at a time from the list of (label, bound-
ing box, description) Task 1 annotations, we perform three region-based tasks.
We use normalized bounding boxes in [ymin, xmin, ymax, xmax ] format as in
Pix2Seq [7]. Our first task is description-label grounding. In multi-object dense
ImageInWords 31

Fig. 11: IIW Final Annotation UI for Task-2. We illustrate the human annotations
available from Task-1 as the human annotators hover over the salient objects in the
image. The annotators can additionally switch between hiding all salient objects to
view the image properly. Task-2 annotations starts with the seed caption from the
VLM and is then refined by human annotators in sequential rounds, building on top
of the previous round’s output.

images, a label in itself is not enough to uniquely identify an object. Thus, we


create a grounding task with (image, label, description) inputs that are tasked
to predict the corresponding normalized bounding box coordinates.
Our second image region task is label prediction, in which we predict an open
vocab label for the object with input (image, bounding box). Lastly, we perform
object description generation, which produces descriptions for each object in the
image given (image, bounding box, label).

Salient Objects Tasks Our next category of fine-tuning tasks concerns the
salient objects in an image. We target the aggregated list of (label, bounding
box) object features per image from Task 1. Our first task is label generation, in
which given an image, we aim to generate a text list of the salient object labels.
The object labels are sorted alphabetically for consistency, but in future work
ordering by saliency would be useful. Our second object-level task is grounded
label generation. The task is to generate the list of (label, bounding box) pairs
per object in the image; we similarly sort the list alphabetically with respect to
label name.

Detailed Description Tasks Finally, our last fine-tuning tasks relate to the
sequentially annotated descriptions from Task 2. We perform description elab-
oration in addition to direct description generation. Given the image and de-
scription from N th step, description elaboration trains the model to elaborate
the current description to the final description. We also create synthetically cor-
rupted versions of the final description to serve as additional training samples.
32 Garg et al.

Fig. 12: IIW based VLM Fine-tuning Tasks. We show tasks based on data collected
from Task-1 and Task-2 per the IIW annotation framework. Different tasks enable the
fine-tuning to focus on the image at (object, attribute), (image, objects) or (image,
hyper-detailed description) levels.

Specifically, we randomly drop X % of sentences. Sentences are dropped starting


from the last sentence so that the structure of the overall text piece is maintained
(as opposed to random sentence removal). For final description generation, given
the image, a VLM learns to generate the final most hyper-detailed description
available from the entire annotation framework. This final task (and not descrip-
tion elaboration), is the only task used to train the IIW model (whereas all are
used for the IIW Combined ablation).

7.4 Experiments
Automatic Readability Measurements Building on the discussion from
the main paper around the automatic readability metrics, we study additional
distribution based readability metrics in Figure 13. The distributions further
support the previous metrics to demonstrate a more mature writing style in
both the IIW human-authored dataset and fine-tuned model generated outputs.

Side-by-Side (SxS) Evaluation Framework We demonstrate the Human


SxS annotation UI to show the input (see Figure 14) and the corresponding
human responses (see Figure 15) across the 5 metrics, each on a 5 point scale.
The metrics are defined as:
– Comprehensiveness: The description should capture all of the important el-
ements of the image, including objects, people, locations, actions, relation-
ships between objects, etc.
ImageInWords 33

FLESCH KINCAID GRADE LEVEL FLESCH EASE SMOG GRADE LEVEL GUNNING FOG GRADE LEVEL
20.0 30
40
17.5 40
25
15.0
30
count_norm

count_norm

count_norm

count_norm
12.5 30 20
10.0 20 15
20
7.5 10
5.0 10 10
2.5 5
0.0 0 0 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

very easy
easy
fairly_easy
standard
fairly_difficult
difficult
very_confusing

6
7
8
9
10
11
12
13
14
15

na
6
7
8
9
10
11
12
college
college_graduate
IIW DCI DOCCI

(a) Distribution on the Human Authored Datasets from DCI, DOCCI and IIW
FLESCH KINCAID GRADE LEVEL FLESCH EASE SMOG GRADE LEVEL 40
GUNNING FOG GRADE LEVEL
50
25 40 35
40 30
20 30
count_norm

count_norm

count_norm

count_norm
30 25
15 20
20
20 15
10
10 10
5 10
5
0 0 0 0
0
1
2
3
4
5
6
7
8
9

very easy
easy
fairly_easy
standard
10
11
12
13
14
15

fairly_difficult
difficult
very_confusing

5
6
7
8
9
10
11
12
13

na
6
7
8
9
10
11
12
college
college_graduate
IIW DCI DOCCI

(b) Distribution on the Fine-tuned Model Generated Outputs from DCI, DOCCI and IIW

Fig. 13: Distribution-based Readability Metrics. We compare both human authored


and model generated outputs from IIW and prior work to show the distribution of
Education based units reflected in the writing style. IIW outputs from both the human
annotators and the model produce a more mature style across the metrics.

– Specificity: The description should use precise and descriptive language to


avoid vagueness and ambiguity. e.g. "3 apples" and "Taj Mahal" are more
specific than "some apples" and "a white marble structure" respectively.
– Hallucinations: The description should be factually correct and avoid making
assumptions or interpretations that are not visually supported by the image.
– First few line(s) as tldr: The first few line(s) should paint a high level picture
of what to expect in the image and create a succinct summary.
– Human-Like: The descriptions should feel as if an educated person wrote
them and be free from artifacts hinting that a machine generated them (e.g.
stuttering, repeating facts, fragmented chain of thought, etc.).

The 5 metrics are defined to capture 3 broad umbrella metrics of precision,


recall and writing-style. An overall metric score can further be computed by
taking an average of the 3 umbrella metrics. Each can be defined as follows:

\texttt {Recall} &= avg(\text {Comprehensiveness}, \text {Specificity})\\ \texttt {Precision} &= \text {Hallucination}\\ \texttt {Writing Style} &= avg(\text {TLDR}, \text {Human Like})\\ \texttt {Overall} &= avg(\text {Recall}, \text {Precision}, \text {Writing Style})
34 Garg et al.

Fig. 14: Human SxS Annotation UI. Annotators are shown the input image and two
input image descriptions to evaluate side-by-side. The input descriptions could be from
any combination of (human, model) sources. This information is not shared with the
annotators and the sources are randomly flipped and marked as A or B to prevent any
source or order based bias.

IIW Fine-tuned Model Ablations As an IIW ablation study, we fine-tune


a separate PaLI-5B model, IIW-Combined, using all the data from Task 1 and
Task 2 as a mixture of 7 training tasks, defined in Section 7.3. Table 9 shows
that this has no clear significant gains on Task-2’s final description eval set. This
currently remains a less explored area and we aim to investigate this in future
work to further improve the model on Task-2 evaluations.

Additional Automatic Metrics In addition to reporting BLEU-4, ROUGE-1,


and ROUGE-2 automatic metrics, we now include CIDEr [54], BERTScore [67],
and BLEURT [40] metrics in Table 10. We include BERTScore and BLEURT as
they are newer, model-based metrics which have been shown to correlate more
closely with human judgements. CIDEr, like BLEU and ROUGE metrics are not
limited by sequence length. BERTScore and BLEURT have a maximum sequence
length of 512 (we specifically use the “wwm_cased_L-24_H-1024_A-16” BERT
checkpoint and the latest BLEURT-20 model), but for our descriptions, they
likely fit under this maximum length, with only outliers being truncated.
CIDEr and BERTScore generally show the same trend of each fine-tuned
model performing best on the same test domain (i.e., DCI fine-tuned models
perform best on DCI test set, DOCCI models perform best on DOCCI test
set, and so on). One anomaly occurs with CIDEr on the DCI test set, where
PaLI models fine-tuned with DOCCI slightly outperform the DCI trained model
(4.91 versus 4.57). Due to how low the metric values are, these differences may
ImageInWords 35

Fig. 15: Human SxS Annotation UI responses for the input image and two image de-
scription pairs (see Figure 14). The annotators respond to the 5 metrics independently
on a 5 point scale. They are additionally asked to justify their choices which can be
used to sanity check and perform quality sweeps.

not be significant. When evaluating the DCI, DOCCI, and IIW test sets with
BLEURT, we instead find a slight preference for IIW models. Across all three
datasets, BLEURT shows PaLI-IIW variants perform better or similarly to the
same-domain test set. Thus, newer metrics may reveal IIW fine-tuned models
generalize better than models fine-tuned on other datasets.

Reconstructing Images with IIW Descriptions For reconstructing images


sentence-by-sentence, we fed T2I model with the first sentence, first two sen-
tences, first three sentences, etc. as prompts from each of the three datasets

Table 8: Human SxS to Evaluate IIW Fine-tuned PaLI-3 5B Model Predictions when
compared to IIW Human-Authored Data on IIW-400 using 100 samples.

IIW-400
Metric IIW-Human IIW-Model
++ + - + ++

Comprehensiveness 40 43 12 4 1
Specificity 79 14 5 2 0
Hallucincations 6 46 33 17 4
TLDR 29 43 14 10 4
Human-Like 27 32 34 6 1
36 Garg et al.

Table 9: Ablation Results Comparing IIW Variants on Automatic Metrics

DCI Test (112) DOCCI Test (5k) IIW Test (445)


PaLI-ft
bleu-4 rouge-1 rouge-2 bleu-4 rouge-1 rouge-2 bleu-4 rouge-1 rouge-2
IIW 3.02 31.59 8.02 4.60 38.10 10.06 5.66 38.57 11.73
IIW Combined 2.95 30.63 7.30 4.76 38.25 10.48 5.40 37.64 11.62

Table 10: Additional Automatic Metric Results. We report CIDEr, BERTScore (re-
ferred to as BERT in table due to space), and BLEURT metrics for all fine-tuned
models. We compare DCI, DOCCI, IIW, and IIW Comb. (Combined).

DCI Test (112) DOCCI Test (5k) IIW Test (445)


PaLI-ft
CIDEr BERT BLEURT CIDEr BERT BLEURT CIDEr BERT BLEURT
DCI 4.57 0.60 0.41 4.71 0.61 0.42 0.75 0.56 0.40
DOCCI 4.91 0.58 0.39 11.09 0.65 0.45 2.40 0.59 0.41
IIW 1.87 0.56 0.41 4.52 0.59 0.46 4.04 0.61 0.45
IIW Comb. 0.61 0.56 0.43 4.15 0.59 0.46 1.77 0.60 0.46

(DCI, DOCCI and IIW). Figure 16 showcases the prompts and the T2I model
outputs from three descriptions along with the original image.
We then asked human annotators to rank the generated images by how sim-
ilar they are to the original image. The image most similar to the original image
is ranked number 1. We allowed generated images to be ranked the same if they
are very similar. Figure 17(a) shows the reconstruction rank counts for all the
sentence counts and Figure 17(b) shows the rank counts when we use sentence
1, sentence 1 and 2, sentence 1, 2 and 3, and sentence 1, 2, 3, and 4. Sentences
from IIW descriptions are ranked first much more frequently than sentences from
DCI and DOCCI descriptions. Specifically, for the first sentence, the difference
is most notable, supporting our claim that IIW descriptions are higher quality
earlier on and IIW first sentences are designed to capture a TLDR.

Compositional Reasoning with IIW Descriptions In our downstream eval-


uation of ARO, SVO-Probes, and Winoground compositional reasoning bench-
marks with IIW descriptions, we formulate a new LLM-only method of evalua-
tion. We prompt a LLM (e.g., PaLM 2) to determine which is the true matching
caption given the generated image description and the image caption options to
select from. We define the LLM prompt which includes an image description as:

“Given the following image description and image caption options, choose
the most likely OPTION number :
IMAGE-DESCRIPTION : <DESCRIPTION>
OPTIONS : <CHOICES>
RESPONSE : ”

where we fill in the <DESCRIPTION> from each VLM description model


(e.g., either our IIW fine-tuned model, InstructBLIP or LLaVA) and the list
ImageInWords 37

of <CHOICES> are from the corresponding evaluation dataset, respectively.


Choices are enumerated in a list-like fashion, and we ask the model to generate
the number of the most likely caption. We define a different prompt for the
language bias baseline, which serves as a sanity check that the image/image
description is truly needed for these datasets. It provides a lower bound for
comparison, too. While the prompt is different as we do not input any image
description, we try to make it as similar as possible to the prior prompt. We set
the language bias prompt to:
“Given the following image caption options, choose the most likely OP-
TION number :
OPTIONS : <CHOICES>
RESPONSE : ”
where <CHOICES> are filled in in the same format as previously described.
Importantly, when filling in the caption choices, we deterministically swap
the index of the “answer,” i.e., true matching caption, among the choices list in
the prompt. This is done to ensure an equal distribution and reduce any order
bias (e.g., a LLM may be more prone to believing the first option is the correct
option).
To obtain the image description which is then fed into the LLM, we prompt
our fine-tuned models with “Generate a detailed image description.” For the In-
structBLIP and LLaVA models, we define similar prompts given the prompts
used in their published papers papers: “Write a long and detailed description for
the photo.” and “Provide a detailed description of the given image” for Instruct-
BLIP and LLaVA, respectively.
We process the LLM outputs as classes, (e.g., when choosing between image
caption choices [1] and [2], LLM responses are ‘1’ or ‘2’) and calculate accuracy
with respect to the true image caption class. If the LLM does not produce a valid
class, it’s considered an incorrect prediction. Note that this task set up is differ-
ent from how VLM models are typically evaluated on these reasoning datasets:
prior work considers a sample to be correctly reasoned about if the image-text
similarity of the true image caption is higher than the image-text similarity of
incorrect image caption. Due to the long length of our descriptions, we cannot
compute image-text similarity reasonably with models like CLIP without sig-
nificantly truncating our image descriptions. In future work, once input length
limitations are mitigated, dual-encoder VLMs like CLIP can be fine-tuned with
our rich data, which will help to improve VLM reasoning.
Note that ARO and Winoground datasets are built with positive and negative
captions for each image. SVO-Probes differs in that it originally contained a
positive and negative image for each positive caption. For our experiments, we
need a true and false caption associated with an image. A large portion (∼90%)
of the SVO-Probes negative images also serve as separate samples (where they
are considered positive images, with associated captions). Thus, we can pull
these captions to serve as the negative caption for the original sample.
For the remaining ∼10%, we use the negative triplet (the S, V, O triplet
specifying the subject, object, and verb, with one of them being modified) to
38 Garg et al.

Table 11: VL Compositional Reasoning Accuracy with Image Descriptions. We evalu-


ate whether rich descriptions can distinguish the true matching image caption in ARO,
SVO-Probes, and Winoground datasets. The COCO and Flickr30k Order subsets of
ARO are not reported due to a very high language bias baseline of 98%.

ARO [66]
Image Description Model SVO-Probes [15] Winoground [51]
VG-A VG-R
None (Language Bias Baseline) 56.50 59.94 50.71 49.88
InstructBLIP-Vicuna-7B 83.99 62.73 89.35 65.25
LLaVA-V1.5-7B 84.80 63.71 87.89 63.38
PaLI-3 + DCI 5B 88.19 66.47 86.50 64.62
PaLI-3 + DOCCI 5B 89.70 68.85 88.73 69.50
PaLI-3 + IIW 5B 90.37 66.19 88.66 69.38
PaLI-3 + IIW Combined 5B 89.46 64.88 87.78 66.88

automatically flip the negative S, V, or O in the positive caption. Ten of these


samples did not have negative triplets in the dataset, so they were removed.
Lastly, there were 114 samples with positive captions not containing the S, V, or
O that needed to be swapped to form the negative caption. This happens as a
result of SVO triplets containing root forms of the words, which were not spelled
the same way in the caption. For example, an SVO may be “man,lie,beach” with
the caption stating “A man lying on a beach.” Due to the verb tense differences,
it would require additional processing to match “lie” to “lying.” We remove these
edge cases for simplicity.
Finally, we include more vision language compositional reasoning results with
different PaLI fine-tuned models in Table 11. Here we additionally include the
models fine-tuned with DCI and DOCCI datasets. The IIW descriptions still re-
sult in highest reasoning accuracy for ARO VG-A and are comparable with
DOCCI on Winoground. Trends also stay the same with SVO-Probes, with
DOCCI performing similarity to IIW, but InstructBLIP performing slightly bet-
ter (by less than 1 accuracy point). Finally, we find that DOCCI performs best
on VG-R, which might be result of its dataset being designed to explicitly con-
tain connected and contrasting images, which might more frequently capture
similar images that only differ by the visual relationship between objects.
While performance differences between DCI, DOCCI, and IIW are smaller,
this could be an artifact of the reasoning datasets; ARO, SVO-Probes, and
Winoground are all built upon short caption datasets, so the utility and qual-
ity differences between DCI, DOCCI, and IIW are not fully captured by these
probing datasets.

7.5 Enriching Image Caption Datasets

As discussed in the main paper, we enrich 1k samples from two existing im-
age caption datasets, namely, Localized Narratives and Crossmodal (XM) 3600,
with new image descriptions generated by IIW fine-tuned models. The goal of
ImageInWords 39

Table 12: Dataset Statistics Comparing ImageInWords (IIW) Descriptions of Prior


Work to their Original Annotations. We include the number of samples (i.e., subset
of captions/descriptions that we enrich) and the average number of tokens, sentences,
nouns (NN), adjectives (ADJ), adverbs (ADV), and verbs (VB). Language statistics
are averages reported per description unless otherwise noted.

Sample Tokens Tokens Sentences NN ADJ ADV VB


Dataset
Count / Sent. / Desc.
LocNar [39] 14.35 30.56 2.12 8.02 1.09 0.16 2.39
1000
IIW Enriched 22.19 128.87 5.80 32.37 16.02 1.82 11.44
XM3600 [50] 10.40 10.40 1.00 3.45 1.08 0.04 0.61
1000
IIW Enriched 22.25 130.56 5.86 33.18 15.82 1.72 11.87

releasing these enriched versions is to provide longer, hyper-detailed image de-


scriptions that can be used for evaluation or fine-tuning purposes in future work.
The enriched versions not only allow for finer-grained, full coverage evaluations
of the content in images (via new metrics or probing datasets), but also may
enable autorater models which learn from the precision and recall errors in the
generated descriptions.
In Table 12, we report the language statistics on the original 1k samples from
each dataset and the enriched versions. It is clear that the IIW descriptions are
significantly longer and richer, as we have higher counts of tokens, sentences,
and each part of speech.

7.6 Percentages Reported in the Main Paper


We re-quote and define all analysis percentages reported in the main paper for
clarity on how they were calculated in Tables 13-15. The reference location is
defined by the section, paragraph, and line it appeared in. We only include
paragraph number for multi-paragraph sections, and only include line number
if the same percentage occurs more than once within a paragraph. For example,
“S4.3 P2 L3” means Section 4, Paragraph 2, Line 3. Most percentages were
rounded to the nearest point in the main paper.

7.7 Limitations
Finally, we discuss the limitations of our annotation framework and evaluations.
In our annotation framework, we define a seeded and sequential annotation pro-
cess, with both aspects having potential limitations. The quality of the seeded
data is of high importance as it will ultimately affect the rest of our human
annotation pipeline. Additionally, even with the best possible seeds, they may
limit the scope of what our crowd workers write by biasing them towards cer-
tain objects or phrases. In terms of limitations for the sequential augmentation
used, unnecessary time may be spent by annotators if the first annotator output
quality is low. By monitoring the initial draft descriptions, quality can be better
ensured so that the framework is as efficient as possible.
40 Garg et al.

With respect to the evaluation of our human annotated data and model
generated outputs, we do only perform evaluations on hundreds of samples (as
opposed to thousands or more). This is due to how expensive and time consum-
ing human SxS evaluations are, but we note that IIW is rated marginally and
substantially better at a much higher rate, which would likely scale to more sam-
ples. Our work is also inherently limited by the lack of metrics available for long
descriptions. We still report standard text similarity metrics and complement
them with human SxS, but in future work the text length limitations should be
addressed so that automated metrics can be applied.
While we currently do not plan to open source our models or training set,
we do release an evaluation set over images that can serve as a unified bench-
mark for IIW, recent and future related work. We also open source the human
SxS judgements and samples enriched from Localized Narratives and XM3600.
Lastly, as also mentioned in the Conclusion of the main text, our initial IIW
dataset and resulting models are English-only. In the future, we plan to expand
our work to have multilingual coverage. We also would like to curate image de-
scriptions that have more specifics with respect to locale/geographical location,
so that we do not strictly have descriptions with a western lens.
ImageInWords 41

Fig. 16: T2I Outputs and Human Ranking Evaluations. We show example T2I results
where the first sentence, first two sentences, ..., all the sentences of the image descrip-
tions from DCI, DOCCI and IIW models are fed sequentially as inputs, i.e., at each
step an additional sentence chunk is fed to the T2I model.
42 Garg et al.

(a) Reconstruction Rank Counts across Inputs over All Cumulative Sentence Chunks

(b) Reconstruction Rank Counts across Inputs of Specific Cumulative Sentence Chunks

Fig. 17: T2I Human Rank Distributions. We illustrate bar plots for the image recon-
struction evaluation results using image descriptions from finetuned PaLI-5B models
on three datasets (DCI, DOCCI, IIW). Images reconstructed from IIW descriptions
are consistently ranked better than other descriptions.
ImageInWords 43

Table 13: Percentages from the Main Text. We reference each percentage and define
how they were calculated for clarity.

Percent Reference Equation and Explanation


+66% Abstract, Average difference of IIW preference vs. other dataset
Intro P4, preference, averaged over DCI and DOCCI datasets and
S4.3 P4, averaged over the five metrics corresponding to
Conclusion (comprehensiveness, specificity, hallucinations, tldr,
human-likeness). Differences of IIW marginally and
substantially better - other dataset marginally and
substantially better for (comprehensiveness, specificity,
hallucinations, tldr, human-likeness) metrics from Table 3
correspond to DCI (61, 80, 42, 91, 82) and DOCCI (42,
82, 35, 79, 68). The final average preference over the five
metrics and two datasets is 66.2%.
+48% Abstract, Average difference of IIW preference vs. GPT-4V outputs,
Intro P4, averaged over the five metrics corresponding to
S4.3 P4 (comprehensiveness, specificity, hallucinations, tldr,
human-likeness). Differences of IIW marginally and
substantially better - GPT-4V marginally and
substantially better for (comprehensiveness, specificity,
hallucinations, tldr, human-likeness) metrics from Table 5
correspond to (35, 53, 59, 70, 21). The final average
preference over the five metrics is 47.6%.
+31% Abstract, Average difference of IIW model output preference vs.
Intro P5, other fine-tuned model output preference, averaged over
S4.4 P2, DCI and DOCCI fine-tuned models and averaged over the
Conclusion five metrics corresponding to (comprehensiveness,
specificity, hallucinations, tldr, human-likeness).
Differences of IIW marginally and substantially better -
other dataset marginally and substantially better for
(comprehensiveness, specificity, hallucinations, tldr,
human-likeness) metrics from Table 5 correspond to DCI
(42, 54, -9, 51, 57) and DOCCI (4, 37, -7, 57, 23). The
final average preference over the five metrics and two
datasets is 30.9%.
20% more S3.2 P3 The average increase in token count from annotation
round 1 to round 3: (205-170)/170 = 20%.
30% less S3.2 P3 The average decrease in time spent annotating from
round 1 to round 3 compared to if three individual round
1s occurred: ((800*3)-(800+600+300))/(800*3) = 30%.
+61% S4.3 P2 L3 The amount IIW is more comprehensive than DCI in
Table 3: (30+41) - (3+7) = 61%.
+42% S4.3 P2 L3 The amount IIW is more comprehensive than DOCCI in
Table 3: (33+19) - (4+6) = 42%.
+80% S4.3 P2 L3 The amount IIW is more specific than DCI in Table 3:
(20+68) - (5+3) = 80%.
44 Garg et al.

Table 14: Percentages from the Main Text. We reference each percentage and define
how they were calculated for clarity.

Percent Reference Equation and Explanation


+82% S4.3 P2 L3 The amount IIW is more specific than DOCCI in Table 3:
(22+65) - (3+2) = 82%.
42% S4.3 P2 L4 The amount IIW contains fewer hallucinations than DCI
in Table 3: (32+15) - (2+3) = 42%.
35% S4.3 P2 L4 The amount IIW contains fewer hallucinations than
DOCCI in Table 3: (34+13) - (0+12) = 35%.
+91% S4.3 P2 L4 The amount IIW contains better TLDR than DCI in
Table 3: (20+74) - (3+0) = 91%.
+79% S4.3 P2 L4 The amount IIW contains better TLDR than DOCCI in
Table 3: (30+54) - (1+4) = 79%.
+82% S4.3 P2 L5 The amount IIW is more human-like than DCI in Table
3: (25+59) - (1+1) = 82%.
+68% S4.3 P2 L5 The amount IIW is more human-like than DOCCI in
Table 3: (46+23) - (1+0) = 68%.
+35% S4.3 P3 The amount IIW is more comprehensive than GPT-4V
outputs in Table 5: (29+19)-(3+10) = 35%.
+53% S4.3 P3 The amount IIW is more specific than GPT-4V outputs
in Table 5: (35+34) - (6+10) = 53%.
+59% S4.3 P3 The amount IIW is contains fewer hallucinations than
GPT-4V outputs in Table 5: (34+31) - (0+6) = 59%.
+70% S4.3 P3 The amount IIW contains better TLDR than GPT-4V
outputs in Table 5: (47+34) - (5+6) = 70%.
+21% S4.3 P3 The amount IIW is more human-like than GPT-4V
outputs in Table 5: (27+13) - (6+13) = 21%.
+42% S4.4 P2 The amount IIW is more comprehensive than DCI in
Table 5: (32+27) - (7+10) = 42%.
+4% S4.4 P2 The amount IIW is more comprehensive than DOCCI in
Table 5: (26+5) - (5+22) = 4%.
+54% S4.4 P2 The amount IIW is more specific than DCI in Table 5:
(24+46) - (6+10) = 54%.
+37% S4.4 P2 The amount IIW is more specific than DOCCI in Table 5:
(33+24) - (6+14) = 37%.
ImageInWords 45

Table 15: Percentages from the Main Text. We reference each percentage and define
how they were calculated for clarity.

Percent Reference Equation and Explanation


+51% S4.4 P2 The amount IIW contains better TLDR than DCI in Table
5: (30+41) - (9+11) = 51%.
+57% S4.4 P2 The amount IIW contains better TLDR than DOCCI in
Table 5: (42+28) - (6+7) = 57%.
+55% S4.4 P2 The amount IIW is more human-like than DCI in Table 5:
(32+39) - (11+5) = 55%.
+23% S4.4 P2 The amount IIW is more human-like than DOCCI in Table
5: (27+14) - (6+12) = 23%.
-9% S4.4 P2 The amount IIW contains fewer hallucinations than DCI in
Table 5: (11+13) - (12+21) = -9%.
-7% S4.4 P2 The amount IIW contains fewer hallucinations than
DOCCI in Table 5: (21+6) - (9+25) = -7%.
34% S4.6 P4 The accuracy improvement on VG-A from using IIW over
the language bias baseline: (90.37) - (56.50) = 33.87%.
6% S4.6 P4 The accuracy improvement on VG-R from using IIW over
the language bias baseline: (66.19) - (59.94) = 6.25%.
20% S4.6 P4 The accuracy improvement on Winoground from using IIW
over the language bias baseline: (69.38) - (49.88) = 19.5%.
6% Abstract, The accuracy improvement on VG-A from using IIW over
S4.6 P4, the next best baseline LLaVA: (90.37) - (84.80) = 5.57%.
Conclusion
2% S4.6 P4 The accuracy improvement on VG-R from using IIW over
the next best baseline LLaVA: (66.19) - (63.71) = 2.48%.
4% S4.6 P4 The accuracy improvement on Winoground from using IIW
over the next best baseline InstructBLIP: (69.38) - (65.25)
= 4.13%.

You might also like