Image in Words
Image in Words
Roopal Garg1 , Andrea Burns1 , Burcu Karagol Ayan1 , Yonatan Bitton1 , Ceslee
Montgomery1 , Yasumasa Onoe1 , Andrew Bunner1 , Ranjay Krishna1,3 , Jason
Baldridge2 , and Radu Soricut1
1
Google Research, 2 Google DeepMind, 3 University of Washington
https://github.com/google/imageinwords
1 Introduction
Fig. 1: ImageInWords Seeded Annotation Framework. Humans enrich and refine out-
puts sequentially, building upon previous human and/or machine provided inputs. The
human annotation flow starts with fine-grained object captions in Task 1. They are used
as building blocks to help compose image-level hyper-detailed descriptions in Task 2.
The VLMs (in orange) are updated in an active learning loop to produce better object-
level captions and image-level descriptions as more annotated data becomes available.
Screenshots of the end-to-end annotation process and UI are in the Supplemental.
of the image [56]. Therefore, scraping image descriptions from the web as the
primary source of effective vision-language (VL) pairing is fundamentally flawed
and limits model capabilities [17, 29, 42, 48, 51].
With the goal of curating higher quality image-text data, recent works have
released dense human-written caption datasets (e.g., DCI [52], DOCCI [32])
or model-generated datasets (e.g., DAC [13], PixLore [5]). Both types have
their limitations, as using annotators without comprehensive guidelines results
in vague, subjective outputs that vary by human attention span, bias, and ef-
fort [6, 31, 36, 63]. In contrast, model-generated captions are cheaper to generate
but result in outputs that are incomplete and rife with hallucinations [11, 44].
In this work, we describe ImageInWords (IIW), a human-in-the-loop frame-
work for curating hyper-detailed and hallucination-free image descriptions, and
its resulting annotations. ImageInWords combines the irreplaceable quality of hu-
man annotators with seeded metadata from machine generations. The process
begins with object detectors first identifying individual object instances in the
image. Next, a VLM generates granular captions for each detected object which
seed our human annotation process. These seed captions may contain hallucina-
tions or lack object-level comprehensiveness and specificity. Our crowd workers
augment and fix the object-level captions to make them richer and hallucination
free to seed the next step. Next, we operate at image-level, where an image cap-
tion is generated by the VLM to seed our final image description. Crowd workers
now consume the image-level seed captions along with the object-level human
ImageInWords 3
annotations to fill in contextual gaps missing from the existing image captions.
We design guidelines for crowd workers to attend to concepts beyond objects,
such as visual perspective, spatial arrangement, and human-object interactions.
This process is iterative to ensure a final high quality dataset (Fig. 1).
Following this process, we construct the IIW dataset containing 9018 images,
each with its hyper-detailed description. In comparison to other dense descrip-
tion datasets (Tab. 1), our dataset contains more information along multiple
dimensions: our descriptions have an average of 9.8 sentences, and 52.5 nouns,
28 adjectives, 5 adverbs, and 19.1 verbs per description. To evaluate description
quality, we perform a side-by-side (SxS) human evaluation against those from
human-generated datasets (DCI, DOCCI) and from those generated by GPT-4V.
Evaluators rated our descriptions as more comprehensive, specific, and human-
like, containing fewer hallucinations and better leading sentences (Tab. 3) at an
average of +66% (DCI and DOCCI) and +48% (GPT-4V).
To comprehensively evaluate the utility of our framework’s generated descrip-
tions, we use it as a fine-tuning dataset. We fine-tune a PaLI-3 5B model [8] to
successfully generate hyper-detailed descriptions. We conduct four evaluations to
compare the same model fine-tuned on our dataset versus other fine-tuning sets.
First, a battery of well-established readability metrics finds IIW to be superior to
DCI and DOCCI, when measuring either the annotations directly or fine-tuned
model-generated descriptions. Second, using these fine-tuned models, we gener-
ate descriptions on the Localized Narratives (LocNar) [39] dataset to create a
common pool of image-description pairs and run human SxS evaluations. We
find that IIW model-generated outputs are preferred more by +31% compared
to models fine-tuned on prior work. Third, we show that images generated using
our model’s descriptions were considered to be a closer reconstruction of the
original image than descriptions from other models. The generations were eval-
uated both by human evaluators as well as CLIP. Fourth, we test the generated
descriptions’ compositionality by swapping the images in compositionality rea-
soning benchmarks with the generated description (ARO [66], SVO-Probes [15]
and Winoground [51]). Our trained model results in higher reasoning accuracy
compared to LLaVA-v1.5 and InstructBLIP.
Overall, our framework produces higher quality image description data that
serve as an effective fine-tuning dataset, and our evaluations along a dozen di-
mensions validate its utility. We posit that the level of detail attained in our an-
notations facilitates a strong new capability that can help advance VL research
and its applications; this level of descriptive detail is important for multiple
downstream applications, including generating descriptions for Text-to-Image
(T2I) models [4, 45, 65] and in applications for people with visual impairments.
2 Related Work
Image captioning has been studied for many years, starting with CNN and LSTM
encoder-decoder frameworks for generating generic captions [2, 16, 43, 55], to the
more recent Transformer-based VLMs which are evaluated on more challenging
4 Garg et al.
captions [9,25,53] (e.g., VizWiz [14], NoCaps [1], TextCaps [49]). These datasets,
among many others, contain captions with an average of 15 words or fewer [1,
12, 14, 19, 23, 27, 30, 38, 38, 47, 49, 64] and may differ by caption grounding level
(e.g. whole image or region-level captions) or image domain (e.g. images taken
by people who are blind or images capturing text).
There are, however, few dense image description datasets. PixLore [5] pro-
posed using multiple vision-language datasets to generate more verbose captions
with BLIP-2 [25]. DAC [13] uses a machine-generated approach: pretrained lan-
guage models expand the original image caption and pretrained VLMs are used
to generate captions over smaller image regions. The resulting descriptions are
then used to fine-tune a VLM model for better compositional reasoning. While
model-only approaches are cost effective and avoid the challenges of designing
instructions for crowd workers, they risk introducing hallucinations and sys-
tematic biases, which may not be easily mitigated. Using only crowd workers,
DOCCI [32] collects image descriptions that we later show can be considerably
improved. Closest to IIW is DCI [52], which uses human annotators to reach
denser image descriptions. DCI uses the SAM [20] object detector to generate
smaller regions to be described and then composes them into an overall image
description. In contrast to DCI, we seed our annotation pipeline with VLM gen-
erated outputs and allow crowd workers to update or correct every component
of the seeded information. The output is then sequentially refined over multiple
annotation rounds to produce a single coherent hyper-detailed annotation out-
put. In comparison to DCI’s “caption-extra” annotation, we collect significantly
better descriptions. In Tab. 1, we provide quantitative comparisons to prior work
designed for longer image captions.
The IIW dataset is composed of 9018 (IIW-Train: 8573, IIW-Test: 445) images
that are sampled from a dataset built similar to WebLI [9], and human anno-
tated. Details on the human annotator pool are provided in the Supplemental.
In Sec. 3.1, we briefly review the foundational guidelines we define for use with
crowd workers. We describe our annotation methodology which consists of seed-
ImageInWords 5
ing and sequential description refinement in Sec. 3.2. Finally, details on the types
of image-text annotations we collect (Fig. 3) are described in Sec. 3.3.
350
1200
300
250 1000
200 800
150 600
100 400
50 200
1 2 3 1 2 3
(a) Description Token Count per (b) Time(sec) per
Annotation Round Annotation Round
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
(1,3) (1,2) (2,3) (1,2)
(c) Jaccard-Similarity b/w Annotation Rounds (d) Jaccard-Similarity b/w Annotation Rounds
in the Beginning after Human-in-the-loop Learning
Fig. 2: Effects of Sequential Annotation: Over annotation rounds, (a) token count
goes up as (b) time spent goes down with (c) higher agreement, measured by Jaccard
Similarity [61]. A very low agreement across rounds (1,3) indicates considerable edits
which annotators largely agreed upon in rounds (2,3). Round (1,2) provides the highest
mean token count gain of ∼30. (d) Over time with a constant human annotator pool,
each learns from the other via an implicit feedback loop and a high agreement rate in
round (1,2) can now be observed as was previously only seen in round (2,3) in (c).
and coverage guarantee from the human outputs. As data is collected, the PaLI-
3 5B model is fine-tuned to produce better quality descriptions in an active
learning loop (as reflected with loops in Fig. 1).
Sequential Augmentation In addition to seeding the IIW annotation pipeline
with VLM outputs to ease human annotation, we also improve the efficiency of
our framework with sequential image description augmentations. For this, hu-
mans augment on top of a previous crowd worker’s or VLM’s outputs instead
of starting from scratch each time in parallel. The seed information and annota-
tions from previous rounds are available to the annotators for reference. During
the annotation process, we observed (Fig. 2) that it is far more effective in terms
of both time and output to read and augment image descriptions than to write
from scratch. From Fig. 2 we see that if annotations were done in parallel, we
would have 3 competing outputs per image, each with their own style, perspec-
tive, strength, and weakness, and each containing ∼170 words and taking ∼800
seconds. Whereas, in the sequential process, we get a single all-inclusive output
per image that has been verified and augmented by three humans with +20% to-
ken count in -30% time. Higher Jaccard Similarity over rounds suggests a higher
inter-annotator agreement on the output, which serves as a proxy for quality.
Finally, our framework has the benefit of injecting an implicit human-to-
human learning loop, as each human annotator has the opportunity to read
and learn from other perspectives across the annotation rounds, leading to im-
proved individual quality. This is evident from a ∼2x improved inter-annotator
agreement between rounds (1, 2); compare (c) and (d) in Fig. 2.
ImageInWords 7
Fig. 3: IIW Annotation Tasks. Objects and their attributes are first individually an-
notated to note the salient objects and focus on coverage of their attributes in Task 1.
These outputs, along with a seed VLM caption, are passed to humans to build the ini-
tial image-level description. The initial caption is then human augmented and refined
in N sequential rounds to attain the final hyper-detailed description in Task 2.
annotations are done using the guidelines in Sec. 3.1 as reference. More details
on the annotation UI, seeded inputs and outputs are in the Supplemental.
Annotation Task-2: Overall Image Description Our second annotation
task is to formulate the final holistic hyper-detailed description using the guide-
lines from Sec. 3.1. Seeded data from Task-1 (detailed above), optional domain
specific metadata (e.g., art style of a painting, otherwise requiring domain ex-
pertise from humans), and a VLM seed image caption are used to hint and help
the annotators compose the overall image description.
The bulk of the annotation responsibility falls on the first annotator who
composes the initial description; note that the crowd worker annotation order
is randomly assigned per sample and the same annotator is not re-employed
across rounds for the same sample. This output is then refined and augmented
in sequential rounds to improve the quality of the output. Sequential annotation
helps mitigate subjectivity and quality drops as data is human verified across the
rounds. The annotators are encouraged to focus more on the augmentation efforts
and only remove things if they are obvious errors from previous rounds. They
are however free to re-frame existing information to augment new details. In our
data collection experiments, we started with 3 sequential annotation rounds and
monitored the n-gram Jaccard similarity between the outputs. With human-in-
the-loop learning, once a high round-over-round output similarity was achieved
(we used a 0.8 threshold), we reduced the numbers of rounds. Optionally, early
stopping support could be added to the annotation framework itself to make this
instance specific. We found our similarity threshold can be met between the first
two rounds, i.e., (1,2), (Fig. 2) suggesting a high individual-annotator quality.
4 Experiments
We perform qualitative and quantitative experiments to evaluate the quality of
the IIW dataset and its utility for fine-tuning. We start the evaluation with
text-based automatic readability metrics in Sec. 4.1 and extend to human SxS
evaluations (defined in Sec. 4.2) to compare our human annotations to prior
work (e.g. DCI, DOCCI, GPT-4V) in Sec. 4.3.
In Sec. 4.4, we fine-tune separate PaLI-3 5B models on DCI, DOCCI and IIW
training splits, with their detailed human-authored text as target. Each model is
trained with an identical setup (∼40 epochs, learning-rate 0.0003, batch-size 32)
and using the generic input instruction: “Generate a detailed image description.”
Additional details on the fine-tuning setup are provided in the Supplemental.
Existing text similarity metrics like BLEU [37], and ROUGE [26] have been
shown to correlate poorly with human judgement as they are heavily dependent
on n-gram overlaps, and thus ill-suited for long texts [24]. As such, to get reliable
results, IIW fine-tuned model outputs are compared with models fine-tuned with
prior work (DCI and DOCCI), and also GPT-4V outputs.
Finally, we quantify the richness of the IIW model outputs via two down-
stream evaluations. First, in Sec. 4.5, we use generated descriptions from DCI,
DOCCI, and IIW fine-tuned models to prompt a Text-to-Image (T2I) model for
ImageInWords 9
Table 3: Human SxS to Evaluate IIW Human-Authored Data. The eval reports per-
centages comparing data from prior work with data annotated by the IIW framework.
Comprehensiveness 3 7 19 30 41 4 6 38 33 19
Specificity 5 3 4 20 68 3 2 8 22 65
Hallucinations 2 3 48 32 15 0 12 41 34 13
TLDR 3 0 3 20 74 1 4 11 30 54
Human-Like 1 1 14 25 59 1 0 30 46 23
DCI test set (112) and a comparable number of samples (100) from the DOCCI
test set with our IIW annotation framework. We thus have human-authored IIW
annotations for direct comparison on images in the DCI and DOCCI datasets.
Table 3 reports preference percentages for each human-authored test set on
our five metrics. Comparing IIW to DCI and DOCCI, Comprehensiveness is
higher by +61% and +42%, Specificity by +80% and +82%, Hallucinations
are lower by 42% and 35%, TLDR quality is higher by +91% and +79%, and
Human-Likeness improves by +82% and +68%, respectively. This indicates that
the IIW human-authored image descriptions on images from DCI and DOCCI
are considerably better than those originally published with prior work.
To further quantify the quality of IIW human annotations, we compare with
GPT-4V outputs [33–35] in Tab. 5 (right). We use GPT-4V to generate image
descriptions on 100 IIW images. The descriptions are generated with the prompt
“Generate a detailed image description” and no other specifications. The results
from the Model-Human section of Tab. 5 show that we reach Comprehensive-
ness (+35%), Specificity (+53%), Hallucination (+59%), TLDR (+70%), and
Human-Likeness (+21%) improvements over GPT-4V outputs. Although GPT-
4V performs relatively better than the human-authored data from DCI and
DOCCI when compared to IIW annotations, we assess that considerable future
modeling efforts are needed for VLMs to reach IIW human-authored quality.
Table 5: Human SxS to Evaluate IIW Model Predictions. Model Generated compares
IIW to prior work DCI and DOCCI using PaLI-5B fine-tuned models and GPT-4V
outputs. Model-Human then compares GPT-4V model to IIW human-annotations.
Table 6: T2I Reconstruction Rankings from Image Descriptions. The original image
is compared to generated images from cumulative sentence inputs on both relative
(Mean Rank Position) and absolute (CLIP similarity) metrics. For both, we limit the
comparisons to a rough estimate of maximum token length in the T2I and CLIP Model.
Fig. 4: Example T2I Outputs and the Resulting Human Rankings. We show an exam-
ple output when the first sentence of the image description from DCI, DOCCI and IIW
PaLI-5B fine-tuned models are fed as input to the same T2I model. Richer information
in the first sentence from IIW makes the T2I model reconstruct the original image
more closely. Additional examples are provided in the Supplemental.
the varied input sentence unit lengths (over the 240 random LocNar eval images)
as a 3-way human ranking evaluation (Fig. 4) and report results in Tab. 6. To
assess image-reconstruction power, we also report the CLIP [41] similarity score
between the reconstructed image and the original image. The results indicate
that IIW’s detailed outputs consistently lead to better T2I reconstruction, with
highest mean rank and CLIP similarity regardless of the length of input units.
Additional rank plots and examples are shared in the Supplemental.
ARO [66]
Image Description Model SVO-Probes [15] Winoground [51]
VG-A VG-R
None (Language Bias Baseline) 56.50 59.94 50.71 49.88
InstructBLIP-Vicuna-7B 83.99 62.73 89.35 65.25
LLaVA-V1.5-7B 84.80 63.71 87.89 63.38
IIW PaLI-3 5B 90.37 66.19 88.66 69.38
5 Data Release
We release a subset of human- and model-annotated IIW image & descriptions,
as well as human SxS results on Human Authored and Model-Human sourced
pairs of descriptions. The model generated descriptions may have hallucinations,
information recall losses, or non-human like writing style artifacts. By releasing
this subset along with human SxS judgements, we encourage the development
of new metrics and evaluation systems to detect them in an automated, scalable
manner. It also promotes fair comparison across methods in future work. The
set is released under a CC BY 4.0 license on GitHub.
Human Annotated We provide human-authored annotations from the IIW
framework. IIW-400, a new eval dataset of 400 random images sampled from
DOCCI-AAR [32]. Full IIW Task-1 and Task-2 annotations (Sec. 3.3) along with
100 human SxS results each on GPT-4V and IIW PaLI-5B models (including
model predictions). DCI-Test, 112 images re-annotated with the IIW framework
along with human SxS results comparing them to DCI’s original human annota-
tions. DOCCI-Test, 100 random images re-annotated with the IIW framework
along with human SxS results comparing them to DOCCI’s human annotations.
Model Annotated We release 2.4k random images annotated by the IIW
PaLI-5B model (Tab. 4) comprising of the IIW-400 set (as used for the human
SxS evaluation), 1k samples from the LocNar Eval set (from the OpenImages [21]
subset), and 1k samples from the XM3600 images [50].
6 Conclusion
In this work, we described ImageInWords (IIW), a new framework for rich, hyper-
detailed image descriptions. Our annotation guidelines and seeded sequential
annotation process lead to human authored descriptions that are strongly pre-
ferred over both prior work’s human annotations (+66%) and fine-tuned models
(+31%). Images constructed with IIW descriptions were ranked 1st more of-
ten regardless of how much of the image description was used, reflecting higher
saliency earlier and better overall quality. Our compositional reasoning evalua-
tion showed IIW descriptions to best contain fine-grained visual detail needed
to decipher true from false visual attributes and semantics, with accuracy gains
of up to 6% over our most performant baselines. Collectively, our results show
the quality and utility of IIW image descriptions as state-of-the-art.
In future work, we will explore the potential of hyper-detailed image descrip-
tions to improve tasks like image retrieval, visual question answering, synthetic
data generation, T2I fine-tuning, and automatic evaluation metrics to measure
these outputs. We are working on extending our careful and meticulous data
vetting process to ensure geodiverse images and continuing to improve the effi-
ciency of our data collection framework. Our current work focused on producing
detailed descriptions in English, and future iterations will expand this to more
languages with locale-specificity. Our goal is to make the annotation guidelines
holistic, reduce human effort and dependency in the annotation process, and
help shift the narrative from captions to descriptions.
ImageInWords 15
References
1. Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D.,
Parikh, D., Lee, S., Anderson, P.: nocaps: novel object captioning at scale. In:
Proceedings of the IEEE International Conference on Computer Vision. pp. 8948–
8957 (2019)
2. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang,
L.: Bottom-up and top-down attention for image captioning and visual question
answering (2018)
3. Anil, R., Dai, A.M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri,
S., Taropa, E., Bailey, P., Chen, Z., Chu, E., Clark, J.H., Shafey, L.E., Huang,
Y., Meier-Hellstern, K., Mishra, G., Moreira, E., Omernick, M., Robinson, K.,
Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang, Y., Abrego, G.H., Ahn, J., Austin,
J., Barham, P., Botha, J., Bradbury, J., Brahma, S., Brooks, K., Catasta, M.,
Cheng, Y., Cherry, C., Choquette-Choo, C.A., Chowdhery, A., Crepy, C., Dave,
S., Dehghani, M., Dev, S., Devlin, J., Díaz, M., Du, N., Dyer, E., Feinberg, V.,
Feng, F., Fienber, V., Freitag, M., Garcia, X., Gehrmann, S., Gonzalez, L., Gur-
Ari, G., Hand, S., Hashemi, H., Hou, L., Howland, J., Hu, A., Hui, J., Hurwitz,
J., Isard, M., Ittycheriah, A., Jagielski, M., Jia, W., Kenealy, K., Krikun, M.,
Kudugunta, S., Lan, C., Lee, K., Lee, B., Li, E., Li, M., Li, W., Li, Y., Li, J.,
Lim, H., Lin, H., Liu, Z., Liu, F., Maggioni, M., Mahendru, A., Maynez, J., Misra,
V., Moussalem, M., Nado, Z., Nham, J., Ni, E., Nystrom, A., Parrish, A., Pellat,
M., Polacek, M., Polozov, A., Pope, R., Qiao, S., Reif, E., Richter, B., Riley, P.,
Ros, A.C., Roy, A., Saeta, B., Samuel, R., Shelby, R., Slone, A., Smilkov, D., So,
D.R., Sohn, D., Tokumine, S., Valter, D., Vasudevan, V., Vodrahalli, K., Wang,
X., Wang, P., Wang, Z., Wang, T., Wieting, J., Wu, Y., Xu, K., Xu, Y., Xue, L.,
Yin, P., Yu, J., Zhang, Q., Zheng, S., Zheng, C., Zhou, W., Zhou, D., Petrov, S.,
Wu, Y.: Palm 2 technical report (2023)
4. Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang,
J., Lee, J., Guo, Y., Manassra, W., Dhariwal, P., Chu, C., Jiao, Y., Ramesh, A.:
Improving image generation with better captions (2023), https://cdn.openai.
com/papers/dall-e-3.pdf
5. Bonilla, D.: Pixlore: A dataset-driven approach to rich image captioning (2023)
6. Burghardt, K., Hogg, T., Lerman, K.: Quantifying the impact of cognitive biases
in question-answering systems (2019)
7. Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: A language modeling
framework for object detection (2022)
8. Chen, X., Wang, X., Beyer, L., Kolesnikov, A., Wu, J., Voigtlaender, P., Mustafa,
B., Goodman, S., Alabdulmohsin, I., Padlewski, P., Salz, D., Xiong, X., Vlasic,
D., Pavetic, F., Rong, K., Yu, T., Keysers, D., Zhai, X., Soricut, R.: Pali-3 vision
language models: Smaller, faster, stronger (2023)
9. Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D.,
Goodman, S., Grycner, A., Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J.,
Ding, N., Rong, K., Akbari, H., Mishra, G., Xue, L., Thapliyal, A., Bradbury,
J., Kuo, W., Seyedhosseini, M., Jia, C., Ayan, B.K., Riquelme, C., Steiner, A.,
Angelova, A., Zhai, X., Houlsby, N., Soricut, R.: Pali: A jointly-scaled multilingual
language-image model (2023)
10. Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi,
S.: Instructblip: Towards general-purpose vision-language models with instruction
tuning (2023)
16 Garg et al.
11. Dai, W., Liu, Z., Ji, Z., Su, D., Fung, P.: Plausible may not be faithful: Probing
object hallucination in vision-language pre-training (2023)
12. Desai, K., Kaul, G., Aysola, Z., Johnson, J.: Redcaps: web-curated image-text data
created by the people, for the people (2021)
13. Doveh, S., Arbelle, A., Harary, S., Herzig, R., Kim, D., Cascante-bonilla, P., Al-
fassy, A., Panda, R., Giryes, R., Feris, R., Ullman, S., Karlinsky, L.: Dense and
aligned captions (dac) promote compositional reasoning in vl models (2023)
14. Gurari, D., Zhao, Y., Zhang, M., Bhattacharya, N.: Captioning images taken by
people who are blind (2020)
15. Hendricks, L.A., Nematzadeh, A.: Probing image-language transformers for verb
understanding (2021)
16. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9,
1735–80 (12 1997). https://doi.org/10.1162/neco.1997.9.8.1735
17. Hsieh, C.Y., Zhang, J., Ma, Z., Kembhavi, A., Krishna, R.: Sugarcrepe: Fixing
hackable benchmarks for vision-language compositionality. Advances in Neural In-
formation Processing Systems 36 (2024)
18. Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q.V., Sung, Y.,
Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning
with noisy text supervision (2021)
19. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to
objects in photographs of natural scenes. In: Proceedings of the 2014 conference
on empirical methods in natural language processing (EMNLP). pp. 787–798 (2014)
20. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T.,
Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything
(2023)
21. Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-El-Haija, S., Kuznetsova,
A., Rom, H., Uijlings, J., Popov, S., Kamali, S., Malloci, M., Pont-Tuset, J.,
Veit, A., Belongie, S., Gomes, V., Gupta, A., Sun, C., Chechik, G., Cai, D.,
Feng, Z., Narayanan, D., Murphy, K.: Openimages: A public dataset for large-
scale multi-label and multi-class image classification. Dataset available from
https://storage.googleapis.com/openimages/web/index.html (2017)
22. Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for gen-
erating descriptive image paragraphs. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. pp. 317–325 (2017)
23. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalan-
tidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Li, F.F.: Visual genome: Con-
necting language and vision using crowdsourced dense image annotations (2016)
24. Kryściński, W., Keskar, N.S., McCann, B., Xiong, C., Socher, R.: Neural text
summarization: A critical evaluation. arXiv preprint arXiv:1908.08960 (2019)
25. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large language models (2023)
26. Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text
Summarization Branches Out. pp. 74–81. Association for Computational Linguis-
tics, Barcelona, Spain (Jul 2004), https://aclanthology.org/W04-1013
27. Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P.,
Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft coco: Common objects in context
(2015)
28. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)
29. Ma, Z., Hong, J., Gul, M.O., Gandhi, M., Gao, I., Krishna, R.: Crepe: Can
vision-language foundation models reason compositionally? In: Proceedings of the
ImageInWords 17
Weng, L., Wiethoff, M., Willner, D., Winter, C., Wolrich, S., Wong, H., Workman,
L., Wu, S., Wu, J., Wu, M., Xiao, K., Xu, T., Yoo, S., Yu, K., Yuan, Q., Zaremba,
W., Zellers, R., Zhang, C., Zhang, M., Zhao, S., Zheng, T., Zhuang, J., Zhuk, W.,
Zoph, B.: Gpt-4 technical report (2023)
34. OpenAI: Gpt-4. https://openai.com/research/gpt-4,2023 (2023), [Online; ac-
cessed 19-February-2024]
35. OpenAI: Gpt-4v(ision) technical work and authors. https://cdn.openai.com/
contributions/gpt-4v.pdf,2023 (2023), [Online; accessed 19-February-2024]
36. Pandey, R., Purohit, H., Castillo, C., Shalin, V.L.: Modeling and mitigating
human annotation errors to design efficient stream processing systems with
human-in-the-loop machine learning. International Journal of Human-Computer
Studies 160, 102772 (2022). https://doi.org/https://doi.org/10.1016/j.
ijhcs.2022.102772, https://www.sciencedirect.com/science/article/pii/
S1071581922000015
37. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic
evaluation of machine translation. In: Isabelle, P., Charniak, E., Lin, D. (eds.)
Proceedings of the 40th Annual Meeting of the Association for Computational
Linguistics. pp. 311–318. Association for Computational Linguistics, Philadelphia,
Pennsylvania, USA (Jul 2002). https://doi.org/10.3115/1073083.1073135,
https://aclanthology.org/P02-1040
38. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazeb-
nik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer
image-to-sentence models. In: Proceedings of the IEEE international conference on
computer vision. pp. 2641–2649 (2015)
39. Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V.: Connecting
vision and language with localized narratives. In: ECCV (2020)
40. Pu, A., Chung, H.W., Parikh, A.P., Gehrmann, S., Sellam, T.: Learning compact
metrics for mt. In: Proceedings of EMNLP (2021)
41. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International conference on machine learning. pp.
8748–8763. PMLR (2021)
42. Ray, A., Radenovic, F., Dubey, A., Plummer, B.A., Krishna, R., Saenko, K.: Cola:
A benchmark for compositional text-to-image retrieval (2023)
43. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object
detection with region proposal networks. In: Cortes, C., Lawrence, N., Lee, D.,
Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Sys-
tems. vol. 28. Curran Associates, Inc. (2015), https://proceedings.neurips.cc/
paper _ files / paper / 2015 / file / 14bfa6bb14875e45bba028a21ed38046 - Paper .
pdf
44. Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object halluci-
nation in image captioning (2019)
45. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour,
S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., Salimans, T., Ho, J., Fleet, D.J.,
Norouzi, M.: Photorealistic text-to-image diffusion models with deep language un-
derstanding (2022)
46. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M.,
Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy,
S., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: Laion-5b: An open
large-scale dataset for training next generation image-text models (2022)
ImageInWords 19
47. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned,
hypernymed, image alt-text dataset for automatic image captioning. In: Gurevych,
I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565. Association
for Computational Linguistics, Melbourne, Australia (Jul 2018). https://doi.
org/10.18653/v1/P18-1238, https://aclanthology.org/P18-1238
48. Shekhar, R., Pezzelle, S., Klimovich, Y., Herbelot, A., Nabi, M., Sangineto, E.,
Bernardi, R.: Foil it! find one mismatch between image and language caption.
In: Proceedings of the 55th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers). Association for Computational Lin-
guistics (2017). https://doi.org/10.18653/v1/p17-1024, http://dx.doi.org/
10.18653/v1/P17-1024
49. Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: a dataset for image cap-
tioning with reading comprehension (2020)
50. Thapliyal, A.V., Pont-Tuset, J., Chen, X., Soricut, R.: Crossmodal-3600: A mas-
sively multilingual multimodal evaluation dataset (2022)
51. Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.:
Winoground: Probing vision and language models for visio-linguistic composition-
ality (2022)
52. Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano,
A.: A picture is worth more than 77 text tokens: Evaluating clip-style models on
dense captions (2023)
53. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need (2023)
54. Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-based image description
evaluation (2015)
55. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image
caption generator (2015)
56. Wikipedia contributors: Alt attribute — Wikipedia, the free encyclopedia. https:
//en.wikipedia.org/w/index.php?title=Alt_attribute&oldid=1189330128
(2023), [Online; accessed 15-January-2024]
57. Wikipedia contributors: Automated readability index — Wikipedia, the free
encyclopedia. https : / / en . wikipedia . org / w / index . php ? title = Automated _
readability_index&oldid=1145735758 (2023), [Online; accessed 22-February-
2024]
58. Wikipedia contributors: Flesch–kincaid readability tests — Wikipedia, the free en-
cyclopedia. https://en.wikipedia.org/w/index.php?title=FleschâĂŞKincaid_
readability_tests&oldid=1192056958 (2023), [Online; accessed 22-February-
2024]
59. Wikipedia contributors: Gunning fog index — Wikipedia, the free encyclopedia.
https://en.wikipedia.org/w/index.php?title=Gunning_fog_index&oldid=
1181089308 (2023), [Online; accessed 22-February-2024]
60. Wikipedia contributors: Smog — Wikipedia, the free encyclopedia. https://en.
wikipedia.org/w/index.php?title=SMOG&oldid=1192815974 (2023), [Online; ac-
cessed 22-February-2024]
61. Wikipedia contributors: Jaccard index — Wikipedia, the free encyclopedia
(2024), https://en.wikipedia.org/w/index.php?title=Jaccard_index&oldid=
1196092673, [Online; accessed 24-January-2024]
62. Yarom, M., Bitton, Y., Changpinyo, S., Aharoni, R., Herzig, J., Lang, O., Ofek,
E., Szpektor, I.: What you see is what you read? improving text-image alignment
evaluation (2023)
20 Garg et al.
63. Ye, A., Santy, S., Hwang, J.D., Zhang, A.X., Krishna, R.: Cultural and linguistic
diversity improves visual representations. arXiv preprint arXiv:2310.14356 (2023)
64. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual
denotations: New similarity metrics for semantic inference over event descriptions.
Transactions of the Association for Computational Linguistics 2, 67–78 (02 2014).
https://doi.org/10.1162/tacl_a_00166, https://doi.org/10.1162/tacl_a_
00166
65. Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A.,
Yang, Y., Ayan, B.K., Hutchinson, B., Han, W., Parekh, Z., Li, X., Zhang, H.,
Baldridge, J., Wu, Y.: Scaling autoregressive models for content-rich text-to-image
generation (2022)
66. Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why
vision-language models behave like bags-of-words, and what to do about it? (2023)
67. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating
text generation with BERT. In: 8th International Conference on Learning Repre-
sentations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net
(2020), https://openreview.net/forum?id=SkeHuCVFDr
ImageInWords 21
7 Supplementary Material
7.1 Annotation Guidelines
We now present the full detailed annotation guideline used for IIW annotations.
Image descriptions should be composed such that they paint a vivid mental
picture of an actual image in the mind of someone hearing the description and has
their eyes closed. In order to reach this level of detail composed in an articulate
manner, we compile an extensive set of annotation guidelines. We iterated over
these guidelines with multiple pilot rounds.
The annotators are asked to operate as if they are instructing a painter to
paint with their words and only include details that can be deduced from visual
cues, erring on the side of higher precision. Unnecessary fragmentation of sen-
tences should be avoided to compose in a flowy coherent style avoiding the use
of filler phrases like: “In this image,” “we can see,” “there is a,” “this is a picture
of,” since they add no visual detail and come at a cost of verbosity.
Objects form the lego-blocks of an image. Interactions and spatial arrange-
ments among them help to form the context of the image. In complex multi-
object images with dense settings, noting each and every object independently
can become cumbersome and highly dependent on the effort the particular hu-
man annotator puts in. To define this better and expect a consistent behavior
from the annotation outputs, we introduce the notion of salient objects. Key ob-
jects without which the image would lose its context and meaning are considered
salient. This can include individual objects or combinations of them depending
on the role they play in the image; consider the following 2 cases as examples:
– Three people in the blurry background of an image, with the scene set inside
a coffee shop, who play no concrete role individually can be grouped as people
in the background instead of 3 individual people object annotations.
– Two people in the foreground and in-focus, engaged in a conversation in the
same scene. The two individuals are likely the focus of the image and hence
worth noting individually in detail as separate objects. This is likely what
the photographer was attempting to capture.
When composing the overall image description, start with a newspaper style
tldr sentence that paints a very clear high level picture. Describe the objects
in order of their saliency while noting the description of individual objects and
relationships in a coherent manner. Include the overall setting, background, style,
and consider:
Camera angle (i.e., the position of the camera in relation to the subject)
is crucial, as this sets a precedence for what level and kind of information to
expect. The choice of camera angle can have a significant impact on the mood
and meaning of a photograph. Different camera angles can be used to create
different effects and convey different messages, e.g., details about a close-up
are different from those of a wide angle shot. Examples of camera angles (see
Figure 5):
ImageInWords 23
Fig. 5: Camera Angles to Consider when Annotating Images. These are important to
set a precedence on the level and kind of information to expect in the image description.
– Eye level: The camera is positioned at the same level as the subject’s eyes.
This is the most natural and neutral camera angle.
– High angle: The camera is positioned above the subject. This angle can make
the subject appear smaller, weaker, or less important.
– Low angle: The camera is positioned below the subject, anywhere below the
eye line, looking up. This angle can make the subject appear larger, stronger,
or more important. Sometimes, it is even directly below the subject’s feet.
– Ground level: The camera is positioned at the ground level. This angle cap-
tures what is in the frame at ground level, that is, the feet, or maybe the
character lying on the ground.
– Dutch tilt: The camera is tilted on its axis. This angle can be used to create
a sense of unease or disorientation.
– Bird’s-eye view: The camera is positioned directly above the subject. This
angle can be used to show the subject’s relationship to their surroundings.
– Worm’s-eye view: The camera is positioned directly below the subject. This
angle can be used to create a sense of awe or wonder.
– Top-down view or Overhead shot: The camera is above the subject and
you’re taking the photograph downwards from straight above, and not at
any kind of angle. It is typically closer to the subject than a bird’s eye view
(see Figure 5 for comparison).
Some other terms that are sometimes used to describe camera angles and
depths:
24 Garg et al.
Fig. 6: An Example where Quoting Text in a Detailed Manner can Enable Precise
Reconstruction. The word-casing and alignment attributes of the multi-line phrase
(“Juice,” “ACROSS THE,” “Universe”) has words “Juice” and “Universe” as capitalized
while the phrase “ACROSS THE” is all upper-cased, all components are aligned along a
diagonal. Information on the font color, type, shadow effect should be included. For the
phrase (“FREE,” “ARCADE,” “GAMES”) all words are upper-cased, vertically stacked
and centrally aligned.
not operate under the assumption that one gender is more common in that
profession.
For any apparel , the descriptions should focus on overall style, unique de-
tails, silhouette of the garment, how it fits, fabric, color, shades, and tone of the
garment. If the branding is visually visible, it should be included while attributes
like size should be skipped unless visually verifiable.
Where applicable use locale specific names of objects like clothing (e.g.,
sherwani, kurta, kimono, saree), food (e.g., shawarma, dosa, paneer tikka) etc.
The aim is to capture the locale specific vocabulary so the downstream models
can pick them up instead of using generic abstract terms.
For art pieces, include art styles, time periods, mediums, moods, viewpoints,
subject matters, cultures as much as possible from the visual cues.
Fig. 7: Human-in-the-Loop Learning. Over time with a constant annotator pool, each
annotator gets an opportunity to read and learn from others’ perspective via an implicit
feedback loop. This has shown to improve individual annotator quality as shown in the
main paper.
1. Heavy reliance on the quality of the base dense description from the first
annotator. If the quality is not good, the annotator in the next round will
spend considerable time fixing the input. There are 2 mitigating steps:
(a) Monitor this at the beginning of the annotation project when the an-
notators are still new to the task using metrics like edit-distance and
provide explicit feedback back to the annotators as needed.
(b) Annotators in each round have the option to start from scratch if they
deem the quality from the previous round to be considerably low. Use
this as feedback for the annotator from the previous round by presenting
them the edited output to learn from.
Fig. 8: IIW Annotation UI for Task-1. We illustrate the seed object-detection objects
and VLM generated object-level captions with object cropped image bytes as input.
Advantages:
1. Reduces the dependency on the human both in terms of number of rounds
and annotation time.
2. Provides a way to evaluate current model quality by monitoring the time,
volume and patterns of augmentations during the human annotation stage.
Some considerations to keep in mind:
1. As discussed above, the effectiveness relies very heavily on the capability of
the model, i.e., having high comprehensiveness and low hallucinations.
– Edit make adjustments to the label and/or bounding box. This can include:
• Making the labels more specific, e.g Animal to German Shepherd
• Enlarging or tightening the bounds of the bounding box by expanding
or contracting the seed box.
ImageInWords 29
Fig. 9: IIW Annotation UI for Task-1. We illustrate the human augmented salient
objects and their human-authored descriptions. The annotations are built on seed
information from Figure 8. This example demonstrates how humans can alter the seed
annotations based on the annotation guidelines, which can include merging, deleting,
editing and adding new salient objects and then describing each.
Fig. 10: IIW Annotation UI for Task-2 with seed VLM description. This VLM has
been fine-tuned in an active learning mode as data was collected iteratively. The seed
caption from the same VLM (PaLI-5B) without the IIW fine-tuning is “a pink bicycle
with a basket of flowers on it.” The seed annotation is then refined and augmented by
human annotators, see Figure 11 for the final resulting description.
For each (label, bounding box) pair, we ask the annotators to generate a
detailed description focused on the object in the context of the image considering
the several axes as reference (see Section 7.1).
Annotation Task-2: Overall Image Description In Task-2, human annota-
tors are presented with the annotations from Task-1 and a seeded VLM descrip-
tion (see Figure 10) which is then refined by human annotators in sequential
rounds to produce the final hyper-detailed description (see Figure 11).
Image Region Tasks Using one object at a time from the list of (label, bound-
ing box, description) Task 1 annotations, we perform three region-based tasks.
We use normalized bounding boxes in [ymin, xmin, ymax, xmax ] format as in
Pix2Seq [7]. Our first task is description-label grounding. In multi-object dense
ImageInWords 31
Fig. 11: IIW Final Annotation UI for Task-2. We illustrate the human annotations
available from Task-1 as the human annotators hover over the salient objects in the
image. The annotators can additionally switch between hiding all salient objects to
view the image properly. Task-2 annotations starts with the seed caption from the
VLM and is then refined by human annotators in sequential rounds, building on top
of the previous round’s output.
Salient Objects Tasks Our next category of fine-tuning tasks concerns the
salient objects in an image. We target the aggregated list of (label, bounding
box) object features per image from Task 1. Our first task is label generation, in
which given an image, we aim to generate a text list of the salient object labels.
The object labels are sorted alphabetically for consistency, but in future work
ordering by saliency would be useful. Our second object-level task is grounded
label generation. The task is to generate the list of (label, bounding box) pairs
per object in the image; we similarly sort the list alphabetically with respect to
label name.
Detailed Description Tasks Finally, our last fine-tuning tasks relate to the
sequentially annotated descriptions from Task 2. We perform description elab-
oration in addition to direct description generation. Given the image and de-
scription from N th step, description elaboration trains the model to elaborate
the current description to the final description. We also create synthetically cor-
rupted versions of the final description to serve as additional training samples.
32 Garg et al.
Fig. 12: IIW based VLM Fine-tuning Tasks. We show tasks based on data collected
from Task-1 and Task-2 per the IIW annotation framework. Different tasks enable the
fine-tuning to focus on the image at (object, attribute), (image, objects) or (image,
hyper-detailed description) levels.
7.4 Experiments
Automatic Readability Measurements Building on the discussion from
the main paper around the automatic readability metrics, we study additional
distribution based readability metrics in Figure 13. The distributions further
support the previous metrics to demonstrate a more mature writing style in
both the IIW human-authored dataset and fine-tuned model generated outputs.
FLESCH KINCAID GRADE LEVEL FLESCH EASE SMOG GRADE LEVEL GUNNING FOG GRADE LEVEL
20.0 30
40
17.5 40
25
15.0
30
count_norm
count_norm
count_norm
count_norm
12.5 30 20
10.0 20 15
20
7.5 10
5.0 10 10
2.5 5
0.0 0 0 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
very easy
easy
fairly_easy
standard
fairly_difficult
difficult
very_confusing
6
7
8
9
10
11
12
13
14
15
na
6
7
8
9
10
11
12
college
college_graduate
IIW DCI DOCCI
(a) Distribution on the Human Authored Datasets from DCI, DOCCI and IIW
FLESCH KINCAID GRADE LEVEL FLESCH EASE SMOG GRADE LEVEL 40
GUNNING FOG GRADE LEVEL
50
25 40 35
40 30
20 30
count_norm
count_norm
count_norm
count_norm
30 25
15 20
20
20 15
10
10 10
5 10
5
0 0 0 0
0
1
2
3
4
5
6
7
8
9
very easy
easy
fairly_easy
standard
10
11
12
13
14
15
fairly_difficult
difficult
very_confusing
5
6
7
8
9
10
11
12
13
na
6
7
8
9
10
11
12
college
college_graduate
IIW DCI DOCCI
(b) Distribution on the Fine-tuned Model Generated Outputs from DCI, DOCCI and IIW
\texttt {Recall} &= avg(\text {Comprehensiveness}, \text {Specificity})\\ \texttt {Precision} &= \text {Hallucination}\\ \texttt {Writing Style} &= avg(\text {TLDR}, \text {Human Like})\\ \texttt {Overall} &= avg(\text {Recall}, \text {Precision}, \text {Writing Style})
34 Garg et al.
Fig. 14: Human SxS Annotation UI. Annotators are shown the input image and two
input image descriptions to evaluate side-by-side. The input descriptions could be from
any combination of (human, model) sources. This information is not shared with the
annotators and the sources are randomly flipped and marked as A or B to prevent any
source or order based bias.
Fig. 15: Human SxS Annotation UI responses for the input image and two image de-
scription pairs (see Figure 14). The annotators respond to the 5 metrics independently
on a 5 point scale. They are additionally asked to justify their choices which can be
used to sanity check and perform quality sweeps.
not be significant. When evaluating the DCI, DOCCI, and IIW test sets with
BLEURT, we instead find a slight preference for IIW models. Across all three
datasets, BLEURT shows PaLI-IIW variants perform better or similarly to the
same-domain test set. Thus, newer metrics may reveal IIW fine-tuned models
generalize better than models fine-tuned on other datasets.
Table 8: Human SxS to Evaluate IIW Fine-tuned PaLI-3 5B Model Predictions when
compared to IIW Human-Authored Data on IIW-400 using 100 samples.
IIW-400
Metric IIW-Human IIW-Model
++ + - + ++
Comprehensiveness 40 43 12 4 1
Specificity 79 14 5 2 0
Hallucincations 6 46 33 17 4
TLDR 29 43 14 10 4
Human-Like 27 32 34 6 1
36 Garg et al.
Table 10: Additional Automatic Metric Results. We report CIDEr, BERTScore (re-
ferred to as BERT in table due to space), and BLEURT metrics for all fine-tuned
models. We compare DCI, DOCCI, IIW, and IIW Comb. (Combined).
(DCI, DOCCI and IIW). Figure 16 showcases the prompts and the T2I model
outputs from three descriptions along with the original image.
We then asked human annotators to rank the generated images by how sim-
ilar they are to the original image. The image most similar to the original image
is ranked number 1. We allowed generated images to be ranked the same if they
are very similar. Figure 17(a) shows the reconstruction rank counts for all the
sentence counts and Figure 17(b) shows the rank counts when we use sentence
1, sentence 1 and 2, sentence 1, 2 and 3, and sentence 1, 2, 3, and 4. Sentences
from IIW descriptions are ranked first much more frequently than sentences from
DCI and DOCCI descriptions. Specifically, for the first sentence, the difference
is most notable, supporting our claim that IIW descriptions are higher quality
earlier on and IIW first sentences are designed to capture a TLDR.
“Given the following image description and image caption options, choose
the most likely OPTION number :
IMAGE-DESCRIPTION : <DESCRIPTION>
OPTIONS : <CHOICES>
RESPONSE : ”
ARO [66]
Image Description Model SVO-Probes [15] Winoground [51]
VG-A VG-R
None (Language Bias Baseline) 56.50 59.94 50.71 49.88
InstructBLIP-Vicuna-7B 83.99 62.73 89.35 65.25
LLaVA-V1.5-7B 84.80 63.71 87.89 63.38
PaLI-3 + DCI 5B 88.19 66.47 86.50 64.62
PaLI-3 + DOCCI 5B 89.70 68.85 88.73 69.50
PaLI-3 + IIW 5B 90.37 66.19 88.66 69.38
PaLI-3 + IIW Combined 5B 89.46 64.88 87.78 66.88
As discussed in the main paper, we enrich 1k samples from two existing im-
age caption datasets, namely, Localized Narratives and Crossmodal (XM) 3600,
with new image descriptions generated by IIW fine-tuned models. The goal of
ImageInWords 39
7.7 Limitations
Finally, we discuss the limitations of our annotation framework and evaluations.
In our annotation framework, we define a seeded and sequential annotation pro-
cess, with both aspects having potential limitations. The quality of the seeded
data is of high importance as it will ultimately affect the rest of our human
annotation pipeline. Additionally, even with the best possible seeds, they may
limit the scope of what our crowd workers write by biasing them towards cer-
tain objects or phrases. In terms of limitations for the sequential augmentation
used, unnecessary time may be spent by annotators if the first annotator output
quality is low. By monitoring the initial draft descriptions, quality can be better
ensured so that the framework is as efficient as possible.
40 Garg et al.
With respect to the evaluation of our human annotated data and model
generated outputs, we do only perform evaluations on hundreds of samples (as
opposed to thousands or more). This is due to how expensive and time consum-
ing human SxS evaluations are, but we note that IIW is rated marginally and
substantially better at a much higher rate, which would likely scale to more sam-
ples. Our work is also inherently limited by the lack of metrics available for long
descriptions. We still report standard text similarity metrics and complement
them with human SxS, but in future work the text length limitations should be
addressed so that automated metrics can be applied.
While we currently do not plan to open source our models or training set,
we do release an evaluation set over images that can serve as a unified bench-
mark for IIW, recent and future related work. We also open source the human
SxS judgements and samples enriched from Localized Narratives and XM3600.
Lastly, as also mentioned in the Conclusion of the main text, our initial IIW
dataset and resulting models are English-only. In the future, we plan to expand
our work to have multilingual coverage. We also would like to curate image de-
scriptions that have more specifics with respect to locale/geographical location,
so that we do not strictly have descriptions with a western lens.
ImageInWords 41
Fig. 16: T2I Outputs and Human Ranking Evaluations. We show example T2I results
where the first sentence, first two sentences, ..., all the sentences of the image descrip-
tions from DCI, DOCCI and IIW models are fed sequentially as inputs, i.e., at each
step an additional sentence chunk is fed to the T2I model.
42 Garg et al.
(a) Reconstruction Rank Counts across Inputs over All Cumulative Sentence Chunks
(b) Reconstruction Rank Counts across Inputs of Specific Cumulative Sentence Chunks
Fig. 17: T2I Human Rank Distributions. We illustrate bar plots for the image recon-
struction evaluation results using image descriptions from finetuned PaLI-5B models
on three datasets (DCI, DOCCI, IIW). Images reconstructed from IIW descriptions
are consistently ranked better than other descriptions.
ImageInWords 43
Table 13: Percentages from the Main Text. We reference each percentage and define
how they were calculated for clarity.
Table 14: Percentages from the Main Text. We reference each percentage and define
how they were calculated for clarity.
Table 15: Percentages from the Main Text. We reference each percentage and define
how they were calculated for clarity.