0% found this document useful (0 votes)
35 views12 pages

INDUS: Language Models for Science

LLMs for Space Applications

Uploaded by

Madhu Atre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views12 pages

INDUS: Language Models for Science

LLMs for Space Applications

Uploaded by

Madhu Atre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

INDUS: Effective and Efficient Language Models for Scientific Applications

Bishwaranjan Bhattacharjee1 , Aashka Trivedi1 , Masayasu Muraoka1 ,


Muthukumaran Ramasubramanian3 , Takuma Udagawa1 , Iksha Gurung3 , Rong Zhang1 ,
Bharath Dandala1 , Rahul Ramachandran2 , Manil Maskey2 , Kaylin Bugbee2 , Mike Little4 ,
Elizabeth Fancher2 , Lauren Sanders5 , Sylvain Costes5 , Sergi Blanco-Cuaresma6 , Kelly Lockhart6 ,
Thomas Allen6 , Felix Grezes6 , Megan Ansdell7 , Alberto Accomazzi6 , Yousef El-Kurdi1 ,
Davis Wertheimer1 , Birgit Pfitzmann1 , Cesar Berrospi Ramis1 , Michele Dolfi1 , Rafael Teixeira de Lima1 ,
Panagiotis Vagenas1 , S. Karthik Mukkavilli1 , Peter Staar1 , Sanaz Vahidinia7 , Ryan McGranaghan8 ,
Armin Mehrabian9 , Tsendgar Lee7
1
IBM Research AI, 2 NASA MFSC, 3 UAH, 4 Navteca, 5 NASA Ames, 6 Harvard-Smithsonian CfA,
7
NASA HQ, 8 JPL, 9 NASA GSFC

Abstract generation tasks. Most popular LLMs rely on the


arXiv:2405.10725v2 [cs.CL] 20 May 2024

transformer architecture (Vaswani et al., 2017)


Large language models (LLMs) trained on gen- and are trained using general-purpose corpora
eral domain corpora showed remarkable re- like Wikipedia or CommonCrawl (Devlin et al.,
sults on natural language processing (NLP)
tasks. However, previous research demon-
2019; Liu et al., 2019; Lewis et al., 2020; Raffel
strated LLMs trained using domain-focused cor- et al., 2020; Brown et al., 2020; Touvron et al.,
pora perform better on specialized tasks. In- 2023). Although these general-purpose models ex-
spired by this pivotal insight, we developed hibited strong performance, the distributional shift
INDUS, a comprehensive suite of LLMs tailored of vocabulary led to sub-optimal performance on
for the Earth science, biology, physics, helio- domain-specific natural language understanding
physics, planetary sciences and astrophysics and generation tasks (Beltagy et al., 2019). Fol-
domains and trained using curated scientific
lowing this observation, several domain-specific
corpora drawn from diverse data sources. The
LLM s such as SCIBERT (Beltagy et al., 2019),
suite of models include: (1) an encoder model
trained using domain-specific vocabulary and BIOBERT (Lee et al., 2019), MATBERT (Walker
corpora to address natural language under- et al., 2021), BATTERYBERT (Huang and Cole,
standing tasks, (2) a contrastive-learning-based 2022) and SCHOLARBERT (Hong et al., 2023) were
general text embedding model trained using developed with the goal of improving accuracy on
a diverse set of datasets drawn from multi- in-domain NLP tasks (Lee et al., 2019; Araci, 2019;
ple sources to address information retrieval Wu et al., 2023).
tasks and (3) smaller versions of these mod-
els created using knowledge distillation tech- In this research, we specifically focused on inter-
niques to address applications which have la-
disciplinary fields related to the Earth, celestial bod-
tency or resource constraints. We also created
three new scientific benchmark datasets namely, ies, the Sun, and planets within our solar system
CLIMATE - CHANGE NER (entity-recognition), such as physics, Earth science, astrophysics, helio-
NASA - QA (extractive QA) and NASA - IR (IR) to physics, planetary sciences and biology. While the
accelerate research in these multi-disciplinary training corpora of existing domain-specific mod-
fields. Finally, we show that our mod- els such as SCIBERT, BIOBERT and SCHOLARBERT
els outperform both general-purpose encoders partially cover some of these fields, there is cur-
(RoBERTa) and existing domain-specific en- rently no specific model available that encompasses
coders (SCIBERT) on these new tasks as well
all of the fields of interest collectively. Further, the
as existing benchmark tasks in the domains of
interest. interdisciplinary nature of these domains of inter-
est is reflected in a vast body of literature scattered
1 Introduction across diverse sources. Thus, we developed INDUS,
a collection of encoder-based LLMs focused on
Large language models (LLMs) trained on huge these domains of interest (Figure 1) trained using
amounts of data have demonstrated impressive ca- meticulously curated corpora from diverse sources.
pabilities on natural language understanding and We believe this work will facilitate research organi-
Contact: bhatta@us.ibm.com,aashka.trivedi@ibm.com, zations and enterprises working in these fields by
muthukumaranr17@gmail.com,rahul.ramachandran@nasa.gov providing efficient access to relevant literature and
Earth Science Data
Pretraining
Finetuning Scientific Corpora
(Masked
BioMedical Data Indus-Base (Contrastive
Language
Learning) Open QA
Modelling)
Astrophysics Data
Duplicate Pairs
Astronomy Data
Representation Indus-Retriever-Base Citation Pairs
General Science Distillation
Data
Fact Verification
General English
Data
Generic Corpora
Output
Encoder Training Corpus Indus-Small Distillation Embedding Training
Corpus

Data Teacher (KD)


Initialization Output Indus-Retriever-
Small

BC5-CHEM BC5-Disease NCBI-Disease BC2GM


NASA-QA TREC-Covid NFCorpus NQ HotpotQA

JNLPBA EBM-PICO ChemProt DDI


FiQA Arguana Touche DBPedia NASA-IR
Climate
GAD HoC PubMedQA Change NER
BioASQ
Climate
SciDocs FEVER SciFact
FEVER
BIOSSES BLURB Benchmark BEIR Benchmark

Natural Language Understanding Benchmarks Information Retrieval Benchmarks

Figure 1: Overview of INDUS models: the general-purpose encoder model and the retriever built from it, and their
distilled counterparts. Also shown are the benchmarks used for evaluation, highlighting our new benchmarks,
NASA - QA , CLIMATE - CHANGE NER and NASA - IR .

enabling them in informed decision-making. (Beltagy et al., 2019). We also show that the
Specifically, we make the following contribu- knowledge-distilled models achieved a signifi-
tions: cant boost in latency while maintaining strong
1. Utilizing the byte-pair encoding algorithm, empirical performance compared to the original
we constructed I NDUS BPE, a customized to- models on most of the benchmark tasks.
kenizer from the curated scientific corpus.
2. We pretrained multiple encoder-only LLMs us- 2 Data
ing the curated scientific corpora and the I N - Sufficient high-quality in-domain corpora is essen-
DUS BPE tokenizer (§2, §3). We further created tial to develop models that perform better than
sentence-embedding models by fine-tuning the their counterparts trained on open-domain corpora.
encoder-only models with a contrastive learn- We meticulously identified corpora for each of the
ing objective to learn “universal” sentence em- aforementioned domains, and created English-only
beddings (Gao et al., 2021) (§4). We also models for the sake of containment. Specifically,
trained smaller, more efficient versions of for each of the domains, we used open-source data
these models using knowledge-distillation tech- which has a permissive license, and further aug-
niques (§3.3, §4.2). mented them with specific data from NASA and its
3. We created three new scientific benchmark data providers. To aid in the learning of general
datasets, CLIMATE - CHANGE NER (an entity English, we also included English Wikipedia in
recognition task), NASA - QA (an extractive ques- our training corpora. We briefly describe each data
tion answering task) and NASA - IR (a retrieval source below, and present statistics of the data in
task) (§5) to further accelerate research in this Table 1.
multi-disciplinary field.
• SAO/NASA Astrophysics Data System (ADS)1 :
4. Through experimental results, we show strong
ADS is the biggest source of data, covering pub-
performance by our models on these benchmark
lications in the areas of astronomy and astro-
tasks as well as on existing domain-specific
physics, physics and general science including
benchmarks, outperforming general-purpose
all arXiv e-prints.
models like RoBERTa (Liu et al., 2019) as
1
well as scientific-domain encoders like SCIBERT https://ui.adsabs.harvard.edu
Dataset Domain #Tokens Ratio Tokenizer ADS PMC Wikipedia
NASA CMR Earth Science 0.3B 1% R o BERT a 12,867,439 7,549,075 15,859
AMS and AGU papers Earth Science 2.8B 4% +lower_cased 12,862,227 7,557,868 16,901
English Wikipedia General 5.0B 8% I NDUS BPE 12,309,023 6,920,659 16,056
PubMed Abstracts Biomedical 6.9B 10%
PMC Biomedical 18.5B 28%
Table 2: Number of tokens produced by RoBERTa and
SAO / NASA ADS Astronomy, 32.7B 49%
I NDUS BPE tokenizers applied to 1k samples from each
Astrophysics, dataset. Fewer tokens lead to a smaller computation
Physics, cost.
General Science
Total 66.2B 100%
ing dataset (§2)6 . For a fair comparison, we set the
Table 1: Basic statistics of our pretraining dataset. vocabulary size to 50, 265, which is equal to that of
the RoBERTa tokenizer (Liu et al., 2019) and used
the uncased variation of both the tokenizers.
• PubMed Central (PMC)2 : PMC is a full-text We performed a brief analysis to understand the
archive of biomedical and life science journal differences between the vocabularies of I NDUS -
literature maintained by National Library of BPE and the RoBERTa tokenizer. Out of 50, 265
Medicine and National Institutes of Health. We tokens, 22, 355(44.5%) tokens are common in both
used the portion of PMC that has a commercial- the tokenizers while the remaining 27, 910(55.5%)
friendly license, along with the PubMed ab- tokens are included only in either tokenizer, indi-
stracts of all the articles in PMC. cating a significant distributional shift in domain.
• American Meteorological Society (AMS)3 : To further understand the effect, we applied both
We used full-text journal documents spanning Ro BERTa and I NDUS BPE on 1, 000 randomly sam-
topics in Earth systems, Earth interactions, ap- pled text fragments from our datasets. These text
plied meteorology and climatology, physical fragments varied from full documents to abstracts
oceanography, atmospheric sciences, climate, to single sentences. As shown in Table 2, I NDUS -
hydrometeorology, weather and forecasting, BPE tokenizer produced fewer tokens than the
and societal impacts. R o BERT a tokenizer, leading to 8̃% drop in com-
• American Geophysical Union (AGU)4 : The putation cost during training.
AGU dataset included journal documents across Table 3 compares the RoBERTa tokenzier and I N -
the topics of atmospheres, biogeosciences, DUS BPE tokenizer, illustrating that the proposed
Earth surface, machine learning and compu- tokenizer treated scientific terms (such as biomak-
tation, oceans, planets, solid earth, and space ers, phosphorylated, alzheimer) as single tokens
physics. while RoBERTa tokenizer splits these words into
• NASA Common Metadata Repository multiple subword pieces.
(CMR)5 : CMR is a high-performance, high-
quality, continuously evolving metadata system 3.2 Encoder Model
that catalogs all data and service metadata We first trained an encoder-only model, INDUS BASE ,
records for NASA’s Earth Science Data and using a masked language modeling objective. The
Information System (ESDIS). It contains text model architecture follows RoBERTaBASE (Liu et al.,
descriptions of the NASA Earth science data 2019), which consists of 12 layers and has 125M
products. parameters. We adopted the default hyperparame-
ters7 but with an effective batch size of 92, 16. We
3 Methodology: Encoder Models trained the model for 500K steps using 192 V100
GPUs.
3.1 I NDUS BPE Tokenizer
3.3 Knowledge Distillation for Efficient
We trained BPE tokenizer (Radford et al., 2019),
Encoder Model
I NDUS BPE from scratch using a subset of our train-
We also trained a smaller model, INDUS SMALL , with
2
https://www.ncbi.nlm.nih.gov/pmc 38M parameters through knowledge distillation
3
https://www.ametsoc.org/index.cfm/ams/publications/
4 6
https://agupubs.onlinelibrary.wiley.com/ We used HF tokenizers, https://github.com/
5
https://www.earthdata.nasa.gov/eosdis/science-system- huggingface/tokenizers
7
description/eosdis-components/cmr We refer readers to Table 9 in (Liu et al., 2019).
Input text
novel tau biomarkers phosphorylated at t181, t217 or t231 rise in the initial stages of the preclinical
alzheimer’s continuum when only subtle changes in a pathology are detected
Tokenization by RoBERTa
<s> no vel t au biomark ers phosph ory lated at t 181 , t 217 , or t 231 rise in the initial stages of the preclinical
al z heimer ’ s continuum when only subtle changes in a pathology are detected </s>
Tokenization by I NDUS BPE
<s> novel tau biomarkers phosphorylated at t 181 , t 217 , or t 231 rise in the initial stages of the preclinical
alzheimer ’ s continuum when only subtle changes in a pathology are detected </s>

Table 3: Tokenization comparison between RoBERTa and our tokenizers. Input text adapted from Suárez-Calvet
et al. (2020).

X X
techniques by using INDUS BASE as the teacher. IN - Zi = es(qi ,pj ) + es(qj ,pi )
DUS SMALL follows a 4-layer architecture recom- j j
(2)
mended by the Neural Architecture Search engine
X X
s(qi ,qj )
+ e + es(pi ,pj )
(Trivedi et al., 2023) with an optimal trade-off be- j̸=i j̸=i
tween performance and latency. We adopted the
where s(q, p) is a measure of temperature-scaled
distillation objective proposed in MiniLMv2 (Wang
cosine similarity between the embeddings of query
et al., 2021) to transfer fine-grained self-attention
and a passage measured by:
relations, which has been shown to be the current
state-of-the-art (Udagawa et al., 2023). Using this 1 E(q) · E(p)
objective, we trained the model for 500K steps with s(q, p) = (3)
τ ∥E(q)∥∥E(p)∥
an effective batch size of 480 on 30 V100 GPUs.
Training Data Similar to prior work (Wang et al.,
2022; Li et al., 2023; Xiao et al., 2023), we em-
4 Methodology: Sentence Embedding
ployed a stage-wise training approach for our sen-
Models
tence embedding model:
4.1 Sentence Embedding Model 1. Unsupervised training: we first trained on a
large corpus of 300 million samples of naturally
Text embeddings represent text as low-dimensional occurring pairs collected from internet sources,
vectors, allowing for efficient use in dense retrieval such as Wikipedia, StackExchange, etc. We
systems, where relevant passages for a given query also included scientific data from PubMed, PMC
are identified on the basis of the similarity between (§2), Arxiv and S 2 ORC (Lo et al., 2020) as in-
their embeddings (Karpukhin et al., 2020). domain data for our science-oriented retriever
model. Furthermore, we created a domain-
Contrastive Learning Objective Sentence em-
specific dataset from the ADS data (§2) by in-
bedding models trained using a contrastive learn-
cluding title-abstract pairs.
ing objective (Khosla et al., 2020; Gao et al., 2021)
2. Supervised fine-tuning: we further trained on
pushes the embeddings of a query closer to that of
high-quality annotated datasets, such as NQ
a relevant passage and further away from that of a
(Kwiatkowski et al., 2019), SQuAD (Rajpurkar
non-relevant passage.
et al., 2016), SPECTER pairs (Cohan et al.,
Inspired by recent work (Li et al., 2023), we 2020), etc. We included the aforementioned
used an improved contrastive loss by introducing ADS data and a sample of the S 2 ORC data in
an additional bidirectional signal. Specifically, for this step, to boost domain-specific signals.
a triple {q, p+ , P − } of a query, a relevant (posi-
Appendix A contains comprehensive details about
tive) passage, and a set of non-relevant (negative)
the datasets used in training. For both training
passages P − = {p− m
j }j=1 , We define the InfoNCE stages, we used large batch sizes and in-batch nega-
loss(van den Oord et al., 2019) as:
tives to better approximate the contrastive learning
n +
objective. During training, we sampled batches
1X es(qi ,pi )
from each data source proportionately to its size,
LIC =− log (1)
n Zi similar to Li et al. (2023).
i=1
Model Specifications We created our sentence approximately 56M sentences. This masked auto-
embedding model by fine-tuning INDUS BASE . Here- encoder model consisted of a full encoder along
after, we refer to the resulting retriever model as with a shallow decoder. The model uses masked
INDUS- RETRIEVER BASE . We followed a bi-encoder language modeling with a training objective to re-
framework (Reimers and Gurevych, 2019), and ex- cover the original sentence based on the decoder’s
perimented with multiple pooling strategies and masked input and the sentence embedding gener-
found that the mean pooling of the contextualized ated from the encoder’s masked input, via masked
transformer representations performed the best. language modelling. There is no distillation loss
contributing to this step, which can be viewed as an
Training Details We trained each stage on 2
extended pretraining mechanism. We find that the
A100 GPUs with an effective batch size of 1,024.
RetroMAE pretraining does not give us good gains
We first trained with unsupervised data for 300K
in the larger model but improves the performance
steps followed by an additional 100K steps with the
of the smaller model.
supervised data. We used a learning rate of 2e − 5
For distilling the sentence embedding model,
during both these steps.
we found that a stage-wise training approach does
4.2 Knowledge Distillation for Embedding not benefit performance as much as in the non-
Models distillation case (ablation presented in Appendix
B). We thus distilled in a single step with all the
To optimize the latency for retrieval applications,
data described in §4.1 and Appendix A and added
we also created a small retriever model with the
labelled pairs from FEVER (Thorne et al., 2018)
aim to transfer the capability of the large teacher
and HOTPOTQA (Yang et al., 2018).
model (INDUS-RETRIEVER BASE ) to smaller student
model (INDUS SMALL ), by distilling the teacher’s dis- Model Specifications We built the sentence em-
tribution of similarity scores. Furthermore, we find bedding model by distilling into INDUS SMALL . This
that it is necessary to modify the training strategy is a 4-layer model with an embedding dimension
for distillation, as described below. of 576. We refer to the resulting retriever model as
INDUS - RETRIEVER SMALL . It follows a bi-encoder
Distillation Loss We used knowledge distillation
techniques introduced in (Xu et al., 2023). Specif- framework, and here we find that using the vector
ically, for a sentence xi and its corresponding in- representation of the first token as the embedding
batch element pairs {xi , xj }m (CLS pooling) gives better performance than using
j=1,j̸=i , we minimized
the cross entropy between the teacher’s distribu- mean pooling.
tion pt of similarity scores between pairs and the Training Details For the Retro-MAE style pre-
student’s distribution, ps . Following Hinton et al. training (Xiao et al., 2022), we trained on 8 A100
(2014), we also scaled the output distribution of GPUs with an effective batch size of 128 for 2
both teacher and student by a temperature, τKD : epochs with a learning rate of 2e − 5. For the stage-
n X
X m wise distillation, we trained on 2 A100 GPUs for
LKD = − pt (xi , xj )logps (xi , xj ) (4) 300K steps with an effective batch size of 2,048,
i=1 j=1 and learning rate of 7e − 4. Through experimenta-
tion, We found that τKD = 4 performed the best.
ess (xi ,xj )/τKD
ps (xi , xj ) = Pm s (x ,x )/τ (5)
k=1 e
s i k KD 5 Creating Benchmarks
est (xi ,xj )/τKD Benchmark datasets play a crucial role in assess-
pt (xi , xj ) = Pm s (x ,x )/τ (6)
k=1 e
t i k KD ing the language understanding capabilities models.
However, to the best our knowledge, there is a no-
Here, ss (xi , xj ) and st (xi , xj ) represent the sim-
ticeable absence of datasets tailored for the diverse
ilarity scores between two pairs {xi , xj }, defined in
and multidisciplinary field under study. Thus, to
Equation 3 for the student and teacher respectively.
effectively benchmark the proposed NLP models
Training Data We first conducted a embedding- and further accelerate research in this multidisci-
oriented pretraining step, as presented in Retro- plinary domain, We introduced three new datasets,
MAE (Xiao et al., 2022), on English Wikipedia, an NER task, a QA task, and an IR task, described
BooksCorpus, and StackExchange data, totalling below.
Train Validation Test spans of the paragraph which answer the questions.
Num. Abstracts 382 77 75
Num. Tokens 32,031 6,443 5,850 We used 29 paragraphs (with 145 QA pairs in total)
Entity Labels as the training set and the remaining 10 paragraphs
climate-nature, climate-greenhouse-gases, climate-assets,
climate-problem-origins, climate-mitigations, (with 50 questions in total) as the evaluation set.
climate-properties, climate-impacts, climate-datasets, The training set was further augmented with para-
climate-organizations, climate-observations,
climate-models, climate-hazards, climate-organisms graphs and QA pairs related to Earth science from
the SQuAD dataset (Rajpurkar et al., 2018). Specif-
Table 4: CLIMATE - CHANGE NER statistics and entity ically, those related to oxygen, Amazon rain forest
labels and geology were used. This resulted in a pruned
SQ u AD set comprising 686 paragraphs with 5,081

5.1 CLIMATE - CHANGE NER


questions (2,817 answerable and 2,264 unanswer-
able). We evaluated the performance of the models
While traditional search engines and databases of- by augmenting these SQuAD pairs to the training
fer some assistance in exploring data related to data sourced from Earth science papers, while keep-
climate change, the complexity of climate-related ing the evaluation set intact.
queries often requires more sophisticated natural
language processing techniques. This necessity is 5.3 NASA - IR
underscored by the extensive array of climate mod-
We introduced a domain-specific information re-
els, datasets, and organizations involved, which
trieval benchmark, NASA-IR10 , spanning almost
demand meticulous curation and continuous up-
500 question-answer pairs related to the Earth sci-
dates. While databases like those maintained by
ence, planetary science, heliophysics, astrophysics
NASA or the UN provide valuable observational
and biological physical sciences domains. Specif-
data, comprehensive overviews of climate models
ically, we sampled a set of 166 paragraphs from
and impact assessments are scarce and not easily
AGU, AMS, ADS, PMC and Pubmed (§2) and
accessible.
manually annotated with 3 questions that are an-
In order to bridge this gap, we introduced a com-
swerable from each of these paragraphs, resulting
prehensive dataset for developing and evaluating
in 498 questions. We used 398 of these questions
NLP models tailored towards understanding and
as the training set and the remaining 100 as the
addressing climate-related topics across various
validation set. To comprehensively evaluate the
domains. Specifically, we created a new manu-
information retrieval systems and mimic the real
ally annotated dataset CLIMATE - CHANGE NER8 , in
world data, We combined 26,839 random ADS ab-
which the named entities of interest originate from
stracts with these annotated paragraphs. On an
complex taxonomies used in climate-related litera-
average, each query is 12 words long, and each
ture. This dataset comprises 534 abstracts sourced
paragraph is 120 words long. We used Recall@10
from Semantic Scholar Academic Graph (Kinney
as evaluation metric since each question has only
et al., 2023), collected using a seed set of climate-
one relevant document.
related keywords such as wildfire or floods. The
abstracts were annotated using the IOB (inside, out- 6 Experimental Results
side, beginning) tagging scheme and encompasses
a diverse array of entity types, as shown in Table 4. Baselines We compared INDUS models against
open source models of similar sizes:
5.2 NASA - QA • INDUS BASE was compared to RoBERTaBASE 11
We present NASA - QA9 , an extractive question an- and SCIBERT12 .
swering task focused on the Earth science domain. • INDUS SMALL was compared to MINILM (6-
First, 39 paragraphs from Earth science papers layer)13 and TINYBERT (4-layer)14 .
which appeared in AGU and AMS journals (§2) were 10
https://huggingface.co/datasets/nasa-impact/nasa-smd-
sourced. Subject matter experts from NASA for- IR-benchmark
11
mulated questions and marked the corresponding https://huggingface.co/FacebookAI/roberta-base
12
https://huggingface.co/allenai/scibert_scivocab_uncased
8 13
https://huggingface.co/datasets/ibm/Climate-Change- https://huggingface.co/nreimers/MiniLM-L6-H384-
NER uncased
9 14
https://huggingface.co/datasets/nasa-impact/nasa-smd- https://huggingface.co/huawei-
qa-benchmark noah/TinyBERT_General_4L_312D
Base model (125M params.) Small model (∼30M params.)
Task Metric Dataset R o BERT a S CI B ERT INDUS BASE T INY BERT M INI LM INDUS SMALL
BC5-chem 90.3 (0.2) 91.4 (0.2) 93.3 (0.2) 84.6 (0.2) 86.1 (0.3) 90.7 (0.1)
BC5-disease 81.5 (0.3) 83.7 (0.3) 85.2 (0.3) 74.0 (0.4) 77.4 (0.3) 81.3 (0.3)
NER Entity F1 NCBI-disease 87.6 (0.6) 87.6 (0.4) 88.3 (0.4) 81.2 (0.4) 83.1 (0.5) 85.6 (0.6)
BC2GM 82.1 (0.3) 82.3 (0.2) 84.0 (0.3) 74.7 (0.4) 77.1 (0.2) 79.7 (0.3)
JNLPBA 79.1 (0.2) 78.2 (0.2) 80.3 (0.2) 70.3 (0.2) 73.4 (0.3) 75.7 (0.2)
PICO Macro F1 EBM PICO 72.3 (0.3) 72.4 (0.3) 73.1 (0.2) 67.4 (0.2) 70.3 (0.1) 73.1 (0.2)
ChemProt 50.4 (28.2) 73.9 (0.7) 76.9 (0.5) 56.2 (3.2) 55.9 (2.1) 71.7 (0.9)
Relation
Extraction Micro F1 DDI 78.6 (1.5) 80.1 (1.0) 81.7 (0.5) 39.3 (5.3) 51.5 (2.9) 69.0 (1.2)
GAD 80.0 (1.1) 81.6 (1.2) 79.4 (5.6) 76.4 (1.3) 77.3 (1.0) 81.3 (0.7)
Document
Classification Micro F1 HoC 82.2 (0.7) 83.1 (0.6) 83.7 (0.5) 41.6 (6.8) 62.8 (4.7) 80.2 (0.6)
Question PubMedQA 53.1 (3.3) 54.3 (3.8) 58.2 (6.7) 50.3 (1.4) 51.6 (1.7) 56.1 (1.4)
Answering Accuracy
BioASQ 69.1 (4.8) 74.6 (4.5) 69.6 (5.8) 74.3 (3.6) 66.7 (2.3) 75.4 (3.3)
Sentence
Similarity Pearson BIOSSES 79.8 (6.3) 86.3 (3.5) 72.2 (9.5) 88.2 (1.1) 26.6 (8.7) 70.4 (3.3)
Micro Average - - 75.9 (3.7) 79.2 (1.3) 78.9 (2.4) 67.6 (1.9) 66.1 (1.9) 76.2 (1.0)
Macro Average - - 74.9 (3.7) 78.2 (1.6) 76.4 (3.2) 65.6 (2.4) 60.6 (3.0) 74.3 (1.3)

Table 5: Evaluation results on BLURB. Results reported are averaged on 10 random seeds with standard deviation in
parenthesis. Micro average is reported across datasets while macro average is computed by first averaging scores on
each task (say, task average), followed by averaging the task average across tasks. Results in bold indicate highest
performance while underlined results indicate significant difference from second highest result by more than two
standard deviations in each model size.

• INDUS-RETRIEVER BASE was compared to Model F1 (SD)


BGE BASE 15 and a Ro BERTaBASE model finetuned R o BERT a 60.8 (0.8)
with the same method presented in §4.1. SCIBERT 61.8 (0.7)
• INDUS-RETRIEVER SMALL was compared to INDUS BASE 64.0 (1.0)
MINILM - V 2 16 and BGE SMALL 17 . TINYBERT 34.3 (1.6)
MINILM 44.7 (1.3)
6.1 Natural Language Understanding INDUS SMALL 54.8 (0.8)
Benchmarks
Table 6: CLIMATE - CHANGE NER benchmark results.
We evaluated our models on BLURB (Gu et al., Standard deviation over 10 random seeds shown in
2021), a benchmark suite for natural language un- parenthesis. Results in bold and underline indicate high-
derstanding and reasoning tasks in the biomedical est performance and significant difference from second
domain. We followed the original work to compute highest result by more than two standard deviations in
the overall score (i.e., macro average). each model size, respectively.
Table 5 shows the evaluation results. Among
base models, INDUS BASE significantly outper- We also noticed SCIBERT tends to perform better
forms the general-purpose RoBERTa model on mi- than our model on paired input-text tasks, such
cro/macro average while achieving competitive per- as QA and semantic similarity tasks, although
formance to the bio-domain-specific counterpart, the results have relatively large standard devia-
SCIBERT .
tions. We hypothesized that the additional next
As for smaller models, we noticed INDUS SMALL sentence prediction objective during training in
outperformed the baselines, TINYBERT and BERT-style models (such as SCIBERT) in contrast to
MINILM , by a large margin in most cases, show- the RoBERTa-style models (such as RoBERTaBASE
ing significant difference from second best models and INDUS) may be beneficial for paired input-text
in NER, PICO, relation extraction, and document tasks. This trend was consistent with the observa-
classification tasks. This demonstrates the effec- tions of Tinn et al. (2023).
tiveness of knowledge distillation from our domain-
specific teacher model, INDUS BASE . 6.2 CLIMATE - CHANGE NER

15
https://huggingface.co/BAAI/bge-base-en-v1.5
As shown in Table 6, our models clearly outper-
16
sentence-transformers/all-MiniLM-L6-v2 formed the corresponding baseline models on the
17
https://huggingface.co/BAAI/bge-small-en-v1.5 CLIMATE - CHANGE NER task, suggesting the effec-
tiveness of training on large domain-specific data. Model NASA - IR ↑ BEIR Avg. ↑ Retrieval
Time ↓
R o BERT aBASE 0.66 0.37 1.20
6.3 NASA - QA BGE BASE 0.67 0.52 1.18
INDUS - RETRIEVER BASE 0.71 0.41 1.19
As mentioned in §5, we augmented the training MINILM - V 2 0.62 0.39 0.24
set with relevant SQuAD pairs for fine-tuning. All BGE SMALL 0.66 0.51 0.42
models are fine tuned for 15 epochs, and the results INDUS - RETRIEVER SMALL 0.73 0.42 0.26

are shown in Table 7. We observed that INDUS BASE Table 8: Evaluation results on NASA - IR and BEIR.
outperformed all models of similar sizes, while IN - NASA - IR showed Recall@10 while BEIR reported the
DUS SMALL had relatively strong performance com- average nDCG@10 across all tasks. Retrieval time per
pared to its counterparts. query on the NQ task from BEIR, reported in seconds.

Model F1 (SD)
7 Conclusions
R o BERT a 66.8 (3.1)
SCIBERT 63.5 (1.9) In this research, we presented INDUS, a constella-
INDUS BASE 68.2 (2.9) tion of models for use in the science domain. We
TINYBERT 43.2 (2.3) demonstrated the effectiveness of a custom tok-
MINILM 59.2 (3.9) enizer and in-domain data for training high qual-
INDUS SMALL 47.4 (1.8) ity encoder models and sentence embedding mod-
els. Further, we created smaller versions of the
Table 7: NASA - QA benchmark results. Standard devi- proposed models suitable for applications with la-
ation over 3 random seeds shown in parenthesis. Re- tency or resource constraints through state-of-the-
sults in bold and underline indicate highest performance
art knowledge distillation techniques. For the ben-
and significant difference from second highest result by
more than two standard deviations in each model size, efit of the scientific community, we will release
respectively. the developed models and benchmark datasets on
Hugging Face.
We saw that INDUS BASE outperformed all models
of similar sizes, while INDUS SMALL had relatively References
strong performance.
Dogu Araci. 2019. Finbert: Financial sentiment analy-
sis with pre-trained language models.
6.4 Information Retrieval Benchmarks
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-
We evaluated our models on the NASA - IR dataset ERT: A pretrained language model for scientific text.
as well as BEIR Benchmark (Thakur et al., 2021), In Proceedings of the 2019 Conference on Empirical
which consists of 12 retrieval tasks spanning a vari- Methods in Natural Language Processing and the
ety of domains. The BEIR benchmark used the Nor- 9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 3615–
malized Cumulative Discount Gain (nDCG@10) 3620, Hong Kong, China. Association for Computa-
(Wang et al., 2013) as their main metric. Table 8 tional Linguistics.
shows the performance of our domain-specific sen-
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
tence embedding models, along with our baselines. Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
As shown, both of our sentence embedding mod- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
els significantly outperformed the baselines on the Askell, Sandhini Agarwal, Ariel Herbert-Voss,
NASA - IR task while still maintaining good perfor- Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
mance on several of the BEIR tasks. (We presented Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
results for each BEIR task in Appendix C). teusz Litwin, Scott Gray, Benjamin Chess, Jack
We also measured the average time per query Clark, Christopher Berner, Sam McCandlish, Alec
for retrieval on the 4,202 test queries of the natural Radford, Ilya Sutskever, and Dario Amodei. 2020.
Language models are few-shot learners. In Ad-
questions set of BEIR, on a single A100 GPU. This vances in Neural Information Processing Systems,
time includes the time to encode the query, cor- volume 33, pages 1877–1901. Curran Associates,
pus, and time to retrieve relevant documents. No- Inc.
tably, INDUS-RETRIEVER SMALL outperformed IN - Colin B. Clement, Matthew Bierbaum, Kevin P.
DUS- RETRIEVER BASE , on both NASA - IR and BEIR, O’Keeffe, and Alexander A. Alemi. 2019. On the
while being about 4.6x faster. use of arxiv as a dataset.
Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Sarna, Yonglong Tian, Phillip Isola, Aaron
Downey, and Daniel S. Weld. 2020. SPECTER: Maschinot, Ce Liu, and Dilip Krishnan. 2020. Su-
Document-level Representation Learning using pervised contrastive learning. In Advances in Neural
Citation-informed Transformers. In ACL. Information Processing Systems, volume 33, pages
18661–18673. Curran Associates, Inc.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of Rodney Kinney, Chloe Anastasiades, Russell Authur,
deep bidirectional transformers for language under- Iz Beltagy, Jonathan Bragg, Alexandra Buraczyn-
standing. In Proceedings of the 2019 Conference of ski, Isabel Cachola, Stefan Candra, Yoganand Chan-
the North American Chapter of the Association for drasekhar, Arman Cohan, Miles Crawford, Doug
Computational Linguistics: Human Language Tech- Downey, Jason Dunkelberger, Oren Etzioni, Rob
nologies, Volume 1 (Long and Short Papers), pages Evans, Sergey Feldman, Joseph Gorney, David Gra-
4171–4186, Minneapolis, Minnesota. Association for ham, Fangzhou Hu, Regan Huff, Daniel King, Se-
Computational Linguistics. bastian Kohlmeier, Bailey Kuehl, Michael Langan,
Daniel Lin, Haokun Liu, Kyle Lo, Jaron Lochner,
Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Kelsey MacMillan, Tyler Murray, Chris Newell,
Guney, Volkan Cirik, and Kyunghyun Cho. 2017. Smita Rao, Shaurya Rohatgi, Paul Sayre, Zejiang
Searchqa: A new q&a dataset augmented with con- Shen, Amanpreet Singh, Luca Soldaini, Shivashankar
text from a search engine. Subramanian, Amber Tanaka, Alex D. Wade, Linda
Wagner, Lucy Lu Wang, Chris Wilhelm, Caroline
Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. Wu, Jiangjiang Yang, Angele Zamarron, Madeleine
2014. Open question answering over curated and ex- Van Zuylen, and Daniel S. Weld. 2023. The semantic
tracted knowledge bases. In Proceedings of the 20th scholar open data platform.
ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, KDD ’14, page Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
1156–1165, New York, NY, USA. Association for field, Michael Collins, Ankur Parikh, Chris Alberti,
Computing Machinery. Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-
ton Lee, Kristina Toutanova, Llion Jones, Matthew
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob
SimCSE: Simple contrastive learning of sentence em- Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural
beddings. In Proceedings of the 2021 Conference Questions: A Benchmark for Question Answering
on Empirical Methods in Natural Language Process- Research. Transactions of the ACL.
ing, pages 6894–6910, Online and Punta Cana, Do-
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon
minican Republic. Association for Computational
Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang.
Linguistics.
2019. BioBERT: a pre-trained biomedical language
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto representation model for biomedical text mining.
Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Bioinformatics, 36(4):1234–1240.
Gao, and Hoifung Poon. 2021. Domain-specific lan- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
guage model pretraining for biomedical natural lan- Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
guage processing. ACM Trans. Comput. Healthcare, Veselin Stoyanov, and Luke Zettlemoyer. 2020.
3(1). BART: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and com-
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2014.
prehension. In Proceedings of the 58th Annual Meet-
Distilling the Knowledge in a Neural Network. In
ing of the Association for Computational Linguistics,
NeurIPS Deep Learning Worksop.
pages 7871–7880, Online. Association for Computa-
Zhi Hong, Aswathy Ajith, Gregory Pauloski, Eamon tional Linguistics.
Duede, Kyle Chard, and Ian Foster. 2023. The dimin- Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Min-
ishing returns of masked language models to science. ervini, Heinrich Küttler, Aleksandra Piktus, Pontus
Stenetorp, and Sebastian Riedel. 2021. PAQ: 65 Mil-
Shu Huang and Jacqueline M Cole. 2022. Batterybert: lion Probably-Asked Questions and What You Can
A pretrained language model for battery database Do With Them. Transactions of the Association for
enhancement. J. Chem. Inf. Model., page DOI: Computational Linguistics, 9:1098–1115.
10.1021/acs.jcim.2c00035.
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long,
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Pengjun Xie, and Meishan Zhang. 2023. Towards
Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and general text embeddings with multi-stage contrastive
Wen-tau Yih. 2020. Dense passage retrieval for open- learning.
domain question answering. In Proceedings of the
2020 Conference on Empirical Methods in Natural Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Language Processing (EMNLP), pages 6769–6781, dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Online. Association for Computational Linguistics. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron proach. arXiv preprint arXiv:1907.11692.
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kin- and VERification. In Proceedings of the 2018
ney, and Daniel Weld. 2020. S2ORC: The semantic Conference of the North American Chapter of
scholar open research corpus. In Proceedings of the the Association for Computational Linguistics:
58th Annual Meeting of the Association for Compu- Human Language Technologies, Volume 1 (Long
tational Linguistics, pages 4969–4983, Online. Asso- Papers), pages 809–819, New Orleans, Louisiana.
ciation for Computational Linguistics. Association for Computational Linguistics.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Robert Tinn, Hao Cheng, Yu Gu, Naoto Usuyama, Xi-
Dario Amodei, and Ilya Sutskever. 2019. Language aodong Liu, Tristan Naumann, Jianfeng Gao, and
models are unsupervised multitask learners. Hoifung Poon. 2023. Fine-tuning large neural lan-
guage models for biomedical natural language pro-
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- cessing. Patterns, 4(4).
ine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
limits of transfer learning with a unified text-to-text Martinet, Marie-Anne Lachaux, Timothée Lacroix,
transformer. Journal of Machine Learning Research, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
21(1). Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023. Llama: Open
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. and efficient foundation language models.
Know what you don’t know: Unanswerable questions
for squad. CoRR, abs/1806.03822. Aashka Trivedi, Takuma Udagawa, Michele Merler,
Rameswar Panda, Yousef El-Kurdi, and Bishwaran-
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
jan Bhattacharjee. 2023. Neural architecture search
Percy Liang. 2016. SQuAD: 100,000+ Questions for
for effective teacher-student knowledge transfer in
Machine Comprehension of Text. In EMNLP.
language models. arXiv preprint arXiv:2303.09639.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
BERT: Sentence embeddings using Siamese BERT- Takuma Udagawa, Aashka Trivedi, Michele Merler, and
networks. In Proceedings of the 2019 Conference on Bishwaranjan Bhattacharjee. 2023. A comparative
Empirical Methods in Natural Language Processing analysis of task-agnostic distillation methods for com-
and the 9th International Joint Conference on Natu- pressing transformer language models. In Proceed-
ral Language Processing (EMNLP-IJCNLP), pages ings of the 2023 Conference on Empirical Methods in
3982–3992, Hong Kong, China. Association for Com- Natural Language Processing: Industry Track, pages
putational Linguistics. 20–31, Singapore. Association for Computational
Linguistics.
Marc Suárez-Calvet, Thomas K Karikari, Nicholas J
Ashton, Juan Lantero Rodríguez, Marta Milà-Alomà, Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2019.
Juan Domingo Gispert, Gemma Salvadó, Car- Representation learning with contrastive predictive
olina Minguillon, Karine Fauria, Mahnaz Shekari, coding.
Oriol Grau-Rivera, Eider M Arenaza-Urquijo, Aleix
Sala-Vila, Gonzalo Sánchez-Benavides, José Maria Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
González-de-Echávarri, Gwendlyn Kollmorgen, Erik Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Stoops, Eugeen Vanmechelen, Henrik Zetterberg, Kaj Kaiser, and Illia Polosukhin. 2017. Attention is all
Blennow, José Luis Molinuevo, null null, Annabella you need. In Advances in Neural Information Pro-
Beteta, Raffaele Cacciaglia, Alba Cañas, Carme cessing Systems, volume 30. Curran Associates, Inc.
Deulofeu, Irene Cumplido, Ruth Dominguez, Maria
Emilio, Carles Falcon, Sherezade Fuentes, Laura Her- Nicholas Walker, Amalie Trewartha, Haoyan Huo,
nandez, Gema Huesa, Jordi Huguet, Paula Marne, Sanghoon Lee, Kevin Cruse, John Dagdelen, Alexan-
Tania Menchón, Grégory Operto, Albina Polo, San- der Dunn, Kristin Persson, Gerbrand Ceder, and
dra Pradas, Anna Soteras, Marc Vilanova, and Na- Anubhav Jain. 2021. The impact of domain-specific
talia Vilor-Tejedor. 2020. Novel tau biomarkers pre-training on named entity recognition tasks in ma-
phosphorylated at t181, t217 or t231 rise in the ini- terials science. Available at SSRN 3950755.
tial stages of the preclinical alzheimer&#x2019;s
<i>continuum</i> when only subtle changes in Liang Wang, Nan Yang, Xiaolong Huang, Binxing
a&#x3b2; pathology are detected. EMBO Molec- Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder,
ular Medicine, 12(12):e12921. and Furu Wei. 2022. Text embeddings by weakly-
supervised contrastive pre-training.
Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab-
hishek Srivastava, and Iryna Gurevych. 2021. Beir: Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong,
A heterogenous benchmark for zero-shot evaluation and Furu Wei. 2021. MiniLMv2: Multi-head self-
of information retrieval models. attention relation distillation for compressing pre-
trained transformers. In Findings of the Association
James Thorne, Andreas Vlachos, Christos for Computational Linguistics: ACL-IJCNLP 2021,
Christodoulopoulos, and Arpit Mittal. 2018. pages 2140–2151, Online. Association for Computa-
FEVER: a large-scale dataset for fact extraction tional Linguistics.
Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Model Training NASA - IR BEIR Avg.
Tie-Yan Liu. 2013. A theoretical analysis of ndcg INDUS - RETRIEVER SMALL One-Stage 0.73 0.42
INDUS - RETRIEVER SMALL Stagewise 0.72 0.41
type ranking measures. In Proceedings of the 26th
Annual Conference on Learning Theory, volume 30
of Proceedings of Machine Learning Research, pages Table 9: Ablation Study: Evaluation results on NASA -
25–54, Princeton, NJ, USA. PMLR. QA and BEIR. NASA - QA showed Recall10 while BEIR
reported nDCG10.
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski,
Mark Dredze, Sebastian Gehrmann, Prabhanjan Kam-
badur, David Rosenberg, and Gideon Mann. 2023. C Complete Results on BEIR Benchmark
Bloomberggpt: A large language model for finance.
Table 11 shows the per-dataset results on the BEIR
Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao.
2022. RetroMAE: Pre-training retrieval-oriented lan- tasks.
guage models via masked auto-encoder. In Proceed-
ings of the 2022 Conference on Empirical Methods in
Natural Language Processing, pages 538–548, Abu
Dhabi, United Arab Emirates. Association for Com-
putational Linguistics.
Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas
Muennighoff. 2023. C-pack: Packaged resources
to advance general chinese embedding.
Jiahao Xu, Wei Shao, Lihui Chen, and Lemao Liu.
2023. DistillCSE: Distilled contrastive learning for
sentence embeddings. In Findings of the Associa-
tion for Computational Linguistics: EMNLP 2023,
pages 8153–8165, Singapore. Association for Com-
putational Linguistics.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
William Cohen, Ruslan Salakhutdinov, and Christo-
pher D. Manning. 2018. HotpotQA: A dataset for
diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2369–2380, Brussels, Belgium. Association for Com-
putational Linguistics.

A Sentence Embedding Training Data


Table 10 shows the various data sources used for
training embedding models. All data is presented
in the form of text-pairs, where each item in the
pair may be a sentence or a paragraph. In the table,
Data Format denotes s2p for sentence-to-paragraph
mappings, s2s for sentence-to-sentence mappings,
and p2p for paragraph-to-paragraph mappings. We
used about 360 million pairs for training and used
in-batch negatives.

B Ablation Study: Stage-wise Distillation


for Embedding Model
For the distilled embedding models, we find that
stage-wise distillation does not benefit performance
as much as a one-step process, combining all the
supervised and unsupervised data. As shown in
Table 9, the stage-wise approach underperformed
the one-stage approach by 1 percentage point for
both NASA - QA and on BEIR.
Dataset Num. Pairs Data Category Data Format
StackOverflow† 18562443 Title-Body s2p
StackExchange Math† 2201906 Title-Body s2p
S2ORC [title - abstract] (Lo et al., 2020) 41769185 Title-Body s2p
S2ORC Citation Pairs [Abstracts] (Lo et al., 2020) 52603982 Title-Body p2p
StackExchange [title - body]† 5415570 Title-Body s2p
Wikipedia (Fader et al., 2014) 6458670 Title-Body s2p
Arxiv (Clement et al., 2019) 2358545 Title-Body s2p
NASA ADS [title - abstract] (§2) 2633240 Title-Body s2p
PubMed [title - abstract] (§2) 24001387 Title-Body s2p
PMC [title - abstract] (§2) 2585537 Title-Body s2p
StackExchange Duplicate Questions [title-body - title-body]† 250460 Duplicate Questions p2p
StackExchange Duplicate Questions [body - body]† 250519 Duplicate Questions p2p
StackExchange Duplicate Questions [title - title]† 304525 Duplicate Questions s2s
WikiAnswer Pairs (Fader et al., 2014) 77427422 Duplicate Questions s2s
Specter Pairs (Cohan et al., 2020) 684100 Citation Pairs s2s
S2ORC Citation Pairs [Titles] (Lo et al., 2020) 52603982 Citation Pairs s2s
SQuAD (Rajpurkar et al., 2016) 87599 Question Answers s2p
NQ (Kwiatkowski et al., 2019) 100231 Question Answers s2p
SearchQA (Dunn et al., 2017) 582261 Question Answers s2p
StackExchange [title - answer]† 4067139 Question Answers s2p
StackExchange [title-body - answer]† 187195 Question Answers p2p
PAQ (Lewis et al., 2021) 64371441 Question Answers s2p
FEVER (Thorne et al., 2018)∗ 109810 Fact Verification s2p
HotpotQA (Yang et al., 2018)∗ 85000 Question Answering s2p

Table 10: Training Data for Embedding Models. The training data totals to around 360M pairs. Data Format denotes
s2p for sentence-to-paragraph mappings, s2s for sentence-to-sentence mappings, and p2p for paragraph-to-paragraph
mappings. † Downloaded from https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml.

Only used for Distillation.

Model BEIR Eval


TREC- NFCorpus NQ HotPotQA FiQA ArguaAna Touche DBPedia Scidocs FEVER Climate SciFact AVG.
Covid FEVER BEIR
R o BERT aBASE 0.47 0.30 0.54 0.34 0.38 0.52 0.18 0.25 0.22 0.46 0.14 0.67 0.37
BGE BASE 0.78 0.37 0.54 0.73 0.41 0.64 0.26 0.41 0.22 0.86 0.31 0.74 0.52
INDUS - RETRIEVER BASE 0.56 0.32 0.54 0.49 0.36 0.54 0.17 0.31 0.21 0.56 0.14 0.74 0.41
MINILM - V 2 0.47 0.32 0.44 0.47 0.35 0.50 0.17 0.32 0.22 0.52 0.25 0.65 0.39
BGE SMALL 0.76 0.34 0.50 0.70 0.40 0.60 0.26 0.40 0.21 0.87 0.32 0.71 0.51
INDUS - RETRIEVER SMALL 0.55 0.31 0.53 0.48 0.29 0.50 0.21 0.33 0.23 0.61 0.23 0.71 0.42

Table 11: Evaluation results BEIR.

You might also like