FAISS
目录
Installing Faiss........................................................................................................................1
Getting started....................................................................................................................3
Lower memory footprint..................................................................................................10
Running on GPUs..................................................................................................................13
Semantic search with FAISS..................................................................................................17
1
Installing Faiss
Standard installs
We support compiling Faiss with cmake from source and installing
via conda on a limited set of platforms: Linux (x86 and ARM), Mac
(only x86), Windows (only x86). For this, see INSTALL.md.
Why don't you support installing via XXX ?
The reason why we don't support more platforms is because it is a
lot of work to make sure Faiss runs in the supported configurations:
building the conda packages for a new release of Faiss always
surfaces compatibility issues. Anaconda provides a sufficiently
controlled environment that we can be confident that it will run on
the user's machines (this is not the case with pip). Besides, the
2
platform (hardware and OS) has to be supported by our CI tool
(circleCI).
So we are very carful before we add new officially supported
platforms (hardware and software). We are very interested in
success (or failure!) stories about porting to other platforms, and
related PRs.
Special configurations
Compiling the python interface within an Anaconda install
The idea is to install everything via anaconda and link Faiss against
that. This is useful to make sure the MKL impementation is as fast as
possible.
source ~/anaconda3/etc/profile.d/conda.sh
conda activate host_env_for_faiss # an environment that contains
python and numpy
git clone https://github.com/facebookresearch/faiss.git faiss_xx
cd faiss_xx
LD_LIBRARY_PATH=
MKLROOT=/private/home/matthijs/anaconda3/envs/host_env_for_fais
s/lib CXX=$(which g++) \
3
$cmake -B build -DBUILD_TESTING=ON -DFAISS_ENABLE_GPU=OFF \
-DFAISS_OPT_LEVEL=axv2 \
-DFAISS_ENABLE_C_API=ON \
-DCMAKE_BUILD_TYPE=Release \
-DBLA_VENDOR=Intel10_64_dyn .
make -C build -j10 swigfaiss && (cd build/faiss/python ; python3
setup.py build)
(cd tests ; PYTHONPATH=../build/faiss/python/build/lib/
OMP_NUM_THREADS=1 python -m unittest -v discover )
Compiling Faiss on ARM
Commands for an ubuntu 18 image on an Amazon c6g.8xlarge
machine :
set -e
sudo apt-get install libatlas-base-dev libatlas3-base
sudo apt-get install clang-8
sudo apt-get install swig
# cmake provided with ubuntu is too old
4
wget
https://github.com/Kitware/CMake/releases/download/v3.19.3/cmake-
3.19.3.tar.gz
tar xvzf cmake-3.19.3.tar.gz
cd cmake-3.19.3/
./configure --prefix=/home/matthijs/cmake && make -j
cd $HOME
alias cmake=$HOME/cmake/bin/cmake
# clone Faiss
git clone https://github.com/facebookresearch/faiss.git
cd faiss
cmake -B build -DCMAKE_CXX_COMPILER=clang++-8 -
DFAISS_ENABLE_GPU=OFF -DPython_EXECUTABLE=$(which
python3) -DFAISS_OPT_LEVEL=generic -
5
DCMAKE_BUILD_TYPE=Release -DBUILD_TEST\
ING=ON
(cd build/faiss/python/ ; python3 setup.py build)
# run tests
export PYTHONPATH=$PWD/build/faiss/python/build/lib/
python3 -m unittest discover
Getting started
For the following, we assume Faiss is installed. We provide code
examples in C++ and Python. The code can be run by copy/pasting
it or running it from the tutorial/ subdirectory of the Faiss
distribution.
Getting some data
Faiss handles collections of vectors of a fixed dimensionality d,
typically a few 10s to 100s. These collections can be stored in
matrices. We assume row-major storage, ie. the j'th component of
vector number i is stored in row i, column j of the matrix. Faiss uses
6
only 32-bit floating point matrices.
We need two matrices:
xb for the database, that contains all the vectors that must be
indexed, and that we are going to search in. Its size is nb-by-d
xq for the query vectors, for which we need to find the nearest
neighbors. Its size is nq-by-d. If we have a single query vector,
nq=1.
In the following examples we are going to work with vectors that are
drawn form a uniform distribution in d=64 dimensions. Just for fun,
we add small translation along the first dimension that depends on
the vector index.
In Python
import numpy as np
d = 64 # dimension
nb = 100000 # database size
nq = 10000 # nb of queries
np.random.seed(1234) # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
In Python, the matrices are always represented as numpy arrays.
7
The data type dtype must be float32.
In C++
int d = 64; // dimension
int nb = 100000; // database size
int nq = 10000; // nb of queries
float *xb = new float[d * nb];
float *xq = new float[d * nq];
for(int i = 0; i < nb; i++) {
for(int j = 0; j < d; j++) xb[d * i + j] = drand48();
xb[d * i] += i / 1000.;
for(int i = 0; i < nq; i++) {
for(int j = 0; j < d; j++) xq[d * i + j] = drand48();
xq[d * i] += i / 1000.;
This example uses plain arrays, because this is the lowest common
denominator all C++ matrix libraries support. Faiss can
accommodate any matrix library, provided it provides a pointer to
the underlying data. For example std::vector<float>'s internal
pointer is given by the data() method.
Building an index and adding the vectors to it
Faiss is built around the Index object. It encapsulates the set of
8
database vectors, and optionally preprocesses them to make
searching efficient. There are many types of indexes, we are going to
use the simplest version that just performs brute-force L2 distance
search on them: IndexFlatL2.
All indexes need to know when they are built which is the
dimensionality of the vectors they operate on, d in our case. Then,
most of the indexes also require a training phase, to analyze the
distribution of the vectors. For IndexFlatL2, we can skip this
operation.
When the index is built and trained, two operations can be
performed on the index: add and search.
To add elements to the index, we call add on xb. We can also display
the two state variables of the index: is_trained, a boolean that
indicates whether training is required and ntotal, the number of
indexed vectors.
Some indexes can also store integer IDs corresponding to each of
the vectors (but not IndexFlatL2). If no IDs are provided, add just
uses the vector ordinal as the id, ie. the first vector gets 0, the
second 1, etc.
In Python
import faiss # make faiss available
index = faiss.IndexFlatL2(d) # build the index
9
print(index.is_trained)
index.add(xb) # add vectors to the index
print(index.ntotal)
In C++
faiss::IndexFlatL2 index(d); // call constructor
printf("is_trained = %s\n", index.is_trained ? "true" : "false");
index.add(nb, xb); // add vectors to the index
printf("ntotal = %ld\n", index.ntotal);
Results
This should just display true (the index is trained) and 100000
(vectors are stored in the index).
Searching
The basic search operation that can be performed on an index is
the k-nearest-neighbor search, ie. for each query vector, find
its k nearest neighbors in the database.
The result of this operation can be conveniently stored in an integer
matrix of size nq-by-k, where row i contains the IDs of the neighbors
of query vector i, sorted by increasing distance. In addition to this
matrix, the search operation returns a nq-by-k floating-point matrix
with the corresponding squared distances.
As a sanity check, we can first search a few database vectors, to
make sure the nearest neighbor is indeed the vector itself.
10
In Python
k=4 # we want to see 4 nearest neighbors
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
D, I = index.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries
In C++
int k = 4;
{ // sanity check: search 5 first vectors of xb
idx_t *I = new idx_t[k * 5];
float *D = new float[k * 5];
index.search(5, xb, k, D, I);
printf("I=\n");
for(int i = 0; i < 5; i++) {
for(int j = 0; j < k; j++) printf("%5ld ", I[i * k + j]);
printf("\n");
...
delete [] I;
delete [] D;
11
}
{ // search xq
idx_t *I = new idx_t[k * nq];
float *D = new float[k * nq];
index.search(nq, xq, k, D, I);
...
The extract is edited because otherwise the C++ version becomes
very verbose, see the full code in the tutorial/cpp subdirectory of
Faiss.
Results
The output of the sanity check should look like
[[ 0 393 363 78]
[ 1 555 277 364]
[ 2 304 101 13]
[ 3 173 18 182]
[ 4 288 370 531]]
[[ 0. 7.17517328 7.2076292 7.25116253]
[ 0. 6.32356453 6.6845808 6.79994535]
[ 0. 5.79640865 6.39173603 7.28151226]
[ 0. 7.27790546 7.52798653 7.66284657]
12
[ 0. 6.76380348 7.29512024 7.36881447]]
ie. the nearest neighbor of each query is indeed the index of the
vector, and the corresponding distance is 0. And within a row,
distances are increasing.
The output of the actual search is similar to
[[ 381 207 210 477]
[ 526 911 142 72]
[ 838 527 1290 425]
[ 196 184 164 359]
[ 526 377 120 425]]
[[ 9900 10500 9309 9831]
[11055 10895 10812 11321]
[11353 11103 10164 9787]
[10571 10664 10632 9638]
[ 9628 9554 10036 9582]]
Because of the value added to the first component of the vectors,
the dataset is smeared along the first axis in d-dim space. So the
neighbors of the first few vectors are around the beginning of the
dataset, and the ones of the vectors around ~10000 are also around
index 10000 in the dataset.
Executing the search above takes about 3.3s on a 2016 machine.
13
Faster search
This is too slow, how can I make it faster?
To speed up the search, it is possible to segment the dataset into
pieces. We define Voronoi cells in the d-dimensional space, and each
database vector falls in one of the cells. At search time, only the
database vectors y contained in the cell the query x falls in and a
few neighboring ones are compared against the query vector.
This is done via the IndexIVFFlat index. This type of index requires a
training stage, that can be performed on any collection of vectors
that has the same distribution as the database vectors. In this case
we just use the database vectors themselves.
The IndexIVFFlat also requires another index, the quantizer, that
assigns vectors to Voronoi cells. Each cell is defined by a centroid,
and finding the Voronoi cell a vector falls in consists in finding the
nearest neighbor of the vector in the set of centroids. This is the task
of the other index, which is typically an IndexFlatL2.
There are two parameters to the search method: nlist, the number of
cells, and nprobe, the number of cells (out of nlist) that are visited to
perform a search. The search time roughly increases linearly with
the number of probes plus some constant due to the quantization.
14
In Python
nlist = 100
k=4
quantizer = faiss.IndexFlatL2(d) # the other index
index = faiss.IndexIVFFlat(quantizer, d, nlist)
assert not index.is_trained
index.train(xb)
assert index.is_trained
index.add(xb) # add may be a bit slower as well
D, I = index.search(xq, k) # actual search
print(I[-5:]) # neighbors of the 5 last queries
index.nprobe = 10 # default nprobe is 1, try a few more
D, I = index.search(xq, k)
print(I[-5:]) # neighbors of the 5 last queries
In C++
int nlist = 100;
int k = 4;
faiss::IndexFlatL2 quantizer(d); // the other index
faiss::IndexIVFFlat index(&quantizer, d, nlist);
assert(!index.is_trained);
index.train(nb, xb);
15
assert(index.is_trained);
index.add(nb, xb);
{ // search xq
idx_t *I = new idx_t[k * nq];
float *D = new float[k * nq];
index.search(nq, xq, k, D, I);
printf("I=\n"); // print neighbors of 5 last queries
...
index.nprobe = 10; // default nprobe is 1, try a few
more
index.search(nq, xq, k, D, I);
printf("I=\n");
...
Results
For nprobe=1, the result looks like
[[ 9900 10500 9831 10808]
[11055 10812 11321 10260]
[11353 10164 10719 11013]
[10571 10203 10793 10952]
[ 9582 10304 9622 9229]]
16
The values are similar, but not exactly the same as for the brute-
force search (see above). This is because some of the results were
not in the exact same Voronoi cell. Therefore, visiting a few more
cells may prove useful.
Increasing nprobe to 10 does exactly this:
[[ 9900 10500 9309 9831]
[11055 10895 10812 11321]
[11353 11103 10164 9787]
[10571 10664 10632 9638]
[ 9628 9554 10036 9582]]
which is the correct result. Note that getting a perfect result in this
case is merely an artifact of the data distribution, as it is has a
strong component on the x-axis which makes it easier to handle.
The nprobe parameter is always a way of adjusting the tradeoff
between speed and accuracy of the result. Setting nprobe = nlist
gives the same result as the brute-force search (but slower).
Lower memory footprint
This uses too much memory, how can I shrink the storage?
The indexes we have seen, IndexFlatL2 and IndexIVFFlat both store
the full vectors. To scale up to very large datasets, Faiss offers
17
variants that compress the stored vectors with a lossy compression
based on product quantizers.
The vectors are still stored in Voronoi cells, but their size is reduced
to a configurable number of bytes m (d must be a multiple of m).
The compression is based on a Product Quantizer, that can be seen
as an additional level of quantization, that is applied on sub-vectors
of the vectors to encode.
In this case, since the vectors are not stored exactly, the distances
that are returned by the search method are also approximations.
In Python
nlist = 100
m=8 # number of subquantizers
k=4
quantizer = faiss.IndexFlatL2(d) # this remains the same
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
# 8 specifies that each sub-vector is encoded
as 8 bits
index.train(xb)
index.add(xb)
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
18
index.nprobe = 10 # make comparable with experiment
above
D, I = index.search(xq, k) # search
print(I[-5:])
In C++
int nlist = 100;
int k = 4;
int m = 8; // number of subquantizers
faiss::IndexFlatL2 quantizer(d); // the other index
faiss::IndexIVFPQ index(&quantizer, d, nlist, m, 8);
index.train(nb, xb);
index.add(nb, xb);
{ // sanity check
...
index.search(5, xb, k, D, I);
printf("I=\n");
...
printf("D=\n");
...
{ // search xq
19
...
index.nprobe = 10;
index.search(nq, xq, k, D, I);
printf("I=\n");
...
Results
The results look like:
[[ 0 608 220 228]
[ 1 1063 277 617]
[ 2 46 114 304]
[ 3 791 527 316]
[ 4 159 288 393]]
[[ 1.40704751 6.19361687 6.34912491 6.35771513]
[ 1.49901485 5.66632462 5.94188499 6.29570007]
[ 1.63260388 6.04126883 6.18447495 6.26815748]
[ 1.5356375 6.33165455 6.64519501 6.86594009]
[ 1.46203303 6.5022912 6.62621975 6.63154221]]
We can observe that the nearest neighbor is found correctly (it is the
vector ID itself), but the estimated distance of the vector to itself is
not 0, although it is significantly lower than the distance to the other
20
neighbors. This is due to the lossy compression.
Here we compress 64 32-bit floats to 8 bytes, so the compression
factor is 32.
When searching on real queries, the results look like:
[[ 9432 9649 9900 10287]
[10229 10403 9829 9740]
[10847 10824 9787 10089]
[11268 10935 10260 10571]
[ 9582 10304 9616 9850]]
They can be compared with the IVFFlat results above. For this case,
most results are wrong, but they are in the correct area of the space,
as shown by the IDs around 10000. The situation is better for real
data because:
uniform data is very difficult to index because there is no
regularity that can be exploited to cluster or reduce
dimensionality
for natural data, the semantic nearest neighbor is often
significantly closer than irrelevant results.
Simplifying index construction
Since building indexes can become complicated, there is a factory
function that constructs them given a string. The indexes above can
be obtained with the following shorthand:
21
index = faiss.index_factory(d, "IVF100,PQ8")
faiss::Index *index = faiss::index_factory(d, "IVF100,PQ8");
Replace PQ8 with Flat to get an IndexFlat. The factory is particularly
useful when preprocessing (PCA) is applied to the input vectors. For
example, the factory string to preprocess reduce the vectors to 32D
by PCA projection is: "PCA32,IVF100,Flat".
Further reading
Explore the next sections to get more specific information about the
types of indexes, GPU faiss, coding structure, etc.
Running on GPUs
Faiss can leverage your nvidia GPUs almost seamlessly.
First, declare a GPU resource, which encapsulates a chunk of the
GPU memory:
In Python
res = faiss.StandardGpuResources() # use a single GPU
In C++
faiss::gpu::StandardGpuResources res; // use a single GPU
Then build a GPU index using the GPU resource:
In Python
# build a flat (CPU) index
index_flat = faiss.IndexFlatL2(d)
22
# make it into a gpu index
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
In C++
faiss::gpu::GpuIndexFlatL2 gpu_index_flat(&res, d);
Note: a single GPU resource object can be used by multiple indices,
as long as they are not issuing concurrent queries.
The obtained GPU index can be used the exact same way as a CPU
index:
In Python
gpu_index_flat.add(xb) # add vectors to the index
print(gpu_index_flat.ntotal)
k=4 # we want to see 4 nearest neighbors
D, I = gpu_index_flat.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries
In C++
gpu_index_flat.add(nb, xb); // add vectors to the index
printf("ntotal = %ld\n", gpu_index_flat.ntotal);
int k = 4;
{ // search xq
23
idx_t *I = new idx_t[k * nq];
float *D = new float[k * nq];
gpu_index_flat.search(nq, xq, k, D, I);
// print results
printf("I (5 first results)=\n");
for(int i = 0; i < 5; i++) {
for(int j = 0; j < k; j++)
printf("%5ld ", I[i * k + j]);
printf("\n");
printf("I (5 last results)=\n");
for(int i = nq - 5; i < nq; i++) {
for(int j = 0; j < k; j++)
printf("%5ld ", I[i * k + j]);
printf("\n");
delete [] I;
delete [] D;
24
}
Results
The results are the same as for the CPU version. Note also that the
performance increase will not be noticeable on a small dataset.
Using multiple GPUs
Making use of multiple GPUs is mainly a matter of declaring several
GPU resources. In python, this can be done implicitly using
the index_cpu_to_all_gpus helper.
Examples:
In Python
ngpus = faiss.get_num_gpus()
print("number of GPUs:", ngpus)
cpu_index = faiss.IndexFlatL2(d)
gpu_index = faiss.index_cpu_to_all_gpus( # build the index
cpu_index
gpu_index.add(xb) # add vectors to the index
print(gpu_index.ntotal)
25
k=4 # we want to see 4 nearest neighbors
D, I = gpu_index.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries
In C++
int ngpus = faiss::gpu::getNumDevices();
printf("Number of GPUs: %d\n", ngpus);
std::vector<faiss::gpu::GpuResources*> res;
std::vector<int> devs;
for(int i = 0; i < ngpus; i++) {
res.push_back(new faiss::gpu::StandardGpuResources);
devs.push_back(i);
faiss::IndexFlatL2 cpu_index(d);
faiss::Index *gpu_index =
faiss::gpu::index_cpu_to_gpu_multiple(
res,
26
devs,
&cpu_index
);
printf("is_trained = %s\n", gpu_index->is_trained ? "true" :
"false");
gpu_index->add(nb, xb); // vectors to the index
printf("ntotal = %ld\n", gpu_index->ntotal);
int k = 4;
{ // search xq
idx_t *I = new idx_t[k * nq];
float *D = new float[k * nq];
gpu_index->search(nq, xq, k, D, I);
// print results
printf("I (5 first results)=\n");
for(int i = 0; i < 5; i++) {
for(int j = 0; j < k; j++)
printf("%5ld ", I[i * k + j]);
27
printf("\n");
printf("I (5 last results)=\n");
for(int i = nq - 5; i < nq; i++) {
for(int j = 0; j < k; j++)
printf("%5ld ", I[i * k + j]);
printf("\n");
delete [] I;
delete [] D;
delete gpu_index;
for(int i = 0; i < ngpus; i++) {
delete res[i];
Semantic search with FAISS
28
In section 5, we created a dataset of GitHub issues and comments
from the 🤗 Datasets repository. In this section we’ll use this
information to build a search engine that can help us find answers to
our most pressing questions about the library!
Using embeddings for semantic search
As we saw in Chapter 1, Transformer-based language models
represent each token in a span of text as an embedding vector. It
turns out that one can “pool” the individual embeddings to create a
vector representation for whole sentences, paragraphs, or (in some
cases) documents. These embeddings can then be used to find
similar documents in the corpus by computing the dot-product
similarity (or some other similarity metric) between each embedding
and returning the documents with the greatest overlap.
In this section we’ll use embeddings to develop a semantic search
engine. These search engines offer several advantages over
conventional approaches that are based on matching keywords in a
query with the documents.
Loading and preparing the dataset
The first thing we need to do is download our dataset of GitHub
issues, so let’s use load_dataset() function as usual:
29
Copied
from datasets import load_dataset
issues_dataset = load_dataset("lewtun/github-issues", split="train")
issues_dataset
Copied
Dataset({
features: ['url', 'repository_url', 'labels_url', 'comments_url',
'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels',
'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments',
'created_at', 'updated_at', 'closed_at', 'author_association',
'active_lock_reason', 'pull_request', 'body',
'performed_via_github_app', 'is_pull_request'],
num_rows: 2855
})
Here we’ve specified the default train split in load_dataset(), so it
returns a Dataset instead of a DatasetDict. The first order of
business is to filter out the pull requests, as these tend to be rarely
used for answering user queries and will introduce noise in our
search engine. As should be familiar by now, we can use
the Dataset.filter() function to exclude these rows in our dataset.
While we’re at it, let’s also filter out rows with no comments, since
30
these provide no answers to user queries:
Copied
issues_dataset = issues_dataset.filter(
lambda x: (x["is_pull_request"] == False and len(x["comments"])
> 0)
issues_dataset
Copied
Dataset({
features: ['url', 'repository_url', 'labels_url', 'comments_url',
'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels',
'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments',
'created_at', 'updated_at', 'closed_at', 'author_association',
'active_lock_reason', 'pull_request', 'body',
'performed_via_github_app', 'is_pull_request'],
num_rows: 771
})
We can see that there are a lot of columns in our dataset, most of
which we don’t need to build our search engine. From a search
perspective, the most informative columns are title, body,
and comments, while html_url provides us with a link back to the
source issue. Let’s use the Dataset.remove_columns() function to
31
drop the rest:
Copied
columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove =
set(columns_to_keep).symmetric_difference(columns)
issues_dataset =
issues_dataset.remove_columns(columns_to_remove)
issues_dataset
Copied
Dataset({
features: ['html_url', 'title', 'comments', 'body'],
num_rows: 771
})
To create our embeddings we’ll augment each comment with the
issue’s title and body, since these fields often include useful
contextual information. Because our comments column is currently a
list of comments for each issue, we need to “explode” the column so
that each row consists of an (html_url, title, body, comment) tuple.
In Pandas we can do this with the DataFrame.explode() function,
which creates a new row for each element in a list-like column, while
replicating all the other column values. To see this in action, let’s
32
first switch to the Pandas DataFrame format:
Copied
issues_dataset.set_format("pandas")
df = issues_dataset[:]
If we inspect the first row in this DataFrame we can see there are
four comments associated with this issue:
Copied
df["comments"][0].tolist()
Copied
['the bug code locate in :\r\n if data_args.task_name is not None:\
r\n # Downloading and loading a dataset from the hub.\r\n
datasets = load_dataset("glue", data_args.task_name,
cache_dir=model_args.cache_dir)',
'Hi @jinec,\r\n\r\nFrom time to time we get this kind of
`ConnectionError` coming from the github.com website:
https://raw.githubusercontent.com\r\n\r\nNormally, it should work if
you wait a little and then retry.\r\n\r\nCould you please confirm if the
problem persists?',
'cannot connect ,even by Web browser ,please check that there is
some problems。',
'I can access
https://raw.githubusercontent.com/huggingface/datasets/1.7.0/datas
33
ets/glue/glue.py without problem...']
When we explode df, we expect to get one row for each of these
comments. Let’s check if that’s the case:
Copied
comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)
html_url title comments bo
dy
0 https:// Connection the bug code locate in :\r\n Hell
github.com/ Error: if data_args.task_name is o,\
huggingface/ Couldn't not None... r\nI
datasets/issues/ reach am
2787 https://raw. tryi
githubuser ng
content.co to
m run
run
_gl
ue.
py
and
34
it
giv
es
me
this
err
or...
1 https:// Connection Hi @jinec,\r\n\r\nFrom time Hell
github.com/ Error: to time we get this kind of o,\
huggingface/ Couldn't `ConnectionError` coming r\nI
datasets/issues/ reach from the github.com am
2787 https://raw. website: tryi
githubuser https://raw.githubuserconte ng
content.co nt.com... to
m run
run
_gl
ue.
py
and
it
35
giv
es
me
this
err
or...
2 https:// Connection cannot connect , even by Hell
github.com/ Error: Web browser , please check o,\
huggingface/ Couldn't that there is some r\nI
datasets/issues/ reach problems。 am
2787 https://raw. tryi
githubuser ng
content.co to
m run
run
_gl
ue.
py
and
it
giv
36
es
me
this
err
or...
3 https:// Connection I can access Hell
github.com/ Error: https://raw.githubuserconte o,\
huggingface/ Couldn't nt.com/huggingface/datase r\nI
datasets/issues/ reach ts/1.7.0/datasets/glue/ am
2787 https://raw. glue.py without problem... tryi
githubuser ng
content.co to
m run
run
_gl
ue.
py
and
it
giv
es
37
me
this
err
or...
Great, we can see the rows have been replicated, with
the comments column containing the individual comments! Now that
we’re finished with Pandas, we can quickly switch back to
a Dataset by loading the DataFrame in memory:
Copied
from datasets import Dataset
comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset
Copied
Dataset({
features: ['html_url', 'title', 'comments', 'body'],
num_rows: 2842
})
Okay, this has given us a few thousand comments to work with!
✏️Try it out! See if you can use Dataset.map() to explode
the comments column of issues_dataset without resorting to the use
38
of Pandas. This is a little tricky; you might find the “Batch
mapping” section of the 🤗 Datasets documentation useful for this
task.
Now that we have one comment per row, let’s create a
new comments_length column that contains the number of words
per comment:
Copied
comments_dataset = comments_dataset.map(
lambda x: {"comment_length": len(x["comments"].split())}
We can use this new column to filter out short comments, which
typically include things like “cc @lewtun” or “Thanks!” that are not
relevant for our search engine. There’s no precise number to select
for the filter, but around 15 words seems like a good start:
Copied
comments_dataset = comments_dataset.filter(lambda x:
x["comment_length"] > 15)
comments_dataset
Copied
Dataset({
features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
num_rows: 2098
39
})
Having cleaned up our dataset a bit, let’s concatenate the issue title,
description, and comments together in a new text column. As usual,
we’ll write a simple function that we can pass to Dataset.map():
Copied
def concatenate_text(examples):
return {
"text": examples["title"]
+ " \n "
+ examples["body"]
+ " \n "
+ examples["comments"]
comments_dataset = comments_dataset.map(concatenate_text)
We’re finally ready to create some embeddings! Let’s take a look.
Creating text embeddings
We saw in Chapter 2 that we can obtain token embeddings by using
the AutoModel class. All we need to do is pick a suitable checkpoint
to load the model from. Fortunately, there’s a library
called sentence-transformers that is dedicated to creating
40
embeddings. As described in the library’s documentation, our use
case is an example of asymmetric semantic search because we have
a short query whose answer we’d like to find in a longer document,
like a an issue comment. The handy model overview table in the
documentation indicates that the multi-qa-mpnet-base-dot-
v1 checkpoint has the best performance for semantic search, so
we’ll use that for our application. We’ll also load the tokenizer using
the same checkpoint:
Copied
from transformers import AutoTokenizer, TFAutoModel
model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = TFAutoModel.from_pretrained(model_ckpt, from_pt=True)
Note that we’ve set from_pt=True as an argument of
the from_pretrained() method. That’s because the multi-qa-mpnet-
base-dot-v1 checkpoint only has PyTorch weights, so
setting from_pt=True will automatically convert them to the
TensorFlow format for us. As you can see, it is very simple to switch
between frameworks in 🤗 Transformers!
As we mentioned earlier, we’d like to represent each entry in our
GitHub issues corpus as a single vector, so we need to “pool” or
41
average our token embeddings in some way. One popular approach
is to perform CLS pooling on our model’s outputs, where we simply
collect the last hidden state for the special [CLS] token. The
following function does the trick for us:
Copied
def cls_pooling(model_output):
return model_output.last_hidden_state[:, 0]
Next, we’ll create a helper function that will tokenize a list of
documents, place the tensors on the GPU, feed them to the model,
and finally apply CLS pooling to the outputs:
Copied
def get_embeddings(text_list):
encoded_input = tokenizer(
text_list, padding=True, truncation=True, return_tensors="tf"
encoded_input = {k: v for k, v in encoded_input.items()}
model_output = model(**encoded_input)
return cls_pooling(model_output)
We can test the function works by feeding it the first text entry in
our corpus and inspecting the output shape:
Copied
embedding = get_embeddings(comments_dataset["text"][0])
42
embedding.shape
Copied
TensorShape([1, 768])
Great, we’ve converted the first entry in our corpus into a 768-
dimensional vector! We can use Dataset.map() to apply
our get_embeddings() function to each row in our corpus, so let’s
create a new embeddings column as follows:
Copied
embeddings_dataset = comments_dataset.map(
lambda x: {"embeddings": get_embeddings(x["text"]).numpy()
[0]}
Notice that we’ve converted the embeddings to NumPy arrays —
that’s because 🤗 Datasets requires this format when we try to index
them with FAISS, which we’ll do next.
Using FAISS for efficient similarity search
Now that we have a dataset of embeddings, we need some way to
search over them. To do this, we’ll use a special data structure in 🤗
Datasets called a FAISS index. FAISS (short for Facebook AI Similarity
Search) is a library that provides efficient algorithms to quickly
search and cluster embedding vectors.
The basic idea behind FAISS is to create a special data structure
43
called an index that allows one to find which embeddings are similar
to an input embedding. Creating a FAISS index in 🤗 Datasets is
simple — we use the Dataset.add_faiss_index() function and specify
which column of our dataset we’d like to index:
Copied
embeddings_dataset.add_faiss_index(column="embeddings")
We can now perform queries on this index by doing a nearest
neighbor lookup with the Dataset.get_nearest_examples() function.
Let’s test this out by first embedding a question as follows:
Copied
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).numpy()
question_embedding.shape
Copied
(1, 768)
Just like with the documents, we now have a 768-dimensional vector
representing the query, which we can compare against the whole
corpus to find the most similar embeddings:
Copied
scores, samples = embeddings_dataset.get_nearest_examples(
"embeddings", question_embedding, k=5
)
44
The Dataset.get_nearest_examples() function returns a tuple of
scores that rank the overlap between the query and the document,
and a corresponding set of samples (here, the 5 best matches). Let’s
collect these in a pandas.DataFrame so we can easily sort them:
Copied
import pandas as pd
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)
Now we can iterate over the first few rows to see how well our query
matched the available comments:
Copied
for _, row in samples_df.iterrows():
print(f"COMMENT: {row.comments}")
print(f"SCORE: {row.scores}")
print(f"TITLE: {row.title}")
print(f"URL: {row.html_url}")
print("=" * 50)
print()
Copied
"""
45
COMMENT: Requiring online connection is a deal breaker in some
cases unfortunately so it'd be great if offline mode is added similar
to how `transformers` loads models offline fine.
@mandubian's second bullet point suggests that there's a
workaround allowing you to use your offline (custom?) dataset with
`datasets`. Could you please elaborate on how that should look like?
SCORE: 25.505046844482422
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
=========================================
=========
COMMENT: The local dataset builders (csv, text , json and pandas)
are now part of the `datasets` package since #1726 :)
You can now use them offline
\`\`\`python
datasets = load_dataset("text", data_files=data_files)
\`\`\`
We'll do a new release soon
SCORE: 24.555509567260742
46
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
=========================================
=========
COMMENT: I opened a PR that allows to reload modules that have
already been loaded once even if there's no internet.
Let me know if you know other ways that can make the offline mode
experience better. I'd be happy to add them :)
I already note the "freeze" modules option, to prevent local modules
updates. It would be a cool feature.
----------
> @mandubian's second bullet point suggests that there's a
workaround allowing you to use your offline (custom?) dataset with
`datasets`. Could you please elaborate on how that should look like?
Indeed `load_dataset` allows to load remote dataset script (squad,
glue, etc.) but also you own local ones.
47
For example if you have a dataset script at
`./my_dataset/my_dataset.py` then you can do
\`\`\`python
load_dataset("./my_dataset")
\`\`\`
and the dataset script will generate your dataset once and for all.
----------
About I'm looking into having `csv`, `json`, `text`, `pandas` dataset
builders already included in the `datasets` package, so that they are
available offline by default, as opposed to the other datasets that
require the script to be downloaded.
cf #1724
SCORE: 24.14896583557129
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
=========================================
=========
COMMENT: > here is my way to load a dataset offline, but it
**requires** an online machine
48
>
> 1. (online machine)
>
> ```
>
> import datasets
>
> data = datasets.load_dataset(...)
>
> data.save_to_disk(/YOUR/DATASET/DIR)
>
> ```
>
> 2. copy the dir from online to the offline machine
>
> 3. (offline machine)
>
> ```
>
> import datasets
>
> data = datasets.load_from_disk(/SAVED/DATA/DIR)
49
>
> ```
>
>
>
> HTH.
SCORE: 22.893993377685547
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
=========================================
=========
COMMENT: here is my way to load a dataset offline, but it
**requires** an online machine
1. (online machine)
\`\`\`
import datasets
data = datasets.load_dataset(...)
data.save_to_disk(/YOUR/DATASET/DIR)
\`\`\`
50
2. copy the dir from online to the offline machine
3. (offline machine)
\`\`\`
import datasets
data = datasets.load_from_disk(/SAVED/DATA/DIR)
\`\`\`
HTH.
SCORE: 22.406635284423828
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
=========================================
=========
"""
Not bad! Our second hit seems to match the query.
51