0% found this document useful (0 votes)
39 views51 pages

Installing Faiss

The document provides a comprehensive guide on installing and using FAISS, a library for efficient similarity search and clustering of dense vectors. It covers installation methods, basic usage examples in Python and C++, and details on building indexes for searching vector collections. Additionally, it explains how to optimize search performance using different indexing strategies.

Uploaded by

李天
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views51 pages

Installing Faiss

The document provides a comprehensive guide on installing and using FAISS, a library for efficient similarity search and clustering of dense vectors. It covers installation methods, basic usage examples in Python and C++, and details on building indexes for searching vector collections. Additionally, it explains how to optimize search performance using different indexing strategies.

Uploaded by

李天
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 51

FAISS

目录
Installing Faiss........................................................................................................................1
Getting started....................................................................................................................3
Lower memory footprint..................................................................................................10
Running on GPUs..................................................................................................................13
Semantic search with FAISS..................................................................................................17

1
Installing Faiss
Standard installs

We support compiling Faiss with cmake from source and installing

via conda on a limited set of platforms: Linux (x86 and ARM), Mac

(only x86), Windows (only x86). For this, see INSTALL.md.

Why don't you support installing via XXX ?

The reason why we don't support more platforms is because it is a

lot of work to make sure Faiss runs in the supported configurations:

building the conda packages for a new release of Faiss always

surfaces compatibility issues. Anaconda provides a sufficiently

controlled environment that we can be confident that it will run on

the user's machines (this is not the case with pip). Besides, the

2
platform (hardware and OS) has to be supported by our CI tool

(circleCI).

So we are very carful before we add new officially supported

platforms (hardware and software). We are very interested in

success (or failure!) stories about porting to other platforms, and

related PRs.

Special configurations

Compiling the python interface within an Anaconda install

The idea is to install everything via anaconda and link Faiss against

that. This is useful to make sure the MKL impementation is as fast as

possible.

source ~/anaconda3/etc/profile.d/conda.sh

conda activate host_env_for_faiss # an environment that contains

python and numpy

git clone https://github.com/facebookresearch/faiss.git faiss_xx

cd faiss_xx

LD_LIBRARY_PATH=

MKLROOT=/private/home/matthijs/anaconda3/envs/host_env_for_fais

s/lib CXX=$(which g++) \


3
$cmake -B build -DBUILD_TESTING=ON -DFAISS_ENABLE_GPU=OFF \

-DFAISS_OPT_LEVEL=axv2 \

-DFAISS_ENABLE_C_API=ON \

-DCMAKE_BUILD_TYPE=Release \

-DBLA_VENDOR=Intel10_64_dyn .

make -C build -j10 swigfaiss && (cd build/faiss/python ; python3

setup.py build)

(cd tests ; PYTHONPATH=../build/faiss/python/build/lib/

OMP_NUM_THREADS=1 python -m unittest -v discover )

Compiling Faiss on ARM

Commands for an ubuntu 18 image on an Amazon c6g.8xlarge

machine :

set -e

sudo apt-get install libatlas-base-dev libatlas3-base

sudo apt-get install clang-8

sudo apt-get install swig

# cmake provided with ubuntu is too old


4
wget

https://github.com/Kitware/CMake/releases/download/v3.19.3/cmake-

3.19.3.tar.gz

tar xvzf cmake-3.19.3.tar.gz

cd cmake-3.19.3/

./configure --prefix=/home/matthijs/cmake && make -j

cd $HOME

alias cmake=$HOME/cmake/bin/cmake

# clone Faiss

git clone https://github.com/facebookresearch/faiss.git

cd faiss

cmake -B build -DCMAKE_CXX_COMPILER=clang++-8 -

DFAISS_ENABLE_GPU=OFF -DPython_EXECUTABLE=$(which

python3) -DFAISS_OPT_LEVEL=generic -
5
DCMAKE_BUILD_TYPE=Release -DBUILD_TEST\

ING=ON

(cd build/faiss/python/ ; python3 setup.py build)

# run tests

export PYTHONPATH=$PWD/build/faiss/python/build/lib/

python3 -m unittest discover

Getting started
For the following, we assume Faiss is installed. We provide code

examples in C++ and Python. The code can be run by copy/pasting

it or running it from the tutorial/ subdirectory of the Faiss

distribution.

Getting some data

Faiss handles collections of vectors of a fixed dimensionality d,

typically a few 10s to 100s. These collections can be stored in

matrices. We assume row-major storage, ie. the j'th component of

vector number i is stored in row i, column j of the matrix. Faiss uses


6
only 32-bit floating point matrices.

We need two matrices:

 xb for the database, that contains all the vectors that must be

indexed, and that we are going to search in. Its size is nb-by-d

 xq for the query vectors, for which we need to find the nearest

neighbors. Its size is nq-by-d. If we have a single query vector,

nq=1.

In the following examples we are going to work with vectors that are

drawn form a uniform distribution in d=64 dimensions. Just for fun,

we add small translation along the first dimension that depends on

the vector index.

In Python

import numpy as np

d = 64 # dimension

nb = 100000 # database size

nq = 10000 # nb of queries

np.random.seed(1234) # make reproducible

xb = np.random.random((nb, d)).astype('float32')

xb[:, 0] += np.arange(nb) / 1000.

xq = np.random.random((nq, d)).astype('float32')

xq[:, 0] += np.arange(nq) / 1000.

In Python, the matrices are always represented as numpy arrays.


7
The data type dtype must be float32.

In C++

int d = 64; // dimension

int nb = 100000; // database size

int nq = 10000; // nb of queries

float *xb = new float[d * nb];

float *xq = new float[d * nq];

for(int i = 0; i < nb; i++) {

for(int j = 0; j < d; j++) xb[d * i + j] = drand48();

xb[d * i] += i / 1000.;

for(int i = 0; i < nq; i++) {

for(int j = 0; j < d; j++) xq[d * i + j] = drand48();

xq[d * i] += i / 1000.;

This example uses plain arrays, because this is the lowest common

denominator all C++ matrix libraries support. Faiss can

accommodate any matrix library, provided it provides a pointer to

the underlying data. For example std::vector<float>'s internal

pointer is given by the data() method.

Building an index and adding the vectors to it

Faiss is built around the Index object. It encapsulates the set of


8
database vectors, and optionally preprocesses them to make

searching efficient. There are many types of indexes, we are going to

use the simplest version that just performs brute-force L2 distance

search on them: IndexFlatL2.

All indexes need to know when they are built which is the

dimensionality of the vectors they operate on, d in our case. Then,

most of the indexes also require a training phase, to analyze the

distribution of the vectors. For IndexFlatL2, we can skip this

operation.

When the index is built and trained, two operations can be

performed on the index: add and search.

To add elements to the index, we call add on xb. We can also display

the two state variables of the index: is_trained, a boolean that

indicates whether training is required and ntotal, the number of

indexed vectors.

Some indexes can also store integer IDs corresponding to each of

the vectors (but not IndexFlatL2). If no IDs are provided, add just

uses the vector ordinal as the id, ie. the first vector gets 0, the

second 1, etc.

In Python

import faiss # make faiss available

index = faiss.IndexFlatL2(d) # build the index


9
print(index.is_trained)

index.add(xb) # add vectors to the index

print(index.ntotal)

In C++

faiss::IndexFlatL2 index(d); // call constructor

printf("is_trained = %s\n", index.is_trained ? "true" : "false");

index.add(nb, xb); // add vectors to the index

printf("ntotal = %ld\n", index.ntotal);

Results

This should just display true (the index is trained) and 100000

(vectors are stored in the index).

Searching

The basic search operation that can be performed on an index is

the k-nearest-neighbor search, ie. for each query vector, find

its k nearest neighbors in the database.

The result of this operation can be conveniently stored in an integer

matrix of size nq-by-k, where row i contains the IDs of the neighbors

of query vector i, sorted by increasing distance. In addition to this

matrix, the search operation returns a nq-by-k floating-point matrix

with the corresponding squared distances.

As a sanity check, we can first search a few database vectors, to

make sure the nearest neighbor is indeed the vector itself.


10
In Python

k=4 # we want to see 4 nearest neighbors

D, I = index.search(xb[:5], k) # sanity check

print(I)

print(D)

D, I = index.search(xq, k) # actual search

print(I[:5]) # neighbors of the 5 first queries

print(I[-5:]) # neighbors of the 5 last queries

In C++

int k = 4;

{ // sanity check: search 5 first vectors of xb

idx_t *I = new idx_t[k * 5];

float *D = new float[k * 5];

index.search(5, xb, k, D, I);

printf("I=\n");

for(int i = 0; i < 5; i++) {

for(int j = 0; j < k; j++) printf("%5ld ", I[i * k + j]);

printf("\n");

...

delete [] I;

delete [] D;
11
}

{ // search xq

idx_t *I = new idx_t[k * nq];

float *D = new float[k * nq];

index.search(nq, xq, k, D, I);

...

The extract is edited because otherwise the C++ version becomes

very verbose, see the full code in the tutorial/cpp subdirectory of

Faiss.

Results

The output of the sanity check should look like

[[ 0 393 363 78]

[ 1 555 277 364]

[ 2 304 101 13]

[ 3 173 18 182]

[ 4 288 370 531]]

[[ 0. 7.17517328 7.2076292 7.25116253]

[ 0. 6.32356453 6.6845808 6.79994535]

[ 0. 5.79640865 6.39173603 7.28151226]

[ 0. 7.27790546 7.52798653 7.66284657]


12
[ 0. 6.76380348 7.29512024 7.36881447]]

ie. the nearest neighbor of each query is indeed the index of the

vector, and the corresponding distance is 0. And within a row,

distances are increasing.

The output of the actual search is similar to

[[ 381 207 210 477]

[ 526 911 142 72]

[ 838 527 1290 425]

[ 196 184 164 359]

[ 526 377 120 425]]

[[ 9900 10500 9309 9831]

[11055 10895 10812 11321]

[11353 11103 10164 9787]

[10571 10664 10632 9638]

[ 9628 9554 10036 9582]]

Because of the value added to the first component of the vectors,

the dataset is smeared along the first axis in d-dim space. So the

neighbors of the first few vectors are around the beginning of the

dataset, and the ones of the vectors around ~10000 are also around

index 10000 in the dataset.

Executing the search above takes about 3.3s on a 2016 machine.


13
Faster search

This is too slow, how can I make it faster?

To speed up the search, it is possible to segment the dataset into

pieces. We define Voronoi cells in the d-dimensional space, and each

database vector falls in one of the cells. At search time, only the

database vectors y contained in the cell the query x falls in and a

few neighboring ones are compared against the query vector.

This is done via the IndexIVFFlat index. This type of index requires a

training stage, that can be performed on any collection of vectors

that has the same distribution as the database vectors. In this case

we just use the database vectors themselves.

The IndexIVFFlat also requires another index, the quantizer, that

assigns vectors to Voronoi cells. Each cell is defined by a centroid,

and finding the Voronoi cell a vector falls in consists in finding the

nearest neighbor of the vector in the set of centroids. This is the task

of the other index, which is typically an IndexFlatL2.

There are two parameters to the search method: nlist, the number of

cells, and nprobe, the number of cells (out of nlist) that are visited to

perform a search. The search time roughly increases linearly with

the number of probes plus some constant due to the quantization.


14
In Python

nlist = 100

k=4

quantizer = faiss.IndexFlatL2(d) # the other index

index = faiss.IndexIVFFlat(quantizer, d, nlist)

assert not index.is_trained

index.train(xb)

assert index.is_trained

index.add(xb) # add may be a bit slower as well

D, I = index.search(xq, k) # actual search

print(I[-5:]) # neighbors of the 5 last queries

index.nprobe = 10 # default nprobe is 1, try a few more

D, I = index.search(xq, k)

print(I[-5:]) # neighbors of the 5 last queries

In C++

int nlist = 100;

int k = 4;

faiss::IndexFlatL2 quantizer(d); // the other index

faiss::IndexIVFFlat index(&quantizer, d, nlist);

assert(!index.is_trained);

index.train(nb, xb);
15
assert(index.is_trained);

index.add(nb, xb);

{ // search xq

idx_t *I = new idx_t[k * nq];

float *D = new float[k * nq];

index.search(nq, xq, k, D, I);

printf("I=\n"); // print neighbors of 5 last queries

...

index.nprobe = 10; // default nprobe is 1, try a few

more

index.search(nq, xq, k, D, I);

printf("I=\n");

...

Results

For nprobe=1, the result looks like

[[ 9900 10500 9831 10808]

[11055 10812 11321 10260]

[11353 10164 10719 11013]

[10571 10203 10793 10952]

[ 9582 10304 9622 9229]]

16
The values are similar, but not exactly the same as for the brute-

force search (see above). This is because some of the results were

not in the exact same Voronoi cell. Therefore, visiting a few more

cells may prove useful.

Increasing nprobe to 10 does exactly this:

[[ 9900 10500 9309 9831]

[11055 10895 10812 11321]

[11353 11103 10164 9787]

[10571 10664 10632 9638]

[ 9628 9554 10036 9582]]

which is the correct result. Note that getting a perfect result in this

case is merely an artifact of the data distribution, as it is has a

strong component on the x-axis which makes it easier to handle.

The nprobe parameter is always a way of adjusting the tradeoff

between speed and accuracy of the result. Setting nprobe = nlist

gives the same result as the brute-force search (but slower).

Lower memory footprint

This uses too much memory, how can I shrink the storage?

The indexes we have seen, IndexFlatL2 and IndexIVFFlat both store

the full vectors. To scale up to very large datasets, Faiss offers


17
variants that compress the stored vectors with a lossy compression

based on product quantizers.

The vectors are still stored in Voronoi cells, but their size is reduced

to a configurable number of bytes m (d must be a multiple of m).

The compression is based on a Product Quantizer, that can be seen

as an additional level of quantization, that is applied on sub-vectors

of the vectors to encode.

In this case, since the vectors are not stored exactly, the distances

that are returned by the search method are also approximations.

In Python

nlist = 100

m=8 # number of subquantizers

k=4

quantizer = faiss.IndexFlatL2(d) # this remains the same

index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)

# 8 specifies that each sub-vector is encoded

as 8 bits

index.train(xb)

index.add(xb)

D, I = index.search(xb[:5], k) # sanity check

print(I)

print(D)
18
index.nprobe = 10 # make comparable with experiment

above

D, I = index.search(xq, k) # search

print(I[-5:])

In C++

int nlist = 100;

int k = 4;

int m = 8; // number of subquantizers

faiss::IndexFlatL2 quantizer(d); // the other index

faiss::IndexIVFPQ index(&quantizer, d, nlist, m, 8);

index.train(nb, xb);

index.add(nb, xb);

{ // sanity check

...

index.search(5, xb, k, D, I);

printf("I=\n");

...

printf("D=\n");

...

{ // search xq
19
...

index.nprobe = 10;

index.search(nq, xq, k, D, I);

printf("I=\n");

...

Results

The results look like:

[[ 0 608 220 228]

[ 1 1063 277 617]

[ 2 46 114 304]

[ 3 791 527 316]

[ 4 159 288 393]]

[[ 1.40704751 6.19361687 6.34912491 6.35771513]

[ 1.49901485 5.66632462 5.94188499 6.29570007]

[ 1.63260388 6.04126883 6.18447495 6.26815748]

[ 1.5356375 6.33165455 6.64519501 6.86594009]

[ 1.46203303 6.5022912 6.62621975 6.63154221]]

We can observe that the nearest neighbor is found correctly (it is the

vector ID itself), but the estimated distance of the vector to itself is

not 0, although it is significantly lower than the distance to the other


20
neighbors. This is due to the lossy compression.

Here we compress 64 32-bit floats to 8 bytes, so the compression

factor is 32.

When searching on real queries, the results look like:

[[ 9432 9649 9900 10287]

[10229 10403 9829 9740]

[10847 10824 9787 10089]

[11268 10935 10260 10571]

[ 9582 10304 9616 9850]]

They can be compared with the IVFFlat results above. For this case,

most results are wrong, but they are in the correct area of the space,

as shown by the IDs around 10000. The situation is better for real

data because:

 uniform data is very difficult to index because there is no

regularity that can be exploited to cluster or reduce

dimensionality

 for natural data, the semantic nearest neighbor is often

significantly closer than irrelevant results.

Simplifying index construction

Since building indexes can become complicated, there is a factory

function that constructs them given a string. The indexes above can

be obtained with the following shorthand:


21
index = faiss.index_factory(d, "IVF100,PQ8")

faiss::Index *index = faiss::index_factory(d, "IVF100,PQ8");

Replace PQ8 with Flat to get an IndexFlat. The factory is particularly

useful when preprocessing (PCA) is applied to the input vectors. For

example, the factory string to preprocess reduce the vectors to 32D

by PCA projection is: "PCA32,IVF100,Flat".

Further reading

Explore the next sections to get more specific information about the

types of indexes, GPU faiss, coding structure, etc.

Running on GPUs
Faiss can leverage your nvidia GPUs almost seamlessly.

First, declare a GPU resource, which encapsulates a chunk of the

GPU memory:

In Python

res = faiss.StandardGpuResources() # use a single GPU

In C++

faiss::gpu::StandardGpuResources res; // use a single GPU

Then build a GPU index using the GPU resource:

In Python

# build a flat (CPU) index

index_flat = faiss.IndexFlatL2(d)
22
# make it into a gpu index

gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)

In C++

faiss::gpu::GpuIndexFlatL2 gpu_index_flat(&res, d);

Note: a single GPU resource object can be used by multiple indices,

as long as they are not issuing concurrent queries.

The obtained GPU index can be used the exact same way as a CPU

index:

In Python

gpu_index_flat.add(xb) # add vectors to the index

print(gpu_index_flat.ntotal)

k=4 # we want to see 4 nearest neighbors

D, I = gpu_index_flat.search(xq, k) # actual search

print(I[:5]) # neighbors of the 5 first queries

print(I[-5:]) # neighbors of the 5 last queries

In C++

gpu_index_flat.add(nb, xb); // add vectors to the index

printf("ntotal = %ld\n", gpu_index_flat.ntotal);

int k = 4;

{ // search xq
23
idx_t *I = new idx_t[k * nq];

float *D = new float[k * nq];

gpu_index_flat.search(nq, xq, k, D, I);

// print results

printf("I (5 first results)=\n");

for(int i = 0; i < 5; i++) {

for(int j = 0; j < k; j++)

printf("%5ld ", I[i * k + j]);

printf("\n");

printf("I (5 last results)=\n");

for(int i = nq - 5; i < nq; i++) {

for(int j = 0; j < k; j++)

printf("%5ld ", I[i * k + j]);

printf("\n");

delete [] I;

delete [] D;
24
}

Results

The results are the same as for the CPU version. Note also that the

performance increase will not be noticeable on a small dataset.

Using multiple GPUs

Making use of multiple GPUs is mainly a matter of declaring several

GPU resources. In python, this can be done implicitly using

the index_cpu_to_all_gpus helper.

Examples:

In Python

ngpus = faiss.get_num_gpus()

print("number of GPUs:", ngpus)

cpu_index = faiss.IndexFlatL2(d)

gpu_index = faiss.index_cpu_to_all_gpus( # build the index

cpu_index

gpu_index.add(xb) # add vectors to the index

print(gpu_index.ntotal)
25
k=4 # we want to see 4 nearest neighbors

D, I = gpu_index.search(xq, k) # actual search

print(I[:5]) # neighbors of the 5 first queries

print(I[-5:]) # neighbors of the 5 last queries

In C++

int ngpus = faiss::gpu::getNumDevices();

printf("Number of GPUs: %d\n", ngpus);

std::vector<faiss::gpu::GpuResources*> res;

std::vector<int> devs;

for(int i = 0; i < ngpus; i++) {

res.push_back(new faiss::gpu::StandardGpuResources);

devs.push_back(i);

faiss::IndexFlatL2 cpu_index(d);

faiss::Index *gpu_index =

faiss::gpu::index_cpu_to_gpu_multiple(

res,
26
devs,

&cpu_index

);

printf("is_trained = %s\n", gpu_index->is_trained ? "true" :

"false");

gpu_index->add(nb, xb); // vectors to the index

printf("ntotal = %ld\n", gpu_index->ntotal);

int k = 4;

{ // search xq

idx_t *I = new idx_t[k * nq];

float *D = new float[k * nq];

gpu_index->search(nq, xq, k, D, I);

// print results

printf("I (5 first results)=\n");

for(int i = 0; i < 5; i++) {

for(int j = 0; j < k; j++)

printf("%5ld ", I[i * k + j]);


27
printf("\n");

printf("I (5 last results)=\n");

for(int i = nq - 5; i < nq; i++) {

for(int j = 0; j < k; j++)

printf("%5ld ", I[i * k + j]);

printf("\n");

delete [] I;

delete [] D;

delete gpu_index;

for(int i = 0; i < ngpus; i++) {

delete res[i];

Semantic search with FAISS


28
In section 5, we created a dataset of GitHub issues and comments

from the 🤗 Datasets repository. In this section we’ll use this

information to build a search engine that can help us find answers to

our most pressing questions about the library!

Using embeddings for semantic search

As we saw in Chapter 1, Transformer-based language models

represent each token in a span of text as an embedding vector. It

turns out that one can “pool” the individual embeddings to create a

vector representation for whole sentences, paragraphs, or (in some

cases) documents. These embeddings can then be used to find

similar documents in the corpus by computing the dot-product

similarity (or some other similarity metric) between each embedding

and returning the documents with the greatest overlap.

In this section we’ll use embeddings to develop a semantic search

engine. These search engines offer several advantages over

conventional approaches that are based on matching keywords in a

query with the documents.

Loading and preparing the dataset

The first thing we need to do is download our dataset of GitHub

issues, so let’s use load_dataset() function as usual:


29
Copied

from datasets import load_dataset

issues_dataset = load_dataset("lewtun/github-issues", split="train")

issues_dataset

Copied

Dataset({

features: ['url', 'repository_url', 'labels_url', 'comments_url',

'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels',

'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments',

'created_at', 'updated_at', 'closed_at', 'author_association',

'active_lock_reason', 'pull_request', 'body',

'performed_via_github_app', 'is_pull_request'],

num_rows: 2855

})

Here we’ve specified the default train split in load_dataset(), so it

returns a Dataset instead of a DatasetDict. The first order of

business is to filter out the pull requests, as these tend to be rarely

used for answering user queries and will introduce noise in our

search engine. As should be familiar by now, we can use

the Dataset.filter() function to exclude these rows in our dataset.

While we’re at it, let’s also filter out rows with no comments, since
30
these provide no answers to user queries:

Copied

issues_dataset = issues_dataset.filter(

lambda x: (x["is_pull_request"] == False and len(x["comments"])

> 0)

issues_dataset

Copied

Dataset({

features: ['url', 'repository_url', 'labels_url', 'comments_url',

'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels',

'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments',

'created_at', 'updated_at', 'closed_at', 'author_association',

'active_lock_reason', 'pull_request', 'body',

'performed_via_github_app', 'is_pull_request'],

num_rows: 771

})

We can see that there are a lot of columns in our dataset, most of

which we don’t need to build our search engine. From a search

perspective, the most informative columns are title, body,

and comments, while html_url provides us with a link back to the

source issue. Let’s use the Dataset.remove_columns() function to


31
drop the rest:

Copied

columns = issues_dataset.column_names

columns_to_keep = ["title", "body", "html_url", "comments"]

columns_to_remove =

set(columns_to_keep).symmetric_difference(columns)

issues_dataset =

issues_dataset.remove_columns(columns_to_remove)

issues_dataset

Copied

Dataset({

features: ['html_url', 'title', 'comments', 'body'],

num_rows: 771

})

To create our embeddings we’ll augment each comment with the

issue’s title and body, since these fields often include useful

contextual information. Because our comments column is currently a

list of comments for each issue, we need to “explode” the column so

that each row consists of an (html_url, title, body, comment) tuple.

In Pandas we can do this with the DataFrame.explode() function,

which creates a new row for each element in a list-like column, while

replicating all the other column values. To see this in action, let’s
32
first switch to the Pandas DataFrame format:

Copied

issues_dataset.set_format("pandas")

df = issues_dataset[:]

If we inspect the first row in this DataFrame we can see there are

four comments associated with this issue:

Copied

df["comments"][0].tolist()

Copied

['the bug code locate in :\r\n if data_args.task_name is not None:\

r\n # Downloading and loading a dataset from the hub.\r\n

datasets = load_dataset("glue", data_args.task_name,

cache_dir=model_args.cache_dir)',

'Hi @jinec,\r\n\r\nFrom time to time we get this kind of

`ConnectionError` coming from the github.com website:

https://raw.githubusercontent.com\r\n\r\nNormally, it should work if

you wait a little and then retry.\r\n\r\nCould you please confirm if the

problem persists?',

'cannot connect ,even by Web browser ,please check that there is

some problems。',

'I can access

https://raw.githubusercontent.com/huggingface/datasets/1.7.0/datas
33
ets/glue/glue.py without problem...']

When we explode df, we expect to get one row for each of these

comments. Let’s check if that’s the case:

Copied

comments_df = df.explode("comments", ignore_index=True)

comments_df.head(4)

html_url title comments bo

dy

0 https:// Connection the bug code locate in :\r\n Hell

github.com/ Error: if data_args.task_name is o,\

huggingface/ Couldn't not None... r\nI

datasets/issues/ reach am

2787 https://raw. tryi

githubuser ng

content.co to

m run

run

_gl

ue.

py

and

34
it

giv

es

me

this

err

or...

1 https:// Connection Hi @jinec,\r\n\r\nFrom time Hell

github.com/ Error: to time we get this kind of o,\

huggingface/ Couldn't `ConnectionError` coming r\nI

datasets/issues/ reach from the github.com am

2787 https://raw. website: tryi

githubuser https://raw.githubuserconte ng

content.co nt.com... to

m run

run

_gl

ue.

py

and

it

35
giv

es

me

this

err

or...

2 https:// Connection cannot connect , even by Hell

github.com/ Error: Web browser , please check o,\

huggingface/ Couldn't that there is some r\nI

datasets/issues/ reach problems。 am

2787 https://raw. tryi

githubuser ng

content.co to

m run

run

_gl

ue.

py

and

it

giv

36
es

me

this

err

or...

3 https:// Connection I can access Hell

github.com/ Error: https://raw.githubuserconte o,\

huggingface/ Couldn't nt.com/huggingface/datase r\nI

datasets/issues/ reach ts/1.7.0/datasets/glue/ am

2787 https://raw. glue.py without problem... tryi

githubuser ng

content.co to

m run

run

_gl

ue.

py

and

it

giv

es

37
me

this

err

or...

Great, we can see the rows have been replicated, with

the comments column containing the individual comments! Now that

we’re finished with Pandas, we can quickly switch back to

a Dataset by loading the DataFrame in memory:

Copied

from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)

comments_dataset

Copied

Dataset({

features: ['html_url', 'title', 'comments', 'body'],

num_rows: 2842

})

Okay, this has given us a few thousand comments to work with!

✏️Try it out! See if you can use Dataset.map() to explode

the comments column of issues_dataset without resorting to the use

38
of Pandas. This is a little tricky; you might find the “Batch

mapping” section of the 🤗 Datasets documentation useful for this

task.

Now that we have one comment per row, let’s create a

new comments_length column that contains the number of words

per comment:

Copied

comments_dataset = comments_dataset.map(

lambda x: {"comment_length": len(x["comments"].split())}

We can use this new column to filter out short comments, which

typically include things like “cc @lewtun” or “Thanks!” that are not

relevant for our search engine. There’s no precise number to select

for the filter, but around 15 words seems like a good start:

Copied

comments_dataset = comments_dataset.filter(lambda x:

x["comment_length"] > 15)

comments_dataset

Copied

Dataset({

features: ['html_url', 'title', 'comments', 'body', 'comment_length'],

num_rows: 2098
39
})

Having cleaned up our dataset a bit, let’s concatenate the issue title,

description, and comments together in a new text column. As usual,

we’ll write a simple function that we can pass to Dataset.map():

Copied

def concatenate_text(examples):

return {

"text": examples["title"]

+ " \n "

+ examples["body"]

+ " \n "

+ examples["comments"]

comments_dataset = comments_dataset.map(concatenate_text)

We’re finally ready to create some embeddings! Let’s take a look.

Creating text embeddings

We saw in Chapter 2 that we can obtain token embeddings by using

the AutoModel class. All we need to do is pick a suitable checkpoint

to load the model from. Fortunately, there’s a library

called sentence-transformers that is dedicated to creating


40
embeddings. As described in the library’s documentation, our use

case is an example of asymmetric semantic search because we have

a short query whose answer we’d like to find in a longer document,

like a an issue comment. The handy model overview table in the

documentation indicates that the multi-qa-mpnet-base-dot-

v1 checkpoint has the best performance for semantic search, so

we’ll use that for our application. We’ll also load the tokenizer using

the same checkpoint:

Copied

from transformers import AutoTokenizer, TFAutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

model = TFAutoModel.from_pretrained(model_ckpt, from_pt=True)

Note that we’ve set from_pt=True as an argument of

the from_pretrained() method. That’s because the multi-qa-mpnet-

base-dot-v1 checkpoint only has PyTorch weights, so

setting from_pt=True will automatically convert them to the

TensorFlow format for us. As you can see, it is very simple to switch

between frameworks in 🤗 Transformers!

As we mentioned earlier, we’d like to represent each entry in our

GitHub issues corpus as a single vector, so we need to “pool” or


41
average our token embeddings in some way. One popular approach

is to perform CLS pooling on our model’s outputs, where we simply

collect the last hidden state for the special [CLS] token. The

following function does the trick for us:

Copied

def cls_pooling(model_output):

return model_output.last_hidden_state[:, 0]

Next, we’ll create a helper function that will tokenize a list of

documents, place the tensors on the GPU, feed them to the model,

and finally apply CLS pooling to the outputs:

Copied

def get_embeddings(text_list):

encoded_input = tokenizer(

text_list, padding=True, truncation=True, return_tensors="tf"

encoded_input = {k: v for k, v in encoded_input.items()}

model_output = model(**encoded_input)

return cls_pooling(model_output)

We can test the function works by feeding it the first text entry in

our corpus and inspecting the output shape:

Copied

embedding = get_embeddings(comments_dataset["text"][0])
42
embedding.shape

Copied

TensorShape([1, 768])

Great, we’ve converted the first entry in our corpus into a 768-

dimensional vector! We can use Dataset.map() to apply

our get_embeddings() function to each row in our corpus, so let’s

create a new embeddings column as follows:

Copied

embeddings_dataset = comments_dataset.map(

lambda x: {"embeddings": get_embeddings(x["text"]).numpy()

[0]}

Notice that we’ve converted the embeddings to NumPy arrays —

that’s because 🤗 Datasets requires this format when we try to index

them with FAISS, which we’ll do next.

Using FAISS for efficient similarity search

Now that we have a dataset of embeddings, we need some way to

search over them. To do this, we’ll use a special data structure in 🤗

Datasets called a FAISS index. FAISS (short for Facebook AI Similarity

Search) is a library that provides efficient algorithms to quickly

search and cluster embedding vectors.

The basic idea behind FAISS is to create a special data structure


43
called an index that allows one to find which embeddings are similar

to an input embedding. Creating a FAISS index in 🤗 Datasets is

simple — we use the Dataset.add_faiss_index() function and specify

which column of our dataset we’d like to index:

Copied

embeddings_dataset.add_faiss_index(column="embeddings")

We can now perform queries on this index by doing a nearest

neighbor lookup with the Dataset.get_nearest_examples() function.

Let’s test this out by first embedding a question as follows:

Copied

question = "How can I load a dataset offline?"

question_embedding = get_embeddings([question]).numpy()

question_embedding.shape

Copied

(1, 768)

Just like with the documents, we now have a 768-dimensional vector

representing the query, which we can compare against the whole

corpus to find the most similar embeddings:

Copied

scores, samples = embeddings_dataset.get_nearest_examples(

"embeddings", question_embedding, k=5

)
44
The Dataset.get_nearest_examples() function returns a tuple of

scores that rank the overlap between the query and the document,

and a corresponding set of samples (here, the 5 best matches). Let’s

collect these in a pandas.DataFrame so we can easily sort them:

Copied

import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)

samples_df["scores"] = scores

samples_df.sort_values("scores", ascending=False, inplace=True)

Now we can iterate over the first few rows to see how well our query

matched the available comments:

Copied

for _, row in samples_df.iterrows():

print(f"COMMENT: {row.comments}")

print(f"SCORE: {row.scores}")

print(f"TITLE: {row.title}")

print(f"URL: {row.html_url}")

print("=" * 50)

print()

Copied

"""
45
COMMENT: Requiring online connection is a deal breaker in some

cases unfortunately so it'd be great if offline mode is added similar

to how `transformers` loads models offline fine.

@mandubian's second bullet point suggests that there's a

workaround allowing you to use your offline (custom?) dataset with

`datasets`. Could you please elaborate on how that should look like?

SCORE: 25.505046844482422

TITLE: Discussion using datasets in offline mode

URL: https://github.com/huggingface/datasets/issues/824

=========================================

=========

COMMENT: The local dataset builders (csv, text , json and pandas)

are now part of the `datasets` package since #1726 :)

You can now use them offline

\`\`\`python

datasets = load_dataset("text", data_files=data_files)

\`\`\`

We'll do a new release soon

SCORE: 24.555509567260742
46
TITLE: Discussion using datasets in offline mode

URL: https://github.com/huggingface/datasets/issues/824

=========================================

=========

COMMENT: I opened a PR that allows to reload modules that have

already been loaded once even if there's no internet.

Let me know if you know other ways that can make the offline mode

experience better. I'd be happy to add them :)

I already note the "freeze" modules option, to prevent local modules

updates. It would be a cool feature.

----------

> @mandubian's second bullet point suggests that there's a

workaround allowing you to use your offline (custom?) dataset with

`datasets`. Could you please elaborate on how that should look like?

Indeed `load_dataset` allows to load remote dataset script (squad,

glue, etc.) but also you own local ones.


47
For example if you have a dataset script at

`./my_dataset/my_dataset.py` then you can do

\`\`\`python

load_dataset("./my_dataset")

\`\`\`

and the dataset script will generate your dataset once and for all.

----------

About I'm looking into having `csv`, `json`, `text`, `pandas` dataset

builders already included in the `datasets` package, so that they are

available offline by default, as opposed to the other datasets that

require the script to be downloaded.

cf #1724

SCORE: 24.14896583557129

TITLE: Discussion using datasets in offline mode

URL: https://github.com/huggingface/datasets/issues/824

=========================================

=========

COMMENT: > here is my way to load a dataset offline, but it

**requires** an online machine


48
>

> 1. (online machine)

>

> ```

>

> import datasets

>

> data = datasets.load_dataset(...)

>

> data.save_to_disk(/YOUR/DATASET/DIR)

>

> ```

>

> 2. copy the dir from online to the offline machine

>

> 3. (offline machine)

>

> ```

>

> import datasets

>

> data = datasets.load_from_disk(/SAVED/DATA/DIR)


49
>

> ```

>

>

>

> HTH.

SCORE: 22.893993377685547

TITLE: Discussion using datasets in offline mode

URL: https://github.com/huggingface/datasets/issues/824

=========================================

=========

COMMENT: here is my way to load a dataset offline, but it

**requires** an online machine

1. (online machine)

\`\`\`

import datasets

data = datasets.load_dataset(...)

data.save_to_disk(/YOUR/DATASET/DIR)

\`\`\`
50
2. copy the dir from online to the offline machine

3. (offline machine)

\`\`\`

import datasets

data = datasets.load_from_disk(/SAVED/DATA/DIR)

\`\`\`

HTH.

SCORE: 22.406635284423828

TITLE: Discussion using datasets in offline mode

URL: https://github.com/huggingface/datasets/issues/824

=========================================

=========

"""

Not bad! Our second hit seems to match the query.

51

You might also like