0% found this document useful (0 votes)
6 views10 pages

Local Density Estimation in High Dimensions: Xian Wu Moses Charikar Vishnu Natchu

Uploaded by

oscar.chen9912
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Local Density Estimation in High Dimensions: Xian Wu Moses Charikar Vishnu Natchu

Uploaded by

oscar.chen9912
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Local Density Estimation in High Dimensions

Xian Wu 1 Moses Charikar 1 Vishnu Natchu 2

Abstract chine learning methods are increasingly shifting towards


e.g. word embeddings (Pennington et al., 2014; Mikolov
An important question that arises in the study of
et al., 2013) and graph embeddings (Perozzi et al., 2014;
high dimensional vector representations learned
Tang et al., 2015; Cao et al., 2015; Grover & Leskovec,
from data is: given a set D of vectors and a query
2016; Yang et al., 2016; Wang et al., 2017; Hamilton et al.,
q, estimate the number of points within a speci-
2017). Over-parameterized models are oftentimes easier
fied distance threshold of q. Our algorithm uses
to train (Livni et al., 2014), and perform just as well, if
locality sensitive hashing to preprocess the data to
not better (Zhang et al., 2016). Word embeddings is one
accurately and efficiently estimate the answers to
example where rigorous evaluation has shown increased
such questions via an unbiased estimator that uses
performance with higher dimensionality (Melamud et al.,
importance sampling. A key innovation is the abil-
2016) (Lai et al., 2016).
ity to maintain a small number of hash tables via
preprocessing data structures and algorithms that In this paper, we develop an estimation scheme for high di-
sample from multiple buckets in each hash table. mensional datasets to count the number of elements around a
We give bounds on the space requirements and query that are in a given radius of cosine similarity. Angular
query complexity of our scheme, and demonstrate distance, which corresponds to Euclidean distance for data
the effectiveness of our algorithm by experiments points on the unit sphere is commonly used in applications
on a standard word embedding dataset. related to word and document embeddings, and image and
video search (Jegou et al., 2011) (Huang et al., 2012). Brute
force search requires a linear scan over the entire dataset,
1. Introduction which is prohibitively expensive. Our approach uses index-
ing and search via locality sensitive hashing (LSH) functions
In this work, we study a basic question that arises in the in order to estimate the size of the neighborhood in a more
study of high dimensional vector representations: given efficient manner than retrieving the neighbors within the
a dataset D of vectors and a query q, estimate the num- given radius of similarity.
ber of points within a specified distance threshold of q.
Such density estimates are important building blocks in Recent work has also explored LSH techniques for spherical
non-parametric clustering, determining the popularity of range counting and related questions around density estima-
topics, search and recommendation systems, the analysis tion for high-dimensional models. For example (Aumüller
of the neighborhoods of nodes in social networks, and in et al., 2017) generalizes nearest neighbor LSH hash func-
outlier detection, where geometric representations of data tions to be sensitive to custom distance ranges. (Ahle et al.,
are frequently used. Yet for high dimensional datasets, we 2017) builds many different parameterized versions of the
still lack simple, practical, experimentally verified and theo- prototypical LSH hash tables and adaptively probes them
retically justified solutions to tackle this question. for spherical range reporting. The closest works to ours
in terms of solution method that we are aware of is that of
Our questions have been studied in the context of spherical (Spring & Shrivastava, 2017), giving an LSH based estima-
range counting. One class of solution methods arising in tor to compute the partition function of a log-linear model,
the computational geometry literature, such as hierarchical and (Charikar & Siminelakis, 2017), adapting LSH to solve
splitting via trees, (Arya et al., 2010) have performance a class of kernel density estimation problems. Both works
guarantees that depend exponentially on dimension. These produce an unbiased estimator, using LSH to implement
are unsuitable for the higher dimensional models that ma- a biased sampling scheme that lowers the variance of this
1
Stanford University, USA 2 Laserlike Inc, USA. Correspon- estimator. However their technique leverages only one hash
dence to: Xian Wu <xwu20@stanford.edu>. bucket per table, and hence requires a large number of ta-
bles for an accurate estimate. The biggest drawback to these
Proceedings of the 35 th International Conference on Machine works is the very high storage (hash tables) and query com-
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 plexities – their techniques, as presented, are impractical for
by the author(s).
Local Density Estimation in High Dimensions

adoption. in our overall dataset into a few buckets that we can easily
sample from, and we sample from these buckets to produce
Our approach improves upon the storage and sample com-
our estimate. In order to compensate for the concentrated
plexities of previous methods using a combination of ex-
sampling, we adjust the value of each sample by the inverse
tracting information from multiple buckets per table (hence
of the probability that the sample lands in the target buckets.
reducing table complexity) and importance sampling (hence
reducing sample complexity). As we show in our experi- Our technique relies on the key insight that LSH functions
mental study on GLOVE embeddings, our estimate of the can effectively implement both of these objectives. Using
number of elements that are 60 degrees from a query q LSH functions to index our dataset ensures that for a given
(which corresponds to synonyms and/or related words to q query q, elements that are close to q in angular distance have
in the English vocabulary), achieves multiple orders of mag- a comparative higher probability of hashing to q’s bucket
nitude improved accuracy over competing methods, subject and to buckets that are of small hamming distance to q’s
to reasonable and practical resource constraints. Our the- bucket, thereby concentrating the elements of interest into
oretical analysis develops a rigorous understanding of our certain buckets that we can selectively sample from.
technique and offers practitioners further insight on optimiz-
Additionally, the hamming distance collision probabilities
ing our solution method for their particular datasets.
for bit-wise LSH functions are well expressed in terms of an-
gular distance. Consider random hyperplane LSH (Charikar,
2. Problem Formulation and Approach 2002), where each hash vector is chosen uniformly at ran-
dom from the d-dimensional unit sphere. Each hash vector
Given a dataset D of vectors v1 , . . . vn ∈ Rd on the unit
r contributes one bit to the hash sequence of a data point v,
sphere, a query q ∈ Rd also on the unit sphere, and a range
based on the rule:
of angles of interest A, for example 0-60 degrees, how many
elements v in D are such that the angle between q and v, (
denoted θqv , are within range A? We use Aq to denote the 0 if r · v ≤ 0
set of data vectors v that are within angle A to q (that have hr (v) =
1 otherwise.
angular distance to query q that is in the range of interest
A). Our goal is to preprocess D in order to estimate the
It is well-known that for any particular hamming distance i,
cardinality of this set, denoted |Aq |, efficiently for any given
and any data point x,
q.
  t−i  i
One final note is that our scheme is conceptualized using t θqx θqx
P(dqx = i|θqx ) = 1−
bit-wise LSH functions; functions that hash vectors to 0-1 i π π
bits, and where the hamming distance between the hash se-
quences of two data points captures information about their where dqx is the hamming distance between the hash for
angular distance. For their simplicity, easy implementation, query q and the hash for data vector x, θqx denotes the angle
and high performance in practice, bit hashes such as hyper- between the 2 vectors, and t is the total number of bits in
plane LSH (Charikar, 2002) are the standard hash functions the hash sequence.
used in practice for angular distance (Andoni et al., 2015).
Our technique and results can be extended for other hash Thus, the choice of t affects the sensitivity of the LSH
functions; however, we will use hamming distance and other scheme – the correlation between the hamming distances
implementation details specific to bit-wise LSH functions of two hash sequences and the angle between the two un-
in this work. derlying data points. Moreover, depending on the design
choice for t, the set of hamming distances I that contains
2.1. Approach Overview most of the probability mass for collision with elements of
angular distance in range A is different. This is also a con-
Our overall estimation scheme is an implementation of im- sideration in our sampling scheme; we want to sample from
portance sampling. It consists of two steps, a preprocessing buckets of hamming distances I that have a high probability
step that applies locality sensitive hash functions to our of containing elements that are within angle A of q.
dataset to produce hash tables. After this preprocessing
step, we sample from our hash tables to produce our final Our sampling scheme picks elements over K hash tables
estimate. from buckets that are at hamming distance I to the query,
where I is tuned to A. Given a sample, x, we compute the
To help guide the reader through the technical details of our angular distance θqx = cos−1 (q · x). Let p(x) = P(dqx ∈
implementation, we first offer an intuitive explanation of I|θqx ), the collision probability that x lands in a bucket that
our approach. Our importance sampling scheme achieves is hamming distance I from q over the random choice of
2 main objectives: we concentrate the elements of interest hash functions.
Local Density Estimation in High Dimensions

We define a random variable Z as a function of sample x as the expected size of the overall sampling pool of elements
follows: in hamming distance I. This ratio of expectations seems
( PK k intuitive – one would expect to get such an expression if our
k=1 Cq (I)
K·p(x) if θqx ∈ A scheme took one sample per table. Surprisingly, we achieve
Z= (1)
0 otherwise. this same type of sample complexity bound while sampling
from relatively few hash tables.

where Cqk (I) is the total number of elements in buckets of Just like random sampling, our sample complexity bound is
hamming distance I from q’s bucket in table k. also based on the proportion of elements of interest in ham-
ming distance I to the total number of elements in hamming
We
PS
take S samples and construct Z1 , Z2 , . . . ZS . We report distance I. However, it is easy to see that applying LSH to
i=1 Zi
S as our estimate for |Aq |. our dataset will increase this proportion to yield a smaller
sample complexity. We choose I so that min p(x) is high
Comparison to Related Work: Note that our problem can x∈Aq
be viewed as kernel density estimation problem for a specific (this probability can be high even for a small set of hamming
kernel function that has value 1 for pairs of points within distances I, since p(x) is the cumulative probability mass
the required angle range of interest and 0 outside. However of I successes in t trials,
√ and binomial distributions in t
the analysis of (Charikar & Siminelakis, 2017) does not concentrate in an O( t) sized interval around the mean),
apply to our setting because they need a scale free hash and E(Cq (I)) to be small (to filter out elements that are not
function (with collision probabilities related to the kernel interesting).
value) and there is no such function for our 0-1 kernel. The
There are certain tradeoffs to choosing I. If more hamming
work of (Spring & Shrivastava, 2017) does not make such
distances are included in I, then min p(x) is higher, how-
an assumption on the hash function, but they do not give x∈Aq
an analysis that gives meaningful bounds in our setting. ever, E(Cq (I)) is also larger. The optimal choice for I is
As noted previously, both works only look at a single hash to choose the hamming distances that substantially increase
bucket in each hash table, leading to a high storage overhead. min p(x) yet do not substantially increase E(Cq (I)) (so
x∈Aq
not too many uninteresting elements are infiltrating those
2.2. Main Result buckets).
We establish the following theoretical bounds on the storage In the following sections, we explain our scheme further
and sample complexity of our estimator in order to achieve a and present our experimental results.
(1±ε)-approximation to the true count with high probability.
Theorem 2.1 (Main Result). For a given angular dis- 3. Preprocessing
tance range of interest A and a given query q, with
probability 1 − δ, our estimator returns a (1 ± ε)- The preprocessing step contributes 3 key ingredients to the
approximation to |Aq |, the true number of elements
! within overall estimation scheme:

angle A to q using O 1
log( 1δ ) tables and Hash Tables: Given a family of bit-wise hash functions
ε2 min p(x)
x∈Aq H, define a function family G = {g : D → {0, 1}t } such
that g(v) = (h1 (v), . . . ht (v)), where hj ∈ H. To construct
!
E(Cq (I))
O ε2 |Aq |· min p(x) log( 1δ ) samples. K tables, we choose K functions g1 , g2 , . . . gK from G
x∈Aq
independently and uniformly at random. We store each
v ∈ D in bucket gk (v) for k = 1, 2 . . . K. This step sets up
To help the reader digest this result, we briefly compare
the hash tables that we will sample from in our scheme.
this statement to the sample complexity of naive random
sampling. It can be shown through a standard Bernoulli- Counts Vector: We create a counts vector, denoted
Chernoff argument that thesample complexity for random Cik ∈ Rt+1 for each hash address ik for each table k ∈
sampling is O( |Aqn|ε2 ln 1δ ), where |Anq | is the inverse pro- {1, . . . , K}, where Cik (d) is the count of the total number of
portion of elements of interest in the overall population. items in buckets that are at hamming distance d = 0, 1, . . . t
Intuitively this says that you need to take more random away from ik in table k.
samples if |Aq | is very small compared to n.
Sampler: We create a sampler that given a separate hash
Our sample complexity replaces the n
|Aq | term with address ik for each table k ∈ {1, . . . , K} and set of ham-
E(Cq (I)) ming distances I, returns a data point uniformly at random
|Aq |· min p(x) , where |Aq | · min p(x) is a measure of the
x∈Aq x∈Aq from the union of elements that were hashed to buckets of
expected number of elements from the set of interest Aq hamming distance I from ik across the K tables.
that will land in hamming distance I to q, and E(Cq (I)) is
Local Density Estimation in High Dimensions

We describe in greater detail the 3 contributions of the pre- Lemma 3.2 (Variance
 of W ). σ 2 (W ) =
processing step. For the rest of this paper, all omitted proofs 1
P p(x,y)
K p(x)p(y) − 1
appear in Appendix C. x,y∈Aq

We want to put these pieces together to make a statement


3.1. Hash Tables
about the number of tables K we should create to guarantee
Setting up quality hash tables to enable accurate and efficient low inherent bias in our estimator. We use Chebyshev’s In-
importance sampling is vital to our scheme. Since we are equality to bound W ’s deviation from its mean as a function
importance sampling from buckets of hamming distance I of K with a constant failure probability 18 . For simplicity,
across K tables, we need to make enough tables to guarantee we fix a constant failure probability that we will boost later
unbiasedness or near-unbiasedness for our sampling-based by average over several sets of estimators. This analysis is
estimator; due to the variance of the randomly generated without loss of generality, as the bounds can be adjusted
hash functions, if we make too few tables we may not find for any desired failure probability δ. We will use this piece
enough elements of interest contained in those tables within again when we analyze our overall estimator.
hamming distance I. We want to characterize the bias of Lemma 3.3 (Bound on Number of Tables). It suffices to
our importance sampling scheme in relation to the contents 8
make K ≥ ε2 min p(x) tables to guarantee that W is within
of the buckets of our hash tables. x∈Aq

ε of |Aq | (relatively) with probability 78 .


We let Bqk (I) denote the set of hash buckets that are at ham-
ming distance I from the hash address of query q for table Proof. Chebyshev’s inequality states: P(|W − |Aq || ≥
k. Next, we introduce an intermediate random variable: 2
ε|Aq |) ≤ εσ2 |A
(W )
q|
2 .
K
1 X X 1(x ∈ Bqk (I))
W = . Therefore, to achieve a constant failure probability δ = 18 ,
K p(x)
k=1 x∈Aq it suffices to create enough tables so that
ε2 |Aq |2
 
2 1 X p(x, y)
where p(x) = P(dqx ∈ I|θqx ). σ (W ) = −1 ≤
K p(x)p(y) 8
x,y∈Aq
W is a random variable that represents the sum of the el-
ements of interest |Aq | that are hashed to the buckets of Hence K needs to be large enough so that:
sampling focus Bqk (I), weighted by their probabilities p(x). P  p(x,y) 
It is clear that once the set of hash functions is fixed, W
8 p(x)p(y) − 1
x,y∈Aq
becomes deterministic. K≥
ε2 |Aq |2
We first show that the random variable Z, as defined in
Equation (1), is an unbiased estimator. Since p(x, y) ≤ min{p(x), p(y)}, we see that it is sufficient
for K to satisfy
Lemma 3.1 (Expectation of Z). The expectation of Z over  
the random choice of hash functions is |Aq |, i.e. E(Z) = 8|Aq |2 minx∈Aq p(x)1
−1
|Aq |. The expectation of Z given a specific realization of K≥
hash functions, or equivalently, given W , is E(Z|W ) = W . ε2 |Aq |2

As a consequence, it is immediately clear that E(W ) = |Aq |. Therefore we conclude with the following bound on K:
It is important to understand the implications of this lemma. 8
K≥ 2 (2)
In particular, the expression for E(Z|W ) says that in a ε min p(x)
x∈Aq
specific realization of a choice of hash functions (or a set of
tables), the estimator Z is biased if W 6= |Aq |. Therefore
K is essential for helping concentrate the realized value of
W around its mean. We emphasize that the joint probability p(x, y) ≤
min{p(x), p(y)} is a very loose worst-case bound assuming
Since in expectation, our estimator Z gives W , we want to
high correlation between data points. The final bound for
understand how many tables K are required to ensure that
K, Equation (2), is also a worst-case bound in the sense that
W concentrates around its mean, |Aq |. This is related to the
it is possible that a very minuscule fraction of x ∈ Aq have
variance of W .
small values for p(x). In the experimental section of the
We also introduce a new quantity p(x, y) = P(dqx ∈ I ∩ paper, we do an empirical analysis of the inherent bias for
dqy ∈ I|θqx , θqy ), the collision probability that x and y different values of K and demonstrate that for real datasets
both land in buckets that are hamming distance I from q the number of tables needed can be far fewer than what is
over the random choice of hash functions. theoretically required in the worst case scenario.
Local Density Estimation in High Dimensions

3.2. Counts Vector which our method and all competing methods use. We de-
scribe the importance sampling scheme in the next section.
Query q maps to a bucket ik for each table k = 1, 2 . . . K.
The preprocessing step produces an average counts vector
corresponding to bucket ik , denoted Cqk , where Cqk (i) is 4. Sampling
the count of the total number of items in buckets that are at
We now analyze our sampling algorithm. Recall that our
hamming distance i = 0, 1, . . . t away from the hash address
sampling scheme works in the following way. Given query
PFor the hamming distances of interest I, we
for q in table k.
q, we generate the hash for q in each of our K tables, by
let Cqk (I) = d∈I Cqk (d).
solving for ik = gk (q) for k = 1, . . . K. Given the hash for
Cqk (I) is an integral part of our weighted importance sam- q in each of our K tables and the set of hamming distances
pling scheme. In Appendix A, we show how to compute I that we want to sample from, we invoke our sampler to
these vectors efficiently. generate a sample from across the K tables.
Theorem 3.1 (Aggregate-Counts). Given a set of K Given this sample, x, we compute the angular distance
hash tables, each with 2t hash buckets with addresses in θqx = cos−1 (q · x). Let p(x) = P(dqx ∈ I|θqx ), the
{0, 1}t , Aggregate-Counts (Algorithm 1) computes, collision probability that x lands in a bucket that is hamming
for each hash address i, the number of elements in buckets distance I from q over the random choice of hash functions;
that are hamming distance 0, 1, . . . t away from i, in each of p(x) is an endogenous property of an LSH function.
the K tables, in time O(Kt2 2t ).
We score each sample as in Equation (1).
Note that the t in our hashing scheme is the length of the We take S samples and construct Z1 , Z2 , . . . ZS . We report
hash sequence; as a general rule of thumb, for bit-wise hash PS
i=1 Zi
as our estimate for |Aq |.
functions, implementers choose t ≈ log(n), so as to average S
out to one element per hash bucket. Therefore, the prepro- As an immediate consequence of Lemma 3.1, it is clear that
cessing runtime of a reasonable hashing implementation for "P #
Aggregate-Counts (Algorithm 1) is approximately S
i=1 Zi
O(nK log2 (n)). E = |Aq | .
S
The key benefit of Aggregate-Counts is that it com-
putes via a message-passing or dynamic programming strat-
Now we analyze the variance of our estimator:
egy that is much more efficient than a naive brute-force
approach that would take time O(K22t ), or O(Kn2 ) if Lemma 4.1 (Variance of Estimator).
t ≈ log(n). 
PS !2 
2
i=1 Zi  ≤ E[Z ] + σ 2 (W )
3.3. Sampler E − |Aq |
S S
We create a sampler that, given a hash address ik for each
table, and a set of hamming distances I that we want to
sample from, generates a sample uniformly at random from This decomposition of the variance into the two terms indi-
the union of elements that were hashed to hamming distance cates that the variance is coming from two sources. The first
2
]
I across the K tables. For an implementation and analysis, source is the variance of the samples, E[Z
S . If we don’t take
please consult Appendix B. enough samples, we do not get a good estimate. The second
source is the variance from the random variable W , σ 2 (W ),
Theorem 3.2 (Sampler). Given a set of K hash tables, each
which corresponds to the contents in the tables. As we have
with 2t hash buckets with addresses in {0, 1}t , a sampling
shown, it is crucial to create enough tables so that W is
scheme consisting of a data structure and a sampling algo-
concentrated around its expectation, |Aq |. Therefore, this
rithm can generate a sample uniformly at random from any
second source of variance of the overall estimator comes
fixed hash table k, an element at hamming distance d to hash
from the variance of the hash functions that underlie table
address i. The data structure is a counts matrix that can be
creation and composition.
precomputed in preprocessing time O(Kt3 2t ), and the sam-
pling algorithm Hamming-Distance-Sampler (Algo- The σ 2 (W ) term has already been analyzed in Section 3.1,
rithm 2) generates a sample in time O(t). see Lemma 3.2. Now we analyze the second moment of Z.

Again, if we follow t ≈ log(n), the preprocessing time Lemma 4.2 (Variance of Z).
comes out to roughly O(nK log3 (n)). Also we expect the X X  p(x, y)   
O(t) online sample generation cost to be negligible com- 2 1 p(y)
E[Z ] = + 1−
pared to, say, the inner product computation cost for q · x, K · p(x)2 K p(x)
x∈Aq y∈D
Local Density Estimation in High Dimensions

8
Now that we have all the components, we are ready to put Substituting for K ≥ ε2 min p(x) gives:
x∈Aq
together the final sample and storage complexities for our
estimator. We want a final estimate that concentrates with  2 
2 ε p(x, y) min p(x)
at most  error around its mean, |Aq | with probability 1 − δ. E[Z ] 1 X X x∈Aq p(y) 
To do this, we make several sets 1, 2, . . . M of our estimator ≤  +
S S 8p(x)2 p(x)
x∈Aq y∈D
(one estimator consists of a set of K tables and S samples).
1 X X ε2 p(x, y) p(y)
 
We choose K and S so that the failure probability of our
≤ +
estimator is a constant, say 14 . Each estimator produces an S 8p(x) p(x)
x∈Aq y∈D
estimate, call it Em , for m ∈ {1, . . . M }. We report our  
final estimate as the median of these estimates. This is the 1 X X p(y)
≤ (1 + ε2 )
classic Median-of-Means technique. S p(x)
x∈Aq y∈D

Let Fm be the indicator variable indicating if the esti- E[Z 2 ] ε2 |Aq |2


mator Em fails to concentrate. Clearly E(Fm ) ≤ 14 . In order to guarantee S ≤ 8 ,
we need:
PM M
Moreover, E(F = m=1 Fm ) ≤ 4 . The probability
h i
p(y)
(1 + ε2 ) p(x)
P P
that the median estimate is bad, P(median of Em fails) ≤ x∈Aq y∈D
P(half of Em fails) = P(F ≥ M S≥
2 ). By a simple Chernoff ε2 |Aq |2
M −M
bound, we see that: P(F ≥ M ) ≤ e−(2 ln 2−1) 4 ≤ e 11 . P 1
P
2
2 x∈Aq p(x) y∈D p(y)
So to satisfy a desired failure probability δ, it suffices to = (1 + ε ) 2 2
−M ε |Aq |
have e 11 ≤ δ, therefore M ∈ O(log( 1δ )). P 1
1 x∈Aq p(x) E(Cq (I))
In the rest of the section, we establish bounds on K and S = (1 + )
ε2 |Aq |2
so that one estimator fails with probability at most 14 . We
appeal again to Chebyshev’s Inequality: Therefore, we conclude that
 
PS
Z
!
PS
σ 2 ( i=1 i
) E(C (I))
i=1 Zi
q
P − |Aq | ≥ ε|Aq | ≤ S S ∈ O 2 
S ε2 |Aq |2 ε |Aq | · min p(x)
x∈Aq

is sufficient.
In Lemma 4.1, P we analyze the variance of our estimator, and
S 2
2 Zi ] Putting together Lemmas 3.3 and 4.1 with the median of
show that σ ( i=1 S ) ≤ E[Z
S + σ 2 (W ). Therefore, in
order so that the failure probability is less than 14 , it suffices means strategy yields our main result, Theorem 2.1.
PS
Zi ε2 |Aq |2
to have σ 2 ( i=1
S )≤ 4 , which can be obtained by In the rest of this paper we discuss the results of our experi-
E[Z 2 ] ε2 |Aq |2 2 ε2 |Aq |2 ments on real datasets.
letting S ≤ 8 and σ (W ) ≤ 8 .
Focusing on the σ 2 (W ) term, which depends on the number
5. Experiments
of tables K created, we show in Lemma 3.3 from Section
8
3.1 that it suffices to take K ≥ ε2 min p(x) . We describe our experiments using the GLOVE dataset. We
x∈Aq
use the set of 400,000 pre-trained 50-dimensional word
Now that we have our table complexity, we can analyze our embedding vectors trained from Wikipedia 2014 + Giga-
2
] word 5, provided by (Pennington et al., 2014). We nor-
sampling complexity S to bound E[Z
S .
malize the embeddings, as is standard in many word em-
8
Lemma 4.1. Suppose K ≥ ε2 min p(x) . Then S ∈ bedding applications (Sugawara et al., 2016) We choose 3
x∈Aq
! query words with different neighborhood profiles: “venice”,
E(Cq (I)) E[Z 2 ] ε2 |Aq |2 “cake”, “book”. Venice has the smallest neighborhood, with
O ε2 |Aq |· min p(x) suffices to achieve S ≤ 8 .
x∈Aq 206 elements with angular distance less than 60 degrees,
cake has a medium sized neighborhood with about 698
elements, book has the largest neighborhood with 1275 el-
ements. The histogram for these 3 queries are shown in
Proof. By Lemma 4.2 we have: Figure 1.
We also choose our angle range of interest, A, to be 0-
E[Z 2 ]
   
1 X X p(x, y) 1 p(y) 60 degrees. A search through our dataset gave “florence”,
= + 1−
S S K · p(x)2 K p(x) “cannes”, “rome” as representative elements that are 40-50
x∈Aq y∈D
Local Density Estimation in High Dimensions

ming threshold. This is as expected, using a larger range


of hamming distances helps concentrate the count of the
elements of interest Aq that fall into the specified range of
hamming distances around the mean, which means that a
smaller K is required to achieve small bias.
Moreover, the empirical bias of our estimator at hamming
threshold 5 is around 5% for 20 hash tables, with very little
improvement with 40 hash tables. This is consistent with
our 3 queries. With this in mind, we compare our estimator
against the benchmark estimator introduced by (Spring &
Shrivastava, 2017). Though their work originally intended
Figure 1: The statistical profiles of angular distance between our 3 queries and the to solve a different problem, their technique can solve our
rest of the dataset D. “venice” has the smallest neighborhood with only about 206 problem by adapting the weight function appropriately. The
points in the 0-60 range, followed by “cake” with 698, followed by “book” with 1275.
We plot the number of elements at different intervals of cosine similarity from the 3 key differences between their work and ours is that they only
queries. Notice that most (more than 370,000 out of the total 400,000) embeddings probe the 0 hamming distance bucket in each table, similar
in the dataset fall at about 60-120 degrees to the query. The leftmost blue bin that
represents the number of elements between 0-60 degrees to our 3 queries are barely to the classic LSH literature, and instead of sampling, they
noticeable. simply enumerate the elements in the hamming distance 0
degrees from “venice”, and “renaissance”, “milan”, “tus- bucket for each table. For higher values of K, which our
cany”, “italy” in the 50-60 degree range. Terms such as experiments demonstrate that their estimator needs in order
“cheesecake”, “desserts”, “ganache”, and “bakes” appear to get good results, enumeration might not be so efficient.
in the 40-50 degree annulus around “cake”, while terms In Figure 3, we compare (Spring & Shrivastava, 2017)’s
such as “fruitcake”, “cupcake”, “confections”, “poundcake”, technique of enumerating and importance-weighting ham-
and “eggs” appear in the 50-60 degree histogram. For ming distance 0 elements to our technique of importance
“book”: “character”, “chronicles”, “paperback”, “authors”, sampling from different hamming thresholds. Our experi-
and “text” are in the 40-50 degree range while “bestseller”, ments use random hyperplane LSH and we report relative
“protagonist”, “publishers”, “booklet”, “publishes”, “edit- error averaged over 25 trials, where in each trial we generate
ing”, “monograph”, and “chapter” are in the 50-60 degree a new set of K tables. Panel (b) experiments with (Spring
range. This particular experiment shows that while elements & Shrivastava, 2017)’s technique for the 3 queries, with dif-
in the 40-50 degree range are extremely related, words in ferent choices of K, the number of tables. Our results show
the 50-60 degree range are also relevant, and so we fix A that even for K = 40 tables, the relative error of their tech-
to be 0-60 degrees in all of our experiments. We also fix nique can still be higher than 50%, particularly for queries
t = 20 in all of our experiments, since we have 400,000 with small neighborhoods such as “venice”. For “venice”
embeddings in total and 20 ≈ log2 (400, 000). the increase in table allocation from 20 to 40 made a very
As Table 1 illustrates, the biggest challenge for this esti- small difference to the overall estimation error. “book” and
mation problem is the fact that the count of the number of “cake” fared better at 40 tables, however, the error was still
elements within 0-60 degrees is dwarfed by the number of around 25 %, while our estimator (panel a) estimated to
elements 60-120 degrees away from the queries. This issue within about 10% error using only 20 tables.
makes locality sensitive techniques necessary for efficient Panel (a) of Figure 3 shows that utilizing any hamming
search and retrieval in high dimensions. threshold greater than 0 gives superior estimation perfor-
mance to staying only within the 0 hamming distance bucket.
Table 1: Statistics of Queries
In this experiment, we fix our sampling budget to 1000
samples and the table budget to 20 tables. The hamming
Q UERY # WITHIN 60 DEGREES % OF POPULATION
distance 0 error reported in this figure uses enumeration; all
V ENICE 206 .0515 other hamming thresholds use the 1000 sampling budget.
C AKE 698 .1745 In our experiments for the 3 queries, one can expect about
B OOK 1275 .31875 80 points in total in the hamming distance 0 buckets across
20 tables. In this experiment, our technique uses 1000 sam-
As we have previously mentioned in section 3.1, the num- ples vs 80 points, however, this (somewhat negligible in
ber of tables K theoretically required for (near) unbiased today’s computing infrastructure) sample complexity trades
estimation relies on a worst-case variance bound; real-world off against a large improvement in precision, as well as a
data do not necessarily exhibit worst-case behavior. In our much lower storage cost in the number of tables K.
studies of our 3 queries see Figures 2, the inherent bias of
our estimator decreases as we increase the sampling ham- Finally, we note that panel (a) of Figure 3 shows the smallest
Local Density Estimation in High Dimensions

(a) venice (b) cake (c) book

Figure 2: The empirical bias for different values of K for queries “venice”, “cake” and “book”. For each hamming threshold, the relative bias is averaged over 50 sets of
K tables, using random hyperplane hash as the LSH function. The bias decreases as the hamming threshold increases and the bias decreases with more hash tables, keeping
hamming threshold fixed. The bias does not drop significantly from 20 to 40 hash tables. At 20 hash tables, the bias at hamming threshold 5 is around 5%. This demonstrates
that for real datasets the number of tables needed can be far fewer than what is theoretically required in the worst case scenario.

(a) Estimation from Different Thresholds Fixing 20 Tables and 1000 samples (b) Estimation using different number of tables fixing I = 0

Figure 3: Comparison of our estimator against the benchmark LSH estimator adapted from ideas introduced in (Spring & Shrivastava, 2017). In our experiments, panel (b),
even after 40 hash tables, the error remained above 20%, and above 50% for queries with very small neighborhoods, such as “venice”. In contrast, panel (a) illustrates the
relative accuracy for our estimator. We fix 20 hash tables and takes 1000 samples from different hamming thresholds. Note that for queries like “venice”, which has a very
small neighborhood, taking 1000 samples at hamming threshold 5 performed worse than at hamming threshold 3; this is likely because 1000 samples was too few for hamming
threshold 5 – the ratio of total elements to elements of interest in hamming threshold 5 is high.

error for “venice” at hamming threshold 3. This is related


to the characteristics of this query and the sampling bud-
get. We see in this example that for “venice”, which is a
fairly isolated data point compared to the other 2 queries,
going to further hamming distances actually hurts the qual-
ity of the estimate because we actually dilute the proportion
of interesting elements. Using higher thresholds typically
requires more samples, as shown in Figure 4. However,
higher thresholds typically lowers the inherent bias in the
importance sampling scheme, as demonstrated in Figure 2.
Implementers should consider this tradeoff in their algorith-
mic design choices. Figure 4: The inverse proportion of relative elements of interest in the overall sub-
sampling pool for various hamming thresholds, averaged over 50 trials of sets of
20 hash tables. This figure shows that using higher thresholds typically requires
more samples. However, higher thresholds typically have less inherent bias in the
6. Discussion importance sampling scheme, as demonstrated in Figure 2. Implementers should
consider this tradeoff in their algorithmic design choices.
Given the case study of our estimator achieving the small-
est estimation error for “venice” at hamming threshold 3,
whereas for the more popular queries “cake” and “book” sample complexity is data-dependent, and cannot be known
performance improves steadily at higher hamming thresh- without a sense of |Aq |, the very quantity we aim to esti-
olds, it would be interesting to, from the practitioner’s point mate. But instead of fixing the sample complexity up-front,
of view, understand what is the best hamming threshold to is there a way we can iteratively, in an on-line fashion, de-
sample from, and given a hamming threshold, how many termine whether we should keep sampling or stop, based on
samples should be taken for a quality estimate. The optimal a current belief of |Aq |?
Local Density Estimation in High Dimensions

Acknowledgements multiple word prototypes. In Proceedings of the 50th


Annual Meeting of the Association for Computational Lin-
This work was initiated while the authors were visiting guistics: Long Papers - Volume 1, ACL ’12, pp. 873–882,
Laserlike, Inc. Xian Wu was supported by a Harold Thomas Stroudsburg, PA, USA, 2012. Association for Computa-
Hahn Jr. Fellowship from the Department of Management tional Linguistics.
Science and Engineering at Stanford University. Moses
Charikar was supported by NSF grant CCF-1617577 and a Jegou, H., Douze, M., and Schmid, C. Product quantiza-
Simons Investigator Award. tion for nearest neighbor search. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 33, Jan 2011.
References Lai, S., Liu, K., He, S., and Zhao, J. How to generate a good
Ahle, T. D., Aumüller, M., and Pagh, R. Parameter-free word embedding. IEEE Intelligent Systems, 31, 2016.
locality sensitive hashing for spherical range reporting. Livni, R., Shalev-Shwartz, S., and Shamir, O. On the compu-
In Proceedings of the Twenty-Eighth Annual ACM-SIAM tational efficiency of training neural networks. In Ghahra-
Symposium on Discrete Algorithms, pp. 239–256. Society mani, Z., Welling, M., Cortes, C., Lawrence, N. D., and
for Industrial and Applied Mathematics, 2017. Weinberger, K. Q. (eds.), Advances in Neural Information
Processing Systems 27, pp. 855–863. Curran Associates,
Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., and
Inc., 2014.
Schmidt, L. Practical and optimal lsh for angular distance.
In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., Melamud, O., McClosky, D., Patwardhan, S., and Bansal, M.
and Garnett, R. (eds.), Advances in Neural Information The role of context types and dimensionality in learning
Processing Systems 28, pp. 1225–1233. 2015. word embeddings. 01 2016. URL https://arxiv.
org/abs/1601.00893.
Arya, S., da Fonseca, G. D., and Mount, D. M. A unified
approach to approximate proximity searching. European Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient es-
Symposium on Algorithms, 2010. timation of word representations in vector space. 01 2013.
URL https://arxiv.org/abs/1301.3781.
Aumüller, M., Christiani, T., Pagh, R., and Silvestri, F.
Distance-sensitive hashing. 03 2017. URL https:// Pennington, J., Socher, R., and Manning, C. D. Glove:
arxiv.org/abs/1703.07867. Global vectors for word representation. In Empirical
Methods in Natural Language Processing (EMNLP), pp.
Cao, S., Lu, W., and Xu, Q. Grarep: Learning graph repre- 1532–1543, 2014.
sentations with global structural information. In Proceed-
ings of the 24th ACM International on Conference on Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online
Information and Knowledge Management, pp. 891–900. learning of social representations. In Proceedings of the
ACM, 2015. 20th ACM SIGKDD international conference on Knowl-
edge discovery and data mining, pp. 701–710. ACM,
Charikar, M. and Siminelakis, P. Hashing-based-estimators 2014.
for kernel density in high dimensions. IEEE Symposium
on Foundations of Computer Science, 2017. Spring, R. and Shrivastava, A. A new unbiased and efficient
class of lsh-based samplers and estimators for partition
Charikar, M. S. Similarity estimation techniques from round- function computation in log-linear models. 03 2017. URL
ing algorithms. In Proceedings of the Thiry-fourth Annual https://arxiv.org/abs/1703.05160.
ACM Symposium on Theory of Computing, STOC ’02, pp.
380–388, New York, NY, USA, 2002. ACM. URL http: Sugawara, K., Kobayashi, H., and Iwasaki, M. On approxi-
//doi.acm.org/10.1145/509907.509965. mately searching for similar word embeddings. Proceed-
ings of the 54th Annual Meeting of the Association for
Grover, A. and Leskovec, J. node2vec: Scalable feature Computational Linguistics, 2016.
learning for networks. In Proceedings of the 22nd ACM
SIGKDD international conference on Knowledge discov- Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei,
ery and data mining, pp. 855–864. ACM, 2016. Q. Line: Large-scale information network embedding.
In Proceedings of the 24th International Conference on
Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre- World Wide Web, pp. 1067–1077. International World
sentation learning on large graphs. In Advances in Neural Wide Web Conferences Steering Committee, 2015.
Information Processing Systems, pp. 1025–1035, 2017.
Wang, X., Cui, P., Wang, J., Pei, J., Zhu, W., and Yang, S.
Huang, E. H., Socher, R., Manning, C. D., and Ng, A. Y. Community preserving network embedding. In AAAI, pp.
Improving word representations via global context and 203–209, 2017.
Local Density Estimation in High Dimensions

Yang, Z., Cohen, W., and Salakhudinov, R. Revisiting


semi-supervised learning with graph embeddings. In
International Conference on Machine Learning, pp. 40–
48, 2016.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.
Understanding deep learning requires rethinking general-
ization. 11 2016. URL https://arxiv.org/abs/
1611.03530.

You might also like