Local Density Estimation in High Dimensions: Xian Wu Moses Charikar Vishnu Natchu
Local Density Estimation in High Dimensions: Xian Wu Moses Charikar Vishnu Natchu
adoption. in our overall dataset into a few buckets that we can easily
sample from, and we sample from these buckets to produce
Our approach improves upon the storage and sample com-
our estimate. In order to compensate for the concentrated
plexities of previous methods using a combination of ex-
sampling, we adjust the value of each sample by the inverse
tracting information from multiple buckets per table (hence
of the probability that the sample lands in the target buckets.
reducing table complexity) and importance sampling (hence
reducing sample complexity). As we show in our experi- Our technique relies on the key insight that LSH functions
mental study on GLOVE embeddings, our estimate of the can effectively implement both of these objectives. Using
number of elements that are 60 degrees from a query q LSH functions to index our dataset ensures that for a given
(which corresponds to synonyms and/or related words to q query q, elements that are close to q in angular distance have
in the English vocabulary), achieves multiple orders of mag- a comparative higher probability of hashing to q’s bucket
nitude improved accuracy over competing methods, subject and to buckets that are of small hamming distance to q’s
to reasonable and practical resource constraints. Our the- bucket, thereby concentrating the elements of interest into
oretical analysis develops a rigorous understanding of our certain buckets that we can selectively sample from.
technique and offers practitioners further insight on optimiz-
Additionally, the hamming distance collision probabilities
ing our solution method for their particular datasets.
for bit-wise LSH functions are well expressed in terms of an-
gular distance. Consider random hyperplane LSH (Charikar,
2. Problem Formulation and Approach 2002), where each hash vector is chosen uniformly at ran-
dom from the d-dimensional unit sphere. Each hash vector
Given a dataset D of vectors v1 , . . . vn ∈ Rd on the unit
r contributes one bit to the hash sequence of a data point v,
sphere, a query q ∈ Rd also on the unit sphere, and a range
based on the rule:
of angles of interest A, for example 0-60 degrees, how many
elements v in D are such that the angle between q and v, (
denoted θqv , are within range A? We use Aq to denote the 0 if r · v ≤ 0
set of data vectors v that are within angle A to q (that have hr (v) =
1 otherwise.
angular distance to query q that is in the range of interest
A). Our goal is to preprocess D in order to estimate the
It is well-known that for any particular hamming distance i,
cardinality of this set, denoted |Aq |, efficiently for any given
and any data point x,
q.
t−i i
One final note is that our scheme is conceptualized using t θqx θqx
P(dqx = i|θqx ) = 1−
bit-wise LSH functions; functions that hash vectors to 0-1 i π π
bits, and where the hamming distance between the hash se-
quences of two data points captures information about their where dqx is the hamming distance between the hash for
angular distance. For their simplicity, easy implementation, query q and the hash for data vector x, θqx denotes the angle
and high performance in practice, bit hashes such as hyper- between the 2 vectors, and t is the total number of bits in
plane LSH (Charikar, 2002) are the standard hash functions the hash sequence.
used in practice for angular distance (Andoni et al., 2015).
Our technique and results can be extended for other hash Thus, the choice of t affects the sensitivity of the LSH
functions; however, we will use hamming distance and other scheme – the correlation between the hamming distances
implementation details specific to bit-wise LSH functions of two hash sequences and the angle between the two un-
in this work. derlying data points. Moreover, depending on the design
choice for t, the set of hamming distances I that contains
2.1. Approach Overview most of the probability mass for collision with elements of
angular distance in range A is different. This is also a con-
Our overall estimation scheme is an implementation of im- sideration in our sampling scheme; we want to sample from
portance sampling. It consists of two steps, a preprocessing buckets of hamming distances I that have a high probability
step that applies locality sensitive hash functions to our of containing elements that are within angle A of q.
dataset to produce hash tables. After this preprocessing
step, we sample from our hash tables to produce our final Our sampling scheme picks elements over K hash tables
estimate. from buckets that are at hamming distance I to the query,
where I is tuned to A. Given a sample, x, we compute the
To help guide the reader through the technical details of our angular distance θqx = cos−1 (q · x). Let p(x) = P(dqx ∈
implementation, we first offer an intuitive explanation of I|θqx ), the collision probability that x lands in a bucket that
our approach. Our importance sampling scheme achieves is hamming distance I from q over the random choice of
2 main objectives: we concentrate the elements of interest hash functions.
Local Density Estimation in High Dimensions
We define a random variable Z as a function of sample x as the expected size of the overall sampling pool of elements
follows: in hamming distance I. This ratio of expectations seems
( PK k intuitive – one would expect to get such an expression if our
k=1 Cq (I)
K·p(x) if θqx ∈ A scheme took one sample per table. Surprisingly, we achieve
Z= (1)
0 otherwise. this same type of sample complexity bound while sampling
from relatively few hash tables.
where Cqk (I) is the total number of elements in buckets of Just like random sampling, our sample complexity bound is
hamming distance I from q’s bucket in table k. also based on the proportion of elements of interest in ham-
ming distance I to the total number of elements in hamming
We
PS
take S samples and construct Z1 , Z2 , . . . ZS . We report distance I. However, it is easy to see that applying LSH to
i=1 Zi
S as our estimate for |Aq |. our dataset will increase this proportion to yield a smaller
sample complexity. We choose I so that min p(x) is high
Comparison to Related Work: Note that our problem can x∈Aq
be viewed as kernel density estimation problem for a specific (this probability can be high even for a small set of hamming
kernel function that has value 1 for pairs of points within distances I, since p(x) is the cumulative probability mass
the required angle range of interest and 0 outside. However of I successes in t trials,
√ and binomial distributions in t
the analysis of (Charikar & Siminelakis, 2017) does not concentrate in an O( t) sized interval around the mean),
apply to our setting because they need a scale free hash and E(Cq (I)) to be small (to filter out elements that are not
function (with collision probabilities related to the kernel interesting).
value) and there is no such function for our 0-1 kernel. The
There are certain tradeoffs to choosing I. If more hamming
work of (Spring & Shrivastava, 2017) does not make such
distances are included in I, then min p(x) is higher, how-
an assumption on the hash function, but they do not give x∈Aq
an analysis that gives meaningful bounds in our setting. ever, E(Cq (I)) is also larger. The optimal choice for I is
As noted previously, both works only look at a single hash to choose the hamming distances that substantially increase
bucket in each hash table, leading to a high storage overhead. min p(x) yet do not substantially increase E(Cq (I)) (so
x∈Aq
not too many uninteresting elements are infiltrating those
2.2. Main Result buckets).
We establish the following theoretical bounds on the storage In the following sections, we explain our scheme further
and sample complexity of our estimator in order to achieve a and present our experimental results.
(1±ε)-approximation to the true count with high probability.
Theorem 2.1 (Main Result). For a given angular dis- 3. Preprocessing
tance range of interest A and a given query q, with
probability 1 − δ, our estimator returns a (1 ± ε)- The preprocessing step contributes 3 key ingredients to the
approximation to |Aq |, the true number of elements
! within overall estimation scheme:
angle A to q using O 1
log( 1δ ) tables and Hash Tables: Given a family of bit-wise hash functions
ε2 min p(x)
x∈Aq H, define a function family G = {g : D → {0, 1}t } such
that g(v) = (h1 (v), . . . ht (v)), where hj ∈ H. To construct
!
E(Cq (I))
O ε2 |Aq |· min p(x) log( 1δ ) samples. K tables, we choose K functions g1 , g2 , . . . gK from G
x∈Aq
independently and uniformly at random. We store each
v ∈ D in bucket gk (v) for k = 1, 2 . . . K. This step sets up
To help the reader digest this result, we briefly compare
the hash tables that we will sample from in our scheme.
this statement to the sample complexity of naive random
sampling. It can be shown through a standard Bernoulli- Counts Vector: We create a counts vector, denoted
Chernoff argument that thesample complexity for random Cik ∈ Rt+1 for each hash address ik for each table k ∈
sampling is O( |Aqn|ε2 ln 1δ ), where |Anq | is the inverse pro- {1, . . . , K}, where Cik (d) is the count of the total number of
portion of elements of interest in the overall population. items in buckets that are at hamming distance d = 0, 1, . . . t
Intuitively this says that you need to take more random away from ik in table k.
samples if |Aq | is very small compared to n.
Sampler: We create a sampler that given a separate hash
Our sample complexity replaces the n
|Aq | term with address ik for each table k ∈ {1, . . . , K} and set of ham-
E(Cq (I)) ming distances I, returns a data point uniformly at random
|Aq |· min p(x) , where |Aq | · min p(x) is a measure of the
x∈Aq x∈Aq from the union of elements that were hashed to buckets of
expected number of elements from the set of interest Aq hamming distance I from ik across the K tables.
that will land in hamming distance I to q, and E(Cq (I)) is
Local Density Estimation in High Dimensions
We describe in greater detail the 3 contributions of the pre- Lemma 3.2 (Variance
of W ). σ 2 (W ) =
processing step. For the rest of this paper, all omitted proofs 1
P p(x,y)
K p(x)p(y) − 1
appear in Appendix C. x,y∈Aq
As a consequence, it is immediately clear that E(W ) = |Aq |. Therefore we conclude with the following bound on K:
It is important to understand the implications of this lemma. 8
K≥ 2 (2)
In particular, the expression for E(Z|W ) says that in a ε min p(x)
x∈Aq
specific realization of a choice of hash functions (or a set of
tables), the estimator Z is biased if W 6= |Aq |. Therefore
K is essential for helping concentrate the realized value of
W around its mean. We emphasize that the joint probability p(x, y) ≤
min{p(x), p(y)} is a very loose worst-case bound assuming
Since in expectation, our estimator Z gives W , we want to
high correlation between data points. The final bound for
understand how many tables K are required to ensure that
K, Equation (2), is also a worst-case bound in the sense that
W concentrates around its mean, |Aq |. This is related to the
it is possible that a very minuscule fraction of x ∈ Aq have
variance of W .
small values for p(x). In the experimental section of the
We also introduce a new quantity p(x, y) = P(dqx ∈ I ∩ paper, we do an empirical analysis of the inherent bias for
dqy ∈ I|θqx , θqy ), the collision probability that x and y different values of K and demonstrate that for real datasets
both land in buckets that are hamming distance I from q the number of tables needed can be far fewer than what is
over the random choice of hash functions. theoretically required in the worst case scenario.
Local Density Estimation in High Dimensions
3.2. Counts Vector which our method and all competing methods use. We de-
scribe the importance sampling scheme in the next section.
Query q maps to a bucket ik for each table k = 1, 2 . . . K.
The preprocessing step produces an average counts vector
corresponding to bucket ik , denoted Cqk , where Cqk (i) is 4. Sampling
the count of the total number of items in buckets that are at
We now analyze our sampling algorithm. Recall that our
hamming distance i = 0, 1, . . . t away from the hash address
sampling scheme works in the following way. Given query
PFor the hamming distances of interest I, we
for q in table k.
q, we generate the hash for q in each of our K tables, by
let Cqk (I) = d∈I Cqk (d).
solving for ik = gk (q) for k = 1, . . . K. Given the hash for
Cqk (I) is an integral part of our weighted importance sam- q in each of our K tables and the set of hamming distances
pling scheme. In Appendix A, we show how to compute I that we want to sample from, we invoke our sampler to
these vectors efficiently. generate a sample from across the K tables.
Theorem 3.1 (Aggregate-Counts). Given a set of K Given this sample, x, we compute the angular distance
hash tables, each with 2t hash buckets with addresses in θqx = cos−1 (q · x). Let p(x) = P(dqx ∈ I|θqx ), the
{0, 1}t , Aggregate-Counts (Algorithm 1) computes, collision probability that x lands in a bucket that is hamming
for each hash address i, the number of elements in buckets distance I from q over the random choice of hash functions;
that are hamming distance 0, 1, . . . t away from i, in each of p(x) is an endogenous property of an LSH function.
the K tables, in time O(Kt2 2t ).
We score each sample as in Equation (1).
Note that the t in our hashing scheme is the length of the We take S samples and construct Z1 , Z2 , . . . ZS . We report
hash sequence; as a general rule of thumb, for bit-wise hash PS
i=1 Zi
as our estimate for |Aq |.
functions, implementers choose t ≈ log(n), so as to average S
out to one element per hash bucket. Therefore, the prepro- As an immediate consequence of Lemma 3.1, it is clear that
cessing runtime of a reasonable hashing implementation for "P #
Aggregate-Counts (Algorithm 1) is approximately S
i=1 Zi
O(nK log2 (n)). E = |Aq | .
S
The key benefit of Aggregate-Counts is that it com-
putes via a message-passing or dynamic programming strat-
Now we analyze the variance of our estimator:
egy that is much more efficient than a naive brute-force
approach that would take time O(K22t ), or O(Kn2 ) if Lemma 4.1 (Variance of Estimator).
t ≈ log(n).
PS !2
2
i=1 Zi ≤ E[Z ] + σ 2 (W )
3.3. Sampler E − |Aq |
S S
We create a sampler that, given a hash address ik for each
table, and a set of hamming distances I that we want to
sample from, generates a sample uniformly at random from This decomposition of the variance into the two terms indi-
the union of elements that were hashed to hamming distance cates that the variance is coming from two sources. The first
2
]
I across the K tables. For an implementation and analysis, source is the variance of the samples, E[Z
S . If we don’t take
please consult Appendix B. enough samples, we do not get a good estimate. The second
source is the variance from the random variable W , σ 2 (W ),
Theorem 3.2 (Sampler). Given a set of K hash tables, each
which corresponds to the contents in the tables. As we have
with 2t hash buckets with addresses in {0, 1}t , a sampling
shown, it is crucial to create enough tables so that W is
scheme consisting of a data structure and a sampling algo-
concentrated around its expectation, |Aq |. Therefore, this
rithm can generate a sample uniformly at random from any
second source of variance of the overall estimator comes
fixed hash table k, an element at hamming distance d to hash
from the variance of the hash functions that underlie table
address i. The data structure is a counts matrix that can be
creation and composition.
precomputed in preprocessing time O(Kt3 2t ), and the sam-
pling algorithm Hamming-Distance-Sampler (Algo- The σ 2 (W ) term has already been analyzed in Section 3.1,
rithm 2) generates a sample in time O(t). see Lemma 3.2. Now we analyze the second moment of Z.
Again, if we follow t ≈ log(n), the preprocessing time Lemma 4.2 (Variance of Z).
comes out to roughly O(nK log3 (n)). Also we expect the X X p(x, y)
O(t) online sample generation cost to be negligible com- 2 1 p(y)
E[Z ] = + 1−
pared to, say, the inner product computation cost for q · x, K · p(x)2 K p(x)
x∈Aq y∈D
Local Density Estimation in High Dimensions
8
Now that we have all the components, we are ready to put Substituting for K ≥ ε2 min p(x) gives:
x∈Aq
together the final sample and storage complexities for our
estimator. We want a final estimate that concentrates with 2
2 ε p(x, y) min p(x)
at most error around its mean, |Aq | with probability 1 − δ. E[Z ] 1 X X x∈Aq p(y)
To do this, we make several sets 1, 2, . . . M of our estimator ≤ +
S S 8p(x)2 p(x)
x∈Aq y∈D
(one estimator consists of a set of K tables and S samples).
1 X X ε2 p(x, y) p(y)
We choose K and S so that the failure probability of our
≤ +
estimator is a constant, say 14 . Each estimator produces an S 8p(x) p(x)
x∈Aq y∈D
estimate, call it Em , for m ∈ {1, . . . M }. We report our
final estimate as the median of these estimates. This is the 1 X X p(y)
≤ (1 + ε2 )
classic Median-of-Means technique. S p(x)
x∈Aq y∈D
is sufficient.
In Lemma 4.1, P we analyze the variance of our estimator, and
S 2
2 Zi ] Putting together Lemmas 3.3 and 4.1 with the median of
show that σ ( i=1 S ) ≤ E[Z
S + σ 2 (W ). Therefore, in
order so that the failure probability is less than 14 , it suffices means strategy yields our main result, Theorem 2.1.
PS
Zi ε2 |Aq |2
to have σ 2 ( i=1
S )≤ 4 , which can be obtained by In the rest of this paper we discuss the results of our experi-
E[Z 2 ] ε2 |Aq |2 2 ε2 |Aq |2 ments on real datasets.
letting S ≤ 8 and σ (W ) ≤ 8 .
Focusing on the σ 2 (W ) term, which depends on the number
5. Experiments
of tables K created, we show in Lemma 3.3 from Section
8
3.1 that it suffices to take K ≥ ε2 min p(x) . We describe our experiments using the GLOVE dataset. We
x∈Aq
use the set of 400,000 pre-trained 50-dimensional word
Now that we have our table complexity, we can analyze our embedding vectors trained from Wikipedia 2014 + Giga-
2
] word 5, provided by (Pennington et al., 2014). We nor-
sampling complexity S to bound E[Z
S .
malize the embeddings, as is standard in many word em-
8
Lemma 4.1. Suppose K ≥ ε2 min p(x) . Then S ∈ bedding applications (Sugawara et al., 2016) We choose 3
x∈Aq
! query words with different neighborhood profiles: “venice”,
E(Cq (I)) E[Z 2 ] ε2 |Aq |2 “cake”, “book”. Venice has the smallest neighborhood, with
O ε2 |Aq |· min p(x) suffices to achieve S ≤ 8 .
x∈Aq 206 elements with angular distance less than 60 degrees,
cake has a medium sized neighborhood with about 698
elements, book has the largest neighborhood with 1275 el-
ements. The histogram for these 3 queries are shown in
Proof. By Lemma 4.2 we have: Figure 1.
We also choose our angle range of interest, A, to be 0-
E[Z 2 ]
1 X X p(x, y) 1 p(y) 60 degrees. A search through our dataset gave “florence”,
= + 1−
S S K · p(x)2 K p(x) “cannes”, “rome” as representative elements that are 40-50
x∈Aq y∈D
Local Density Estimation in High Dimensions
Figure 2: The empirical bias for different values of K for queries “venice”, “cake” and “book”. For each hamming threshold, the relative bias is averaged over 50 sets of
K tables, using random hyperplane hash as the LSH function. The bias decreases as the hamming threshold increases and the bias decreases with more hash tables, keeping
hamming threshold fixed. The bias does not drop significantly from 20 to 40 hash tables. At 20 hash tables, the bias at hamming threshold 5 is around 5%. This demonstrates
that for real datasets the number of tables needed can be far fewer than what is theoretically required in the worst case scenario.
(a) Estimation from Different Thresholds Fixing 20 Tables and 1000 samples (b) Estimation using different number of tables fixing I = 0
Figure 3: Comparison of our estimator against the benchmark LSH estimator adapted from ideas introduced in (Spring & Shrivastava, 2017). In our experiments, panel (b),
even after 40 hash tables, the error remained above 20%, and above 50% for queries with very small neighborhoods, such as “venice”. In contrast, panel (a) illustrates the
relative accuracy for our estimator. We fix 20 hash tables and takes 1000 samples from different hamming thresholds. Note that for queries like “venice”, which has a very
small neighborhood, taking 1000 samples at hamming threshold 5 performed worse than at hamming threshold 3; this is likely because 1000 samples was too few for hamming
threshold 5 – the ratio of total elements to elements of interest in hamming threshold 5 is high.