-
Notifications
You must be signed in to change notification settings - Fork 195
LSH Families
Currently, FALCONN supports two LSH families: hyperplane LSH and cross-polytope LSH. In general, hyperplane LSH is simpler to tune (since it has fewer parameters) and preprocessing the dataset when setting up the table is somewhat faster. On the other hand, cross-polytope LSH offers better query times and can be 4 - 10x faster than hyperplane LSH on medium to large data sets.
Formally, hyperplane and cross-polytope LSH have theoretical guarantees for the cosine similarity (angular distance). In practice, both hyperplane and cross-polytope LSH are often still useful for the Euclidean distance (in all of Rd) or a maximum inner product search, despite being designed for cosine similarity.
At their core, both hyperplane LSH and cross-polytope LSH are randomized space partitions of a d-dimensional unit sphere centered at the origin. The two hash families differ in how granular these partitions are.
Hyperplane LSH is fairly simple: to partition a sphere, we sample a random hyperplane through the center of the sphere and cut the sphere across it in order to get two parts of equal size. An alternative way of thinking about it is as follows: we sample a random d-dimensional vector r whose coordinates are i.i.d. standard Gaussians. To hash a point v, we then compute the sign of the dot product <r, v>.
For two points with angle α between then, the probability of collision is exactly equal to 1 - α/π.
First, we quickly review the what a d-dimensional cross-polytope is. Let e1, ..., e2d be the set of signed standard basis vectors, i.e., each ei has exactly one nonzero coordinate that is either +1 or -1. The d-dimensional cross-polytope is the convex hull of these signed standard basis vectors. So in 2 dimensions, the cross-polytope is a rotated square; and in 3 dimensions, the cross-polytope is the octahedron. The cross-polytope is also known as the l1-unit ball, i.e., all points on the surface of the cross-polytope have l1-norm 1).
We now describe a simplified version of cross-polytope LSH, which has fewer ingredients but is slow to compute. To construct a cross-polytope hash function, we sample a random rotation S. In order to hash a point v, we first apply the rotation S to v and then find the vertex of the cross-polytope that is closest to Sv. So a cross-polytope hash function partitions the unit sphere according to the Voronoi cells of the vertices of a randomly-rotated cross-polytope.
The above simplified version is slow because a truly random rotation is expensive to apply: computing Sv given v takes time O(d2). To overcome this issue, we use pseudo-random rotations instead. The underlying primitive here is the Fast Hadamard Transform. Unlike truly random rotations, we might need more than one pseudo-random rotation to get enough randomness. But the overall approach still brings the evaluation time down to O(d log d).
For very high-dimensional sparse data (e.g., a bag of words tf/idf dataset), we apply feature hashing to reduce the high-dimensional vectors to an intermediate dimension d'. We then apply cross-polytope LSH in this smaller dimension d', which significantly speeds-up the hash computation. A smaller d' leads to faster hash computation, but also decreases the quality of the hash.
Finally, instead of using multiple fully-dimensional cross-polytope, we can also use a lower-dimensional cross-polytopes with fewer vertices. This allows for better granularity. For instance, in the one-dimensional case, the cross-polytope hash simply becomes the hyperplane LSH family.