Billion-scale Approximate Nearest Neighbor Search
https://speakerdeck.com/matsui_528/cvpr20-tutorial-billion-scale-approximate-nearest-neighbor-search
Recorded video
Nearest Neighbor Search: NN
N D-dim database vectors
Given a query q, find the closest vector from the database
Solution: linear scan, O(ND), slow
Should try this first of all
Naive implementation
Faiss implementation
M = number of query vectors
SIMD (Single-Instruction Multiple Data)
also known as vectorization
BLAS (Basic Linear Algebra Subprograms)
SIMD
Then go into the d >= 4 and d > 0 and the end of the function
SIMD codes of faiss are simple and easy to read
Being able to read SIMD codes comes in handy sometimes; why this impl is super fast
2. BLAS
Norm: Euclidean norm
k-means: one of the benchmarks for KNN
Faster if using multiple GPU
Some useful reference
Approximate Nearest Neighbor Search: ANN
Faster search
Trade-off: runtime, accuracy, and memory consumption
A sense of scale: billion-scale data on memory
Application
NN/ANN for CV
Image retrieval
Person re-identification
Clustering
kNN recognition
Fast construction of bag-of-features
One of the benchmarks is still SIFT
Locality Sensitive Hashing (LSH)
Hash function that we use, generally speaking:
Good sides:
Math-friendly
Popular in the theory area (FOCS, STOC, ...)
Bad sides:
Large memory cost
Need several tables to boost the accuracy
Need to store the original data (# = N) on memory
Data-dependent methods such as PQ are better for real-world data
Thus, in recent CV papers, LSH has been treated as a classic method
Falconn
Tree / Space Partitioning
FLANN
Annoy
Graph Traversal
Very popular in recent years
Around 2017, it turned out that the graph-traversal-based methods work well for million-scale data
Pioneer
Navigable Small World Graphs (NSW)
Hierarchical NSW (HNSW)
Implementation: nmslib, hnsw, faiss
Record Phase
Early links can be long
In the early phase, there are not many vectors
Such long links encourage a large hop, making the fast convergence for search
Search Phase
Task: find the similar node for the query node
Then, traverse in a greedy manner
Extension: Hierarchical NSW; HNSW
Sub-sample from bottom to the top
NMSLIB (Non-Metric Space Library)
Other implementation of HNSW
Hnswlib
Spin-off library from nmslib
Include only hnsw
Simpler; may be useful if you want to extend hnsw
Faiss
Libraries for PQ-based methods
This lib also include hnsw
Other graph-based approaches
Compressed data
Basic idea
Product Quantization: PQ
Split the vector into three parts
For this first 2-D vector, find the closest in this codebook
Then, record the ID
Memory Efficient
2. Distance Estimation
First, apply product quantization on the database
Then, get the PQ code
Deep PQ
Hamming-based v.s Lookup-based
Billion-scale
Inverted Index + PQ: Record
Inverted Index + PQ: Search
Faiss
Graph Traversal
Cheat sheet of choosing index
Benchmark
Nearest neighbor search engine
Problems of ANN
Last updated
Was this helpful?