CtrlK

Billion-scale Approximate Nearest Neighbor Search

https://speakerdeck.com/matsui_528/cvpr20-tutorial-billion-scale-approximate-nearest-neighbor-search

Recorded video

Image Retrieval in the Wild

Nearest Neighbor Search: NN

N D-dim database vectors
Given a query q, find the closest vector from the database
Solution: linear scan, O(ND), slow
Should try this first of all

Naive implementation

Faiss implementation

M = number of query vectors
SIMD (Single-Instruction Multiple Data)
- also known as vectorization
BLAS (Basic Linear Algebra Subprograms)

SIMD

Then go into the d >= 4 and d > 0 and the end of the function

SIMD codes of faiss are simple and easy to read
Being able to read SIMD codes comes in handy sometimes; why this impl is super fast

2. BLAS

Norm: Euclidean norm

k-means: one of the benchmarks for KNN
Faster if using multiple GPU

Some useful reference

Approximate Nearest Neighbor Search: ANN

Faster search
Trade-off: runtime, accuracy, and memory consumption
A sense of scale: billion-scale data on memory

Application

NN/ANN for CV
- Image retrieval
- Person re-identification
- Clustering
- kNN recognition
Fast construction of bag-of-features
One of the benchmarks is still SIFT

Locality Sensitive Hashing (LSH)

Hash function that we use, generally speaking:

Good sides:

Math-friendly
Popular in the theory area (FOCS, STOC, ...)

Bad sides:

Large memory cost
- Need several tables to boost the accuracy
- Need to store the original data (# = N) on memory
Data-dependent methods such as PQ are better for real-world data
Thus, in recent CV papers, LSH has been treated as a classic method

Falconn

Tree / Space Partitioning

FLANN

Annoy

Graph Traversal

Very popular in recent years
Around 2017, it turned out that the graph-traversal-based methods work well for million-scale data
Pioneer
- Navigable Small World Graphs (NSW)
- Hierarchical NSW (HNSW)
Implementation: nmslib, hnsw, faiss

Record Phase

Early links can be long
- In the early phase, there are not many vectors
Such long links encourage a large hop, making the fast convergence for search

Search Phase

Task: find the similar node for the query node

Then, traverse in a greedy manner

Extension: Hierarchical NSW; HNSW

Sub-sample from bottom to the top

NMSLIB (Non-Metric Space Library)

Other implementation of HNSW

Hnswlib
- Spin-off library from nmslib
- Include only hnsw
- Simpler; may be useful if you want to extend hnsw
Faiss
- Libraries for PQ-based methods
- This lib also include hnsw

Other graph-based approaches

Compressed data

Basic idea

Product Quantization: PQ

Split the vector into three parts
For this first 2-D vector, find the closest in this codebook
Then, record the ID

Memory Efficient

2. Distance Estimation

First, apply product quantization on the database
Then, get the PQ code

Deep PQ

Hamming-based v.s Lookup-based

Billion-scale

Inverted Index + PQ: Record

Inverted Index + PQ: Search

Faiss

Graph Traversal

Cheat sheet of choosing index

Benchmark

Nearest neighbor search engine

Problems of ANN

PreviousQuery2box: Reasoning over Knowledge Graphs in Vector Space using Box Embeddings NextDiskANN: Fast accurate billion-point nearest neighbor search on a single node

Last updated 4 years ago

Was this helpful?