# Billion-scale Approximate Nearest Neighbor Search

## Recorded video&#x20;

{% embed url="<https://matsui528.github.io/cvpr2020_tutorial_retrieval/>" %}

## Nearest Neighbor Search: NN&#x20;

* N D-dim database vectors&#x20;
* Given a query q, find the closest vector from the database&#x20;
* Solution: linear scan, O(ND), slow&#x20;
* Should try this first of all&#x20;

![Problem setting ](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdAJNyQ0dU7-BA6PHF%2Fimage.png?alt=media\&token=632453b9-3206-4f86-bfb7-8219bdc33186)

#### Naive implementation&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdAQPikFrscIllX4NO%2Fimage.png?alt=media\&token=d42bb1a6-068b-490c-9ada-c0c531c322d9)

#### Faiss implementation&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdAfB4mgkEZBGwKN_-%2Fimage.png?alt=media\&token=81c8a61e-1d77-4acf-9ce7-d963b38faa29)

* M = number of query vectors&#x20;
* SIMD (Single-Instruction Multiple Data)&#x20;
  * also known as vectorization
* BLAS (Basic Linear Algebra Subprograms)&#x20;

1. SIMD&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdC67xUOxnGUvnQclJ%2Fimage.png?alt=media\&token=3294e92e-b758-4ccc-be43-a3c836ab9b32)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdCD2s2P46vcY3yGE5%2Fimage.png?alt=media\&token=68ee3c02-f567-4736-aaa7-922580f51fde)

Then go into the d >= 4 and d > 0 and the end of the function&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdD9_BC25YUtrmeUnz%2Fimage.png?alt=media\&token=f0fff9d7-fa3a-4d2a-b46e-61af7316a0b8)

* SIMD codes of faiss are simple and easy to read&#x20;
* Being able to read SIMD codes comes in handy sometimes; why this impl is super fast&#x20;

2\. BLAS&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdDLTXnJ1cOmDOE3NZ%2Fimage.png?alt=media\&token=38d77274-a621-4f70-b1c0-4a5a21bf447f)

* Norm: Euclidean norm

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdDpvlanh3bw8e3XwC%2Fimage.png?alt=media\&token=2d49fa36-8b49-4927-a912-dc0f0b483775)

* k-means: one of the benchmarks for KNN&#x20;
* Faster if using multiple GPU

Some useful reference&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdE5WzwAoGCH-4tMvX%2Fimage.png?alt=media\&token=19c7cb2b-dbba-4d67-a33d-4b2e569b13d5)

## Approximate Nearest Neighbor Search: ANN&#x20;

* Faster search&#x20;
* Trade-off: runtime, accuracy, and memory consumption&#x20;
* A sense of scale: billion-scale data on memory&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-Mbd9GTMdfPEJUg28SnR%2Fimage.png?alt=media\&token=b0a09775-bd8e-43eb-871e-b67c32d0b766)

#### Application&#x20;

* NN/ANN for CV&#x20;
  * Image retrieval&#x20;
  * Person re-identification&#x20;
  * Clustering&#x20;
  * kNN recognition&#x20;
* Fast construction of bag-of-features
* One of the benchmarks is still SIFT&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdEIxgWl3meUxIK5gK%2Fimage.png?alt=media\&token=d5a711dc-7a85-47f7-ba92-d3be9634650b)

### Locality Sensitive Hashing (LSH)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdElzXjrbbtOpKn9ZP%2Fimage.png?alt=media\&token=ed386fce-7775-40b1-9b02-bc6a4b7c05fa)

Hash function that we use, generally speaking:&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdFCaJbtuEsKegglYH%2Fimage.png?alt=media\&token=b02ae321-7f4c-43b3-819b-779d54f05223)

Good sides:

* Math-friendly&#x20;
* Popular in the theory area (FOCS, STOC, ...)

Bad sides:

* Large memory cost&#x20;
  * Need several tables to boost the accuracy&#x20;
  * Need to store the original data (# = N) on memory&#x20;
* Data-dependent methods such as PQ are better for real-world data&#x20;
* Thus, in recent CV papers, LSH has been treated as a classic method&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdFgYcKzIbRNOviBwK%2Fimage.png?alt=media\&token=ce38062d-5123-4ce3-9fbd-399f51678f08)

#### Falconn&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdG-gIUAQwY5RK6PBV%2Fimage.png?alt=media\&token=009ffd6d-9e95-4ab8-becf-2bbbfadd47ca)

### Tree / Space Partitioning&#x20;

#### FLANN&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdGOwkOL7a7pOESiXE%2Fimage.png?alt=media\&token=9932e0f8-777f-40c0-82c6-7d9e7d496add)

#### Annoy&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdGct7bGTsCAf3hJaJ%2Fimage.png?alt=media\&token=04ac57ed-9f05-49c9-9917-49f70ff74f42)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdGuMXT1GROV7BxqsX%2Fimage.png?alt=media\&token=e1c38b96-d191-40e2-98e9-3df7426f18a7)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdH0YdsrNZ-cr9IS0v%2Fimage.png?alt=media\&token=d9469d75-f94a-4a58-b181-06588e5179bc)

### Graph Traversal&#x20;

* **Very popular** in recent years&#x20;
* Around 2017, it turned out that the graph-traversal-based methods work well for million-scale data&#x20;
* Pioneer&#x20;
  * Navigable Small World Graphs (NSW)
  * Hierarchical NSW (HNSW)&#x20;
* Implementation: nmslib, hnsw, faiss&#x20;

#### Record Phase&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdHpgqiw0SsdsaqlTh%2Fimage.png?alt=media\&token=e2cddbe0-4470-435c-a5f4-f165dfa9a763)

* Early links can be long
  * In the early phase, there are not many vectors&#x20;
* Such long links encourage a large hop, making the fast convergence for search&#x20;

#### Search Phase&#x20;

Task: find the similar node for the query node&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdIL_LsiEtY-u1lzWq%2Fimage.png?alt=media\&token=85e7845b-bf83-49fb-b6c5-6dc304927aac)

* Then, traverse in a greedy manner&#x20;

#### Extension: Hierarchical NSW; HNSW&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdI_4z_0fIk2GWO9b0%2Fimage.png?alt=media\&token=557b3f7d-f01b-466e-a876-81777599943c)

* Sub-sample from bottom to the top&#x20;

#### NMSLIB (Non-Metric Space Library)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdIoxbCRyUdk_rN1n3%2Fimage.png?alt=media\&token=a566acc9-3423-4daa-891d-a60b8916291d)

#### Other implementation of HNSW

* Hnswlib&#x20;
  * Spin-off library from nmslib
  * Include only hnsw&#x20;
  * Simpler; may be useful if you want to extend hnsw&#x20;
* Faiss&#x20;
  * Libraries for PQ-based methods&#x20;
  * This lib also include hnsw&#x20;

#### Other graph-based approaches&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdJAU1VMPRTNLPVQtr%2Fimage.png?alt=media\&token=7c454abf-bad7-45f8-aeba-4ff9492f702d)

### Compressed data&#x20;

#### Basic idea&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdJg50k1f7AVvHJ-fu%2Fimage.png?alt=media\&token=fe9ef0f0-4ba2-44e4-92c7-8415378da888)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdJzChdScWam2BqVN8%2Fimage.png?alt=media\&token=93e1a5ea-8747-4810-a6b7-39f03e1ba8ea)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdK9UVdhfbpbBIlnzT%2Fimage.png?alt=media\&token=32b87a91-c350-4dff-aae4-0d010fffcbeb)

#### Product Quantization: PQ&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdKM1uQRqeXKduETon%2Fimage.png?alt=media\&token=9c989714-ffe4-4ea5-9ddf-8452ddad2115)

* Split the vector into three parts&#x20;
* For this first 2-D vector, find the closest in this codebook
* Then, record the ID&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdKqfeHghxeZ4nR-6q%2Fimage.png?alt=media\&token=900b460c-9a1c-4d28-b680-ef68e3951ba2)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdKuKuyTEzrFVrxWOF%2Fimage.png?alt=media\&token=b64cd32c-971b-4276-a88d-536fef392dfa)

1. Memory Efficient&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdLOUsxbSMnRX7UJRd%2Fimage.png?alt=media\&token=4e8040b3-33ef-4fa1-816f-1c6857aeda36)

2\. Distance Estimation&#x20;

* First, apply product quantization on the database&#x20;
* Then, get the PQ code&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdLiK7VyCFBeBnpnHA%2Fimage.png?alt=media\&token=858b8eb1-5af0-4c75-ac75-0cb965614298)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdLkaYH6rrkUp9KF5O%2Fimage.png?alt=media\&token=6276db57-325f-4747-83ca-1785aea88010)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdLxthUlG-TaxfourW%2Fimage.png?alt=media\&token=64dac211-ea8d-4114-9aed-56b8f5efa22f)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdMEoJU2Pq84G71JhQ%2Fimage.png?alt=media\&token=0cf783eb-122e-4d1c-b889-404477309b9f)

#### Deep PQ&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdMTPocZPMCVeScAJH%2Fimage.png?alt=media\&token=1793555e-ad68-4efc-9704-c6defa35e0b5)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdMWhjBZlQct1HuOUU%2Fimage.png?alt=media\&token=aea24085-03ec-4ad3-9c6b-c5346dadeb24)

#### Hamming-based v.s Lookup-based&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdM_oM1h3l-RjQBsq7%2Fimage.png?alt=media\&token=24cf1f84-ce87-4b78-aa53-704e115b7491)

### Billion-scale&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdMlmqnkqWyCY_7Oh-%2Fimage.png?alt=media\&token=ba27209e-9442-41f8-bdff-e91680b4cf2a)

#### Inverted Index + PQ: Record&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdN-2LjOkVyJUuRycj%2Fimage.png?alt=media\&token=e7c83597-cf9e-4222-aebf-dfe63299dde5)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdNQkUnOM7M0I0nJ2d%2Fimage.png?alt=media\&token=3cd1b353-931e-4ac2-a223-c00b952bb03e)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdNZM50OqPuVmJlmMX%2Fimage.png?alt=media\&token=e980df7e-454b-48ce-9d66-dfe0fe35926e)

#### Inverted Index + PQ: Search&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdNk_mHCxZ1UYywY1N%2Fimage.png?alt=media\&token=66f8fd34-d8c3-47f0-9ec2-c811f1906cbd)

#### Faiss&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdO4ZvaeIW-GEVg3zq%2Fimage.png?alt=media\&token=b047dfc5-fbc9-4fb2-9647-02c28e37b44b)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdOAF5mR9JyST1Rn_O%2Fimage.png?alt=media\&token=a2825e41-605a-4dfb-90bb-98dd14e2e716)

#### Graph Traversal

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdOXJgA5VMrRJ50WSZ%2Fimage.png?alt=media\&token=1bab0561-07a5-4d48-9bfc-3536284810d0)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdOqVExoFl6YknO1NK%2Fimage.png?alt=media\&token=7d4bdec8-f2a0-4c8c-b645-ec3718b8f68a)

## Cheat sheet of choosing index &#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-Mbd9qSKQ8ZTC9NK_avC%2Fimage.png?alt=media\&token=33fdeb3f-0237-47d2-ab47-fe2c4860a1bb)

### Benchmark&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdQ-OJ8xPEhiVfmnpn%2Fimage.png?alt=media\&token=7bbc0209-a4df-478e-a889-ea2820a96b4c)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdQ6zSyUqj-_XZqdaZ%2Fimage.png?alt=media\&token=012fc091-0dc9-4d37-9d57-e1972c5b2d65)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdQCfhloHY9UwveMmg%2Fimage.png?alt=media\&token=ca21979e-e0ed-4762-8370-1dedd5cd4eb8)

### Nearest neighbor search engine

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdQTD1sRoCxFHDZoHK%2Fimage.png?alt=media\&token=b9c9742e-5b68-4972-ae5c-63f9b1c30ce7)

### Problems of ANN&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mbc8ln4gmKDQ4hnlFMY%2F-MbdQZJRY_5bGBgpvnLU%2Fimage.png?alt=media\&token=6469fef0-98da-499a-8c0b-4c99af99e1ad)
