# DiskANN: Fast accurate billion-point nearest neighbor search on a single node

https://papers.nips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf

### Intro & Problem Statement

KNN:

dataset P, target k, query q. Extract KNN of q from P quickly.

Application domains: embeddings

Curse of dimensionality, almost impossible to find exact search result (motivation for ANN)

ANN:

Goal: maximize recall while retrieving the results as quickly as possible

Trade-off:

**recall v.s latency (and/or indexing time)**Algorithms

k-d trees

Locality Sensitive Hashing (LSH)

Graph-based: HNSW, NSG

#### Indexing large datasets: two approaches

Based on

**Inverted Index + Data Compression**and includes methods such as FAISS and IVFOADC+G+PClustered + Compressed (e.x. PQ)

*Benefit*: small memory footprint, and low latency of retrieving result using accelerator*Problem*: Recall is low because compression is lossy

Divide the dataset into disjoint

**shards**, and build in-memory index for each shard*Problem*: increased search latency and reduced throughput since the query needs to be routed to several shards

### Motivation

SOTA

**approximate nearest neighbor search (ANNS)**algorithms generate indices that must be stored in**main memory**for fast high-recall search.Expensive, Limit the size of the dataset

E.x. FAISS: supports searching only from RAM

Search

**throughput**of an SSD-resident index is limited by the number of random disk accesses/query and**latency**is limited by the round-trips (each round-trip can consist of multiple reads) to the disk.SSD:

**a few hundred microseconds**to serve a random read and can service about ~300K random reads per secondBut: search applications require mean latencies of

**a few milliseconds**for KNN search.

**Main challenges for SSD-resident index**arereducing the number of random SSD accesses to a few dozen

reducing the number of round trip requests to disk to under ten, preferably five

Naively mapping indices to SSDs does not work! (Generate several hundreds of disk reads per query)

### Contribution

DiskANN: SSD-resident ANNS system based on new graph-based indexing algorithm called Vamana

**DiskANN**: index and serve 1B dataset (d = 100s) using 64GB RAM, with low latency and high precision**Vamana**:generate graph indices with smaller diameter to minimize the number of sequential disk reads

Used in memory, outperform SOTA algorithms like HNSW and NSG

Partitions and merge

Compression: cache in RAM

Note: what's the contribution of last two...?

### Vamana Graph Construction Algorithm

#### Comparison to HNSW and NSG

HNSW & NSG have no tunable parameter α (default value is 1). This is the main factor that Vamana achieves a better trade-off between graph degree and diameter.

Some features that help Vamana and NSG add long-range edges, while HNSW has an additional step of constructing a hierarchy of graphs over a nested sequence of samples of the dataset

Pertains to the initial graph: HNSW and Vamana have simpler initializations over NSG

Vamana makes two passes over the dataset to improve graph quality

### DiskANN System Design

#### Key questions to address

How do we build a graph over a billion points?

How do we do distance comparisons between the query point and points in our candidate list at search time, if we cannot even store the vector data?

#### Index construction algorithms

Address question 1

Key: send each base point to multiple nearby centers to obtain overlapping clusters

partition dataset into K clusters using K-means

Assign each base point to the l-closest centers (typically l = 2)

Build Vamana indices for the points assigned to each of the clusters

Merge all different graphs into a single graph by taking union of edges

Why good: overlapping nature of clusters provide connectivity for the algorithm to succeed even if the query's NN are split across multiple shards

**Note**: Use full-precision coordinates to build the graph index. Graph with full-precision vectors on SSD

#### Search algorithms

Use PQ data at search time. Store compressed vectors of all the data points in memory.

**Beam Search**Natural way to search is to use Algorithm 1, fetching the neighborhood information from the SSD as needed. Use compressed vectors to guide the best vertices (and neighbors) to read from disk

To reduce the number of round trips to SSD (to fetch neighborhoods sequentially) without increasing compute (distance calculations) excessively, fetch the neighborhoods of a small number, W (say 4,8) of the closest points in L\V in one shot, and update L to be the top L candidates in L along with all the neighbors retrieved in this step.

Beam width: to strike a balance between latency and throughput

**Other**

**Other**

Cache frequently visited vertices

Either known query distribution or all vertices that are 3/4 hops from the starting point

Fetching and re-ranking full-precision coordinates stored on the SSD

Full precision coordinates essentially piggyback on the cost of expanding the neighborhoods

### Metric for success

High recall

Low query latency

### Result

#### With HNSW, NSG for In-memory search performance

Dataset: SIFT1M, GIST1M, DEEP1M

Vamana matches or outperforms the current best ANNS methods on both hundred and thousand-dimensional datasets obtained from different sources

#### With HNSW, NSG for number of hops

More suitable for SSD-based serving than other graph-based algorithms as it makes 2-3 times fewer hops for search to converge on large datasets compared to HNSW and NSG

Hops = number of rounds of disk reads on the critical path of the search

#### Billion-scale datasets: one-shot Vamana v.s Merged Vamana

Dataset: ANN_SIFT1B

Take away

Partition and merging are fast and can be done directly on disk, so the entire build process consumes under 64 GB main memory

Single index outperforms merged index, which traverses more links to reach the same neighborhoods, thus increasing search latency

Merged index is still a good choice for Billion-scale k-ANN indexing

Require no more than 20% extra latency and target recall when compared to single index

#### Billion-scale datasets: DiskANN v.s IVF-based Methods

IVFOADC+G+P

Uses inverted indexing and PQ to develop indicies with low-memory footprint and serve queries with high-throughput and good 1-recall@100

Not compared with FAISS: because need GPUs might not be available and https://arxiv.org/abs/1802.02422 demonstrate superior recall over FAISS (??)

DiskANN: matches memory footprint, achieve significantly higher recall at same latency

Thoughts

Single machine setting. Are these meaningful experimental comparisons in our case? It's a single node and it hasn't been compared with FAISS

Apply this ...

Last updated