DiskANN: Fast accurate billion-point nearest neighbor search on a single node
https://papers.nips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf
Intro & Problem Statement
KNN:
dataset P, target k, query q. Extract KNN of q from P quickly.
Application domains: embeddings
Curse of dimensionality, almost impossible to find exact search result (motivation for ANN)
ANN:
Goal: maximize recall while retrieving the results as quickly as possible
Trade-off: recall v.s latency (and/or indexing time)
Algorithms
k-d trees
Locality Sensitive Hashing (LSH)
Graph-based: HNSW, NSG
Indexing large datasets: two approaches
Based on Inverted Index + Data Compression and includes methods such as FAISS and IVFOADC+G+P
Clustered + Compressed (e.x. PQ)
Benefit : small memory footprint, and low latency of retrieving result using accelerator
Problem : Recall is low because compression is lossy
Divide the dataset into disjoint shards, and build in-memory index for each shard
Problem : increased search latency and reduced throughput since the query needs to be routed to several shards
Motivation
SOTA approximate nearest neighbor search (ANNS) algorithms generate indices that must be stored in main memory for fast high-recall search.
Expensive, Limit the size of the dataset
E.x. FAISS: supports searching only from RAM
Search throughput of an SSD-resident index is limited by the number of random disk accesses/query and latency is limited by the round-trips (each round-trip can consist of multiple reads) to the disk.
SSD: a few hundred microseconds to serve a random read and can service about ~300K random reads per second
But: search applications require mean latencies of a few milliseconds for KNN search.
Main challenges for SSD-resident index are
reducing the number of random SSD accesses to a few dozen
reducing the number of round trip requests to disk to under ten, preferably five
Naively mapping indices to SSDs does not work! (Generate several hundreds of disk reads per query)
Contribution
DiskANN: SSD-resident ANNS system based on new graph-based indexing algorithm called Vamana
DiskANN: index and serve 1B dataset (d = 100s) using 64GB RAM, with low latency and high precision
Vamana:
generate graph indices with smaller diameter to minimize the number of sequential disk reads
Used in memory, outperform SOTA algorithms like HNSW and NSG
Partitions and merge
Compression: cache in RAM
Note: what's the contribution of last two...?
Vamana Graph Construction Algorithm
Comparison to HNSW and NSG
HNSW & NSG have no tunable parameter α (default value is 1). This is the main factor that Vamana achieves a better trade-off between graph degree and diameter.
Some features that help Vamana and NSG add long-range edges, while HNSW has an additional step of constructing a hierarchy of graphs over a nested sequence of samples of the dataset
Pertains to the initial graph: HNSW and Vamana have simpler initializations over NSG
Vamana makes two passes over the dataset to improve graph quality
DiskANN System Design
Key questions to address
How do we build a graph over a billion points?
How do we do distance comparisons between the query point and points in our candidate list at search time, if we cannot even store the vector data?
Index construction algorithms
Address question 1
Key: send each base point to multiple nearby centers to obtain overlapping clusters
partition dataset into K clusters using K-means
Assign each base point to the l-closest centers (typically l = 2)
Build Vamana indices for the points assigned to each of the clusters
Merge all different graphs into a single graph by taking union of edges
Why good: overlapping nature of clusters provide connectivity for the algorithm to succeed even if the query's NN are split across multiple shards
Note: Use full-precision coordinates to build the graph index. Graph with full-precision vectors on SSD
Search algorithms
Use PQ data at search time. Store compressed vectors of all the data points in memory.
Beam Search
Natural way to search is to use Algorithm 1, fetching the neighborhood information from the SSD as needed. Use compressed vectors to guide the best vertices (and neighbors) to read from disk
To reduce the number of round trips to SSD (to fetch neighborhoods sequentially) without increasing compute (distance calculations) excessively, fetch the neighborhoods of a small number, W (say 4,8) of the closest points in L\V in one shot, and update L to be the top L candidates in L along with all the neighbors retrieved in this step.
Beam width: to strike a balance between latency and throughput
Other
Cache frequently visited vertices
Either known query distribution or all vertices that are 3/4 hops from the starting point
Fetching and re-ranking full-precision coordinates stored on the SSD
Full precision coordinates essentially piggyback on the cost of expanding the neighborhoods
Metric for success
High recall
Low query latency
Result
With HNSW, NSG for In-memory search performance
Dataset: SIFT1M, GIST1M, DEEP1M
Vamana matches or outperforms the current best ANNS methods on both hundred and thousand-dimensional datasets obtained from different sources
With HNSW, NSG for number of hops
More suitable for SSD-based serving than other graph-based algorithms as it makes 2-3 times fewer hops for search to converge on large datasets compared to HNSW and NSG
Hops = number of rounds of disk reads on the critical path of the search
Billion-scale datasets: one-shot Vamana v.s Merged Vamana
Dataset: ANN_SIFT1B
Take away
Partition and merging are fast and can be done directly on disk, so the entire build process consumes under 64 GB main memory
Single index outperforms merged index, which traverses more links to reach the same neighborhoods, thus increasing search latency
Merged index is still a good choice for Billion-scale k-ANN indexing
Require no more than 20% extra latency and target recall when compared to single index
Billion-scale datasets: DiskANN v.s IVF-based Methods
IVFOADC+G+P
Uses inverted indexing and PQ to develop indicies with low-memory footprint and serve queries with high-throughput and good 1-recall@100
Not compared with FAISS: because need GPUs might not be available and https://arxiv.org/abs/1802.02422 demonstrate superior recall over FAISS (??)
DiskANN: matches memory footprint, achieve significantly higher recall at same latency
Thoughts
Single machine setting. Are these meaningful experimental comparisons in our case? It's a single node and it hasn't been compared with FAISS
Apply this ...
Last updated
Was this helpful?