Milvus: A Purpose-Built Vector Data Management System
https://www.cs.purdue.edu/homes/csjgwang/pubs/SIGMOD21_Milvus.pdf
Abstract
Limitations for existing management systems & algorithms for high-dim vector data:
Performance issue when handling large-scale and dynamic vector data
Limited functionalities that cannot meet the requirements of versatile applications
This paper: Milvus - purpose-built data management system to efficiently store and search large-scale vector data for DS and AI applications
easy-to-use application interfaces (SDKs, RESTful APIs)
optimize for heterogeneous computing platform with modern CPU / GPU
enable advanced query processing
handle dynamic data (fast update)
distribute data across nodes for scalability and availability
Motivation
Trend: growth of unstructured data and ML that can transform data into vectors for analytics
Unique requirements
Fast query processing on large-scale vector data
Efficient handling of dynamic vector data
Advanced query processing
E.x. attribute filtering, multi-vector query processing
Existing solutions
Poor performance: large-scale, dynamic data
Limited functionalities: for advanced query processing
Two categories
Algorithms
Open-source implementation libraries: Faiss, SPTAG
Limitations
Not full-fledged system that manages vector data (make assumption on in-memory storage, cannot span multiple machines)
Cannot easily handle dynamic data while ensuring fast search
Do not support advanced query processing
Not optimized for heterogeneous computing (with CPU & GPU)
Systems
E.x. Alibaba AnalyticDB-V, Alibaba PASE
Limitations
Legacy database components (e.g. optimizer and storage engine) prevent fine-tuned optimizations for vectors
Do not support advanced query processing
E.x. Vearch
Not efficient on large-scale data and multi-vector query processing
System Design
Query engine
Types: vector query, attribute filtering, multi-vector query
Indexing
Quantization based: IVF (Flat, SQ, PQ)
Graph based: HNSW, RNSG
Dynamic data management
LSM-tree
How it works:
Inserted entities stored as MemTable
Accumulated size > threshold, immutable, and flush as a new segment
Smaller segments merged into larger ones for fast sequential access
Build indexes only for large segments by default
Segment: basic unit of searching, scheduling, and buffering
Snapshot isolation
Storage management
Columnar fashion
Vector, attribute storage, bufferpool (caching unit is a segment, LRU-based), multi-storage
Advanced Query Processing
Attribute Filtering
Applications: finding similar houses (vector data) whose sizes are within a specific range (non-vector data)
Strategy A: attribute-first-vector-full-scan
Use attribute constraint to obtain relevant entities via index search
Binary search if data fits in Mem
O/W, use skip pointers for fast search
All entities scanned to obtain top-k results
Outcome: produce exact result
Limitations: suitable when constraint is highly selective.
Strategy B: attribute-first-vector-search
After obtaining relevant entities according to attribute constraint, it produces a bitmap of resultant entity IDs
Conduct normal query processing, checks bitmaps whenever a vector is encountered
Only vectors that pass bitmap passing are included in top-k results
Limitations: suitable when attribute constraint or normal vector query constraint are moderately selective
Strategy C: vector-search-attribute-full-scan
Strategy D: cost-based
Estimate cost of A, B, C, picks up the one with the least cost
Strategy E: partition-based
partitions the dataset based on the frequently searched attribute and applies the cost-based approach (D) for each partition
Offline partition, # of partitions controlled by users
Multi-vector Queries
Vector fusion
Iterative merging
Last updated
Was this helpful?