Milvus: A Purpose-Built Vector Data Management System

https://www.cs.purdue.edu/homes/csjgwang/pubs/SIGMOD21_Milvus.pdf

Limitations for existing management systems & algorithms for high-dim vector data:
- Performance issue when handling large-scale and dynamic vector data
- Limited functionalities that cannot meet the requirements of versatile applications
This paper: Milvus - purpose-built data management system to efficiently store and search large-scale vector data for DS and AI applications
- easy-to-use application interfaces (SDKs, RESTful APIs)
- optimize for heterogeneous computing platform with modern CPU / GPU
- enable advanced query processing
- handle dynamic data (fast update)
- distribute data across nodes for scalability and availability

Trend: growth of unstructured data and ML that can transform data into vectors for analytics
Unique requirements
- Fast query processing on large-scale vector data
- Efficient handling of dynamic vector data
- Advanced query processing
  - E.x. attribute filtering, multi-vector query processing
Existing solutions
- Poor performance: large-scale, dynamic data
- Limited functionalities: for advanced query processing
- Two categories
  - Algorithms
    Open-source implementation libraries: Faiss, SPTAG
    Limitations
    Not full-fledged system that manages vector data (make assumption on in-memory storage, cannot span multiple machines)
    Cannot easily handle dynamic data while ensuring fast search
    Do not support advanced query processing
    Not optimized for heterogeneous computing (with CPU & GPU)
  - Systems
    E.x. Alibaba AnalyticDB-V, Alibaba PASE
    Limitations
    Legacy database components (e.g. optimizer and storage engine) prevent fine-tuned optimizations for vectors
    Do not support advanced query processing
    E.x. Vearch
    Not efficient on large-scale data and multi-vector query processing

Query engine
- Types: vector query, attribute filtering, multi-vector query
Indexing
- Quantization based: IVF (Flat, SQ, PQ)
- Graph based: HNSW, RNSG
Dynamic data management
- LSM-tree
- How it works:
  - Inserted entities stored as MemTable
  - Accumulated size > threshold, immutable, and flush as a new segment
  - Smaller segments merged into larger ones for fast sequential access
- Build indexes only for large segments by default
- Segment: basic unit of searching, scheduling, and buffering
- Snapshot isolation
Storage management
- Columnar fashion
- Vector, attribute storage, bufferpool (caching unit is a segment, LRU-based), multi-storage

Applications: finding similar houses (vector data) whose sizes are within a specific range (non-vector data)

Strategy A: attribute-first-vector-full-scan
- Use attribute constraint to obtain relevant entities via index search
  - Binary search if data fits in Mem
  - O/W, use skip pointers for fast search
- All entities scanned to obtain top-k results
- Outcome: produce exact result
- Limitations: suitable when constraint is highly selective.
Strategy B: attribute-first-vector-search
- After obtaining relevant entities according to attribute constraint, it produces a bitmap of resultant entity IDs
- Conduct normal query processing, checks bitmaps whenever a vector is encountered
- Only vectors that pass bitmap passing are included in top-k results
- Limitations: suitable when attribute constraint or normal vector query constraint are moderately selective
Strategy C: vector-search-attribute-full-scan
Strategy D: cost-based
- Estimate cost of A, B, C, picks up the one with the least cost
Strategy E: partition-based
- partitions the dataset based on the frequently searched attribute and applies the cost-based approach (D) for each partition
- Offline partition, # of partitions controlled by users

Last updated 4 years ago

Was this helpful?