Milvus: A Purpose-Built Vector Data Management System

https://www.cs.purdue.edu/homes/csjgwang/pubs/SIGMOD21_Milvus.pdf

Abstract

  • Limitations for existing management systems & algorithms for high-dim vector data:

    • Performance issue when handling large-scale and dynamic vector data

    • Limited functionalities that cannot meet the requirements of versatile applications

  • This paper: Milvus - purpose-built data management system to efficiently store and search large-scale vector data for DS and AI applications

    • easy-to-use application interfaces (SDKs, RESTful APIs)

    • optimize for heterogeneous computing platform with modern CPU / GPU

    • enable advanced query processing

    • handle dynamic data (fast update)

    • distribute data across nodes for scalability and availability

Motivation

  • Trend: growth of unstructured data and ML that can transform data into vectors for analytics

  • Unique requirements

    • Fast query processing on large-scale vector data

    • Efficient handling of dynamic vector data

    • Advanced query processing

      • E.x. attribute filtering, multi-vector query processing

  • Existing solutions

    • Poor performance: large-scale, dynamic data

    • Limited functionalities: for advanced query processing

    • Two categories

      • Algorithms

        • Open-source implementation libraries: Faiss, SPTAG

        • Limitations

          • Not full-fledged system that manages vector data (make assumption on in-memory storage, cannot span multiple machines)

          • Cannot easily handle dynamic data while ensuring fast search

          • Do not support advanced query processing

          • Not optimized for heterogeneous computing (with CPU & GPU)

      • Systems

        • E.x. Alibaba AnalyticDB-V, Alibaba PASE

        • Limitations

          • Legacy database components (e.g. optimizer and storage engine) prevent fine-tuned optimizations for vectors

          • Do not support advanced query processing

        • E.x. Vearch

          • Not efficient on large-scale data and multi-vector query processing

System Design

  • Query engine

    • Types: vector query, attribute filtering, multi-vector query

  • Indexing

    • Quantization based: IVF (Flat, SQ, PQ)

    • Graph based: HNSW, RNSG

  • Dynamic data management

    • LSM-tree

    • How it works:

      • Inserted entities stored as MemTable

      • Accumulated size > threshold, immutable, and flush as a new segment

      • Smaller segments merged into larger ones for fast sequential access

    • Build indexes only for large segments by default

    • Segment: basic unit of searching, scheduling, and buffering

    • Snapshot isolation

  • Storage management

    • Columnar fashion

    • Vector, attribute storage, bufferpool (caching unit is a segment, LRU-based), multi-storage

Advanced Query Processing

Attribute Filtering

  • Applications: finding similar houses (vector data) whose sizes are within a specific range (non-vector data)

  • Strategy A: attribute-first-vector-full-scan

    • Use attribute constraint to obtain relevant entities via index search

      • Binary search if data fits in Mem

      • O/W, use skip pointers for fast search

    • All entities scanned to obtain top-k results

    • Outcome: produce exact result

    • Limitations: suitable when constraint is highly selective.

  • Strategy B: attribute-first-vector-search

    • After obtaining relevant entities according to attribute constraint, it produces a bitmap of resultant entity IDs

    • Conduct normal query processing, checks bitmaps whenever a vector is encountered

    • Only vectors that pass bitmap passing are included in top-k results

    • Limitations: suitable when attribute constraint or normal vector query constraint are moderately selective

  • Strategy C: vector-search-attribute-full-scan

  • Strategy D: cost-based

    • Estimate cost of A, B, C, picks up the one with the least cost

  • Strategy E: partition-based

    • partitions the dataset based on the frequently searched attribute and applies the cost-based approach (D) for each partition

    • Offline partition, # of partitions controlled by users

Multi-vector Queries

  • Vector fusion

  • Iterative merging

Last updated