# Milvus: A Purpose-Built Vector Data Management System

### Abstract&#x20;

* &#x20;Limitations for existing management systems & algorithms for high-dim vector data:
  * Performance issue when handling large-scale and dynamic vector data&#x20;
  * Limited functionalities that cannot meet the requirements of versatile applications&#x20;
* This paper: Milvus - purpose-built data management system to efficiently store and search large-scale vector data for DS and AI applications&#x20;
  * easy-to-use application interfaces (SDKs, RESTful APIs)&#x20;
  * optimize for heterogeneous computing platform with modern CPU / GPU&#x20;
  * enable advanced query processing&#x20;
  * handle dynamic data (fast update)&#x20;
  * distribute data across nodes for scalability and availability

### Motivation &#x20;

* Trend: growth of unstructured data and ML that can transform data into vectors for analytics&#x20;
* Unique requirements&#x20;
  * Fast query processing on large-scale vector data&#x20;
  * Efficient handling of dynamic vector data&#x20;
  * Advanced query processing&#x20;
    * E.x. attribute filtering, multi-vector query processing
* Existing solutions&#x20;
  * **Poor performance**: large-scale, dynamic data&#x20;
  * **Limited functionalities**: for advanced query processing&#x20;
  * Two categories
    * *Algorithms*&#x20;
      * Open-source implementation libraries: Faiss, SPTAG&#x20;
      * Limitations&#x20;
        * Not full-fledged system that manages vector data (make assumption on in-memory storage, cannot span multiple machines)&#x20;
        * Cannot easily handle dynamic data while ensuring fast search&#x20;
        * Do not support advanced query processing&#x20;
        * Not optimized for heterogeneous computing (with CPU & GPU)&#x20;
    * *Systems*&#x20;
      * E.x. Alibaba AnalyticDB-V, Alibaba PASE&#x20;
      * Limitations&#x20;
        * Legacy database components (e.g. optimizer and storage engine) prevent fine-tuned optimizations for vectors&#x20;
        * Do not support advanced query processing&#x20;
      * E.x. Vearch&#x20;
        * Not efficient on large-scale data and multi-vector query processing&#x20;

### System Design&#x20;

* Query engine&#x20;
  * Types: vector query, attribute filtering, multi-vector query&#x20;
* Indexing&#x20;
  * Quantization based: IVF (Flat, SQ, PQ)
  * Graph based: HNSW, RNSG&#x20;
* Dynamic data management&#x20;
  * LSM-tree&#x20;
  * How it works:&#x20;
    * Inserted entities stored as MemTable
    * Accumulated size > threshold, immutable, and flush as a new segment&#x20;
    * Smaller segments merged into larger ones for fast sequential access&#x20;
  * Build indexes only for large segments by default&#x20;
  * Segment: basic unit of searching, scheduling, and buffering&#x20;
  * Snapshot isolation&#x20;
* Storage management&#x20;
  * Columnar fashion&#x20;
  * Vector, attribute storage, bufferpool (caching unit is a segment, LRU-based), multi-storage&#x20;

### Advanced Query Processing&#x20;

#### Attribute Filtering&#x20;

* Applications: finding similar houses (vector data) whose sizes are within a specific range (non-vector data)&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-Mf_ty1BwSjI_TrXBoAt%2F-Mfaaqlu7CrpeUt_rwBc%2Fimage.png?alt=media\&token=8cf59c29-92fc-4f6c-8fe8-9c8ec6e665fc)

* Strategy A: attribute-first-vector-full-scan&#x20;
  * Use attribute constraint to obtain relevant entities via index search&#x20;
    * Binary search if data fits in Mem&#x20;
    * O/W, use skip pointers for fast search&#x20;
  * All entities scanned to obtain top-k results&#x20;
  * Outcome: produce exact result&#x20;
  * Limitations: suitable when constraint is highly selective.
* Strategy B: attribute-first-vector-search&#x20;
  * After obtaining relevant entities according to attribute constraint, it produces a bitmap of resultant entity IDs
  * Conduct normal query processing, checks bitmaps whenever a vector is encountered&#x20;
  * Only vectors that pass bitmap passing are included in top-k results&#x20;
  * Limitations: suitable when attribute constraint or normal vector query constraint are moderately selective&#x20;
* Strategy C: vector-search-attribute-full-scan&#x20;
* Strategy D: cost-based
  * Estimate cost of A, B, C, picks up the one with the least cost&#x20;
* Strategy E: partition-based&#x20;
  * partitions the dataset based on the frequently searched attribute and applies the cost-based approach (D) for each partition
  * Offline partition, # of partitions controlled by users&#x20;

#### Multi-vector Queries&#x20;

* Vector fusion&#x20;
* Iterative merging&#x20;
