# Query2box: Reasoning over Knowledge Graphs in Vector Space using Box Embeddings

### Abstract&#x20;

* Task: answering **complex** **logical queries** on large-scale incomplete knowledge graphs (KGs)&#x20;
* Motivation:&#x20;
  * Prior work models queries as single points in the vector space: problematic because a complex query represents a large set of its answer entities&#x20;
  * Prior work only handles queries with conjunctions and existential quantifiers, not disjunctions&#x20;
* Key insight: embed queries as boxes --> a set of points inside the box corresponds to a set of answer entities of the query&#x20;
* Able to: handles queries that use conjunctions, existential quantifiers, and logical disjunctions in massive and incomplete KGs&#x20;

### Introduction&#x20;

* First-order logical queries: can be represented as DAGs&#x20;
  * Con&#x20;
    * *Limited scalability*: computational complexity of subgraph matching (exp in query size)
    * *Correctness*: cannot correctly answer queries with missing relations&#x20;
      * Imputation: make the graph denser, and limit scalability&#x20;
* Alternative: embed logical queries and KG entities to low-dim vector space&#x20;
  * Pro&#x20;
    * Robustly handle missing relations&#x20;
    * Orders of magnitude faster&#x20;
  * Con: single point in vector space
    * Answer logical query requires modeling a set of active entities while traversing the KG&#x20;
    * Unnatural to define logical operators of points in the vector space&#x20;
    * Can only handle conjunctive queries&#x20;
* This paper: embedding-based framework for reasoning over KGs that is capable of handling **EPFO logical queries** in **scalable** manner&#x20;
  * Model a set of entities using a closed region instead of a single point (i.e. box)
    * Box naturally model sets of entities they enclose&#x20;
    * Logical operators (e.g. set intersection) can be similarly defined over boxes&#x20;
    * Operations are closed&#x20;
  * EPFO --> DNF&#x20;
    * Represent any EPFO query as a set of individual boxes, where each box is obtained for each conjunctive query in the DNF&#x20;
    * Return NN to any of the boxes as the answers to the query&#x20;
    * Then take the union of the answer entities&#x20;

### Query2Box

* Objective function&#x20;
  * Learn embeddings of entities in the KG
  * Learn parameterized geometric logical operators over boxes&#x20;
* Query answer process: arbitrary EPFO query q&#x20;
  * Identify its computation graph
  * embed the query by executing a set of geometric operators over boxes&#x20;
  * entities that are enclosed in the final box embedding are returned as answers to the query&#x20;

![Conjunctive queries: Where did Canadian citizens with Turing Award Graduate?](https://lh6.googleusercontent.com/QkUBr4kAHYOd4qZjiIUgWL_SmkvxzUfi-t8NWeL4RuteF5C4eNKLB6yAF295HVQLIzu673kM52Ymk3Jo8IQ9IP6z8yycHoWiwI8Pwfj42U5UBZi_eKviXfGESCVPSDHp7d75V-IPtuk)

#### 3.1 Knowledge graphs and conjunctive queries&#x20;

* Conjunctive queries: subclass of first-order logical queries that uses existential and conjunction operations&#x20;
* Dependency graph
  * Nodes: variable or non-variable entities in q&#x20;
  * Edges: relations in q&#x20;
  * In order for the query to be valid, the corresponding dependency graph needs to be a DAG&#x20;
    * Anchor entities: source nodes&#x20;
    * Query target: unique sink node&#x20;
* Computation graph: projection, intersection&#x20;

#### 3.2 Reasoning over sets of entities using box embeddings&#x20;

* Intuitions: decompose complex query into a sequence of logical operations, and then execute these operations in the vector space&#x20;
  * This way we will obtain the embedding of the query
  * Answers to the query will be entities that are enclosed in the final query embedding box&#x20;
* Two methodological advances&#x20;
  * The use of box embeddings to model and reason over set of entities&#x20;
  * How to tractably handle disjunction operators&#x20;
* **Box embeddings**&#x20;
  * p = (Cen(p), Off(p))
    * Cen(p) is the center of the box&#x20;
    * Off(p) is the positive offset of the box, modeling the size of the box&#x20;
  * *Initial boxes for source nodes:* (v, 0), where v is the anchor entity vector and 0 is a d-dim all-zero vector&#x20;
  * *geometric projection* operator
    * relation embedding: r = (Cen(r), Off(r))&#x20;
    * Given input box embedding p, projection is modeled by  p + r: sum the centers and sum the offsets&#x20;
      * translated center, larger offset&#x20;
  * *geometric intersection operator*&#x20;
    * p(inter) = (Cen(p(inter)), Off(p(inter)))
    * Calculated by: performing attention over the box centers, and shrinking the box offset using the sigmoid function&#x20;
  * *Entity-to-box distance*&#x20;
  * Training objective
    * Learn entity embeddings and geometric projection and intersection operators&#x20;

![Entity to box distance ](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MgYDe1ZgHtK4quQ8yZq%2F-MgYLm4cBZOWONmViXXB%2Fimage.png?alt=media\&token=0a69b92a-2a75-41f8-ac3d-84d8afdf19c1)

![Entity to box distance (2) ](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MgYDe1ZgHtK4quQ8yZq%2F-MgYLvak7kKBpFieaVgF%2Fimage.png?alt=media\&token=d3bceefc-513f-4ce8-a11a-1f9d6841798a)

#### 3.3 Tractable handling of disjunction using disjunctive normal form&#x20;

* Can define union, but union operations over boxes is not closed&#x20;
* Key idea: transform a given EPFO query into a Disjunctive Normal Form (DNF) (i.e. disjunction of conjunctive queries, and union operations only appear at the last steps)&#x20;
* **Transformation to DNF**&#x20;
  * See the paper (calculate computation graphs and combine)&#x20;
* **Aggregation**&#x20;
  * Distance between the given EPFO query q and an entity v&#x20;
    * N > 1: minimum distance to the closest box as the distance to an entity&#x20;
* **Computational complexity**&#x20;
  * Complexity of answering an EPFO query = answering the N conjunctive queries&#x20;
    * N is not large, and the computation can be parallelized&#x20;
    * Answering each conjunctive query: execute a sequence of box operations, and then perform a range search&#x20;

### Experiment&#x20;

#### KGs and Query Generation&#x20;

* E.x. FB15k, FB15k-237, NELL995

![Query structures considered in the experiments ](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MgcCAoEy6LpeVlT6Z36%2F-MgcJR8ekwZqYULcxSoQ%2Fimage.png?alt=media\&token=3741f73e-5955-4262-98be-d42fbf163af6)

* 'p': projection&#x20;
* 'i': intersection&#x20;
* 'u': union&#x20;

### Take-away

* Workloads
  * Originally: more like structured embeddings, model individual entities and study pairwise relations&#x20;
  * Here (box embeddings): model sets of entities and reason over those sets &#x20;
