|
|
@@ -1,15 +1,24 @@
|
|
|
-# K-nearest neighbor search
|
|
|
+# K-nearest neighbor vector search
|
|
|
|
|
|
-<!-- example KNN -->
|
|
|
+Manticore Search supports the ability to add embeddings generated by your Machine Learning models to each document, and then doing a nearest-neighbor search on them. This lets you build features like similarity search, recommendations, semantic search, and relevance ranking based on NLP algorithms, among others, including image, video, and sound searches.
|
|
|
+
|
|
|
+## What is an embedding?
|
|
|
+
|
|
|
+An embedding is a method of representing data—such as text, images, or sound—as vectors in a high-dimensional space. These vectors are crafted to ensure that the distance between them reflects the similarity of the data they represent. This process typically employs algorithms like word embeddings (e.g., Word2Vec, BERT) for text or neural networks for images. The high-dimensional nature of the vector space, with many components per vector, allows for the representation of complex and nuanced relationships between items. Their similarity is gauged by the distance between these vectors, often measured using methods like Euclidean distance or cosine similarity.
|
|
|
+
|
|
|
+Manticore Search enables k-nearest neighbor (KNN) vector searches using the HNSW library. This functionality is part of the [Manticore Columnar Library](https://github.com/manticoresoftware/columnar).
|
|
|
|
|
|
-Manticore can perform k-nearest neighbor (KNN) searches using the HNSW library.
|
|
|
+<!-- example KNN -->
|
|
|
|
|
|
-The KNN search functionality is provided by the [Manticore Columnar Library](https://github.com/manticoresoftware/columnar)
|
|
|
+### Configuring a table for KNN search
|
|
|
|
|
|
To run KNN searches, you must first configure your table. It needs to have at least one float_vector attribute, which serves as a data vector. You need to specify the following properties:
|
|
|
* `knn_type`: A mandatory setting; currently, only `hnsw` is supported.
|
|
|
* `knn_dims`: A mandatory setting that specifies the dimensions of the vectors being indexed.
|
|
|
-* `hnsw_similarity`: A mandatory setting that specifies the distance function used by the HNSW index. Acceptable values are `L2`, `IP`, and `COSINE`.
|
|
|
+* `hnsw_similarity`: A mandatory setting that specifies the distance function used by the HNSW index. Acceptable values are:
|
|
|
+ - `L2` - Squared L2
|
|
|
+ - `IP` - Inner product
|
|
|
+ - `COSINE` - Cosine similarity
|
|
|
* `hnsw_m`: An optional setting that defines the maximum number of outgoing connections in the graph. The default is 16.
|
|
|
* `hnsw_ef_construction`: An optional setting that defines a construction time/accuracy trade-off.
|
|
|
|
|
|
@@ -28,6 +37,10 @@ Query OK, 0 rows affected (0.01 sec)
|
|
|
```
|
|
|
<!-- end -->
|
|
|
|
|
|
+<!-- example knn_insert -->
|
|
|
+
|
|
|
+### Inserting vector data
|
|
|
+
|
|
|
After creating the table, you need to insert your vector data, ensuring it matches the dimensions you specified when creating the table.
|
|
|
|
|
|
<!-- intro -->
|
|
|
@@ -38,7 +51,7 @@ After creating the table, you need to insert your vector data, ensuring it match
|
|
|
```sql
|
|
|
insert into test values ( 1, 'yellow bag', (0.653448,0.192478,0.017971,0.339821) ), ( 2, 'white bag', (-0.148894,0.748278,0.091892,-0.095406) );
|
|
|
```
|
|
|
-<!-- response -->
|
|
|
+<!-- response SQL -->
|
|
|
|
|
|
```sql
|
|
|
Query OK, 2 rows affected (0.00 sec)
|
|
|
@@ -87,8 +100,33 @@ POST /insert
|
|
|
|
|
|
<!-- end -->
|
|
|
|
|
|
+<!-- example knn_search -->
|
|
|
|
|
|
-Now, you can initiate the KNN search using the `knn()` clause.
|
|
|
+### KNN vector search
|
|
|
+
|
|
|
+Now, you can perform a KNN search using the `knn` clause in either SQL or JSON format. Both interfaces support the same essential parameters, ensuring a consistent experience regardless of the format you choose:
|
|
|
+
|
|
|
+- SQL: `select ... from <table name> where knn ( <field>, <k>, <query vector> )`
|
|
|
+- JSON:
|
|
|
+ ```
|
|
|
+ POST /search
|
|
|
+ {
|
|
|
+ "index": "<table name>",
|
|
|
+ "knn":
|
|
|
+ {
|
|
|
+ "field": "<field>"",
|
|
|
+ "query_vector": [<query vector>],
|
|
|
+ "k": <k>
|
|
|
+ }
|
|
|
+ }
|
|
|
+ ```
|
|
|
+
|
|
|
+The parameters are:
|
|
|
+* `field`: This is the name of the float vector attribute containing vector data.
|
|
|
+* `k`: This represents the number of documents to return. It indicates how many documents a single Hierarchical Navigable Small World (HNSW) index will return. The actual result may include more documents than `k` (e.g., if each disk chunk in a real-time table returns `k` documents, the total would be `num_chunks * k` documents). Conversely, the result might contain fewer than `k` documents if, for example, you request `k` documents and subsequently filter them by some attribute.
|
|
|
+* `query_vector`: This is the search vector.
|
|
|
+
|
|
|
+Documents are always sorted by their distance to the search vector. Any additional sorting criteria you specify will be applied after this primary sort condition. For retrieving the distance, there is a built-in function called [knn_dist()](../Functions/Other_functions.md#KNN_DIST%28%29).
|
|
|
|
|
|
<!-- intro -->
|
|
|
##### SQL:
|
|
|
@@ -98,7 +136,7 @@ Now, you can initiate the KNN search using the `knn()` clause.
|
|
|
```sql
|
|
|
select id, knn_dist() from test where knn ( image_vector, 5, (0.286569,-0.031816,0.066684,0.032926) );
|
|
|
```
|
|
|
-<!-- response -->
|
|
|
+<!-- response SQL -->
|
|
|
|
|
|
```sql
|
|
|
+------+------------+
|
|
|
@@ -167,12 +205,10 @@ POST /search
|
|
|
|
|
|
<!-- end -->
|
|
|
|
|
|
-The `knn()` clause has the following syntax: `knn ( <attribute_name>, <k>, <query_vector> )`
|
|
|
-* `attribute_name` is the name of the float vector attribute that contains vector data.
|
|
|
-* `k` is the number of documents to return. Note that this represents how many documents a single HNSW index will return. The result may have more documents than `k` (e.g., each disk chunk in a real-time table returns `k` docs, resulting in `num_chunks*k` docs), or it may have fewer than `k` docs (e.g., you request `k` docs and filter them by some attribute).
|
|
|
-* `<query_vector>` is the search vector.
|
|
|
|
|
|
-Documents are always sorted by the distance to the search vector. Any additional sorting that you may specify will be performed after that primary sort condition. There is a built-in function `knn_dist()` that can be used to retrieve the distance.
|
|
|
+<!-- Example knn_filtering -->
|
|
|
+
|
|
|
+### Filtering KNN vector search results
|
|
|
|
|
|
Manticore also supports additional filtering of documents returned by the KNN search, either by full-text matching, attribute filters, or both.
|
|
|
|
|
|
@@ -182,9 +218,9 @@ Manticore also supports additional filtering of documents returned by the KNN se
|
|
|
<!-- request SQL -->
|
|
|
|
|
|
```sql
|
|
|
-select id, knn_dist() from test where knn ( image_vector, 5, (0.286569,-0.031816,0.066684,0.032926) ) and match('white') and id<10;
|
|
|
+select id, knn_dist() from test where knn ( image_vector, 5, (0.286569,-0.031816,0.066684,0.032926) ) and match('white') and id < 10;
|
|
|
```
|
|
|
-<!-- response -->
|
|
|
+<!-- response SQL -->
|
|
|
|
|
|
```sql
|
|
|
+------+------------+
|