Explorar el Código

ci: CLT test of the jina intergation

Related PR where it was removed in the past: https://github.com/manticoresoftware/manticoresearch/pull/3871

Related issue https://github.com/manticoresoftware/manticoresearch/issues/3870
Sergey Nikolaev hace 5 días
padre
commit
6084155165
Se han modificado 1 ficheros con 120 adiciones y 0 borrados
  1. 120 0
      test/clt-tests/mcl/auto-embeddings-jina-remote.rec

+ 120 - 0
test/clt-tests/mcl/auto-embeddings-jina-remote.rec

@@ -0,0 +1,120 @@
+Test remote JinaAI embeddings API integration with auto-generation from text fields. Tests API key handling, model specification, error cases, and vector generation consistency using jina/ model prefix.
+
+––– comment –––
+Start Manticore Search
+––– block: ../base/start-searchd –––
+––– block: ./base –––
+––– comment –––
+Test error case: Invalid JinaAI model name
+––– input –––
+mysql -h0 -P9306 -e "CREATE TABLE test_invalid_model (title TEXT, embedding FLOAT_VECTOR KNN_TYPE='hnsw' HNSW_SIMILARITY='l2' MODEL_NAME = 'jina/invalid-model-name-12345' FROM = 'title') " 2>&1
+––– output –––
+#!/.*ERROR.*(model.*not.*found|invalid.*model|model.*does.*not.*exist|configuration).*/!#
+––– comment –––
+Test that we have error when creating WITHOUT api_key passed
+––– input –––
+mysql -h0 -P9306 -e "CREATE TABLE test_valid_model_no_api_key (title TEXT, embedding FLOAT_VECTOR KNN_TYPE='hnsw' HNSW_SIMILARITY='l2' MODEL_NAME = 'jina/jina-embeddings-v2-base-en' FROM = 'title') " 2>&1
+––– output –––
+ERROR 1064 (42000) at line 1: error adding table 'test_valid_model_no_api_key': prealloc: Invalid API key for remote model
+––– comment –––
+Test table creation with valid JinaAI model and KEY
+––– input –––
+mysql -h0 -P9306 -e "CREATE TABLE test_jina_remote (title TEXT, content TEXT, description TEXT, embedding FLOAT_VECTOR KNN_TYPE='hnsw' HNSW_SIMILARITY='l2' MODEL_NAME = 'jina/jina-embeddings-v4' FROM = 'title, content' API_KEY='${JINA_API_KEY}') "; echo $?
+––– output –––
+0
+––– comment –––
+Check that the table was created and not display API_KEY
+––– input –––
+mysql -h0 -P9306 -E -e "SHOW CREATE TABLE test_jina_remote"
+––– output –––
+*************************** 1. row ***************************
+       Table: test_jina_remote
+Create Table: CREATE TABLE test_jina_remote (
+id bigint,
+title text,
+content text,
+description text,
+embedding float_vector knn_type='hnsw' hnsw_similarity='L2' model_name='jina/jina-embeddings-v4' FROM='title, content'
+)
+––– comment –––
+Test data insertion with auto-generated vectors
+––– input –––
+mysql -h0 -P9306 -e "INSERT INTO test_jina_remote (id, title, content, description) VALUES(1, 'machine learning algorithms', 'deep neural networks and artificial intelligence', 'advanced AI research')"; echo $?
+––– output –––
+0
+––– comment –––
+Verify data was inserted and vectors were generated
+––– input –––
+mysql -h0 -P9306 -e "SELECT COUNT(*) as record_count FROM test_jina_remote WHERE id=1"
+––– output –––
++--------------+
+| record_count |
++--------------+
+|            1 |
++--------------+
+––– comment –––
+We check consistency by using cosine similarity to compare vectors because it's not deterministic and we want to avoid floating point precision issues.
+––– input –––
+mysql -h0 -P9306 -e "INSERT INTO test_jina_remote (id, title, content, description) VALUES(2, 'machine learning algorithms', 'deep neural networks and artificial intelligence', 'different description')"
+
+mysql -h0 -P9306 -e "SELECT embedding FROM test_jina_remote WHERE id=1" | \
+    grep -v embedding | \
+    sed 's/[0-9]\+\(\.[0-9]\+\)\?/\n&\n/g' | \
+    grep -E '^[0-9]+(\.[0-9]+)?$' | \
+    awk '{printf "%.5f\n", $1}' > /tmp/vector1.txt
+
+mysql -h0 -P9306 -e "SELECT embedding FROM test_jina_remote WHERE id=2" | \
+    grep -v embedding | \
+    sed 's/[0-9]\+\(\.[0-9]\+\)\?/\n&\n/g' | \
+    grep -E '^[0-9]+(\.[0-9]+)?$' | \
+    awk '{printf "%.5f\n", $1}' > /tmp/vector2.txt
+
+SIMILARITY=$(cosine_similarity /tmp/vector1.txt /tmp/vector2.txt)
+
+echo "Cosine similarity: $SIMILARITY"
+
+RESULT=$(awk -v sim="$SIMILARITY" 'BEGIN {
+    if (sim > 0.99)
+        print "SUCCESS: Same FROM fields produce similar vectors (similarity: " sim ")"
+    else
+        print "FAIL: Different vectors (FROM does not include description field and should not change generated vector value) (similarity: " sim ")"
+}')
+
+echo "$RESULT"
+
+rm -f /tmp/vector1.txt /tmp/vector2.txt
+––– output –––
+Cosine similarity: #!/(1|0\.[0-9]+)/!#
+SUCCESS: Same FROM fields produce similar vectors (similarity: #!/(1|0\.[0-9]+)/!#)
+––– comment –––
+Test different FROM field combinations produce different vectors
+––– input –––
+mysql -h0 -P9306 -e "CREATE TABLE test_jina_title_only (title TEXT, content TEXT, embedding FLOAT_VECTOR KNN_TYPE='hnsw' HNSW_SIMILARITY='l2' MODEL_NAME = 'jina/jina-embeddings-v4' FROM = 'title' API_KEY='${JINA_API_KEY}') "; mysql -h0 -P9306 -e "INSERT INTO test_jina_title_only (id, title, content) VALUES(1, 'machine learning algorithms', 'completely different content here')"; MD5_MULTI=$(mysql -h0 -P9306 -e "SELECT embedding FROM test_jina_remote WHERE id=1" | grep -v embedding | md5sum | awk '{print $1}'); MD5_SINGLE=$(mysql -h0 -P9306 -e "SELECT embedding FROM test_jina_title_only WHERE id=1" | grep -v embedding | md5sum | awk '{print $1}'); echo "multi_field_md5: $MD5_MULTI"; echo "single_field_md5: $MD5_SINGLE"; if [ "$MD5_MULTI" != "$MD5_SINGLE" ]; then echo "SUCCESS: Different FROM specifications produce different vectors"; else echo "INFO: FROM field comparison result"; fi
+––– output –––
+multi_field_md5: #!/[0-9a-f]{32}/!#
+single_field_md5: #!/[0-9a-f]{32}/!#
+SUCCESS: Different FROM specifications produce different vectors
+––– comment –––
+Test error case: FROM referencing non-existent field with JinaAI model
+––– input –––
+mysql -h0 -P9306 -e "CREATE TABLE test_jina_invalid_field (title TEXT, embedding FLOAT_VECTOR KNN_TYPE='hnsw' HNSW_SIMILARITY='l2' MODEL_NAME = 'jina/text-embedding-ada-002' FROM = 'nonexistent_field') " 2>&1
+––– output –––
+#!/.*ERROR.*(field|column|nonexistent).*/!#
+––– comment –––
+Test manual vector insertion without auto-generation
+––– input –––
+if mysql -h0 -P9306 -e "SHOW TABLES LIKE 'test_jina_no_from'" | grep -q test_jina_no_from; then mysql -h0 -P9306 -e "INSERT INTO test_jina_no_from (id, title, embedding) VALUES(1, 'test title', '(0.1, 0.2, 0.3, 0.4, 0.5)')"; echo "insert_result: $?"; else echo "insert_result: skipped (table not created)"; fi
+––– output –––
+#!/(insert_result: (0|ERROR.*dimension.*mismatch.*)|insert_result: skipped \(table not created\))/!#
+––– comment –––
+Show table structure to verify JinaAI model configuration
+––– input –––
+if mysql -h0 -P9306 -e "SHOW TABLES LIKE 'test_jina_no_from'" | grep -q test_jina_no_from; then mysql -h0 -P9306 -e "SHOW CREATE TABLE test_jina_no_from"; else echo "table_structure: skipped (table not created)"; fi
+––– output –––
+#!/(.*CREATE TABLE test_jina_no_from.*model_name='jina\/text-embedding-ada-002'.*engine='columnar'.*|table_structure: skipped \(table not created\))/!#
+––– comment –––
+Test API key environment variable handling
+––– input –––
+if [ -n "$JINA_API_KEY" ] && [ "$JINA_API_KEY" != "dummy_key_for_testing" ]; then echo "API key is available for testing"; else echo "API key not available - using dummy for error testing"; fi
+––– output –––
+#!/(API key is available for testing|API key not available - using dummy for error testing)/!#