Database
/
manticoresearch
mirror of https://github.com/manticoresoftware/manticoresearch.git


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348
							Additional functionality
------------------------
.. _build_excerpts:

BuildExcerpts
~~~~~~~~~~~~~

**Prototype:** function BuildExcerpts ( $docs, $index, $words,
$opts=array() )

Excerpts (snippets) builder function. Connects to ``searchd``, asks it
to generate excerpts (snippets) from given documents, and returns the
results.

``$docs`` is a plain array of strings that carry the documents'
contents. ``$index`` is an index name string. Different settings (such
as charset, morphology, wordforms) from given index will be used.
``$words`` is a string that contains the keywords to highlight. They
will be processed with respect to index settings. For instance, if
English stemming is enabled in the index, ``shoes`` will be highlighted
even if keyword is ``shoe``. Keywords can contain wildcards, that work
similarly to star-syntax available in queries. ``$opts`` is a hash which
contains additional optional highlighting parameters:

-  ``before_match``:
   A string to insert before a keyword match. A ``%PASSAGE_ID%`` macro can
   be used in this string. The first match of the macro is replaced with
   an incrementing passage number within a current snippet. Numbering
   starts at 1 by default but can be overridden with
   ``start_passage_id`` option. In a multi-document call, %PASSAGE_ID%
   would restart at every given document. Default is ``<b>``.

-  ``after_match``:
   A string to insert after a keyword match. Starting with version
   1.10-beta, a %PASSAGE_ID% macro can be used in this string. Default
   is ``</b>``.

-  ``chunk_separator``:
   A string to insert between snippet chunks (passages). Default is ``…``.

-  ``field_separator``:
   A string to insert between fields. Default is ``|``.

-  ``limit``:
   Maximum snippet size, in symbols (codepoints). Integer, default is
   256.

-  ``around``:
   How much words to pick around each matching keywords block. Integer,
   default is 5.

-  ``exact_phrase``:
   Whether to highlight exact query phrase matches only instead of
   individual keywords. Boolean, default is false.

-  ``use_boundaries``:
   Whether to additionally break passages by phrase boundary characters,
   as configured in index settings with
   :ref:`phrase_boundary <phrase_boundary>`
   directive. Boolean, default is false.

-  ``weight_order``:
   Whether to sort the extracted passages in order of relevance
   (decreasing weight), or in order of appearance in the document
   (increasing position). Boolean, default is false.

-  ``query_mode``:
   Whether to handle $words as a query in :ref:`extended
   syntax <extended_query_syntax>`, or as a bag of words
   (default behavior). For instance, in query mode (``one two`` \| ``three
   four``) will only highlight and include those occurrences ``one two`` or
   ``three four`` when the two words from each pair are adjacent to each
   other. In default mode, any single occurrence of ``one``, ``two``,
   ``three``, or ``four`` would be highlighted. Boolean, default is false.

-  ``force_all_words``:
   Ignores the snippet length limit until it includes all the keywords.
   Boolean, default is false.

-  ``limit_passages``:
   Limits the maximum number of passages that can be included into the
   snippet. Integer, default is 0 (no limit).

-  ``limit_words``:
   Limits the maximum number of words that can be included into the
   snippet. Note the limit applies to any words, and not just the
   matched keywords to highlight. For example, if we are highlighting
   ``Mary`` and a passage ``Mary had a little lamb`` is selected, then it
   contributes 5 words to this limit, not just 1. Integer, default is 0
   (no limit).

-  ``start_passage_id``:
   Specifies the starting value of ``%PASSAGE_ID%`` macro (that gets
   detected and expanded in ``before_match``, ``after_match`` strings).
   Integer, default is 1.

-  ``load_files``:
   Whether to handle $docs as data to extract snippets from (default
   behavior), or to treat it as file names, and load data from specified
   files on the server side. Up to
   :ref:`dist_threads <dist_threads>`
   worker threads per request will be created to parallelize the work
   when this flag is enabled. Boolean, default is false. Building of the
   snippets could be parallelized between remote agents. Just set the
   :ref:`‘dist_threads’ <dist_threads>`
   param in the config to the value greater than 1, and then invoke the
   snippets generation over the distributed index, which contain only
   one(!) :ref:`local <local>` agent
   and several remotes. The
   :ref:`snippets_file_prefix <snippets_file_prefix>`
   option is also in the game and the final filename is calculated by
   concatenation of the prefix with given name. Otherwords, when
   snippets_file_prefix is ‘/var/data’ and filename is ‘text.txt’ the
   sphinx will try to generate the snippets from the file
   ‘/var/datatext.txt’, which is exactly ‘/var/data’ + ‘text.txt’.

-  ``load_files_scattered``:
   It works only with distributed snippets generation with remote
   agents. The source files for snippets could be distributed among
   different agents, and the main daemon will merge together all
   non-erroneous results. So, if one agent of the distributed index has
   ‘file1.txt’, another has ‘file2.txt’ and you call for the snippets
   with both these files, the sphinx will merge results from the agents
   together, so you will get the snippets from both ‘file1.txt’ and
   ‘file2.txt’. Boolean, default is false.

   If the ``load_files`` is also set, the request will return the error
   in case if any of the files is not available anywhere. Otherwise (if
   ``load_files`` is not set) it will just return the empty strings for
   all absent files. The master instance reset this flag when
   distributes the snippets among agents. So, for agents the absence of
   a file is not critical error, but for the master it might be so. If
   you want to be sure that all snippets are actually created, set both
   ``load_files_scattered`` and ``load_files``. If the absence of some
   snippets caused by some agents is not critical for you - set just
   ``load_files_scattered``, leaving ``load_files`` not set.

-  ``html_strip_mode``:
   HTML stripping mode setting. Defaults to ``index``, which means that
   index settings will be used. The other values are ``none`` and ``strip``,
   that forcibly skip or apply stripping irregardless of index settings;
   and ``retain``, that retains HTML markup and protects it from
   highlighting. The ``retain`` mode can only be used when highlighting
   full documents and thus requires that no snippet size limits are set.
   String, allowed values are ``none``, ``strip``, ``index``, and ``retain``.

-  ``allow_empty``:
   Allows empty string to be returned as highlighting result when a
   snippet could not be generated (no keywords match, or no passages fit
   the limit). By default, the beginning of original text would be
   returned instead of an empty string. Boolean, default is false.

-  ``passage_boundary``:
   Ensures that passages do not cross a sentence, paragraph, or zone
   boundary (when used with an index that has the respective indexing
   settings enabled). String, allowed values are ``sentence``,
   ``paragraph``, and ``zone``.

-  ``emit_zones``:
   Emits an HTML tag with an enclosing zone name before each passage.
   Boolean, default is false.

-  ``force_passages``:
   Whether to generate passages for snippet even if limits allow to highlight
   whole text. Boolean, default is false.

Snippets extraction algorithm currently favors better passages (with
closer phrase matches), and then passages with keywords not yet in
snippet. Generally, it will try to highlight the best match with the
query, and it will also to highlight all the query keywords, as made
possible by the limits. In case the document does not match the query,
beginning of the document trimmed down according to the limits will be
return by default. You can also return an empty snippet instead case by
setting ``allow_empty`` option to true.

Returns false on failure. Returns a plain array of strings with excerpts
(snippets) on success.

.. _build_keywords:

BuildKeywords
~~~~~~~~~~~~~

**Prototype:** function BuildKeywords ( $query, $index, $hits )

Extracts keywords from query using tokenizer settings for given index,
optionally with per-keyword occurrence statistics. Returns an array of
hashes with per-keyword information.

``$query`` is a query to extract keywords from. ``$index`` is a name of
the index to get tokenizing settings and keyword occurrence statistics
from. ``$hits`` is a boolean flag that indicates whether keyword
occurrence statistics are required.

Usage example:

.. code-block:: php


    $keywords = $cl->BuildKeywords ( "this.is.my query", "test1", false );


.. _escape_string:

EscapeString
~~~~~~~~~~~~

**Prototype:** function EscapeString ( $string )

Escapes characters that are treated as special operators by the query
language parser. Returns an escaped string.

``$string`` is a string to escape.

This function might seem redundant because it's trivial to implement in
any calling application. However, as the set of special characters might
change over time, it makes sense to have an API call that is guaranteed
to escape all such characters at all times.

Usage example:

.. code-block:: php


    $escaped = $cl->EscapeString ( "escaping-sample@query/string" );


.. _flush_attributes:

FlushAttributes
~~~~~~~~~~~~~~~

**Prototype:** function FlushAttributes ()

Forces ``searchd`` to flush pending attribute updates to disk, and
blocks until completion. Returns a non-negative internal ``flush tag`` on
success. Returns -1 and sets an error message on error.

Attribute values updated using
:ref:`UpdateAttributes() <update_attributes>`
API call are kept in a memory mapped file. Which means the OS
decides when the updates are actually written to disk.
FlushAttributes() call lets you enforce a flush, which writes all the
changes to disk. The call will block
until ``searchd`` finishes writing the data to disk, which might take
seconds or even minutes depending on the total data size (.spa file
size). All the currently updated indexes will be flushed.

Flush tag should be treated as an ever growing magic number that does
not mean anything. It's guaranteed to be non-negative. It is guaranteed
to grow over time, though not necessarily in a sequential fashion; for
instance, two calls that return 10 and then 1000 respectively are a
valid situation. If two calls to FlushAttrs() return the same tag, it
means that there were no actual attribute updates in between them, and
therefore current flushed state remained the same (for all indexes).

Usage example:

.. code-block:: php


    $status = $cl->FlushAttributes ();
    if ( $status<0 )
        print "ERROR: " . $cl->GetLastError();


.. _Status:

Status
~~~~~~

**Prototype:** function Status ()

Queries searchd status, and returns an array of status variable name and
value pairs.

Usage example:

.. code-block:: php


    $status = $cl->Status ();
    foreach ( $status as $row )
        print join ( ": ", $row ) . "\n";


.. _update_attributes:

UpdateAttributes
~~~~~~~~~~~~~~~~

**Prototype:** function UpdateAttributes ( $index, $attrs, $values,
$type=SPH_UPDATE_INT, $ignorenonexistent=false )

Instantly updates given attribute values in given documents. Returns
number of actually updated documents (0 or more) on success, or -1 on
failure.

``$index`` is a name of the index (or indexes) to be updated. ``$attrs``
is a plain array with string attribute names, listing attributes that
are updated.

.. warning::
  Note that document ``id`` attribute cannot be updated.

``$values`` is a hash with documents IDs as keys and new attribute values,
see below.

Optional ``$type`` parameter can have the following values:

1. ``SPH_UPDATE_INT``. This is the default value. ``$values`` hash holds
documents IDs as keys and a plain arrays of new attribute values.

2. ``SPH_UPDATE_MVA``. Points that MVA attributes are being updated. In this
case the ``$values`` must be a hash with document IDs as keys and array of
arrays of int values (new MVA attribute values).

3. ``SPH_UPDATE_STRING``. Points that string attributes are being updated.
``$values`` must be a hash with document IDs as keys and array of strings
as values.

4. ``SPH_UPDATE_JSON``. Works the same as ``SPH_UPDATE_STRING``, but for
JSON attribute updates.

Optional boolean parameter ``$ignorenonexistent``
points that the update will silently ignore any warnings about trying to
update a column which is not exists in current index schema.

``$index`` can be either a single index name or a list, like in
``Query()``. Unlike ``Query()``, wildcard is not allowed and all the
indexes to update must be specified explicitly. The list of indexes can
include distributed index names. Updates on distributed indexes will be
pushed to all agents.

Usage example:

.. code-block:: php


    $cl->UpdateAttributes ( "test1", array("group_id"), array(1=>array(456)) );
    $cl->UpdateAttributes ( "products", array ( "price", "amount_in_stock" ),
        array ( 1001=>array(123,5), 1002=>array(37,11), 1003=>(25,129) ) );

The first sample statement will update document 1 in index ``test1``,
setting ``group_id`` to 456. The second one will update documents 1001,
1002 and 1003 in index ``products``. For document 1001, the new price will
be set to 123 and the new amount in stock to 5; for document 1002, the
new price will be 37 and the new amount will be 11; etc.