additional_functionality.rst 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328
  1. Additional functionality
  2. ------------------------
  3. .. _build_excerpts:
  4. BuildExcerpts
  5. ~~~~~~~~~~~~~
  6. **Prototype:** function BuildExcerpts ( $docs, $index, $words,
  7. $opts=array() )
  8. Excerpts (snippets) builder function. Connects to ``searchd``, asks it
  9. to generate excerpts (snippets) from given documents, and returns the
  10. results.
  11. ``$docs`` is a plain array of strings that carry the documents'
  12. contents. ``$index`` is an index name string. Different settings (such
  13. as charset, morphology, wordforms) from given index will be used.
  14. ``$words`` is a string that contains the keywords to highlight. They
  15. will be processed with respect to index settings. For instance, if
  16. English stemming is enabled in the index, “shoes” will be highlighted
  17. even if keyword is “shoe”. Keywords can contain wildcards, that work
  18. similarly to star-syntax available in queries. ``$opts`` is a hash which
  19. contains additional optional highlighting parameters:
  20. - “before_match”:
  21. - A string to insert before a keyword match. A %PASSAGE_ID% macro can
  22. be used in this string. The first match of the macro is replaced with
  23. an incrementing passage number within a current snippet. Numbering
  24. starts at 1 by default but can be overridden with
  25. “start_passage_id” option. In a multi-document call, %PASSAGE_ID%
  26. would restart at every given document. Default is “**”.
  27. - “after_match”:
  28. - A string to insert after a keyword match. Starting with version
  29. 1.10-beta, a %PASSAGE_ID% macro can be used in this string. Default
  30. is “**”.
  31. - “chunk_separator”:
  32. - A string to insert between snippet chunks (passages). Default is " …
  33. “.
  34. - “limit”:
  35. - Maximum snippet size, in symbols (codepoints). Integer, default is
  36. 256.
  37. - “around”:
  38. - How much words to pick around each matching keywords block. Integer,
  39. default is 5.
  40. - “exact_phrase”:
  41. - Whether to highlight exact query phrase matches only instead of
  42. individual keywords. Boolean, default is false.
  43. - “use_boundaries”:
  44. - Whether to additionally break passages by phrase boundary characters,
  45. as configured in index settings with
  46. :ref:`phrase_boundary <phrase_boundary>`
  47. directive. Boolean, default is false.
  48. - “weight_order”:
  49. - Whether to sort the extracted passages in order of relevance
  50. (decreasing weight), or in order of appearance in the document
  51. (increasing position). Boolean, default is false.
  52. - “query_mode”:
  53. - Whether to handle $words as a query in :ref:`extended
  54. syntax <extended_query_syntax>`, or as a bag of words
  55. (default behavior). For instance, in query mode (“one two” \| “three
  56. four”) will only highlight and include those occurrences “one two” or
  57. “three four” when the two words from each pair are adjacent to each
  58. other. In default mode, any single occurrence of “one”, “two”,
  59. “three”, or “four” would be highlighted. Boolean, default is false.
  60. - “force_all_words”:
  61. - Ignores the snippet length limit until it includes all the keywords.
  62. Boolean, default is false.
  63. - “limit_passages”:
  64. - Limits the maximum number of passages that can be included into the
  65. snippet. Integer, default is 0 (no limit).
  66. - “limit_words”:
  67. - Limits the maximum number of words that can be included into the
  68. snippet. Note the limit applies to any words, and not just the
  69. matched keywords to highlight. For example, if we are highlighting
  70. “Mary” and a passage “Mary had a little lamb” is selected, then it
  71. contributes 5 words to this limit, not just 1. Integer, default is 0
  72. (no limit).
  73. - “start_passage_id”:
  74. - Specifies the starting value of %PASSAGE_ID% macro (that gets
  75. detected and expanded in ``before_match``, ``after_match`` strings).
  76. Integer, default is 1.
  77. - “load_files”:
  78. - Whether to handle $docs as data to extract snippets from (default
  79. behavior), or to treat it as file names, and load data from specified
  80. files on the server side. Up to
  81. :ref:`dist_threads <dist_threads>`
  82. worker threads per request will be created to parallelize the work
  83. when this flag is enabled. Boolean, default is false. Building of the
  84. snippets could be parallelized between remote agents. Just set the
  85. :ref:`‘dist_threads’ <dist_threads>`
  86. param in the config to the value greater than 1, and then invoke the
  87. snippets generation over the distributed index, which contain only
  88. one(!) :ref:`local <local>` agent
  89. and several remotes. The
  90. :ref:`snippets_file_prefix <snippets_file_prefix>`
  91. option is also in the game and the final filename is calculated by
  92. concatenation of the prefix with given name. Otherwords, when
  93. snippets_file_prefix is ‘/var/data’ and filename is ‘text.txt’ the
  94. sphinx will try to generate the snippets from the file
  95. ‘/var/datatext.txt’, which is exactly ‘/var/data’ + ‘text.txt’.
  96. - “load_files_scattered”:
  97. - It works only with distributed snippets generation with remote
  98. agents. The source files for snippets could be distributed among
  99. different agents, and the main daemon will merge together all
  100. non-erroneous results. So, if one agent of the distributed index has
  101. ‘file1.txt’, another has ‘file2.txt’ and you call for the snippets
  102. with both these files, the sphinx will merge results from the agents
  103. together, so you will get the snippets from both ‘file1.txt’ and
  104. ‘file2.txt’. Boolean, default is false.
  105. If the “load_files” is also set, the request will return the error
  106. in case if any of the files is not available anywhere. Otherwise (if
  107. “load_files” is not set) it will just return the empty strings for
  108. all absent files. The master instance reset this flag when
  109. distributes the snippets among agents. So, for agents the absence of
  110. a file is not critical error, but for the master it might be so. If
  111. you want to be sure that all snippets are actually created, set both
  112. “load_files_scattered” and “load_files”. If the absence of some
  113. snippets caused by some agents is not critical for you - set just
  114. “load_files_scattered”, leaving “load_files” not set.
  115. - “html_strip_mode”:
  116. - HTML stripping mode setting. Defaults to “index”, which means that
  117. index settings will be used. The other values are “none” and “strip”,
  118. that forcibly skip or apply stripping irregardless of index settings;
  119. and “retain”, that retains HTML markup and protects it from
  120. highlighting. The “retain” mode can only be used when highlighting
  121. full documents and thus requires that no snippet size limits are set.
  122. String, allowed values are “none”, “strip”, “index”, and “retain”.
  123. - “allow_empty”:
  124. - Allows empty string to be returned as highlighting result when a
  125. snippet could not be generated (no keywords match, or no passages fit
  126. the limit). By default, the beginning of original text would be
  127. returned instead of an empty string. Boolean, default is false.
  128. - “passage_boundary”:
  129. - Ensures that passages do not cross a sentence, paragraph, or zone
  130. boundary (when used with an index that has the respective indexing
  131. settings enabled). String, allowed values are “sentence”,
  132. “paragraph”, and “zone”.
  133. - “emit_zones”:
  134. - Emits an HTML tag with an enclosing zone name before each passage.
  135. Boolean, default is false.
  136. Snippets extraction algorithm currently favors better passages (with
  137. closer phrase matches), and then passages with keywords not yet in
  138. snippet. Generally, it will try to highlight the best match with the
  139. query, and it will also to highlight all the query keywords, as made
  140. possible by the limits. In case the document does not match the query,
  141. beginning of the document trimmed down according to the limits will be
  142. return by default. You can also return an empty snippet instead case by
  143. setting “allow_empty” option to true.
  144. Returns false on failure. Returns a plain array of strings with excerpts
  145. (snippets) on success.
  146. .. _build_keywords:
  147. BuildKeywords
  148. ~~~~~~~~~~~~~
  149. **Prototype:** function BuildKeywords ( $query, $index, $hits )
  150. Extracts keywords from query using tokenizer settings for given index,
  151. optionally with per-keyword occurrence statistics. Returns an array of
  152. hashes with per-keyword information.
  153. ``$query`` is a query to extract keywords from. ``$index`` is a name of
  154. the index to get tokenizing settings and keyword occurrence statistics
  155. from. ``$hits`` is a boolean flag that indicates whether keyword
  156. occurrence statistics are required.
  157. Usage example:
  158. ::
  159. $keywords = $cl->BuildKeywords ( "this.is.my query", "test1", false );
  160. .. _escape_string:
  161. EscapeString
  162. ~~~~~~~~~~~~
  163. **Prototype:** function EscapeString ( $string )
  164. Escapes characters that are treated as special operators by the query
  165. language parser. Returns an escaped string.
  166. ``$string`` is a string to escape.
  167. This function might seem redundant because it's trivial to implement in
  168. any calling application. However, as the set of special characters might
  169. change over time, it makes sense to have an API call that is guaranteed
  170. to escape all such characters at all times.
  171. Usage example:
  172. ::
  173. $escaped = $cl->EscapeString ( "escaping-sample@query/string" );
  174. .. _flush_attributes:
  175. FlushAttributes
  176. ~~~~~~~~~~~~~~~
  177. **Prototype:** function FlushAttributes ()
  178. Forces ``searchd`` to flush pending attribute updates to disk, and
  179. blocks until completion. Returns a non-negative internal “flush tag” on
  180. success. Returns -1 and sets an error message on error.
  181. Attribute values updated using
  182. :ref:`UpdateAttributes() <update_attributes>`
  183. API call are only kept in RAM until a so-called flush (which writes the
  184. current, possibly updated attribute values back to disk).
  185. FlushAttributes() call lets you enforce a flush. The call will block
  186. until ``searchd`` finishes writing the data to disk, which might take
  187. seconds or even minutes depending on the total data size (.spa file
  188. size). All the currently updated indexes will be flushed.
  189. Flush tag should be treated as an ever growing magic number that does
  190. not mean anything. It's guaranteed to be non-negative. It is guaranteed
  191. to grow over time, though not necessarily in a sequential fashion; for
  192. instance, two calls that return 10 and then 1000 respectively are a
  193. valid situation. If two calls to FlushAttrs() return the same tag, it
  194. means that there were no actual attribute updates in between them, and
  195. therefore current flushed state remained the same (for all indexes).
  196. Usage example:
  197. ::
  198. $status = $cl->FlushAttributes ();
  199. if ( $status<0 )
  200. print "ERROR: " . $cl->GetLastError();
  201. .. _Status:
  202. Status
  203. ~~~~~~
  204. **Prototype:** function Status ()
  205. Queries searchd status, and returns an array of status variable name and
  206. value pairs.
  207. Usage example:
  208. ::
  209. $status = $cl->Status ();
  210. foreach ( $status as $row )
  211. print join ( ": ", $row ) . "\n";
  212. .. _update_attributes:
  213. UpdateAttributes
  214. ~~~~~~~~~~~~~~~~
  215. **Prototype:** function UpdateAttributes ( $index, $attrs, $values,
  216. $mva=false, $ignorenonexistent=false )
  217. Instantly updates given attribute values in given documents. Returns
  218. number of actually updated documents (0 or more) on success, or -1 on
  219. failure.
  220. ``$index`` is a name of the index (or indexes) to be updated. ``$attrs``
  221. is a plain array with string attribute names, listing attributes that
  222. are updated. ``$values`` is a hash where key is document ID, and value
  223. is a plain array of new attribute values. Optional boolean parameter
  224. ``mva`` points that there is update of MVA attributes. In this case the
  225. values must be a dict with int key (document ID) and list of lists of int values (new MVA attribute values). Optional boolean parameter ``$ignorenonexistent``
  226. points that the update will silently ignore any warnings about trying to
  227. update a column which is not exists in current index schema.
  228. ``$index`` can be either a single index name or a list, like in
  229. ``Query()``. Unlike ``Query()``, wildcard is not allowed and all the
  230. indexes to update must be specified explicitly. The list of indexes can
  231. include distributed index names. Updates on distributed indexes will be
  232. pushed to all agents.
  233. The updates only work with ``docinfo=extern`` storage strategy. They are
  234. very fast because they're working fully in RAM, but they can also be
  235. made persistent: updates are saved on disk on clean ``searchd`` shutdown
  236. initiated by SIGTERM signal. With additional restrictions, updates are
  237. also possible on MVA attributes; refer to
  238. :ref:`mva_updates_pool <mva_updates_pool>`
  239. directive for details.
  240. Usage example:
  241. ::
  242. $cl->UpdateAttributes ( "test1", array("group_id"), array(1=>array(456)) );
  243. $cl->UpdateAttributes ( "products", array ( "price", "amount_in_stock" ),
  244. array ( 1001=>array(123,5), 1002=>array(37,11), 1003=>(25,129) ) );
  245. The first sample statement will update document 1 in index “test1”,
  246. setting “group_id” to 456. The second one will update documents 1001,
  247. 1002 and 1003 in index “products”. For document 1001, the new price will
  248. be set to 123 and the new amount in stock to 5; for document 1002, the
  249. new price will be 37 and the new amount will be 11; etc.