usermanual-buffers-language-script-and-direction.xml 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412
  1. <?xml version="1.0"?>
  2. <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
  3. "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
  4. <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
  5. <!ENTITY version SYSTEM "version.xml">
  6. ]>
  7. <chapter id="buffers-language-script-and-direction">
  8. <title>Buffers, language, script and direction</title>
  9. <para>
  10. The input to the HarfBuzz shaper is a series of Unicode characters, stored in a
  11. buffer. In this chapter, we'll look at how to set up a buffer with
  12. the text that we want and how to customize the properties of the
  13. buffer. We'll also look at a piece of lower-level machinery that
  14. you will need to understand before proceeding: the functions that
  15. HarfBuzz uses to retrieve Unicode information.
  16. </para>
  17. <para>
  18. After shaping is complete, HarfBuzz puts its output back
  19. into the buffer. But getting that output requires setting up a
  20. face and a font first, so we will look at that in the next chapter
  21. instead of here.
  22. </para>
  23. <section id="creating-and-destroying-buffers">
  24. <title>Creating and destroying buffers</title>
  25. <para>
  26. As we saw in our <emphasis>Getting Started</emphasis> example, a
  27. buffer is created and
  28. initialized with <function>hb_buffer_create()</function>. This
  29. produces a new, empty buffer object, instantiated with some
  30. default values and ready to accept your Unicode strings.
  31. </para>
  32. <para>
  33. HarfBuzz manages the memory of objects (such as buffers) that it
  34. creates, so you don't have to. When you have finished working on
  35. a buffer, you can call <function>hb_buffer_destroy()</function>:
  36. </para>
  37. <programlisting language="C">
  38. hb_buffer_t *buf = hb_buffer_create();
  39. ...
  40. hb_buffer_destroy(buf);
  41. </programlisting>
  42. <para>
  43. This will destroy the object and free its associated memory -
  44. unless some other part of the program holds a reference to this
  45. buffer. If you acquire a HarfBuzz buffer from another subsystem
  46. and want to ensure that it is not garbage collected by someone
  47. else destroying it, you should increase its reference count:
  48. </para>
  49. <programlisting language="C">
  50. void somefunc(hb_buffer_t *buf) {
  51. buf = hb_buffer_reference(buf);
  52. ...
  53. </programlisting>
  54. <para>
  55. And then decrease it once you're done with it:
  56. </para>
  57. <programlisting language="C">
  58. hb_buffer_destroy(buf);
  59. }
  60. </programlisting>
  61. <para>
  62. While we are on the subject of reference-counting buffers, it is
  63. worth noting that an individual buffer can only meaningfully be
  64. used by one thread at a time.
  65. </para>
  66. <para>
  67. To throw away all the data in your buffer and start from scratch,
  68. call <function>hb_buffer_reset(buf)</function>. If you want to
  69. throw away the string in the buffer but keep the options, you can
  70. instead call <function>hb_buffer_clear_contents(buf)</function>.
  71. </para>
  72. </section>
  73. <section id="adding-text-to-the-buffer">
  74. <title>Adding text to the buffer</title>
  75. <para>
  76. Now we have a brand new HarfBuzz buffer. Let's start filling it
  77. with text! From HarfBuzz's perspective, a buffer is just a stream
  78. of Unicode code points, but your input string is probably in one of
  79. the standard Unicode character encodings (UTF-8, UTF-16, or
  80. UTF-32). HarfBuzz provides convenience functions that accept
  81. each of these encodings:
  82. <function>hb_buffer_add_utf8()</function>,
  83. <function>hb_buffer_add_utf16()</function>, and
  84. <function>hb_buffer_add_utf32()</function>. Other than the
  85. character encoding they accept, they function identically.
  86. </para>
  87. <para>
  88. You can add UTF-8 text to a buffer by passing in the text array,
  89. the array's length, an offset into the array for the first
  90. character to add, and the length of the segment to add:
  91. </para>
  92. <programlisting language="C">
  93. hb_buffer_add_utf8 (hb_buffer_t *buf,
  94. const char *text,
  95. int text_length,
  96. unsigned int item_offset,
  97. int item_length)
  98. </programlisting>
  99. <para>
  100. So, in practice, you can say:
  101. </para>
  102. <programlisting language="C">
  103. hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text));
  104. </programlisting>
  105. <para>
  106. This will append your new characters to
  107. <parameter>buf</parameter>, not replace its existing
  108. contents. Also, note that you can use <literal>-1</literal> in
  109. place of the first instance of <function>strlen(text)</function>
  110. if your text array is NULL-terminated. Similarly, you can also use
  111. <literal>-1</literal> as the final argument want to add its full
  112. contents.
  113. </para>
  114. <para>
  115. Whatever start <parameter>item_offset</parameter> and
  116. <parameter>item_length</parameter> you provide, HarfBuzz will also
  117. attempt to grab the five characters <emphasis>before</emphasis>
  118. the offset point and the five characters
  119. <emphasis>after</emphasis> the designated end. These are the
  120. before and after "context" segments, which are used internally
  121. for HarfBuzz to make shaping decisions. They will not be part of
  122. the final output, but they ensure that HarfBuzz's
  123. script-specific shaping operations are correct. If there are
  124. fewer than five characters available for the before or after
  125. contexts, HarfBuzz will just grab what is there.
  126. </para>
  127. <para>
  128. For longer text runs, such as full paragraphs, it might be
  129. tempting to only add smaller sub-segments to a buffer and
  130. shape them in piecemeal fashion. Generally, this is not a good
  131. idea, however, because a lot of shaping decisions are
  132. dependent on this context information. For example, in Arabic
  133. and other connected scripts, HarfBuzz needs to know the code
  134. points before and after each character in order to correctly
  135. determine which glyph to return.
  136. </para>
  137. <para>
  138. The safest approach is to add all of the text available (even
  139. if your text contains a mix of scripts, directions, languages
  140. and fonts), then use <parameter>item_offset</parameter> and
  141. <parameter>item_length</parameter> to indicate which characters you
  142. want shaped (which must all have the same script, direction,
  143. language and font), so that HarfBuzz has access to any context.
  144. </para>
  145. <para>
  146. You can also add Unicode code points directly with
  147. <function>hb_buffer_add_codepoints()</function>. The arguments
  148. to this function are the same as those for the UTF
  149. encodings. But it is particularly important to note that
  150. HarfBuzz does not do validity checking on the text that is added
  151. to a buffer. Invalid code points will be replaced, but it is up
  152. to you to do any deep-sanity checking necessary.
  153. </para>
  154. </section>
  155. <section id="setting-buffer-properties">
  156. <title>Setting buffer properties</title>
  157. <para>
  158. Buffers containing input characters still need several
  159. properties set before HarfBuzz can shape their text correctly.
  160. </para>
  161. <para>
  162. Initially, all buffers are set to the
  163. <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content
  164. type. After adding text, the buffer should be set to
  165. <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which
  166. indicates that it contains un-shaped input
  167. characters. After shaping, the buffer will have the
  168. <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type.
  169. </para>
  170. <para>
  171. <function>hb_buffer_add_utf8()</function> and the
  172. other UTF functions set the content type of their buffer
  173. automatically. But if you are reusing a buffer you may want to
  174. check its state with
  175. <function>hb_buffer_get_content_type(buffer)</function>. If
  176. necessary you can set the content type with
  177. </para>
  178. <programlisting language="C">
  179. hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE);
  180. </programlisting>
  181. <para>
  182. to prepare for shaping.
  183. </para>
  184. <para>
  185. Buffers also need to carry information about the script,
  186. language, and text direction of their contents. You can set
  187. these properties individually:
  188. </para>
  189. <programlisting language="C">
  190. hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
  191. hb_buffer_set_script(buf, HB_SCRIPT_LATIN);
  192. hb_buffer_set_language(buf, hb_language_from_string("en", -1));
  193. </programlisting>
  194. <para>
  195. However, since these properties are often repeated for
  196. multiple text runs, you can also save them in a
  197. <literal>hb_segment_properties_t</literal> for reuse:
  198. </para>
  199. <programlisting language="C">
  200. hb_segment_properties_t *savedprops;
  201. hb_buffer_get_segment_properties (buf, savedprops);
  202. ...
  203. hb_buffer_set_segment_properties (buf2, savedprops);
  204. </programlisting>
  205. <para>
  206. HarfBuzz also provides getter functions to retrieve a buffer's
  207. direction, script, and language properties individually.
  208. </para>
  209. <para>
  210. HarfBuzz recognizes four text directions in
  211. <type>hb_direction_t</type>: left-to-right
  212. (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>),
  213. top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and
  214. bottom-to-top (<literal>HB_DIRECTION_BTT</literal>). For the
  215. script property, HarfBuzz uses identifiers based on the
  216. <ulink
  217. url="https://unicode.org/iso15924/">ISO 15924
  218. standard</ulink>. For languages, HarfBuzz uses tags based on the
  219. <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard.
  220. </para>
  221. <para>
  222. Helper functions are provided to convert character strings into
  223. the necessary script and language tag types.
  224. </para>
  225. <para>
  226. Two additional buffer properties to be aware of are the
  227. "invisible glyph" and the replacement code point. The
  228. replacement code point is inserted into buffer output in place of
  229. any invalid code points encountered in the input. By default, it
  230. is the Unicode <literal>REPLACEMENT CHARACTER</literal> code
  231. point, <literal>U+FFFD</literal> "&#xFFFD;". You can change this with
  232. </para>
  233. <programlisting language="C">
  234. hb_buffer_set_replacement_codepoint(buf, replacement);
  235. </programlisting>
  236. <para>
  237. passing in the replacement Unicode code point as the
  238. <parameter>replacement</parameter> parameter.
  239. </para>
  240. <para>
  241. The invisible glyph is used to replace all output glyphs that
  242. are invisible. By default, the standard space character
  243. <literal>U+0020</literal> is used; you can replace this (for
  244. example, when using a font that provides script-specific
  245. spaces) with
  246. </para>
  247. <programlisting language="C">
  248. hb_buffer_set_invisible_glyph(buf, replacement_glyph);
  249. </programlisting>
  250. <para>
  251. Do note that in the <parameter>replacement_glyph</parameter>
  252. parameter, you must provide the glyph ID of the replacement you
  253. wish to use, not the Unicode code point.
  254. </para>
  255. <para>
  256. HarfBuzz supports a few additional flags you might want to set
  257. on your buffer under certain circumstances. The
  258. <literal>HB_BUFFER_FLAG_BOT</literal> and
  259. <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz
  260. that the buffer represents the beginning or end (respectively)
  261. of a text element (such as a paragraph or other block). Knowing
  262. this allows HarfBuzz to apply certain contextual font features
  263. when shaping, such as initial or final variants in connected
  264. scripts.
  265. </para>
  266. <para>
  267. <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal>
  268. tells HarfBuzz not to hide glyphs with the
  269. <literal>Default_Ignorable</literal> property in Unicode. This
  270. property designates control characters and other non-printing
  271. code points, such as joiners and variation selectors. Normally
  272. HarfBuzz replaces them in the output buffer with zero-width
  273. space glyphs (using the "invisible glyph" property discussed
  274. above); setting this flag causes them to be printed, which can
  275. be helpful for troubleshooting.
  276. </para>
  277. <para>
  278. Conversely, setting the
  279. <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag
  280. tells HarfBuzz to remove <literal>Default_Ignorable</literal>
  281. glyphs from the output buffer entirely. Finally, setting the
  282. <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal>
  283. flag tells HarfBuzz not to insert the dotted-circle glyph
  284. (<literal>U+25CC</literal>, "&#x25CC;"), which is normally
  285. inserted into buffer output when broken character sequences are
  286. encountered (such as combining marks that are not attached to a
  287. base character).
  288. </para>
  289. </section>
  290. <section id="customizing-unicode-functions">
  291. <title>Customizing Unicode functions</title>
  292. <para>
  293. HarfBuzz requires some simple functions for accessing
  294. information from the Unicode Character Database (such as the
  295. <literal>General_Category</literal> (gc) and
  296. <literal>Script</literal> (sc) properties) that is useful
  297. for shaping, as well as some useful operations like composing and
  298. decomposing code points.
  299. </para>
  300. <para>
  301. HarfBuzz includes its own internal, lightweight set of Unicode
  302. functions. At build time, it is also possible to compile support
  303. for some other options, such as the Unicode functions provided
  304. by GLib or the International Components for Unicode (ICU)
  305. library. Generally, this option is only of interest for client
  306. programs that have specific integration requirements or that do
  307. a significant amount of customization.
  308. </para>
  309. <para>
  310. If your program has access to other Unicode functions, however,
  311. such as through a system library or application framework, you
  312. might prefer to use those instead of the built-in
  313. options. HarfBuzz supports this by implementing its Unicode
  314. functions as a set of virtual methods that you can replace —
  315. without otherwise affecting HarfBuzz's functionality.
  316. </para>
  317. <para>
  318. The Unicode functions are specified in a structure called
  319. <literal>unicode_funcs</literal> which is attached to each
  320. buffer. But even though <literal>unicode_funcs</literal> is
  321. associated with a <type>hb_buffer_t</type>, the functions
  322. themselves are called by other HarfBuzz APIs that access
  323. buffers, so it would be unwise for you to hook different
  324. functions into different buffers.
  325. </para>
  326. <para>
  327. In addition, you can mark your <literal>unicode_funcs</literal>
  328. as immutable by calling
  329. <function>hb_unicode_funcs_make_immutable (ufuncs)</function>.
  330. This is especially useful if your code is a
  331. library or framework that will have its own client programs. By
  332. marking your Unicode function choices as immutable, you prevent
  333. your own client programs from changing the
  334. <literal>unicode_funcs</literal> configuration and introducing
  335. inconsistencies and errors downstream.
  336. </para>
  337. <para>
  338. You can retrieve the Unicode-functions configuration for
  339. your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>:
  340. </para>
  341. <programlisting language="C">
  342. hb_unicode_funcs_t *ufunctions;
  343. ufunctions = hb_buffer_get_unicode_funcs(buf);
  344. </programlisting>
  345. <para>
  346. The current version of <literal>unicode_funcs</literal> uses six functions:
  347. </para>
  348. <itemizedlist>
  349. <listitem>
  350. <para>
  351. <function>hb_unicode_combining_class_func_t</function>:
  352. returns the Canonical Combining Class of a code point.
  353. </para>
  354. </listitem>
  355. <listitem>
  356. <para>
  357. <function>hb_unicode_general_category_func_t</function>:
  358. returns the General Category (gc) of a code point.
  359. </para>
  360. </listitem>
  361. <listitem>
  362. <para>
  363. <function>hb_unicode_mirroring_func_t</function>: returns
  364. the Mirroring Glyph code point (for bi-directional
  365. replacement) of a code point.
  366. </para>
  367. </listitem>
  368. <listitem>
  369. <para>
  370. <function>hb_unicode_script_func_t</function>: returns the
  371. Script (sc) property of a code point.
  372. </para>
  373. </listitem>
  374. <listitem>
  375. <para>
  376. <function>hb_unicode_compose_func_t</function>: returns the
  377. canonical composition of a sequence of two code points.
  378. </para>
  379. </listitem>
  380. <listitem>
  381. <para>
  382. <function>hb_unicode_decompose_func_t</function>: returns
  383. the canonical decomposition of a code point.
  384. </para>
  385. </listitem>
  386. </itemizedlist>
  387. <para>
  388. Note, however, that future HarfBuzz releases may alter this set.
  389. </para>
  390. <para>
  391. Each Unicode function has a corresponding setter, with which you
  392. can assign a callback to your replacement function. For example,
  393. to replace
  394. <function>hb_unicode_general_category_func_t</function>, you can call
  395. </para>
  396. <programlisting language="C">
  397. hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy)
  398. </programlisting>
  399. <para>
  400. Virtualizing this set of Unicode functions is primarily intended
  401. to improve portability. There is no need for every client
  402. program to make the effort to replace the default options, so if
  403. you are unsure, do not feel any pressure to customize
  404. <literal>unicode_funcs</literal>.
  405. </para>
  406. </section>
  407. </chapter>