| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412 |
- <?xml version="1.0"?>
- <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
- "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
- <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
- <!ENTITY version SYSTEM "version.xml">
- ]>
- <chapter id="buffers-language-script-and-direction">
- <title>Buffers, language, script and direction</title>
- <para>
- The input to the HarfBuzz shaper is a series of Unicode characters, stored in a
- buffer. In this chapter, we'll look at how to set up a buffer with
- the text that we want and how to customize the properties of the
- buffer. We'll also look at a piece of lower-level machinery that
- you will need to understand before proceeding: the functions that
- HarfBuzz uses to retrieve Unicode information.
- </para>
- <para>
- After shaping is complete, HarfBuzz puts its output back
- into the buffer. But getting that output requires setting up a
- face and a font first, so we will look at that in the next chapter
- instead of here.
- </para>
- <section id="creating-and-destroying-buffers">
- <title>Creating and destroying buffers</title>
- <para>
- As we saw in our <emphasis>Getting Started</emphasis> example, a
- buffer is created and
- initialized with <function>hb_buffer_create()</function>. This
- produces a new, empty buffer object, instantiated with some
- default values and ready to accept your Unicode strings.
- </para>
- <para>
- HarfBuzz manages the memory of objects (such as buffers) that it
- creates, so you don't have to. When you have finished working on
- a buffer, you can call <function>hb_buffer_destroy()</function>:
- </para>
- <programlisting language="C">
- hb_buffer_t *buf = hb_buffer_create();
- ...
- hb_buffer_destroy(buf);
- </programlisting>
- <para>
- This will destroy the object and free its associated memory -
- unless some other part of the program holds a reference to this
- buffer. If you acquire a HarfBuzz buffer from another subsystem
- and want to ensure that it is not garbage collected by someone
- else destroying it, you should increase its reference count:
- </para>
- <programlisting language="C">
- void somefunc(hb_buffer_t *buf) {
- buf = hb_buffer_reference(buf);
- ...
- </programlisting>
- <para>
- And then decrease it once you're done with it:
- </para>
- <programlisting language="C">
- hb_buffer_destroy(buf);
- }
- </programlisting>
- <para>
- While we are on the subject of reference-counting buffers, it is
- worth noting that an individual buffer can only meaningfully be
- used by one thread at a time.
- </para>
- <para>
- To throw away all the data in your buffer and start from scratch,
- call <function>hb_buffer_reset(buf)</function>. If you want to
- throw away the string in the buffer but keep the options, you can
- instead call <function>hb_buffer_clear_contents(buf)</function>.
- </para>
- </section>
-
- <section id="adding-text-to-the-buffer">
- <title>Adding text to the buffer</title>
- <para>
- Now we have a brand new HarfBuzz buffer. Let's start filling it
- with text! From HarfBuzz's perspective, a buffer is just a stream
- of Unicode code points, but your input string is probably in one of
- the standard Unicode character encodings (UTF-8, UTF-16, or
- UTF-32). HarfBuzz provides convenience functions that accept
- each of these encodings:
- <function>hb_buffer_add_utf8()</function>,
- <function>hb_buffer_add_utf16()</function>, and
- <function>hb_buffer_add_utf32()</function>. Other than the
- character encoding they accept, they function identically.
- </para>
- <para>
- You can add UTF-8 text to a buffer by passing in the text array,
- the array's length, an offset into the array for the first
- character to add, and the length of the segment to add:
- </para>
- <programlisting language="C">
- hb_buffer_add_utf8 (hb_buffer_t *buf,
- const char *text,
- int text_length,
- unsigned int item_offset,
- int item_length)
- </programlisting>
- <para>
- So, in practice, you can say:
- </para>
- <programlisting language="C">
- hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text));
- </programlisting>
- <para>
- This will append your new characters to
- <parameter>buf</parameter>, not replace its existing
- contents. Also, note that you can use <literal>-1</literal> in
- place of the first instance of <function>strlen(text)</function>
- if your text array is NULL-terminated. Similarly, you can also use
- <literal>-1</literal> as the final argument want to add its full
- contents.
- </para>
- <para>
- Whatever start <parameter>item_offset</parameter> and
- <parameter>item_length</parameter> you provide, HarfBuzz will also
- attempt to grab the five characters <emphasis>before</emphasis>
- the offset point and the five characters
- <emphasis>after</emphasis> the designated end. These are the
- before and after "context" segments, which are used internally
- for HarfBuzz to make shaping decisions. They will not be part of
- the final output, but they ensure that HarfBuzz's
- script-specific shaping operations are correct. If there are
- fewer than five characters available for the before or after
- contexts, HarfBuzz will just grab what is there.
- </para>
- <para>
- For longer text runs, such as full paragraphs, it might be
- tempting to only add smaller sub-segments to a buffer and
- shape them in piecemeal fashion. Generally, this is not a good
- idea, however, because a lot of shaping decisions are
- dependent on this context information. For example, in Arabic
- and other connected scripts, HarfBuzz needs to know the code
- points before and after each character in order to correctly
- determine which glyph to return.
- </para>
- <para>
- The safest approach is to add all of the text available (even
- if your text contains a mix of scripts, directions, languages
- and fonts), then use <parameter>item_offset</parameter> and
- <parameter>item_length</parameter> to indicate which characters you
- want shaped (which must all have the same script, direction,
- language and font), so that HarfBuzz has access to any context.
- </para>
- <para>
- You can also add Unicode code points directly with
- <function>hb_buffer_add_codepoints()</function>. The arguments
- to this function are the same as those for the UTF
- encodings. But it is particularly important to note that
- HarfBuzz does not do validity checking on the text that is added
- to a buffer. Invalid code points will be replaced, but it is up
- to you to do any deep-sanity checking necessary.
- </para>
-
- </section>
-
- <section id="setting-buffer-properties">
- <title>Setting buffer properties</title>
- <para>
- Buffers containing input characters still need several
- properties set before HarfBuzz can shape their text correctly.
- </para>
- <para>
- Initially, all buffers are set to the
- <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content
- type. After adding text, the buffer should be set to
- <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which
- indicates that it contains un-shaped input
- characters. After shaping, the buffer will have the
- <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type.
- </para>
- <para>
- <function>hb_buffer_add_utf8()</function> and the
- other UTF functions set the content type of their buffer
- automatically. But if you are reusing a buffer you may want to
- check its state with
- <function>hb_buffer_get_content_type(buffer)</function>. If
- necessary you can set the content type with
- </para>
- <programlisting language="C">
- hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE);
- </programlisting>
- <para>
- to prepare for shaping.
- </para>
- <para>
- Buffers also need to carry information about the script,
- language, and text direction of their contents. You can set
- these properties individually:
- </para>
- <programlisting language="C">
- hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
- hb_buffer_set_script(buf, HB_SCRIPT_LATIN);
- hb_buffer_set_language(buf, hb_language_from_string("en", -1));
- </programlisting>
- <para>
- However, since these properties are often repeated for
- multiple text runs, you can also save them in a
- <literal>hb_segment_properties_t</literal> for reuse:
- </para>
- <programlisting language="C">
- hb_segment_properties_t *savedprops;
- hb_buffer_get_segment_properties (buf, savedprops);
- ...
- hb_buffer_set_segment_properties (buf2, savedprops);
- </programlisting>
- <para>
- HarfBuzz also provides getter functions to retrieve a buffer's
- direction, script, and language properties individually.
- </para>
- <para>
- HarfBuzz recognizes four text directions in
- <type>hb_direction_t</type>: left-to-right
- (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>),
- top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and
- bottom-to-top (<literal>HB_DIRECTION_BTT</literal>). For the
- script property, HarfBuzz uses identifiers based on the
- <ulink
- url="https://unicode.org/iso15924/">ISO 15924
- standard</ulink>. For languages, HarfBuzz uses tags based on the
- <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard.
- </para>
- <para>
- Helper functions are provided to convert character strings into
- the necessary script and language tag types.
- </para>
- <para>
- Two additional buffer properties to be aware of are the
- "invisible glyph" and the replacement code point. The
- replacement code point is inserted into buffer output in place of
- any invalid code points encountered in the input. By default, it
- is the Unicode <literal>REPLACEMENT CHARACTER</literal> code
- point, <literal>U+FFFD</literal> "�". You can change this with
- </para>
- <programlisting language="C">
- hb_buffer_set_replacement_codepoint(buf, replacement);
- </programlisting>
- <para>
- passing in the replacement Unicode code point as the
- <parameter>replacement</parameter> parameter.
- </para>
- <para>
- The invisible glyph is used to replace all output glyphs that
- are invisible. By default, the standard space character
- <literal>U+0020</literal> is used; you can replace this (for
- example, when using a font that provides script-specific
- spaces) with
- </para>
- <programlisting language="C">
- hb_buffer_set_invisible_glyph(buf, replacement_glyph);
- </programlisting>
- <para>
- Do note that in the <parameter>replacement_glyph</parameter>
- parameter, you must provide the glyph ID of the replacement you
- wish to use, not the Unicode code point.
- </para>
- <para>
- HarfBuzz supports a few additional flags you might want to set
- on your buffer under certain circumstances. The
- <literal>HB_BUFFER_FLAG_BOT</literal> and
- <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz
- that the buffer represents the beginning or end (respectively)
- of a text element (such as a paragraph or other block). Knowing
- this allows HarfBuzz to apply certain contextual font features
- when shaping, such as initial or final variants in connected
- scripts.
- </para>
- <para>
- <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal>
- tells HarfBuzz not to hide glyphs with the
- <literal>Default_Ignorable</literal> property in Unicode. This
- property designates control characters and other non-printing
- code points, such as joiners and variation selectors. Normally
- HarfBuzz replaces them in the output buffer with zero-width
- space glyphs (using the "invisible glyph" property discussed
- above); setting this flag causes them to be printed, which can
- be helpful for troubleshooting.
- </para>
- <para>
- Conversely, setting the
- <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag
- tells HarfBuzz to remove <literal>Default_Ignorable</literal>
- glyphs from the output buffer entirely. Finally, setting the
- <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal>
- flag tells HarfBuzz not to insert the dotted-circle glyph
- (<literal>U+25CC</literal>, "◌"), which is normally
- inserted into buffer output when broken character sequences are
- encountered (such as combining marks that are not attached to a
- base character).
- </para>
- </section>
-
- <section id="customizing-unicode-functions">
- <title>Customizing Unicode functions</title>
- <para>
- HarfBuzz requires some simple functions for accessing
- information from the Unicode Character Database (such as the
- <literal>General_Category</literal> (gc) and
- <literal>Script</literal> (sc) properties) that is useful
- for shaping, as well as some useful operations like composing and
- decomposing code points.
- </para>
- <para>
- HarfBuzz includes its own internal, lightweight set of Unicode
- functions. At build time, it is also possible to compile support
- for some other options, such as the Unicode functions provided
- by GLib or the International Components for Unicode (ICU)
- library. Generally, this option is only of interest for client
- programs that have specific integration requirements or that do
- a significant amount of customization.
- </para>
- <para>
- If your program has access to other Unicode functions, however,
- such as through a system library or application framework, you
- might prefer to use those instead of the built-in
- options. HarfBuzz supports this by implementing its Unicode
- functions as a set of virtual methods that you can replace —
- without otherwise affecting HarfBuzz's functionality.
- </para>
- <para>
- The Unicode functions are specified in a structure called
- <literal>unicode_funcs</literal> which is attached to each
- buffer. But even though <literal>unicode_funcs</literal> is
- associated with a <type>hb_buffer_t</type>, the functions
- themselves are called by other HarfBuzz APIs that access
- buffers, so it would be unwise for you to hook different
- functions into different buffers.
- </para>
- <para>
- In addition, you can mark your <literal>unicode_funcs</literal>
- as immutable by calling
- <function>hb_unicode_funcs_make_immutable (ufuncs)</function>.
- This is especially useful if your code is a
- library or framework that will have its own client programs. By
- marking your Unicode function choices as immutable, you prevent
- your own client programs from changing the
- <literal>unicode_funcs</literal> configuration and introducing
- inconsistencies and errors downstream.
- </para>
- <para>
- You can retrieve the Unicode-functions configuration for
- your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>:
- </para>
- <programlisting language="C">
- hb_unicode_funcs_t *ufunctions;
- ufunctions = hb_buffer_get_unicode_funcs(buf);
- </programlisting>
- <para>
- The current version of <literal>unicode_funcs</literal> uses six functions:
- </para>
- <itemizedlist>
- <listitem>
- <para>
- <function>hb_unicode_combining_class_func_t</function>:
- returns the Canonical Combining Class of a code point.
- </para>
- </listitem>
- <listitem>
- <para>
- <function>hb_unicode_general_category_func_t</function>:
- returns the General Category (gc) of a code point.
- </para>
- </listitem>
- <listitem>
- <para>
- <function>hb_unicode_mirroring_func_t</function>: returns
- the Mirroring Glyph code point (for bi-directional
- replacement) of a code point.
- </para>
- </listitem>
- <listitem>
- <para>
- <function>hb_unicode_script_func_t</function>: returns the
- Script (sc) property of a code point.
- </para>
- </listitem>
- <listitem>
- <para>
- <function>hb_unicode_compose_func_t</function>: returns the
- canonical composition of a sequence of two code points.
- </para>
- </listitem>
- <listitem>
- <para>
- <function>hb_unicode_decompose_func_t</function>: returns
- the canonical decomposition of a code point.
- </para>
- </listitem>
- </itemizedlist>
- <para>
- Note, however, that future HarfBuzz releases may alter this set.
- </para>
- <para>
- Each Unicode function has a corresponding setter, with which you
- can assign a callback to your replacement function. For example,
- to replace
- <function>hb_unicode_general_category_func_t</function>, you can call
- </para>
- <programlisting language="C">
- hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy)
- </programlisting>
- <para>
- Virtualizing this set of Unicode functions is primarily intended
- to improve portability. There is no need for every client
- program to make the effort to replace the default options, so if
- you are unsure, do not feel any pressure to customize
- <literal>unicode_funcs</literal>.
- </para>
- </section>
-
- </chapter>
|