| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701 |
- <?xml version="1.0"?>
- <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
- "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
- <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
- <!ENTITY version SYSTEM "version.xml">
- ]>
- <chapter id="clusters">
- <title>Clusters</title>
- <section id="clusters-and-shaping">
- <title>Clusters and shaping</title>
- <para>
- In text shaping, a <emphasis>cluster</emphasis> is a sequence of
- characters that needs to be treated as a single, indivisible
- unit. A single letter or symbol can be a cluster of its
- own. Other clusters correspond to longer subsequences of the
- input code points — such as a ligature or conjunct form
- — and require the shaper to ensure that the cluster is not
- broken during the shaping process.
- </para>
- <para>
- A cluster is distinct from a <emphasis>grapheme</emphasis>,
- which is the smallest unit of meaning in a writing system or
- script.
- </para>
- <para>
- The definitions of the two terms are similar. However, clusters
- are only relevant for script shaping and glyph layout. In
- contrast, graphemes are a property of the underlying script, and
- are of interest when client programs implement orthographic
- or linguistic functionality.
- </para>
- <para>
- For example, two individual letters are often two separate
- graphemes. When two letters form a ligature, however, they
- combine into a single glyph. They are then part of the same
- cluster and are treated as a unit by the shaping engine —
- even though the two original, underlying letters remain separate
- graphemes.
- </para>
- <para>
- HarfBuzz is concerned with clusters, <emphasis>not</emphasis>
- with graphemes — although client programs using HarfBuzz
- may still care about graphemes for other reasons from time to time.
- </para>
- <para>
- During the shaping process, there are several shaping operations
- that may merge adjacent characters (for example, when two code
- points form a ligature or a conjunct form and are replaced by a
- single glyph) or split one character into several (for example,
- when decomposing a code point through the
- <literal>ccmp</literal> feature). Operations like these alter
- clusters; HarfBuzz tracks the changes to ensure that no clusters
- get lost or broken during shaping.
- </para>
- <para>
- HarfBuzz records cluster information independently from how
- shaping operations affect the individual glyphs returned in an
- output buffer. Consequently, a client program using HarfBuzz can
- utilize the cluster information to implement features such as:
- </para>
- <itemizedlist>
- <listitem>
- <para>
- Correctly positioning the cursor within a shaped text run,
- even when characters have formed ligatures, composed or
- decomposed, reordered, or undergone other shaping operations.
- </para>
- </listitem>
- <listitem>
- <para>
- Correctly highlighting a text selection that includes some,
- but not all, of the characters in a word.
- </para>
- </listitem>
- <listitem>
- <para>
- Applying text attributes (such as color or underlining) to
- part, but not all, of a word.
- </para>
- </listitem>
- <listitem>
- <para>
- Generating output document formats (such as PDF) with
- embedded text that can be fully extracted.
- </para>
- </listitem>
- <listitem>
- <para>
- Determining the mapping between input characters and output
- glyphs, such as which glyphs are ligatures.
- </para>
- </listitem>
- <listitem>
- <para>
- Performing line-breaking, justification, and other
- line-level or paragraph-level operations that must be done
- after shaping is complete, but which require examining
- character-level properties.
- </para>
- </listitem>
- </itemizedlist>
- </section>
- <section id="working-with-harfbuzz-clusters">
- <title>Working with HarfBuzz clusters</title>
- <para>
- When you add text to a HarfBuzz buffer, each code point must be
- assigned a <emphasis>cluster value</emphasis>.
- </para>
- <para>
- This cluster value is an arbitrary number; HarfBuzz uses it only
- to distinguish between clusters. Many client programs will use
- the index of each code point in the input text stream as the
- cluster value. This is for the sake of convenience; the actual
- value does not matter.
- </para>
- <para>
- Some of the shaping operations performed by HarfBuzz —
- such as reordering, composition, decomposition, and substitution
- — may alter the cluster values of some characters. The
- final cluster values in the buffer at the end of the shaping
- process will indicate to client programs which subsequences of
- glyphs represent a cluster and, therefore, must not be
- separated.
- </para>
- <para>
- In addition, client programs can query the final cluster values
- to discern other potentially important information about the
- glyphs in the output buffer (such as whether or not a ligature
- was formed).
- </para>
- <para>
- For example, if the initial sequence of cluster values was:
- </para>
- <programlisting>
- 0,1,2,3,4
- </programlisting>
- <para>
- and the final sequence of cluster values is:
- </para>
- <programlisting>
- 0,0,3,3
- </programlisting>
- <para>
- then there are two clusters in the output buffer: the first
- cluster includes the first two glyphs, and the second cluster
- includes the third and fourth glyphs. It is also evident that a
- ligature or conjunct has been formed, because there are fewer
- glyphs in the output buffer (four) than there were code points
- in the input buffer (five).
- </para>
- <para>
- Although client programs using HarfBuzz are free to assign
- initial cluster values in any manner they choose to, HarfBuzz
- does offer some useful guarantees if the cluster values are
- assigned in a monotonic (either non-decreasing or non-increasing)
- order.
- </para>
- <para>
- For buffers in the left-to-right (LTR)
- or top-to-bottom (TTB) text flow direction,
- HarfBuzz will preserve the monotonic property: client programs
- are guaranteed that monotonically increasing initial cluster
- values will be returned as monotonically increasing final
- cluster values.
- </para>
- <para>
- For buffers in the right-to-left (RTL)
- or bottom-to-top (BTT) text flow direction,
- the directionality of the buffer itself is reversed for final
- output as a matter of design. Therefore, HarfBuzz inverts the
- monotonic property: client programs are guaranteed that
- monotonically increasing initial cluster values will be
- returned as monotonically <emphasis>decreasing</emphasis> final
- cluster values.
- </para>
- <para>
- Client programs can adjust how HarfBuzz handles clusters during
- shaping by setting the
- <literal>cluster_level</literal> of the
- buffer. HarfBuzz offers three <emphasis>levels</emphasis> of
- clustering support for this property:
- </para>
- <itemizedlist>
- <listitem>
- <para><emphasis>Level 0</emphasis> is the default.
- </para>
- <para>
- The distinguishing feature of level 0 behavior is that, at
- the beginning of processing the buffer, all code points that
- are categorized as <emphasis>marks</emphasis>,
- <emphasis>modifier symbols</emphasis>, or
- <emphasis>Emoji extended pictographic</emphasis> modifiers,
- as well as the <emphasis>Zero Width Joiner</emphasis> and
- <emphasis>Zero Width Non-Joiner</emphasis> code points, are
- assigned the cluster value of the closest preceding code
- point from <emphasis>different</emphasis> category.
- </para>
- <para>
- In essence, whenever a base character is followed by a mark
- character or a sequence of mark characters, those marks are
- reassigned to the same initial cluster value as the base
- character. This reassignment is referred to as
- "merging" the affected clusters. This behavior is based on
- the Grapheme Cluster Boundary specification in <ulink
- url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode
- Technical Report 29</ulink>.
- </para>
- <para>
- This cluster level is suitable for code that likes to use
- HarfBuzz cluster values as an approximation of the Unicode
- Grapheme Cluster Boundaries as well.
- </para>
- <para>
- Client programs can specify level 0 behavior for a buffer by
- setting its <literal>cluster_level</literal> to
- <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>.
- </para>
- </listitem>
- <listitem>
- <para>
- <emphasis>Level 1</emphasis> tweaks the old behavior
- slightly to produce better results. Therefore, level 1
- clustering is recommended for code that is not required to
- implement backward compatibility with the old HarfBuzz.
- </para>
- <para>
- <emphasis>Level 1</emphasis> differs from level 0 by not merging the
- clusters of marks and other modifier code points with the
- preceding "base" code point's cluster. By preserving the
- separate cluster values of these marks and modifier code
- points, script shapers can perform additional operations
- that might lead to improved results (for example, coloring
- mark glyphs differently than their base).
- </para>
- <para>
- Client programs can specify level 1 behavior for a buffer by
- setting its <literal>cluster_level</literal> to
- <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>.
- </para>
- </listitem>
- <listitem>
- <para>
- <emphasis>Level 2</emphasis> differs significantly in how it
- treats cluster values. In level 2, HarfBuzz never merges
- clusters.
- </para>
- <para>
- This difference can be seen most clearly when HarfBuzz processes
- ligature substitutions and glyph decompositions. In level 0
- and level 1, ligatures and glyph decomposition both involve
- merging clusters; in level 2, neither of these operations
- triggers a merge.
- </para>
- <para>
- Client programs can specify level 2 behavior for a buffer by
- setting its <literal>cluster_level</literal> to
- <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>.
- </para>
- </listitem>
- </itemizedlist>
- <para>
- As mentioned earlier, client programs using HarfBuzz often
- assign initial cluster values in a buffer by reusing the indices
- of the code points in the input text. This gives a sequence of
- cluster values that is monotonically increasing (for example,
- 0,1,2,3,4).
- </para>
- <para>
- It is not <emphasis>required</emphasis> that the cluster values
- in a buffer be monotonically increasing. However, if the initial
- cluster values in a buffer are monotonic and the buffer is
- configured to use cluster level 0 or 1, then HarfBuzz
- guarantees that the final cluster values in the shaped buffer
- will also be monotonic. No such guarantee is made for cluster
- level 2.
- </para>
- <para>
- In levels 0 and 1, HarfBuzz implements the following conceptual
- model for cluster values:
- </para>
- <itemizedlist spacing="compact">
- <listitem>
- <para>
- If the sequence of input cluster values is monotonic, the
- sequence of cluster values will remain monotonic.
- </para>
- </listitem>
- <listitem>
- <para>
- Each cluster value represents a single cluster.
- </para>
- </listitem>
- <listitem>
- <para>
- Each cluster contains one or more glyphs and one or more
- characters.
- </para>
- </listitem>
- </itemizedlist>
- <para>
- In practice, this model offers several benefits. Assuming that
- the initial cluster values were monotonically increasing
- and distinct before shaping began, then, in the final output:
- </para>
- <itemizedlist spacing="compact">
- <listitem>
- <para>
- All adjacent glyphs having the same final cluster
- value belong to the same cluster.
- </para>
- </listitem>
- <listitem>
- <para>
- Each character belongs to the cluster that has the highest
- cluster value <emphasis>not larger than</emphasis> its
- initial cluster value.
- </para>
- </listitem>
- </itemizedlist>
- </section>
- <section id="a-clustering-example-for-levels-0-and-1">
- <title>A clustering example for levels 0 and 1</title>
- <para>
- The basic shaping operations affect clusters in a predictable
- manner when using level 0 or level 1:
- </para>
- <itemizedlist>
- <listitem>
- <para>
- When two or more clusters <emphasis>merge</emphasis>, the
- resulting merged cluster takes as its cluster value the
- <emphasis>minimum</emphasis> of the incoming cluster values.
- </para>
- </listitem>
- <listitem>
- <para>
- When a cluster <emphasis>decomposes</emphasis>, all of the
- resulting child clusters inherit as their cluster value the
- cluster value of the parent cluster.
- </para>
- </listitem>
- <listitem>
- <para>
- When a character is <emphasis>reordered</emphasis>, the
- reordered character and all clusters that the character
- moves past as part of the reordering are merged into one cluster.
- </para>
- </listitem>
- </itemizedlist>
- <para>
- The functionality, guarantees, and benefits of level 0 and level
- 1 behavior can be seen with some examples. First, let us examine
- what happens with cluster values when shaping involves cluster
- merging with ligatures and decomposition.
- </para>
- <para>
- Let's say we start with the following character sequence (top row) and
- initial cluster values (bottom row):
- </para>
- <programlisting>
- A,B,C,D,E
- 0,1,2,3,4
- </programlisting>
- <para>
- During shaping, HarfBuzz maps these characters to glyphs from
- the font. For simplicity, let us assume that each character maps
- to the corresponding, identical-looking glyph:
- </para>
- <programlisting>
- A,B,C,D,E
- 0,1,2,3,4
- </programlisting>
- <para>
- Now if, for example, <literal>B</literal> and <literal>C</literal>
- form a ligature, then the clusters to which they belong
- "merge". This merged cluster takes for its cluster
- value the minimum of all the cluster values of the clusters that
- went in to the ligature. In this case, we get:
- </para>
- <programlisting>
- A,BC,D,E
- 0,1 ,3,4
- </programlisting>
- <para>
- because 1 is the minimum of the set {1,2}, which were the
- cluster values of <literal>B</literal> and
- <literal>C</literal>.
- </para>
- <para>
- Next, let us say that the <literal>BC</literal> ligature glyph
- decomposes into three components, and <literal>D</literal> also
- decomposes into two components. Whenever a cluster decomposes,
- its components each inherit the cluster value of their parent:
- </para>
- <programlisting>
- A,BC0,BC1,BC2,D0,D1,E
- 0,1 ,1 ,1 ,3 ,3 ,4
- </programlisting>
- <para>
- Next, if <literal>BC2</literal> and <literal>D0</literal> form a
- ligature, then their clusters (cluster values 1 and 3) merge into
- <literal>min(1,3) = 1</literal>:
- </para>
- <programlisting>
- A,BC0,BC1,BC2D0,D1,E
- 0,1 ,1 ,1 ,1 ,4
- </programlisting>
- <para>
- Note that the entirety of cluster 3 merges into cluster 1, not
- just the <literal>D0</literal> glyph. This reflects the fact
- that the cluster <emphasis>must</emphasis> be treated as an
- indivisible unit.
- </para>
- <para>
- At this point, cluster 1 means: the character sequence
- <literal>BCD</literal> is represented by glyphs
- <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
- further.
- </para>
- </section>
- <section id="reordering-in-levels-0-and-1">
- <title>Reordering in levels 0 and 1</title>
- <para>
- Another common operation in some shapers is glyph
- reordering. In order to maintain a monotonic cluster sequence
- when glyph reordering takes place, HarfBuzz merges the clusters
- of everything in the reordering sequence.
- </para>
- <para>
- For example, let us again start with the character sequence (top
- row) and initial cluster values (bottom row):
- </para>
- <programlisting>
- A,B,C,D,E
- 0,1,2,3,4
- </programlisting>
- <para>
- If <literal>D</literal> is reordered to the position immediately
- before <literal>B</literal>, then HarfBuzz merges the
- <literal>B</literal>, <literal>C</literal>, and
- <literal>D</literal> clusters — all the clusters between
- the final position of the reordered glyph and its original
- position. This means that we get:
- </para>
- <programlisting>
- A,D,B,C,E
- 0,1,1,1,4
- </programlisting>
- <para>
- as the final cluster sequence.
- </para>
- <para>
- Merging this many clusters is not ideal, but it is the only
- sensible way for HarfBuzz to maintain the guarantee that the
- sequence of cluster values remains monotonic and to retain the
- true relationship between glyphs and characters.
- </para>
- </section>
- <section id="the-distinction-between-levels-0-and-1">
- <title>The distinction between levels 0 and 1</title>
- <para>
- The preceding examples demonstrate the main effects of using
- cluster levels 0 and 1. The only difference between the two
- levels is this: in level 0, at the very beginning of the shaping
- process, HarfBuzz merges the cluster of each base character
- with the clusters of all Unicode marks (combining or not) and
- modifiers that follow it.
- </para>
- <para>
- For example, let us start with the following character sequence
- (top row) and accompanying initial cluster values (bottom row):
- </para>
- <programlisting>
- A,acute,B
- 0,1 ,2
- </programlisting>
- <para>
- The <literal>acute</literal> is a Unicode mark. If HarfBuzz is
- using cluster level 0 on this sequence, then the
- <literal>A</literal> and <literal>acute</literal> clusters will
- merge, and the result will become:
- </para>
- <programlisting>
- A,acute,B
- 0,0 ,2
- </programlisting>
- <para>
- This merger is performed before any other script-shaping
- steps.
- </para>
- <para>
- This initial cluster merging is the default behavior of the
- Windows shaping engine, and the old HarfBuzz codebase copied
- that behavior to maintain compatibility. Consequently, it has
- remained the default behavior in the new HarfBuzz codebase.
- </para>
- <para>
- But this initial cluster-merging behavior makes it impossible
- for client programs to implement some features (such as to
- color diacritic marks differently from their base
- characters). That is why, in level 1, HarfBuzz does not perform
- the initial merging step.
- </para>
- <para>
- For client programs that rely on HarfBuzz cluster values to
- perform cursor positioning, level 0 is more convenient. But
- relying on cluster boundaries for cursor positioning is wrong: cursor
- positions should be determined based on Unicode grapheme
- boundaries, not on shaping-cluster boundaries. As such, using
- level 1 clustering behavior is recommended.
- </para>
- <para>
- One final facet of levels 0 and 1 is worth noting. HarfBuzz
- currently does not allow any
- <emphasis>multiple-substitution</emphasis> GSUB lookups to
- replace a glyph with zero glyphs (in other words, to delete a
- glyph).
- </para>
- <para>
- But, in some other situations, glyphs can be deleted. In
- those cases, if the glyph being deleted is the last glyph of its
- cluster, HarfBuzz makes sure to merge the deleted glyph's
- cluster with a neighboring cluster.
- </para>
- <para>
- This is done primarily to make sure that the starting cluster of the
- text always has the cluster index pointing to the start of the text
- for the run; more than one client program currently relies on this
- guarantee.
- </para>
- <para>
- Incidentally, Apple's CoreText does something different to
- maintain the same promise: it inserts a glyph with id 65535 at
- the beginning of the glyph string if the glyph corresponding to
- the first character in the run was deleted. HarfBuzz might do
- something similar in the future.
- </para>
- </section>
- <section id="level-2">
- <title>Level 2</title>
- <para>
- HarfBuzz's level 2 cluster behavior uses a significantly
- different model than that of level 0 and level 1.
- </para>
- <para>
- The level 2 behavior is easy to describe, but it may be
- difficult to understand in practical terms. In brief, level 2
- performs no merging of clusters whatsoever.
- </para>
- <para>
- This means that there is no initial base-and-mark merging step
- (as is done in level 0), and it means that reordering moves and
- ligature substitutions do not trigger a cluster merge.
- </para>
- <para>
- Only one shaping operation directly affects clusters when using
- level 2:
- </para>
- <itemizedlist>
- <listitem>
- <para>
- When a cluster <emphasis>decomposes</emphasis>, all of the
- resulting child clusters inherit as their cluster value the
- cluster value of the parent cluster.
- </para>
- </listitem>
- </itemizedlist>
- <para>
- When glyphs do form a ligature (or when some other feature
- substitutes multiple glyphs with one glyph) the cluster value
- of the first glyph is retained as the cluster value for the
- resulting ligature.
- </para>
- <para>
- This occurrence sounds similar to a cluster merge, but it is
- different. In particular, no subsequent characters —
- including marks and modifiers — are affected. They retain
- their previous cluster values.
- </para>
- <para>
- Level 2 cluster behavior is ultimately less complex than level 0
- or level 1, but there are several cases for which processing
- cluster values produced at level 2 may be tricky.
- </para>
- <section id="ligatures-with-combining-marks-in-level-2">
- <title>Ligatures with combining marks in level 2</title>
- <para>
- The first example of how HarfBuzz's level 2 cluster behavior
- can be tricky is when the text to be shaped includes combining
- marks attached to ligatures.
- </para>
- <para>
- Let us start with an input sequence with the following
- characters (top row) and initial cluster values (bottom row):
- </para>
- <programlisting>
- A,acute,B,breve,C,circumflex
- 0,1 ,2,3 ,4,5
- </programlisting>
- <para>
- If the sequence <literal>A,B,C</literal> forms a ligature,
- then these are the cluster values HarfBuzz will return under
- the various cluster levels:
- </para>
- <para>
- Level 0:
- </para>
- <programlisting>
- ABC,acute,breve,circumflex
- 0 ,0 ,0 ,0
- </programlisting>
- <para>
- Level 1:
- </para>
- <programlisting>
- ABC,acute,breve,circumflex
- 0 ,0 ,0 ,5
- </programlisting>
- <para>
- Level 2:
- </para>
- <programlisting>
- ABC,acute,breve,circumflex
- 0 ,1 ,3 ,5
- </programlisting>
- <para>
- Making sense of the level 2 result is the hardest for a client
- program, because there is nothing in the cluster values that
- indicates that <literal>B</literal> and <literal>C</literal>
- formed a ligature with <literal>A</literal>.
- </para>
- <para>
- In contrast, the "merged" cluster values of the mark glyphs
- that are seen in the level 0 and level 1 output are evidence
- that a ligature substitution took place.
- </para>
- </section>
- <section id="reordering-in-level-2">
- <title>Reordering in level 2</title>
- <para>
- Another example of how HarfBuzz's level 2 cluster behavior
- can be tricky is when glyphs reorder. Consider an input sequence
- with the following characters (top row) and initial cluster
- values (bottom row):
- </para>
- <programlisting>
- A,B,C,D,E
- 0,1,2,3,4
- </programlisting>
- <para>
- Now imagine <literal>D</literal> moves before
- <literal>B</literal> in a reordering operation. The cluster
- values will then be:
- </para>
- <programlisting>
- A,D,B,C,E
- 0,3,1,2,4
- </programlisting>
- <para>
- Next, if <literal>D</literal> forms a ligature with
- <literal>B</literal>, the output is:
- </para>
- <programlisting>
- A,DB,C,E
- 0,3 ,2,4
- </programlisting>
- <para>
- However, in a different scenario, in which the shaping rules
- of the script instead caused <literal>A</literal> and
- <literal>B</literal> to form a ligature
- <emphasis>before</emphasis> the <literal>D</literal> reordered, the
- result would be:
- </para>
- <programlisting>
- AB,D,C,E
- 0 ,3,2,4
- </programlisting>
- <para>
- There is no way for a client program to differentiate between
- these two scenarios based on the cluster values
- alone. Consequently, client programs that use level 2 might
- need to undertake additional work in order to manage cursor
- positioning, text attributes, or other desired features.
- </para>
- </section>
- <section id="other-considerations-in-level-2">
- <title>Other considerations in level 2</title>
- <para>
- There may be other problems encountered with ligatures under
- level 2, such as if the direction of the text is forced to
- the opposite of its natural direction (for example, Arabic text
- that is forced into left-to-right directionality). But,
- generally speaking, these other scenarios are minor corner
- cases that are too obscure for most client programs to need to
- worry about.
- </para>
- </section>
- </section>
- </chapter>
|