usermanual-shaping-concepts.xml 13 KB


  1. <?xml version="1.0"?>
  2. <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
  3. "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
  4. <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
  5. <!ENTITY version SYSTEM "version.xml">
  6. ]>
  7. <chapter id="shaping-concepts">
  8. <title>Shaping concepts</title>
  9. <section id="text-shaping-concepts">
  10. <title>Text shaping</title>
  11. <para>
  12. Text shaping is the process of transforming a sequence of Unicode
  13. codepoints that represent individual characters (letters,
  14. diacritics, tone marks, numbers, symbols, etc.) into the
  15. orthographically and linguistically correct two-dimensional layout
  16. of glyph shapes taken from a specified font.
  17. </para>
  18. <para>
  19. For some writing systems (or <emphasis>scripts</emphasis>) and
  20. languages, the process is simple, requiring the shaper to do
  21. little more than advance the horizontal position forward by the
  22. correct amount for each successive glyph.
  23. </para>
  24. <para>
  25. But, for other scripts (often unceremoniously called <emphasis>complex scripts</emphasis>), any combination of
  26. several shaping operations may be required, and the rules for how
  27. and when they are applied vary from script to script. HarfBuzz and
  28. other shaping engines implement these rules.
  29. </para>
  30. <para>
  31. The exact rules and necessary operations for a particular script
  32. constitute a shaping <emphasis>model</emphasis>. OpenType
  33. specifies a set of shaping models that covers all of
  34. Unicode. Other shaping models are available, however, including
  35. Graphite and Apple Advanced Typography (AAT).
  36. </para>
  37. </section>
  38. <section id="script-specific-shaping">
  39. <title>Script-specific shaping</title>
  40. <para>
  41. In many scripts, transforming the input
  42. sequence into the final layout often requires some combination of
  43. operations&mdash;such as context-dependent substitutions,
  44. context-dependent mark positioning, glyph-to-glyph joining,
  45. glyph reordering, or glyph stacking.
  46. </para>
  47. <para>
  48. In some scripts, the shaping rules require that a text
  49. run be divided into syllables before the operations can be
  50. applied. Other scripts may apply shaping operations over
  51. entire words or over the entire text run, with no subdivision
  52. required.
  53. </para>
  54. <para>
  55. Other scripts, do not require these
  56. operations. However, correctly shaping a text run in
  57. any script may still involve Unicode normalization,
  58. ligature substitutions, mark positioning, kerning, and applying
  59. other font features.
  60. </para>
  61. </section>
  62. <section id="shaping-operations">
  63. <title>Shaping operations</title>
  64. <para>
  65. Shaping a text run involves transforming the
  66. input sequence of Unicode codepoints with some combination of
  67. operations that is specified in the shaping model for the
  68. script.
  69. </para>
  70. <para>
  71. The specific conditions that trigger a given operation for a
  72. text run varies from script to script, as do the order that the
  73. operations are performed in and which codepoints are
  74. affected. However, the same general set of shaping operations is
  75. common to all of the script shaping models.
  76. </para>
  77. <itemizedlist>
  78. <listitem>
  79. <para>
  80. A <emphasis>reordering</emphasis> operation moves a glyph
  81. from its original ("logical") position in the sequence to
  82. some other ("visual") position.
  83. </para>
  84. <para>
  85. The shaping model for a given script might involve
  86. more than one reordering step.
  87. </para>
  88. </listitem>
  89. <listitem>
  90. <para>
  91. A <emphasis>joining</emphasis> operation replaces a glyph
  92. with an alternate form that is designed to connect with one
  93. or more of the adjacent glyphs in the sequence.
  94. </para>
  95. </listitem>
  96. <listitem>
  97. <para>
  98. A contextual <emphasis>substitution</emphasis> operation
  99. replaces either a single glyph or a subsequence of several
  100. glyphs with an alternate glyph. This substitution is
  101. performed when the original glyph or subsequence of glyphs
  102. occurs in a specified position with respect to the
  103. surrounding sequence. For example, one substitution might be
  104. performed only when the target glyph is the first glyph in
  105. the sequence, while another substitution is performed only
  106. when a different target glyph occurs immediately after a
  107. particular string pattern.
  108. </para>
  109. <para>
  110. The shaping model for a given script might involve
  111. multiple contextual-substitution operations, each applying
  112. to different target glyphs and patterns, and which are
  113. performed in separate steps.
  114. </para>
  115. </listitem>
  116. <listitem>
  117. <para>
  118. A contextual <emphasis>positioning</emphasis> operation
  119. moves the horizontal and/or vertical position of a
  120. glyph. This positioning move is performed when the glyph
  121. occurs in a specified position with respect to the
  122. surrounding sequence.
  123. </para>
  124. <para>
  125. Many contextual positioning operations are used to place
  126. <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
  127. signs, and tone markers) with respect to
  128. <emphasis>base</emphasis> glyphs. However, some
  129. scripts may use contextual positioning operations to
  130. correctly place base glyphs as well, such as
  131. when the script uses <emphasis>stacking</emphasis> characters.
  132. </para>
  133. </listitem>
  134. </itemizedlist>
  135. </section>
  136. <section id="unicode-character-categories">
  137. <title>Unicode character categories</title>
  138. <para>
  139. Shaping models are typically specified with respect to how
  140. scripts are defined in the Unicode standard.
  141. </para>
  142. <para>
  143. Every codepoint in the Unicode Character Database (UCD) is
  144. assigned a <emphasis>Unicode General Category</emphasis> (UGC),
  145. which provides the most fundamental information about the
  146. codepoint: whether the codepoint represents a
  147. <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
  148. <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
  149. <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
  150. or something else (<emphasis>Other</emphasis>).
  151. </para>
  152. <para>
  153. These UGC properties are "Major" categories. Each codepoint is
  154. further assigned to a "minor" category within its Major
  155. category, such as "Letter, uppercase" (<literal>Lu</literal>) or
  156. "Letter, modifier" (<literal>Lm</literal>).
  157. </para>
  158. <para>
  159. Shaping models are concerned primarily with Letter and Mark
  160. codepoints. The minor categories of Mark codepoints are
  161. particularly important for shaping. Marks can be nonspacing
  162. (<literal>Mn</literal>), spacing combining
  163. (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
  164. </para>
  165. <para>
  166. In addition to the UGC property, codepoints in the Indic and
  167. Southeast Asian scripts are also assigned
  168. <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
  169. <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
  170. properties that provide more detailed information needed for
  171. shaping.
  172. </para>
  173. <para>
  174. The UISC property sub-categorizes Letters and Marks according to
  175. common script-shaping behaviors. For example, UISC distinguishes
  176. between consonant letters, vowel letters, and vowel marks. The
  177. UIPC property sub-categorizes Mark codepoints by the relative visual
  178. position that they occupy (above, below, right, left, or in
  179. multiple positions).
  180. </para>
  181. <para>
  182. Some scripts require that the text run be split into
  183. syllables. What constitutes a valid syllable in these
  184. scripts is specified in regular expressions, formed from the
  185. Letter and Mark codepoints, that take the UISC and UIPC
  186. properties into account.
  187. </para>
  188. </section>
  189. <section id="text-runs">
  190. <title>Text runs</title>
  191. <para>
  192. Real-world text usually contains codepoints from a mixture of
  193. different Unicode scripts (including punctuation, numbers, symbols,
  194. white-space characters, and other codepoints that do not belong
  195. to any script). Real-world text may also be marked up with
  196. formatting that changes font properties (including the font,
  197. font style, and font size).
  198. </para>
  199. <para>
  200. For shaping purposes, all real-world text streams must be first
  201. segmented into runs that have a uniform set of properties.
  202. </para>
  203. <para>
  204. In particular, shaping models always assume that every codepoint
  205. in a text run has the same <emphasis>direction</emphasis>,
  206. <emphasis>script</emphasis> tag, and
  207. <emphasis>language</emphasis> tag.
  208. </para>
  209. </section>
  210. <section id="opentype-shaping-models">
  211. <title>OpenType shaping models</title>
  212. <para>
  213. OpenType provides shaping models for the following scripts:
  214. </para>
  215. <itemizedlist>
  216. <listitem>
  217. <para>
  218. The <emphasis>default</emphasis> shaping model handles all
  219. scripts with no script-specific shaping model, and may also be used as a fallback for
  220. handling unrecognized scripts.
  221. </para>
  222. </listitem>
  223. <listitem>
  224. <para>
  225. The <emphasis>Indic</emphasis> shaping model handles the Indic
  226. scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
  227. Malayalam, Oriya, Tamil, and Telugu.
  228. </para>
  229. <para>
  230. The Indic shaping model was revised significantly in
  231. 2005. To denote the change, a new set of <emphasis>script
  232. tags</emphasis> was assigned for Bengali, Devanagari,
  233. Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
  234. Telugu. For the sake of clarity, the term "Indic2" is
  235. sometimes used to refer to the current, revised shaping
  236. model.
  237. </para>
  238. </listitem>
  239. <listitem>
  240. <para>
  241. The <emphasis>Arabic</emphasis> shaping model supports
  242. Arabic, Mongolian, N'Ko, Syriac, and several other connected
  243. or cursive scripts.
  244. </para>
  245. </listitem>
  246. <listitem>
  247. <para>
  248. The <emphasis>Thai/Lao</emphasis> shaping model supports
  249. the Thai and Lao scripts.
  250. </para>
  251. </listitem>
  252. <listitem>
  253. <para>
  254. The <emphasis>Khmer</emphasis> shaping model supports the
  255. Khmer script.
  256. </para>
  257. </listitem>
  258. <listitem>
  259. <para>
  260. The <emphasis>Myanmar</emphasis> shaping model supports the
  261. Myanmar (or Burmese) script.
  262. </para>
  263. </listitem>
  264. <listitem>
  265. <para>
  266. The <emphasis>Tibetan</emphasis> shaping model supports the
  267. Tibetan script.
  268. </para>
  269. </listitem>
  270. <listitem>
  271. <para>
  272. The <emphasis>Hangul</emphasis> shaping model supports the
  273. Hangul script.
  274. </para>
  275. </listitem>
  276. <listitem>
  277. <para>
  278. The <emphasis>Hebrew</emphasis> shaping model supports the
  279. Hebrew script.
  280. </para>
  281. </listitem>
  282. <listitem>
  283. <para>
  284. The <emphasis>Universal Shaping Engine</emphasis> (USE)
  285. shaping model supports scripts not covered by one of
  286. the above, script-specific shaping models, including
  287. Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
  288. Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
  289. Viet, and many others.
  290. </para>
  291. </listitem>
  292. <listitem>
  293. <para>
  294. Text runs that do not fall under one of the above shaping
  295. models may still require processing by a shaping engine. Of
  296. particular note is <emphasis>Emoji</emphasis> shaping, which
  297. may involve variation-selector sequences and glyph
  298. substitution. Emoji shaping is handled by the default
  299. shaping model.
  300. </para>
  301. </listitem>
  302. </itemizedlist>
  303. </section>
  304. <section id="graphite-shaping">
  305. <title>Graphite shaping</title>
  306. <para>
  307. In contrast to OpenType shaping, Graphite shaping does not
  308. specify a predefined set of shaping models or a set of supported
  309. scripts.
  310. </para>
  311. <para>
  312. Instead, each Graphite font contains a complete set of rules that
  313. implement the required shaping model for the intended
  314. script. These rules include finite-state machines to match
  315. sequences of codepoints to the shaping operations to perform.
  316. </para>
  317. <para>
  318. Graphite shaping can perform the same shaping operations used in
  319. OpenType shaping, as well as other functions that have not been
  320. defined for OpenType shaping.
  321. </para>
  322. </section>
  323. <section id="aat-shaping">
  324. <title>AAT shaping</title>
  325. <para>
  326. In contrast to OpenType shaping, AAT shaping does not specify a
  327. predefined set of shaping models or a set of supported scripts.
  328. </para>
  329. <para>
  330. Instead, each AAT font includes a complete set of rules that
  331. implement the desired shaping model for the intended
  332. script. These rules include finite-state machines to match glyph
  333. sequences and the shaping operations to perform.
  334. </para>
  335. <para>
  336. Notably, AAT shaping rules are expressed for glyphs in the font,
  337. not for Unicode codepoints. AAT shaping can perform the same
  338. shaping operations used in OpenType shaping, as well as other
  339. functions that have not been defined for OpenType shaping.
  340. </para>
  341. </section>
  342. </chapter>