12 rokov pred · f6e1c76aa8
--- a/rtl/objpas/unicodedata.pas
+++ b/rtl/objpas/unicodedata.pas
@@ -17,6 +17,69 @@
 
				     This program is distributed in the hope that it will be useful,
			
 
				     but WITHOUT ANY WARRANTY; without even the implied warranty of
			
 
				     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
			
 
				+
			
 
				+-------------------------------------------------------------------------------
			
 
				+
			
 
				+  Overview of the Unicode Collation Algorithm(UCA) data layout :
			
 
				+  ============================================================
			
 
				+
			
 
				+    The UCA data(see “TUCA_DataBook”) are organized into index data
			
 
				+    (see the “TUCA_DataBook” fields “BMP_Table1”, “BMP_Table2”,
			
 
				+    “OBMP_Table1” and “OBMP_Table2”) and actual properties data(see
			
 
				+    the “Props” field of  “TUCA_DataBook”). The index is a 3 level
			
 
				+    tables designed to minimize the overhaul data size. The
			
 
				+    properties’ data contain the actual (used) UCA’s properties
			
 
				+    for the customized code points(or sequence of code points)
			
 
				+    data (see TUCA_PropItemRec).
			
 
				+    To get the properties’ record of a code point, one goes
			
 
				+    through the index data to get its offset into the “Props”
			
 
				+    serialized data, see the “GetPropUCA” procedure.
			
 
				+    The “TUCA_PropItemRec” record, that represents the actual
			
 
				+    properties, contains a fixed part and a variable part. The
			
 
				+    fixed part is directly expressed as fields of the record :
			
 
				+      “WeightLength”, “ChildCount”, “Size”, “Flags”. The
			
 
				+    variable part depends on some values of the fixed part; For
			
 
				+    example “WeightLength” specify the number of weight[1] item,
			
 
				+    it can be zero or not null; The “Flags” fields does contains
			
 
				+    some bit states to indicate for example if the record’s owner,
			
 
				+    that is the target code point, is present(it is not always
			
 
				+    necessary to store the code point as you are required to have
			
 
				+    this information in the first place in order to get the
			
 
				+    “TUCA_PropItemRec” record).
			
 
				+
			
 
				+    The data, as it is organized now, is as follow for each code point :
			
 
				+      * the fixed part is serialized,
			
 
				+      * if there are weight item array, they are serialized
			
 
				+          (see the "WeigthLength")
			
 
				+      * the code point is serialized (if needed)
			
 
				+      * the context[2] array is serialized
			
 
				+      * The children[3] record are serialized.
			
 
				+
			
 
				+    The “Size” represent the size of the whole record, including its
			
 
				+    children records(see [3]). The “GetSelfOnlySize” returns the size
			
 
				+    of the queried record, excluding the size of its children.
			
 
				+
			
 
				+
			
 
				+    Notes :
			
 
				+
			
 
				+    [1] : A weight item is an array of 3 words. A code point/sequence of code
			
 
				+          point may have zero or multiple items.
			
 
				+
			
 
				+    [2] :  There are characters(mostly japanese ones) that do not have their
			
 
				+           own weighs; There inherit the weights of the preceding character
			
 
				+           in the string that you will be evaluating.
			
 
				+    [3] :  Some unicode characters are expressed using more than one code point.
			
 
				+           In that case the properties records are serialized as a trie. The
			
 
				+           trie data structure is useful when many characters’ expression have
			
 
				+           the same starting code point(s).
			
 
				+
			
 
				+    [4] TUCA_PropItemRec serialization :
			
 
				+            TUCA_PropItemRec :
			
 
				+              WeightLength, ChildCount, Size, Flags [weight item array]
			
 
				+    [Code Point] [Context data]
			
 
				+              [Child 0] [Child 1] .. [Child n]
			
 
				+
			
 
				+        each [Child k] is a TUCA_PropItemRec.
			
 
				 }
			
 
				 
			
 
				 unit unicodedata;