XmlSchemaInferenceDesign.txt 9.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343
  1. * INCOMPLETE
  2. * XML Schema Inference Rules
  3. ** Requirements
  4. XmlReader:
  5. <ul>
  6. - that does not expose EntityReference.
  7. - that does not contain xsd:* elements.
  8. </ul>
  9. XmlSchemaSet: only that was generated by this utility class. See
  10. particle inference section described later.
  11. Actually MS implementation has insufficient check for this input,
  12. so it accepts more than it expects.
  13. *** Allowed schema components
  14. Before infering merged particles with premised particles in
  15. XmlSchemaSet, we have to know what is expected and what is not:
  16. <ul>
  17. - facets are not supported. [a014.xsd]
  18. - xs:all is not supported. [a003.xsd]
  19. - xs:group (ref) is not supported. [a004.xsd]
  20. - xs:choice that does not contain xs:sequence is not
  21. supported [a005.xsd].
  22. - xs:any is not supported. Only xs:element are expected
  23. to be contained in xs:sequence. [a011.xsd]
  24. - same name particles that are still not ambiguous
  25. are computed into invalid particles. It looks
  26. like MS's unexpected bug. [a010.xsd]
  27. - attributeGroup looks not supposed to be there (MS has a
  28. bug around here). [a006.xsd]
  29. - anyAttribute is not regarded as a valid particle, and
  30. the output complexType definition just rips them out.
  31. [a013.xsd]
  32. - but substitutionGroup is not rejected and it will remain
  33. in the output. [a001.xsd]
  34. -> It must be rejected. It breaks choice compatibility.
  35. </ul>
  36. ** Processing model
  37. First, parameter XmlSchemaSet is compiled[*1] and interpreted into
  38. its internal schema representation that is going to be used for
  39. XmlReader input examination. The resulting XmlSchemaSet is the same
  40. as the input XmlSchemaSet.
  41. [*1] FIXME: this design might change.
  42. The XmlSchemaSet is compiled and , because 1) it might contain
  43. XmlSchemaInclude items. So it won't be possible to process inference
  44. inside the input schema set. However, reusing the input reduces
  45. some annoyance; to preserve elementFormDefault etc.
  46. Second, XmlReader is moved to content (document element) and
  47. "element inference" starts from here (described later).
  48. Resulting XmlSchemaSet keeps original XmlSchemas into itslef.
  49. For example, it keeps elementFormDefault and attributeFormDefault.
  50. Basically it will process the XmlReader with existing XmlSchemaSet
  51. and won't "merge" two XmlSchemaSets one of which is newly infered
  52. from this XmlReader. Because anyways the XmlReader will have to
  53. infer sequential nodes (siblings).
  54. Once the element definition is determined (or created), any other
  55. branches in the schema are ignored.
  56. ** Attributes
  57. *** attribute component definitions and references.
  58. **** ignored attributes
  59. xsi:type, xsi:schemaLocation and xsi:noNamespaceSchemaLocation
  60. attributes are ignored.
  61. **** special attributes
  62. If xsi:nil does exist, then its content are not handled, while its
  63. attributes are handled.
  64. xml:* schema are predetermined; it has a fixed schema for that ns.
  65. **** namespaced attributes
  66. miscellaneous attributes that resides in a certain namespace is
  67. referenced as <attribute ref="qualified-name" />
  68. **** local attributes
  69. miscellaneous attributes are represented as <attribute name="blah" />
  70. *** attribute occurence
  71. when defining a complexType for a newly-created element, the attribute
  72. can be set as "required". Otherwise, it must be set as "optional".
  73. For every element instance occurence, all attributes are tested
  74. existence, and if it does not, then it must be set as "optional".
  75. *** attribute value types
  76. FIXME: need to describe the relaxation of attribute value types.
  77. ** Content model inference
  78. *** inference processing model
  79. Content model consists of two parts;
  80. - content type : empty | elementOnly | textOnly | mixed
  81. - particle : sequence | choice | all | groupRef
  82. On processing reader.Read(), the node is first "tested" against
  83. current schema content model. If the current node on the XmlReader
  84. is not acceptable, then "content model expansion" happens.
  85. <ul>
  86. - If the current node is text content, then process the
  87. text node according to "evaluating text content".
  88. - If the current node is an element, then process it
  89. in accordance with "evaluating particle".
  90. </ul>
  91. *** evaluating element
  92. When an element occured, then it must be accepted as a particle.
  93. First, content type must be examined:
  94. <ul>
  95. - If the content type was simpleType, then it is changed
  96. into complexType with complexContent and mixed='true'.
  97. The infered content particle must be optional.
  98. - If the content type was empty, then it is changed into
  99. complexType with complexContent (it is not mixed unlike
  100. above). The infered content particle must be optional.
  101. - If the content type was elementOnly or mixed, no need
  102. to change.
  103. </ul>
  104. Next, the content particle must be evaluated.
  105. According to the input XmlSchemaSet limitations, there will be
  106. only these patterns listed here:
  107. - empty content
  108. - simple content
  109. - sequence (of element particles)
  110. - choice of sequences
  111. **** Reader progress
  112. Every element is tested against current element candidates.
  113. <ul>
  114. - When the target element is a document element, then all
  115. the global elements in XmlSchemaSet are the candidates.
  116. <ul>
  117. - If there is a maching name, then that element
  118. definition is used as the context element for
  119. the node's content, and current particle is
  120. in front of the first particle.
  121. - If there isn't, then the inference engine creates
  122. a new element definition, and content is none
  123. (none != empty).
  124. </ul>
  125. - When the target element is infered in a new element
  126. definition, then
  127. </ul>
  128. **** Particle inference
  129. IMPORTANT: Here I tried to formalize the inference, but it is
  130. incomplete notes.
  131. Target {particle} to add:
  132. isNew -> <xs:element name={name}> ... </xs:element>
  133. !isNew -> <xs:element name={name minOccurs="0"> ... </xs:element>
  134. no definition
  135. // define complexType and add {particle} to .Particle
  136. toComplexType()
  137. processcontent(ct.Particle, isNew)
  138. simpleType
  139. makeComplexContent()
  140. complexType
  141. empty definition (no content model, no particle)
  142. // -> add xs:element name={name} minOccurs="0" to .Particle
  143. -> processcontent(ct.Particle, isNew)
  144. simple content
  145. -> makeComplexContent()
  146. complex content / extension
  147. -> processContent(cce.Particle, isNew)
  148. complex content / restriction
  149. -> processContent(ccr.Particle, isNew)
  150. .Particle
  151. -> processContent(ct.Particle, isNew)
  152. makeComplexContent()
  153. change to complexType which has complex content mixed="true" and
  154. extension. Discard simple type information. Add {particle} to
  155. extension's .Particle.
  156. processContent(Particle particle, isNew)
  157. if particle is either empty or sequence
  158. processSequential(particle, 0, false, isNew)
  159. else if particle is sequence of choices
  160. processLax(particle, 0)
  161. else
  162. error.
  163. processSequential(Sequence particle, int index, bool consumed, bool isNew)
  164. particle.Count <= index
  165. -> appendSequential(particle, isNew)
  166. sequence
  167. if (particle[index] has the same name)
  168. -> if (consumed) then sequence[index].maxOccurs = inf.
  169. InferElement (sequence[index])
  170. processParticles(particle, index, true)
  171. else
  172. -> if (!consumed)
  173. sequence[index].minOccurs = 0.
  174. processParticle(particle, index+1, false)
  175. else
  176. particle = toSequenceOfChoice(particle)
  177. processLax(particle, index)
  178. processLax(choice, index)
  179. foreach (element el in choice.Items)
  180. if (el has the same name)
  181. InferElement (el)
  182. processLax(choice, index + 1)
  183. return;
  184. appendLax(particle)
  185. appendSequential(particle)
  186. if (particle is empty)
  187. make particle as sequence
  188. sequence.Items.Add(InferElement(null))
  189. appendLax(choice)
  190. choice.Items.Add(InferElement(null))
  191. *** evaluating text content
  192. When text content occured, it must be accepted as simple content.
  193. <ul>
  194. - If the content type was textOnly, then "type relaxation"
  195. happens (described later).
  196. - If the content type was already mixed, then it is skipped.
  197. - If the content type was elementOnly, then the content type
  198. becomes mixed and then skipped.
  199. - If the content type was empty, then its content type
  200. becomes text and then skipped. The type is xs:string (no
  201. type promotion will happen since empty value cannot be
  202. accepted as any other types handles in this design).
  203. </ul>
  204. (Actually inference is done from non post compilation information.)
  205. Note that type relaxation happens only when it is infered as textOnly
  206. and it always occurs.
  207. ** Type inference
  208. All data types are infered from string value; either element content
  209. or attribute value.
  210. *** primitive type inference
  211. When a string is being evaluated as xs:blahblah typed value, it is
  212. tried against several types.
  213. <ul>
  214. - First, it is evaluated as xs:boolean; true, false<del>, 1 or 0</del>.
  215. - Next, its integer value is computed. 1) If it is
  216. successful, then its value range is examined if it
  217. matches with unsignedByte, byte, unsignedShort, short,
  218. unsignedInt, int, unsignedLong, long, and integer.
  219. - If it was not an integer, then it is evaluated as a float
  220. number, as a double number, and then as a decimal number
  221. as well.
  222. - Next, it is examined as xs:dateTime, xs:duration and
  223. related schema types.
  224. - If if did not match any kind of predefined types, then
  225. xs:string is infered. No other string-based types (such
  226. as xs:token) are infered.
  227. </ul>
  228. *** type relaxation
  229. When a string value is being accepted with existing type, the type
  230. might have to change to accept it.
  231. For example:
  232. <ul>
  233. - xs:int cannot accept "abc"
  234. - <del>string with maxLength="3" cannot accept "abcd"</del>
  235. facets are not created anyways and thus not supported
  236. by this inference engine.
  237. - 12345 is not acceptable for xs:unsignedByte, but acceptable
  238. for unsignedShort
  239. </ul>
  240. Here, the new string value is infered into a simpleType, and then
  241. the processor will compute the most specific common type between
  242. the existing type and the newly infered type.