libarchive_internals.3.html 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374
  1. <!-- Creator : groff version 1.22.4 -->
  2. <!-- CreationDate: Tue Jul 18 07:11:06 2023 -->
  3. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  4. "http://www.w3.org/TR/html4/loose.dtd">
  5. <html>
  6. <head>
  7. <meta name="generator" content="groff -Thtml, see www.gnu.org">
  8. <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
  9. <meta name="Content-Style" content="text/css">
  10. <style type="text/css">
  11. p { margin-top: 0; margin-bottom: 0; vertical-align: top }
  12. pre { margin-top: 0; margin-bottom: 0; vertical-align: top }
  13. table { margin-top: 0; margin-bottom: 0; vertical-align: top }
  14. h1 { text-align: center }
  15. </style>
  16. <title></title>
  17. </head>
  18. <body>
  19. <hr>
  20. <p>LIBARCHIVE_INTERNALS(3) BSD Library Functions Manual
  21. LIBARCHIVE_INTERNALS(3)</p>
  22. <p style="margin-top: 1em"><b>NAME</b></p>
  23. <p style="margin-left:6%;"><b>libarchive_internals</b>
  24. &mdash; description of libarchive internal interfaces</p>
  25. <p style="margin-top: 1em"><b>OVERVIEW</b></p>
  26. <p style="margin-left:6%;">The <b>libarchive</b> library
  27. provides a flexible interface for reading and writing
  28. streaming archive files such as tar and cpio. Internally, it
  29. follows a modular layered design that should make it easy to
  30. add new archive and compression formats.</p>
  31. <p style="margin-top: 1em"><b>GENERAL ARCHITECTURE</b></p>
  32. <p style="margin-left:6%;">Externally, libarchive exposes
  33. most operations through an opaque, object-style interface.
  34. The archive_entry(3) objects store information about a
  35. single filesystem object. The rest of the library provides
  36. facilities to write archive_entry(3) objects to archive
  37. files, read them from archive files, and write them to disk.
  38. (There are plans to add a facility to read archive_entry(3)
  39. objects from disk as well.)</p>
  40. <p style="margin-left:6%; margin-top: 1em">The read and
  41. write APIs each have four layers: a public API layer, a
  42. format layer that understands the archive file format, a
  43. compression layer, and an I/O layer. The I/O layer is
  44. completely exposed to clients who can replace it entirely
  45. with their own functions.</p>
  46. <p style="margin-left:6%; margin-top: 1em">In order to
  47. provide as much consistency as possible for clients, some
  48. public functions are virtualized. Eventually, it should be
  49. possible for clients to open an archive or disk writer, and
  50. then use a single set of code to select and write entries,
  51. regardless of the target.</p>
  52. <p style="margin-top: 1em"><b>READ ARCHITECTURE</b></p>
  53. <p style="margin-left:6%;">From the outside, clients use
  54. the archive_read(3) API to manipulate an <b>archive</b>
  55. object to read entries and bodies from an archive stream.
  56. Internally, the <b>archive</b> object is cast to an
  57. <b>archive_read</b> object, which holds all read-specific
  58. data. The API has four layers: The lowest layer is the I/O
  59. layer. This layer can be overridden by clients, but most
  60. clients use the packaged I/O callbacks provided, for
  61. example, by archive_read_open_memory(3), and
  62. archive_read_open_fd(3). The compression layer calls the I/O
  63. layer to read bytes and decompresses them for the format
  64. layer. The format layer unpacks a stream of uncompressed
  65. bytes and creates <b>archive_entry</b> objects from the
  66. incoming data. The API layer tracks overall state (for
  67. example, it prevents clients from reading data before
  68. reading a header) and invokes the format and compression
  69. layer operations through registered function pointers. In
  70. particular, the API layer drives the format-detection
  71. process: When opening the archive, it reads an initial block
  72. of data and offers it to each registered compression
  73. handler. The one with the highest bid is initialized with
  74. the first block. Similarly, the format handlers are polled
  75. to see which handler is the best for each archive. (Prior to
  76. 2.4.0, the format bidders were invoked for each entry, but
  77. this design hindered error recovery.)</p>
  78. <p style="margin-left:6%; margin-top: 1em"><b>I/O Layer and
  79. Client Callbacks</b> <br>
  80. The read API goes to some lengths to be nice to clients. As
  81. a result, there are few restrictions on the behavior of the
  82. client callbacks.</p>
  83. <p style="margin-left:6%; margin-top: 1em">The client read
  84. callback is expected to provide a block of data on each
  85. call. A zero-length return does indicate end of file, but
  86. otherwise blocks may be as small as one byte or as large as
  87. the entire file. In particular, blocks may be of different
  88. sizes.</p>
  89. <p style="margin-left:6%; margin-top: 1em">The client skip
  90. callback returns the number of bytes actually skipped, which
  91. may be much smaller than the skip requested. The only
  92. requirement is that the skip not be larger. In particular,
  93. clients are allowed to return zero for any skip that they
  94. don&rsquo;t want to handle. The skip callback must never be
  95. invoked with a negative value.</p>
  96. <p style="margin-left:6%; margin-top: 1em">Keep in mind
  97. that not all clients are reading from disk: clients reading
  98. from networks may provide different-sized blocks on every
  99. request and cannot skip at all; advanced clients may use
  100. mmap(2) to read the entire file into memory at once and
  101. return the entire file to libarchive as a single block;
  102. other clients may begin asynchronous I/O operations for the
  103. next block on each request.</p>
  104. <p style="margin-left:6%; margin-top: 1em"><b>Decompresssion
  105. Layer</b> <br>
  106. The decompression layer not only handles decompression, it
  107. also buffers data so that the format handlers see a much
  108. nicer I/O model. The decompression API is a two stage
  109. peek/consume model. A read_ahead request specifies a minimum
  110. read amount; the decompression layer must provide a pointer
  111. to at least that much data. If more data is immediately
  112. available, it should return more: the format layer handles
  113. bulk data reads by asking for a minimum of one byte and then
  114. copying as much data as is available.</p>
  115. <p style="margin-left:6%; margin-top: 1em">A subsequent
  116. call to the <b>consume</b>() function advances the read
  117. pointer. Note that data returned from a <b>read_ahead</b>()
  118. call is guaranteed to remain in place until the next call to
  119. <b>read_ahead</b>(). Intervening calls to <b>consume</b>()
  120. should not cause the data to move.</p>
  121. <p style="margin-left:6%; margin-top: 1em">Skip requests
  122. must always be handled exactly. Decompression handlers that
  123. cannot seek forward should not register a skip handler; the
  124. API layer fills in a generic skip handler that reads and
  125. discards data.</p>
  126. <p style="margin-left:6%; margin-top: 1em">A decompression
  127. handler has a specific lifecycle:</p>
  128. <p>Registration/Configuration</p>
  129. <p style="margin-left:17%;">When the client invokes the
  130. public support function, the decompression handler invokes
  131. the internal <b>__archive_read_register_compression</b>()
  132. function to provide bid and initialization functions. This
  133. function returns <b>NULL</b> on error or else a pointer to a
  134. <b>struct decompressor_t</b>. This structure contains a
  135. <i>void * config</i> slot that can be used for storing any
  136. customization information.</p>
  137. <p>Bid</p>
  138. <p style="margin-left:17%; margin-top: 1em">The bid
  139. function is invoked with a pointer and size of a block of
  140. data. The decompressor can access its config data through
  141. the <i>decompressor</i> element of the <b>archive_read</b>
  142. object. The bid function is otherwise stateless. In
  143. particular, it must not perform any I/O operations.</p>
  144. <p style="margin-left:17%; margin-top: 1em">The value
  145. returned by the bid function indicates its suitability for
  146. handling this data stream. A bid of zero will ensure that
  147. this decompressor is never invoked. Return zero if magic
  148. number checks fail. Otherwise, your initial implementation
  149. should return the number of bits actually checked. For
  150. example, if you verify two full bytes and three bits of
  151. another byte, bid 19. Note that the initial block may be
  152. very short; be careful to only inspect the data you are
  153. given. (The current decompressors require two bytes for
  154. correct bidding.)</p>
  155. <p>Initialize</p>
  156. <p style="margin-left:17%;">The winning bidder will have
  157. its init function called. This function should initialize
  158. the remaining slots of the <i>struct decompressor_t</i>
  159. object pointed to by the <i>decompressor</i> element of the
  160. <i>archive_read</i> object. In particular, it should
  161. allocate any working data it needs in the <i>data</i> slot
  162. of that structure. The init function is called with the
  163. block of data that was used for tasting. At this point, the
  164. decompressor is responsible for all I/O requests to the
  165. client callbacks. The decompressor is free to read more data
  166. as and when necessary.</p>
  167. <p>Satisfy I/O requests</p>
  168. <p style="margin-left:17%;">The format handler will invoke
  169. the <i>read_ahead</i>, <i>consume</i>, and <i>skip</i>
  170. functions as needed.</p>
  171. <p>Finish</p>
  172. <p style="margin-left:17%; margin-top: 1em">The finish
  173. method is called only once when the archive is closed. It
  174. should release anything stored in the <i>data</i> and
  175. <i>config</i> slots of the <i>decompressor</i> object. It
  176. should not invoke the client close callback.</p>
  177. <p style="margin-left:6%; margin-top: 1em"><b>Format
  178. Layer</b> <br>
  179. The read formats have a similar lifecycle to the
  180. decompression handlers:</p>
  181. <p>Registration</p>
  182. <p style="margin-left:17%;">Allocate your private data and
  183. initialize your pointers.</p>
  184. <p>Bid</p>
  185. <p style="margin-left:17%; margin-top: 1em">Formats bid by
  186. invoking the <b>read_ahead</b>() decompression method but
  187. not calling the <b>consume</b>() method. This allows each
  188. bidder to look ahead in the input stream. Bidders should not
  189. look further ahead than necessary, as long look aheads put
  190. pressure on the decompression layer to buffer lots of data.
  191. Most formats only require a few hundred bytes of look ahead;
  192. look aheads of a few kilobytes are reasonable. (The ISO9660
  193. reader sometimes looks ahead by 48k, which should be
  194. considered an upper limit.)</p>
  195. <p>Read header</p>
  196. <p style="margin-left:17%;">The header read is usually the
  197. most complex part of any format. There are a few strategies
  198. worth mentioning: For formats such as tar or cpio, reading
  199. and parsing the header is straightforward since headers
  200. alternate with data. For formats that store all header data
  201. at the beginning of the file, the first header read request
  202. may have to read all headers into memory and store that
  203. data, sorted by the location of the file data. Subsequent
  204. header read requests will skip forward to the beginning of
  205. the file data and return the corresponding header.</p>
  206. <p>Read Data</p>
  207. <p style="margin-left:17%;">The read data interface
  208. supports sparse files; this requires that each call return a
  209. block of data specifying the file offset and size. This may
  210. require you to carefully track the location so that you can
  211. return accurate file offsets for each read. Remember that
  212. the decompressor will return as much data as it has.
  213. Generally, you will want to request one byte, examine the
  214. return value to see how much data is available, and possibly
  215. trim that to the amount you can use. You should invoke
  216. consume for each block just before you return it.</p>
  217. <p>Skip All Data</p>
  218. <p style="margin-left:17%;">The skip data call should skip
  219. over all file data and trailing padding. This is called
  220. automatically by the API layer just before each header read.
  221. It is also called in response to the client calling the
  222. public <b>data_skip</b>() function.</p>
  223. <p>Cleanup</p>
  224. <p style="margin-left:17%;">On cleanup, the format should
  225. release all of its allocated memory.</p>
  226. <p style="margin-left:6%; margin-top: 1em"><b>API Layer</b>
  227. <br>
  228. XXX to do XXX</p>
  229. <p style="margin-top: 1em"><b>WRITE ARCHITECTURE</b></p>
  230. <p style="margin-left:6%;">The write API has a similar set
  231. of four layers: an API layer, a format layer, a compression
  232. layer, and an I/O layer. The registration here is much
  233. simpler because only one format and one compression can be
  234. registered at a time.</p>
  235. <p style="margin-left:6%; margin-top: 1em"><b>I/O Layer and
  236. Client Callbacks</b> <br>
  237. XXX To be written XXX</p>
  238. <p style="margin-left:6%; margin-top: 1em"><b>Compression
  239. Layer</b> <br>
  240. XXX To be written XXX</p>
  241. <p style="margin-left:6%; margin-top: 1em"><b>Format
  242. Layer</b> <br>
  243. XXX To be written XXX</p>
  244. <p style="margin-left:6%; margin-top: 1em"><b>API Layer</b>
  245. <br>
  246. XXX To be written XXX</p>
  247. <p style="margin-top: 1em"><b>WRITE_DISK
  248. ARCHITECTURE</b></p>
  249. <p style="margin-left:6%;">The write_disk API is intended
  250. to look just like the write API to clients. Since it does
  251. not handle multiple formats or compression, it is not
  252. layered internally.</p>
  253. <p style="margin-top: 1em"><b>GENERAL SERVICES</b></p>
  254. <p style="margin-left:6%;">The <b>archive_read</b>,
  255. <b>archive_write</b>, and <b>archive_write_disk</b> objects
  256. all contain an initial <b>archive</b> object which provides
  257. common support for a set of standard services. (Recall that
  258. ANSI/ISO C90 guarantees that you can cast freely between a
  259. pointer to a structure and a pointer to the first element of
  260. that structure.) The <b>archive</b> object has a magic value
  261. that indicates which API this object is associated with,
  262. slots for storing error information, and function pointers
  263. for virtualized API functions.</p>
  264. <p style="margin-top: 1em"><b>MISCELLANEOUS NOTES</b></p>
  265. <p style="margin-left:6%;">Connecting existing archiving
  266. libraries into libarchive is generally quite difficult. In
  267. particular, many existing libraries strongly assume that you
  268. are reading from a file; they seek forwards and backwards as
  269. necessary to locate various pieces of information. In
  270. contrast, libarchive never seeks backwards in its input,
  271. which sometimes requires very different approaches.</p>
  272. <p style="margin-left:6%; margin-top: 1em">For example,
  273. libarchive&rsquo;s ISO9660 support operates very differently
  274. from most ISO9660 readers. The libarchive support utilizes a
  275. work-queue design that keeps a list of known entries sorted
  276. by their location in the input. Whenever libarchive&rsquo;s
  277. ISO9660 implementation is asked for the next header, checks
  278. this list to find the next item on the disk. Directories are
  279. parsed when they are encountered and new items are added to
  280. the list. This design relies heavily on the ISO9660 image
  281. being optimized so that directories always occur earlier on
  282. the disk than the files they describe.</p>
  283. <p style="margin-left:6%; margin-top: 1em">Depending on the
  284. specific format, such approaches may not be possible. The
  285. ZIP format specification, for example, allows archivers to
  286. store key information only at the end of the file. In
  287. theory, it is possible to create ZIP archives that cannot be
  288. read without seeking. Fortunately, such archives are very
  289. rare, and libarchive can read most ZIP archives, though it
  290. cannot always extract as much information as a dedicated ZIP
  291. program.</p>
  292. <p style="margin-top: 1em"><b>SEE ALSO</b></p>
  293. <p style="margin-left:6%;">archive_entry(3),
  294. archive_read(3), archive_write(3), archive_write_disk(3),
  295. libarchive(3)</p>
  296. <p style="margin-top: 1em"><b>HISTORY</b></p>
  297. <p style="margin-left:6%;">The <b>libarchive</b> library
  298. first appeared in FreeBSD&nbsp;5.3.</p>
  299. <p style="margin-top: 1em"><b>AUTHORS</b></p>
  300. <p style="margin-left:6%;">The <b>libarchive</b> library
  301. was written by Tim Kientzle &lt;[email protected]&gt;.</p>
  302. <p style="margin-left:6%; margin-top: 1em">BSD
  303. January&nbsp;26, 2011 BSD</p>
  304. <hr>
  305. </body>
  306. </html>