| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120 |
- LZ4 Format Description
- Last revised: 2012-02-27
- Author : Y. Collet
- This small specification intents to provide enough information
- to anyone willing to produce LZ4-compatible compressed data blocks
- using any programming language.
- LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
- The most important design principle behind LZ4 is simplicity.
- It helps to create an easy to read and maintain source code.
- It also helps later on for optimisations, compactness, and speed.
- There is no entropy encoder backend nor framing layer.
- The latter is assumed to be handled by other parts of the system.
- This document only describes the format,
- not how the LZ4 compressor nor decompressor actually work.
- The correctness of the decompressor should not depend
- on implementation details of the compressor, and vice versa.
- -- Compressed block format --
- An LZ4 compressed block is composed of sequences.
- Schematically, a sequence is a suite of literals, followed by a match copy.
- Each sequence starts with a token.
- The token is a one byte value, separated into two 4-bits fields.
- Therefore each field ranges from 0 to 15.
- The first field uses the 4 high-bits of the token.
- It provides the length of literals to follow.
- (Note : a literal is a not-compressed byte).
- If the field value is 0, then there is no literal.
- If it is 15, then we need to add some more bytes to indicate the full length.
- Each additionnal byte then represent a value from 0 to 255,
- which is added to the previous value to produce a total length.
- When the byte value is 255, another byte is output.
- There can be any number of bytes following the token. There is no "size limit".
- (Sidenote this is why a not-compressible input block is expanded by 0.4%).
- Example 1 : A length of 48 will be represented as :
- - 15 : value for the 4-bits High field
- - 33 : (=48-15) remaining length to reach 48
- Example 2 : A length of 280 will be represented as :
- - 15 : value for the 4-bits High field
- - 255 : following byte is maxed, since 280-15 >= 255
- - 10 : (=280 - 15 - 255) ) remaining length to reach 280
- Example 3 : A length of 15 will be represented as :
- - 15 : value for the 4-bits High field
- - 0 : (=15-15) yes, the zero must be output
- Following the token and optional length bytes, are the literals themselves.
- They are exactly as numerous as previously decoded (length of literals).
- It's possible that there are zero literal.
- Following the literals is the match copy operation.
- It starts by the offset.
- This is a 2 bytes value, in little endian format.
- The offset represents the position of the match to be copied from.
- 1 means "current position - 1 byte".
- The maximum offset value is 65535, 65536 cannot be coded.
- Note that 0 is an invalid value, not used.
- Then we need to extract the match length.
- For this, we use the second token field, the low 4-bits.
- Value, obviously, ranges from 0 to 15.
- However here, 0 means that the copy operation will be minimal.
- The minimum length of a match, called minmatch, is 4.
- As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
- Similar to literal length, on reaching the highest possible value (15),
- we output additional bytes, one at a time, with values ranging from 0 to 255.
- They are added to total to provide the final match length.
- A 255 value means there is another byte to read and add.
- There is no limit to the number of optional bytes that can be output this way.
- (This points towards a maximum achievable compression ratio of ~250).
- With the offset and the matchlength,
- the decoder can now proceed to copy the data from the already decoded buffer.
- On decoding the matchlength, we reach the end of the compressed sequence,
- and therefore start another one.
- -- Parsing restrictions --
- There are specific parsing rules to respect in order to remain compatible
- with assumptions made by the decoder :
- 1) The last 5 bytes are always literals
- 2) The last match must start at least 12 bytes before end of block
- Consequently, a block with less than 13 bytes cannot be compressed.
- These rules are in place to ensure that the decoder
- will never read beyond the input buffer, nor write beyond the output buffer.
- Note that the last sequence is also incomplete,
- and stops right after literals.
- -- Additional notes --
- There is no assumption nor limits to the way the compressor
- searches and selects matches within the source data block.
- It could be a fast scan, a multi-probe, a full search using BST,
- standard hash chains or MMC, well whatever.
- Advanced parsing strategies can also be implemented, such as lazy match,
- or full optimal parsing.
- All these trade-off offer distinctive speed/memory/compression advantages.
- Whatever the method used by the compressor, its result will be decodable
- by any LZ4 decoder if it follows the format specification described above.
|