Then there is mkv/webm, where strictly speaking you need to implement at least part of an EBML parser to distinguish them. Possibly why no other file format adopts EBML, everything just recognizes it as either of mkv or matroska based on dodgy heuristics.
Git interprets a zero byte as an unconditional sign that a file is a binary file [0]. With other “nonprintable” characters (including the high-bit ones) it depends on their frequency. Other tools look for high bits, or whether it’s valid UTF-8. PDF files usually have a comment with high-bit characters on the second line for similar reasons.
These recommended rules cover various common ways to check for text vs. binary, while also aiming to ensure that no genuine text file would ever accidentally match the magic number. The zero-byte recommendation largely achieves the latter (if one ignores double/quad-byte encodings like UTF-16/32).
I think the idea of all these is to make the file not be recognised as text (which doesn't allow nulls), ASCII (which doesn't use the high bit), UTF-8 (which doesn't allow invalid UTF-8 sequences).
Basically so that no valid file in this binary format will be incorrectly misidentified as a text file.
the high bit one is pretty ancient by now. I don't think we have transmission methods that are not 8-bit-clean anymore. And if your file detector detects "generic text" before any more specialized detections (like "GIF87a"), and thus treats everything that starts with ASCII bytes as "generic text", then sorry, but your detector is badly broken
There's no reason for the high-bit "rule" in 2025.
I would argue the same goes for the 0-byte rule. If you use strcmp() in your magic byte detector, then you're doing it wrong
The zero byte rule has nothing to do with strcmp(). Text files never contain 0-bytes, so having one is a strong sign the file is binary. Many detectors check for this.
I think what the author likes is the fact that the first 4 bytes are defined as 0x7F followed by the file extension "ELF" in ASCII, which makes it a quite robust identifier.
And to be fair, including the 4 byte following the magic number make the ELF-format qualify at least 3 out of the 4 'MUST' requirements:
_ 7F 45 4C 46
- 0x04: Either 01 or 02 (defines 32bit or 64bit)
- 0x05: Either 01 or 02 (defines Little Endian or Big Endian)
Maybe, yes. There are certainly worse offenders than ELF, but I still don't see how it satisfies 3 out of the 4 MUSTs. There is no byte with the high bit set and it is a valid ASCII sequence and therefore also valid UTF-8.
When it comes to the "eight is better" requirement, at least Linux does not care what comes after the fourth byte for identification purposes, so I think that does not count either.
- MUST be the very first N bytes in the file -> check
- MUST be at least four bytes long, eight is better -> check, eight bytes
- MUST include at least one byte with the high bit set -> nope
- MUST include a byte sequence that is invalid UTF-8 -> check, bytes 0x04 0x05 0x06 do not create valid UTF-8
- SHOULD include a zero byte -> nope
> When it comes to the "eight is better" requirement, at least Linux does not care what comes after the fourth byte for identification purposes, so I think that does not count either.
This describes the behavior of a system (Linux), not the file-format.
I agree that it isn’t a particularly good example, especially with reference to the stated rules. Many binary-detection routines will treat DEL as a regular ASCII character.
> For every system to be able to parse it without loading the entire file
It also solves the ambiguity problem, zip files have the magic numbers at the end, and most other files like pdf have the magic numbers at the beginning, so you can have a file that is both a pdf and a zip file.
0xDC 0xDF are bytes with the high bit set. Together with the next two bytes, they form a four-byte sequence that cannot appear in any valid ASCII, UTF-8, Corrected UTF-8, or UTF-16 (regardless of endianness) text document. This is not a perfectly bulletproof declaration that the file does not contain text, but it should be strong enough except maybe for formats like PDF that can't decide if they're structured text or binary.
X X x x: Four ASCII alphanumeric characters naming your file format. Make them clearly related to your recommended file name extension. I'm giving you four characters because we're running out of three-letter acronyms. If you don't need four characters, pad at the end with 0x1A (aka ^Z).
The first two of these (the uppercase Xes) must not have their high bits set, lest the "this is not text" declaration be weakened. For the other two (lowercase xes), use of ASCII alphanumerics is just a strong recommendation.
0x01 0x00 or 0x00 0x01: This is to be understood as a 16-bit unsigned integer in your choice of little- or big-endian order. It serves three functions. In descending order of importance:
It includes a zero byte, reinforcing the declaration that this is not a text file.
It demonstrates which byte ordering will be used throughout the file. It does not matter which order you choose, but you need to consciously choose either big- or little-endian and then use that byte order consistently throughout the file. Yes, I have seen cases where people didn't do that.
It's an escape hatch. If one day you discover that you need to alter the structure of the rest of the file in a totally incompatible way, and yet it is still meaningfully the same format, so you don't want to change the name characters, you can change the 0x01 to 0x02. We both hope that day will never come, but we both know it might.
Then there is mkv/webm, where strictly speaking you need to implement at least part of an EBML parser to distinguish them. Possibly why no other file format adopts EBML, everything just recognizes it as either of mkv or matroska based on dodgy heuristics.
Git interprets a zero byte as an unconditional sign that a file is a binary file [0]. With other “nonprintable” characters (including the high-bit ones) it depends on their frequency. Other tools look for high bits, or whether it’s valid UTF-8. PDF files usually have a comment with high-bit characters on the second line for similar reasons.
These recommended rules cover various common ways to check for text vs. binary, while also aiming to ensure that no genuine text file would ever accidentally match the magic number. The zero-byte recommendation largely achieves the latter (if one ignores double/quad-byte encodings like UTF-16/32).
[0] https://github.com/git/git/blob/683c54c999c301c2cd6f715c4114...
I think the idea of all these is to make the file not be recognised as text (which doesn't allow nulls), ASCII (which doesn't use the high bit), UTF-8 (which doesn't allow invalid UTF-8 sequences).
Basically so that no valid file in this binary format will be incorrectly misidentified as a text file.
The author explains their reasoning in the next post: https://hackers.town/@zwol/114155807716413069
Well, the 3rd point follows from the second: all sequences without the high bit set are valid ASCII, and all valid ASCII sequences are valid UTF-8.
the high bit one is pretty ancient by now. I don't think we have transmission methods that are not 8-bit-clean anymore. And if your file detector detects "generic text" before any more specialized detections (like "GIF87a"), and thus treats everything that starts with ASCII bytes as "generic text", then sorry, but your detector is badly broken
There's no reason for the high-bit "rule" in 2025.
I would argue the same goes for the 0-byte rule. If you use strcmp() in your magic byte detector, then you're doing it wrong
> I don't think we have transmission methods that are not 8-bit-clean anymore.
I've just dealt with a 7N1 serial link yesterday, they still exist. Granted, nobody really uses them for truly arbitrary data exchange, but still.
The zero byte rule has nothing to do with strcmp(). Text files never contain 0-bytes, so having one is a strong sign the file is binary. Many detectors check for this.
Why is ELF a good example?
- MUST be the very first N bytes in the file -> check- MUST be at least four bytes long, eight is better -> check, but only four
- MUST include at least one byte with the high bit set -> nope
- MUST include a byte sequence that is invalid UTF-8 -> nope
- SHOULD include a zero byte -> nope
So, just 1.5 out of 5. Not good.
By the way, does anyone know the reason it starts with DEL (7F) specifically?
I think what the author likes is the fact that the first 4 bytes are defined as 0x7F followed by the file extension "ELF" in ASCII, which makes it a quite robust identifier.
And to be fair, including the 4 byte following the magic number make the ELF-format qualify at least 3 out of the 4 'MUST' requirements:
_ 7F 45 4C 46
- 0x04: Either 01 or 02 (defines 32bit or 64bit)
- 0x05: Either 01 or 02 (defines Little Endian or Big Endian)
- 0x06: Set to 01 (ELF-version)
- 0x07: 00~12 (Target OS ABI)
Still not a shiny example though...
Maybe, yes. There are certainly worse offenders than ELF, but I still don't see how it satisfies 3 out of the 4 MUSTs. There is no byte with the high bit set and it is a valid ASCII sequence and therefore also valid UTF-8.
When it comes to the "eight is better" requirement, at least Linux does not care what comes after the fourth byte for identification purposes, so I think that does not count either.
- MUST be the very first N bytes in the file -> check
- MUST be at least four bytes long, eight is better -> check, eight bytes
- MUST include at least one byte with the high bit set -> nope
- MUST include a byte sequence that is invalid UTF-8 -> check, bytes 0x04 0x05 0x06 do not create valid UTF-8
- SHOULD include a zero byte -> nope
> When it comes to the "eight is better" requirement, at least Linux does not care what comes after the fourth byte for identification purposes, so I think that does not count either.
This describes the behavior of a system (Linux), not the file-format.
I agree that it isn’t a particularly good example, especially with reference to the stated rules. Many binary-detection routines will treat DEL as a regular ASCII character.
It's (7F) ELF
Hmm, I would expect that to be 31F, if it stood for "ELF" in correct Hexspeak.
As anyone able to break down why those requirements are desirable?
From the top of my head, most are to make it as clear as possible that the file is binary and NOT text:
> MUST be the very first N bytes in the file
For every system to be able to parse it without loading the entire file
> MUST be at least four bytes long, eight is better
To reduce risk of two different binary files on the same system having the same magic number
> MUST include at least one byte with the high bit set
To avoid wrongful identification as an ASCII file (ASCII doesn't use the high bit)
> MUST include a byte sequence that is invalid UTF-8
To avoid wrongful identification as UTF-8 text file
> SHOULD include a zero byte
To avoid wrongful identification as ANY text file
>> MUST be the very first N bytes in the file
> For every system to be able to parse it without loading the entire file
It also solves the ambiguity problem, zip files have the magic numbers at the end, and most other files like pdf have the magic numbers at the beginning, so you can have a file that is both a pdf and a zip file.
For ZIP files this is a design GOAL to a allow things like self-extracting archives.
Isn't a bigger risk wrongly identifying a text file as the specific type of binary you're trying to process?
First eight bytes of the file:
0xDC 0xDF X X x x (0x01 0x00 | 0x00 0x01)
0xDC 0xDF are bytes with the high bit set. Together with the next two bytes, they form a four-byte sequence that cannot appear in any valid ASCII, UTF-8, Corrected UTF-8, or UTF-16 (regardless of endianness) text document. This is not a perfectly bulletproof declaration that the file does not contain text, but it should be strong enough except maybe for formats like PDF that can't decide if they're structured text or binary.
X X x x: Four ASCII alphanumeric characters naming your file format. Make them clearly related to your recommended file name extension. I'm giving you four characters because we're running out of three-letter acronyms. If you don't need four characters, pad at the end with 0x1A (aka ^Z).
The first two of these (the uppercase Xes) must not have their high bits set, lest the "this is not text" declaration be weakened. For the other two (lowercase xes), use of ASCII alphanumerics is just a strong recommendation.
0x01 0x00 or 0x00 0x01: This is to be understood as a 16-bit unsigned integer in your choice of little- or big-endian order. It serves three functions. In descending order of importance:
All according to the very next post in the thread, which people who actually read it will have found. https://hackers.town/@zwol/114155807716413069Solve this forever by choosing a header that adheres to these properties, then add a UUID for the actual format.