Recommendations for designing magic numbers of binary file formats

(hackers.town)

213 points | by _Microft 4 months ago ago

88 comments

petertodd 4 months ago
That's basically how I designed the magic bytes for the OpenTimestamps proof files:
```
    $ hexdump -C foo.ots 
    00000000  00 4f 70 65 6e 54 69 6d  65 73 74 61 6d 70 73 00  |.OpenTimestamps.|
    00000010  00 50 72 6f 6f 66 00 bf  89 e2 e8 84 e8 92 94 01  |.Proof..........|
```
0) Magic is at the beginning of the file.
1) Starts with a null-byte to make it clear this is binary, not text.
2) Includes a human-readable part to make it easy to figure out what the file is in hex dumps.
3) 8 bytes of randomly chosen bytes, all of which greater than 0x7F to ensure they're not ASCII.
3) Finally, a one-byte major version number.
4) Total length (including major version) is 32 bytes to fit nicely in a hex dump.
[-]
- RustyRussell 4 months ago
  These days I generally advise that you interpret the version number as odd and even bits: odd means it's compatible with readers, even means it isn't.
  [-]
  - quag 4 months ago
    That sounds interesting. Can you say a little more about how this works?
    [-]
    - RustyRussell 4 months ago
      It's a trick I stole from ext2, and simplified. In that filesystem there are three bitsets: one for reading, one for writing, one for fsck. If you don't understand a bit you can't do that action.
      For most protocols there's only reading and writing, so you can use odd bits to mean "backwards compatible features, you can read even if you don't understand" and even for "stop, we broke compat".
      [-]
      - petertodd 4 months ago
        That's a good idea for filesystems. But OpenTimestamps Proofs aren't really "written to". They're created, and then later validated. Also, being cryptographic proofs, my philosophy is the validator should almost always understand them 100%, or not at all, to avoid any false proofs.
        That's also why I picked a binary encoding: it's difficult to parse an OTS proof incorrectly. An incorrect implementation will almost always fail to parse the proof at all, with a clear error, rather than silently parse the proof incorrectly.
        [-]
        RustyRussell 4 months ago
        We use the same for Lightning: even bits for incompatible changes, odd for backwards compatible changes.
- kazinator 4 months ago
  I wouldn't waste the first byte on a null; make use the all four bytes for a "fourcc" code. Then have a null byte soon after that somewhere.
  [-]
  - petertodd 4 months ago
    Well, like I said, I wanted a 32-byte magic so it'd look nice in hex-dumps. So I had plenty of room.
    OTS proof files typically contain lots of 32-byte hash digests. So "wasting" 32 bytes on magic bytes isn't a big deal.
conaclos 4 months ago
```
   SHOULD include a zero byte
```
I guess it is expected to be at the end of the magic number to act as a null-termibated string?
```
   MUST include a byte sequence that is invalid UTF-8
```
I guess it is to differentiate a text file from a specific format?
```
   MUST include at least one byte with the high bit set
```
Any reason?
[-]
- robinhouston 4 months ago
  I think the idea of all these is to make the file not be recognised as text (which doesn't allow nulls), ASCII (which doesn't use the high bit), UTF-8 (which doesn't allow invalid UTF-8 sequences).
  Basically so that no valid file in this binary format will be incorrectly misidentified as a text file.
  [-]
  - CrossVR 4 months ago
    I think preventing the opposite is more pressing. Imagine creating a text file and it just so happens that the first 8 characters match a magic number of an image format. Now when you go back to edit your text file it is suddenly recognized as an image file by your file browser.
- layer8 4 months ago
  Git interprets a zero byte as an unconditional sign that a file is a binary file [0]. With other “nonprintable” characters (including the high-bit ones) it depends on their frequency. Other tools look for high bits, or whether it’s valid UTF-8. PDF files usually have a comment with high-bit characters on the second line for similar reasons.
  These recommended rules cover various common ways to check for text vs. binary, while also aiming to ensure that no genuine text file would ever accidentally match the magic number. The zero-byte recommendation largely achieves the latter (if one ignores double/quad-byte encodings like UTF-16/32).
  [0] https://github.com/git/git/blob/683c54c999c301c2cd6f715c4114...
- wjholden 4 months ago
  The author explains their reasoning in the next post: https://hackers.town/@zwol/114155807716413069
- rcxdude 4 months ago
  Well, the 3rd point follows from the second: all sequences without the high bit set are valid ASCII, and all valid ASCII sequences are valid UTF-8.
- Retr0id 4 months ago
  Having a high-bit set allows you to immediately detect if a file has been mangled through transmission over a 7-bit-only medium.
- dark-star 4 months ago
  the high bit one is pretty ancient by now. I don't think we have transmission methods that are not 8-bit-clean anymore. And if your file detector detects "generic text" before any more specialized detections (like "GIF87a"), and thus treats everything that starts with ASCII bytes as "generic text", then sorry, but your detector is badly broken
  There's no reason for the high-bit "rule" in 2025.
  I would argue the same goes for the 0-byte rule. If you use strcmp() in your magic byte detector, then you're doing it wrong
  [-]
  - 7jjjjjjj 4 months ago
    The zero byte rule has nothing to do with strcmp(). Text files never contain 0-bytes, so having one is a strong sign the file is binary. Many detectors check for this.
    [-]
    - dark-star 4 months ago
      > Text files never contain 0-bytes
      that might be true for ASCII but there are other text encodings out there
      And again, if a detector doesn't check for the more specific matches first, before falling back to "ah, that seems to be text", then the detector is broken
  - Joker_vD 4 months ago
    > I don't think we have transmission methods that are not 8-bit-clean anymore.
    I've just dealt with a 7N1 serial link yesterday, they still exist. Granted, nobody really uses them for truly arbitrary data exchange, but still.
gardaani 4 months ago
Wikipedia has a good explanation why the PNG magic number is 89 50 4e 47 0d 0a 1a 0a. It has some good features, such as the end-of-file character for DOS and detection of line ending conversions. https://en.wikipedia.org/wiki/PNG#File_header
[-]
- nayuki 4 months ago
  The old PNG specification also explained the rationale: http://www.libpng.org/pub/png/spec/1.2/PNG-Rationale.html#R....
  But the new spec doesn't explain: https://www.w3.org/TR/2003/REC-PNG-20031110/
  [-]
  - somat 4 months ago
    That is unfortunate. Not enough standards have rationale or intent sections.
    On the one hand I sort of understand why they don't "If it is not critical and load-bearing to the standard. Why is it in there? it is just noise that will confuse the issue."
    On the other hand, it can provide very important clues as to the why of the standard, not just the what. While the standards authors understood why they did things the way they did, many years later when we read it often we are left with more questions than answers.
    [-]
    - 0xCE0 4 months ago
      Clear rationales doesn't sell well: better to obfuscate specs with hidden design decisions and build complexity moat.
- CrossVR 4 months ago
  At first I wasn't sure why it contained a separate Unix line feed when you would already be able to detect a Unix to DOS conversion from the DOS line ending:
  0D 0A 1A 0A -> 0D 0D 0A 1A 0D 0A
  But of course this isn't to try and detect a Unix-to-DOS conversion, it's to detect a roundtrip DOS-to-Unix-to-DOS conversion:
  0D 0A 1A 0A -> 0A 1A 0A -> 0D 0A 1A 0D 0A
  Certainly a very well thought-out magic number.
  [-]
  - layer8 4 months ago
    Unix2dos is idempotent on CRLF, it doesn’t change it to CRCRLF. Therefore converted singular LFs elsewhere in the file wouldn’t be recognized by the magic-number check if it only contained CRLF. This isn’t about roundtrip conversion.
  - Dwedit 4 months ago
    It's also detecting when a file on DOS/Windows is opened in "ASCII mode" rather than binary mode. When opened in ASCII mode, "\r\n" is automatically converted to "\n" upon reading the data.
- IshKebab 4 months ago
  I can count the number of times I've had binary file corruption due to line ending conversion on zero hands. And I'm old enough to have used FTP extensively. Seems kind of unnecessary.
  [-]
  - hnlmorg 4 months ago
    “Modern” FTP clients would auto detect if you were sending text or binary files and thus disable line conversations for binary.
    But go back to the 90s and before, and you’d have to manually select whether you were sending text or binary data. Often these clients defaulted to text and so you’d end up accidentally corrupting files if you weren’t careful.
    The pain was definitely real
    [-]
    - kbolino 4 months ago
      And, if you were using a Windows client talking to a Unix server, you didn't want to get a text file in binary mode, since most programs at the time couldn't handle Unix line endings. This is much better nowadays, to the point that it rarely matters on either side of the platform divide which type of line endings you use.
  - layer8 4 months ago
    It can easily happen with version control across Windows and Unix clients. I’ve seen it a number of times.
    [-]
    - IshKebab 4 months ago
      Not for binary files. I've seen it for text files, sure.
      [-]
      - layer8 4 months ago
        I’ve seen it with binary files a number of times.
  - jxhdbd 4 months ago
    Have you tried using git with Windows clients?
    There are so many random line conversions going on and the detection on what is a binary file is clearly broken.
    I don't understand why the default would be anything but "commit the file as is"
    [-]
    - layer8 4 months ago
      > I don't understand why the default would be anything but "commit the file as is"
      Because it’s not uncommon for dev tools on Windows to generate DOS line endings when modifying files (for example when adding an element to an XML configuration file, all line endings of the file may be converted when it is rewritten out from its parsed form), and if those where committed as-is, you’d get a lot of gratuitous changes in the commit and also complaints from the Unix users.
      For Git, the important thing is to have a .gitattributes file in the repository with “* text=auto” in it (plus more specific settings as desired). The text/binary auto-detection works mostly fine.
    - kbolino 4 months ago
      Up until just a few years ago, Notepad on Windows could not handle Unix-style line endings. It probably makes sense now to adopt the as-is convention, but for a while, it made more sense to convert when checking out, and then to prevent spurious diffs, convert back when committing.
      [-]
      - hnlmorg 4 months ago
        Line endings between windows and unix-like systems were so painful that when I started development on my shell scripting language, I wrote a bunch of code to all Linux to handle Windows files and visa versa.
        Though this has nothing to do with FTP. I’d already abandoned that protocol by then.
    - IshKebab 4 months ago
      Yes I have. For binary files it's never an issue because Git detects those and doesn't to line ending conversion.
      And I agree "commit the file as-is" should be the default - what programming editor can't handle unix newlines?

shagie 4 months ago

The magic file (man magic / man file) is a neat one to read. On my Mac, this is located in /usr/share/file/magic/ while I recall on a unix distribution I worked on it was /etc/magic

The file itself has a format that can test a file and identify it (and possibly more useful information) that is read by the file command.

    # Various dictionary images used by OpenFirware FORTH environment

    0       lelong  0xe1a00000
    >8      lelong  0xe1a00000
    # skip raspberry pi kernel image kernel7.img by checking for positive text length
    >>24    lelong  >0              ARM OpenFirmware FORTH Dictionary,
    >>>24   lelong  x               Text length: %d bytes,
    >>>28   lelong  x               Data length: %d bytes,
    >>>32   lelong  x               Text Relocation Table length: %d bytes,
    >>>36   lelong  x               Data Relocation Table length: %d bytes,
    >>>40   lelong  x               Entry Point: %#08X,
    >>>44   lelong  x               BSS length: %d bytes

[-]

chrismorgan 4 months ago
On Arch, per `file`:
```
  /usr/share/file/misc/magic.mgc: magic binary file for file(1) cmd (version 20) (little endian)
```
Per magic(5), it could also be “a directory of source text magic pattern fragment files in /usr/share/file/misc/magic”.
The original sources are mirrored in <https://github.com/file/file/tree/master/magic/Magdir>. The identification of the magic.mgc file itself comes from <https://github.com/file/file/blob/0fa0ffd15ff17d798e2f985447...>.

ajross 4 months ago
Unpopular opinion: this is all needless pedantry. At best this gives parsers like file managers a cleaner path to recognizing the specific version of the specific format you're designing. Your successors won't evolve the format with the same rigor you think you're applying now. They just won't. They'll make a "compatible" change at some point in the future which will (1) be actually backwards compatible! yet (2) need to be detected in some affirmative way. Which it won't be. And your magic number will just end up being a wart like all the rest.
This isn't a solvable problem. File formats evolve in messy ways, they always have and always will, and "magic numbers" just aren't an important enough part of the solution to be worth freaking out about.
Just make it unique; read some bytes out of /dev/random, whatever. Arguments like the one here about making them a safe nul-terminated string that is guaranteed to be utf-8 invalid are not going to help anyone in the long term.
[-]
- IshKebab 4 months ago
  Magic numbers aren't for parsing files, they're for identifying file formats.
  And yes making it obviously binary is helpful, e.g. for Git.
  [-]
  - ajross 4 months ago
    > Magic numbers aren't for parsing files, they're for identifying file formats.
    Unpopular corollary: thinking those are two separate actions is a terribly bad design smell. What are you going to do with that file you "identified" if not read it to get something out of it, or hand it to something that will.
    If your file manager wants to turn that path into a thumbnail, you have already gone beyond anything the magic number can have helped you with.
    Again, needless pedantry. Put a random number in the front and be done with it. Anything else needs a parser anyway.
    [-]
    - IshKebab 4 months ago
      > What are you going to do with that file you "identified" if not read it to get something out of it
      Anything where there's a decision based on the file format but not its contents. This happens all the time.
      * Telling the user what file type it is.
      * Choosing a parser to load a file with.
      * Restricting file types on upload forms.
      * Identifying files for linters.
      * Associating files with programs.
      Ok some of those use file extensions because it's a lot easier and text based formats often don't have a magic number, but it's still a valid use case.
    - kergonath 4 months ago
      Speaking of needless pedantry, why do you start with this?
      > Unpopular corollary:
      To me at least it comes across as condescending and unnecessarily edgy or contrarian for the sake of it. I am perfectly able to gauge whether a comment is popular or not.
- CrossVR 4 months ago
  The magic number isn't about recognizing specific versions. That's just an added benefit if you choose to add that to the magic number.
  It is to solve the problem of how to build a file manager that can efficiently recognize all the file types in a large folder without relying on file name extensions.
  If you don't include a magic number a file manager would need to attempt to parse the file format before it can determine which file type it is.
  [-]
  - CamperBob2 4 months ago
    Filename extensions are pretty useful. They were adopted for very good reasons, and every attempt to hide them, pretend they don't matter, or otherwise make them go away has only made things worse.
    You still need a way to make it hard to fool people with deceptive extensions, though, and that's where the magic numbers come in.
  - ajross 4 months ago
    > The magic number isn't about recognizing specific versions
    Yes it is, though. Does your file manager want to display Excel files differently from .jar files? They're both different "versions" of the same file format! Phil Katz in 1988 or whatever could have followed the pedantry in the linked article to the letter (he didn't). And it wouldn't have helped the problem at hand one bit.
- benatkin 4 months ago
  I agree. The author could go after msgpack, they don’t have a magic number, but support using the .msgpack extension for storing data in files. Since a magic number isn’t required at all, it shouldn’t be required to be good.
badmintonbaseba 4 months ago
Then there is mkv/webm, where strictly speaking you need to implement at least part of an EBML parser to distinguish them. Possibly why no other file format adopts EBML, everything just recognizes it as either of mkv or matroska based on dodgy heuristics.
[-]
- gardaani 4 months ago
  Many modern file formats are based on generic container formats (zip, riff, json, toml, xml without namespaces, ..). Identifying those files requires reading the entire file and then guessing the format from the contents. Magic numbers are becoming rare, which is a shame.
weinzierl 4 months ago
Why is ELF a good example?
```
    7F 45 4C 46
```
- MUST be the very first N bytes in the file -> check
- MUST be at least four bytes long, eight is better -> check, but only four
- MUST include at least one byte with the high bit set -> nope
- MUST include a byte sequence that is invalid UTF-8 -> nope
- SHOULD include a zero byte -> nope
So, just 1.5 out of 5. Not good.
By the way, does anyone know the reason it starts with DEL (7F) specifically?
[-]
- rickdeckard 4 months ago
  I think what the author likes is the fact that the first 4 bytes are defined as 0x7F followed by the file extension "ELF" in ASCII, which makes it a quite robust identifier.
  And to be fair, including the 4 byte following the magic number make the ELF-format qualify at least 3 out of the 4 'MUST' requirements:
  _ 7F 45 4C 46
  - 0x04: Either 01 or 02 (defines 32bit or 64bit)
  - 0x05: Either 01 or 02 (defines Little Endian or Big Endian)
  - 0x06: Set to 01 (ELF-version)
  - 0x07: 00~12 (Target OS ABI)
  Still not a shiny example though...
  [-]
  - weinzierl 4 months ago
    Maybe, yes. There are certainly worse offenders than ELF, but I still don't see how it satisfies 3 out of the 4 MUSTs. There is no byte with the high bit set and it is a valid ASCII sequence and therefore also valid UTF-8.
    When it comes to the "eight is better" requirement, at least Linux does not care what comes after the fourth byte for identification purposes, so I think that does not count either.
    [-]
    - 4 months ago
      [deleted]
- layer8 4 months ago
  I agree that it isn’t a particularly good example, especially with reference to the stated rules. Many binary-detection routines will treat DEL as a regular ASCII character.
- 0xFF0123 4 months ago
  It's (7F) ELF
  [-]
  - indigoabstract 4 months ago
    Hmm, I would expect that to be 31F, if it stood for "ELF" in correct Hexspeak.
xg15 4 months ago
Most of those make intuitive sense, except this one:
> MUST include a byte sequence that is invalid UTF-8
Making the magic number UTF-8 (or ASCII, which would still break the rule) would effectively turn it into a "magic string". Isn't that the better method for distinguishability? It's easier to pick unique memorable strings than unique memorable numbers, and you can also read it in a hex editor.
What would be the downsides?
Or is the idea of the requirement to distinguish the format from plaintext files? I'd think that the version number or the rest of the format already likely contained some invalid UTF-8 to ensure that.
[-]
- kbolino 4 months ago
  The key part of magic numbers is that they appear early in the file. You shouldn't rely on something that will probably appear at some point because that requires reading the entire file to detect its type. A single 0x00 byte, ideally the first byte, should be enough to indicate the file is binary and thus make the question of encoding moot. However, 0x00 is technically valid UTF-8 corresponding to U+0000 and ASCII NUL. So, throwing something like 0xFF in there also helps to throw off UTF-8 detection as well as adding high-bit-stripping detection.
  If you really wanted to go the extra mile, you could also include an impossible sequence of UTF-16 code units, but I think you'd need to dedicate 8 bytes to that: two invalid surrogate sequences, one in little-endian and the other in big-endian. You could possibly get by with just 6 bytes if you used a BOM in place of one of the surrogate pairs, or even just 4 with a BOM and an isolated surrogate if you can guarantee that nothing after it can be confused for the other half of a surrogate pair. However, throwing off UTF-16 detection doesn't seem that common or useful; many UTF-16 decoders don't even reject these invalid sequences.
  [-]
  - xg15 4 months ago
    Ah, so it's really distinguishing the file from plaintext then. Thanks!
secondcoming 4 months ago
`0xcafebabe` is the ultimate winner and follows none of these rules.
kazinator 4 months ago
If there is any foreseeable need that the format will benefit from being executable, I would make the magic bytes looks like this:
```
  #!/usr/bin/whatever^@^@^@^@^@[HDR]
```
A hash bang path terminated by a null, followed by some (aligned) binary material with version information and whatnot, all fitting into around 32 bytes.
The header format could allow for variability in the path; the #! and [HDR] part could be enough to give it identify it.
hgomersall 4 months ago
As anyone able to break down why those requirements are desirable?
[-]
- rickdeckard 4 months ago
  From the top of my head, most are to make it as clear as possible that the file is binary and NOT text:
  > MUST be the very first N bytes in the file
  For every system to be able to parse it without loading the entire file
  > MUST be at least four bytes long, eight is better
  To reduce risk of two different binary files on the same system having the same magic number
  > MUST include at least one byte with the high bit set
  To avoid wrongful identification as an ASCII file (ASCII doesn't use the high bit)
  > MUST include a byte sequence that is invalid UTF-8
  To avoid wrongful identification as UTF-8 text file
  > SHOULD include a zero byte
  To avoid wrongful identification as ANY text file
  [-]
  - gus_massa 4 months ago
    >> MUST be the very first N bytes in the file
    > For every system to be able to parse it without loading the entire file
    It also solves the ambiguity problem, zip files have the magic numbers at the end, and most other files like pdf have the magic numbers at the beginning, so you can have a file that is both a pdf and a zip file.
    [-]
    - account42 4 months ago
      For ZIP files this is a design GOAL to a allow things like self-extracting archives.
      [-]
      - silvestrov 4 months ago
        It is a security nightmare.
        [-]
        Aachen 4 months ago
        I thought security people found it fun to do polyglots, as in the magazine POC||GTFO
        bongodongobob 4 months ago
        Can you explain why?
        [-]
        snackbroken 4 months ago
        Any time two parts of a system disagree on how to interpret a given input, there's an exploit waiting to happen. One of the more famous examples of this is HTTP request smuggling.
        As a more concrete example of how file type confusion can bite you, you can imagine a hypothetical photo sharing service that lets users upload both individual images and zip files containing images; The basic structure of the server looks something like
        function user_upload_hook(file): if(is_zipfile(file)): extract(file, tempdir) else: move(file, tempdir/file) for image in tempdir: create_thumbnail(image) ...
        The developers are aware that zip files can contain zip bombs, so they decide to place some off the shelf ZipCop middleware in front of their application. ZipCop rejects all "bad" zip files, including files that aren't zip files at all. That's almost what they want, so they glue it all together with a shell script that first runs `file` (the POSIX command) on the user-supplied files and only feeds them through ZipCop if the file type isn't on a whitelist of image files. ZipCop rejects bad zip files, and image files are treated properly. All is well and there is much rejo- BANG! A zip bomb blows up in production.
        A malicious user has concatenated a JPEG of a cute kitten with a zip bomb. `file` reports that the uploaded file is a JPEG, so it's fed through unchecked to the server. The application's `is_zipfile()` correctly identifies that the file is a valid zip file, so the application extracts it and DOSes the server. The two different layers of the stack disagreeing on how to classify the offending file directly lead to an exploitable vulnerability.
        MyOutfitIsVague 4 months ago
        Don't know their rationale, but back in the day it was really popular to cat a zip file onto the end of a JPEG or PNG and upload it onto 4chan, to be able to smuggle zip files where only images were supposed to be allowed. I remember back in the day seeing the "mods are asleep, post sinks" threads where people would share pictures of sinks, and I thought it was people being goofy, but later somebody told me that people were sharing zips of CP in those threads. I don't know if it's true or not, and never cared to find out.
        [-]
        bongodongobob 4 months ago
        That's not really security though, that's just bad behavior.
        [-]
        MyOutfitIsVague 4 months ago
        Preventing smuggling of data where you don't want it is definitely an aspect of security.
        [-]
        account42 4 months ago
        If your security depends on no format with undesirable properties existing then you have no security. The problem here is not the zip format but insufficient validation for the images you accept - the hidden data could be any ad-hoc format. Message smuggling in image files in particular is only something you can prevent if you re-encode the image -- and even then it's possible to hide messages in the image data in ways that will survive re-encodes.
  - foxglacier 4 months ago
    Isn't a bigger risk wrongly identifying a text file as the specific type of binary you're trying to process?
- cratermoon 4 months ago
  First eight bytes of the file:
  0xDC 0xDF X X x x (0x01 0x00 | 0x00 0x01)
  0xDC 0xDF are bytes with the high bit set. Together with the next two bytes, they form a four-byte sequence that cannot appear in any valid ASCII, UTF-8, Corrected UTF-8, or UTF-16 (regardless of endianness) text document. This is not a perfectly bulletproof declaration that the file does not contain text, but it should be strong enough except maybe for formats like PDF that can't decide if they're structured text or binary.
  X X x x: Four ASCII alphanumeric characters naming your file format. Make them clearly related to your recommended file name extension. I'm giving you four characters because we're running out of three-letter acronyms. If you don't need four characters, pad at the end with 0x1A (aka ^Z).
  The first two of these (the uppercase Xes) must not have their high bits set, lest the "this is not text" declaration be weakened. For the other two (lowercase xes), use of ASCII alphanumerics is just a strong recommendation.
  0x01 0x00 or 0x00 0x01: This is to be understood as a 16-bit unsigned integer in your choice of little- or big-endian order. It serves three functions. In descending order of importance:
```
    It includes a zero byte, reinforcing the declaration that this is not a text file.

    It demonstrates which byte ordering will be used throughout the file.   It does not matter which order you choose, but you need to consciously choose either big- or little-endian and then use that byte order consistently throughout the file.  Yes, I have seen cases where people didn't do that.

    It's an escape hatch. If one day you discover that you need to alter the structure of the rest of the file in a totally incompatible way, and yet it is still meaningfully the same format, so you don't want to change the name characters, you can change the 0x01 to 0x02.  We both hope that day will never come, but we both know it might.
```
  All according to the very next post in the thread, which people who actually read it will have found. https://hackers.town/@zwol/114155807716413069
ks2048 4 months ago
Do his "good examples" even follow his recommendations? e.g. I think they don't contain a 0x00 byte.
eternityforest 4 months ago
Why not just a zero followed by a UUID? UUIDs are the obvious standard everyone knows for identifying stuff.
Maybe a zero, the UUID as ASCII, then another zero, then a human readable description for debugging and search, or a structured metadata header.
But first, ask yourself why you are designing a binary format, unless maybe it's a new media container.
When would someone ever want a binary file that's not zip, SQLite, or version controllable text?
[-]
- addaon 4 months ago
  > When would someone ever want a binary file that's not zip, SQLite, or version controllable text?
  It feels like there’s an infinite number of answers to this, but to choose one: when choosing the format to allow memory mapping makes some operations simpler or more performant?
- lelanthran 4 months ago
  > But first, ask yourself why you are designing a binary format, unless maybe it's a new media container.
  > When would someone ever want a binary file that's not zip, SQLite, or version controllable text?
  Maybe I'm not getting the humour here, but in case you are being serious binary files do have a few advantages over text formats.
  1. Quick detection (say, for dispatching to a handler) 2. Rapid serialisation, both into and out of a running program (program state, in-memory data, etc) 3. Better and safer handling of binary data (no clunky roundtrips of binary blobs to text and back again) 4. Much better checksumming.
  [-]
  - eternityforest 4 months ago
    Binary files are useful, but binary files that aren't either zip, sqlite, or a media container seem pretty niche.
    It makes sense for model weights and media and opaque blobs where you don't need to load just a part of it, but I see a lot of custom binary save files that don't seem to make any sense.
    If it's a server, everything is probably in a database, and if it's a desktop app, eventually something is going to make an 8GB file and it's probably going to be slow unless you have indexing.
    People are also likely to want to incrementally update the file as well.
    If you're sure nobody will ever make a giant file, then VCSability is probably something someone will want.
- petertodd 4 months ago
  > But first, ask yourself why you are designing a binary format, unless maybe it's a new media container.
  Binary formats are ideal for critical applications where you want to either 1) parse the file correctly, 2) fail to parse the file at all. Non-binary formats (and re-use of existing binary formats) tend to have failure modes where you parse the file incorrectly due to a bug, resulting in something bad happening like a security exploit.
- Dwedit 4 months ago
  Tagged files are too useful. 4 byte tag name, 4 byte length of the object, then the binary data of the object. You see these all the time. Sometimes you see the size before the tag name.
  Occasionally, you also see a file header, followed by a size, and an "x", that often indicates a block of ZLIB compressed data.
  [-]
  - lelanthran 4 months ago
    They're called TLV (Tag, Length, Value) and are used extensively in payment transaction systems.
baggy_trough 4 months ago
Solve this forever by choosing a header that adheres to these properties, then add a UUID for the actual format.
[-]
- masfuerte 4 months ago
  You could call it the Compound Document Format.
_ce5e 4 months ago
Honestly I just do any arbitrary uint64, it's good enough for a majority of usecases.
Sometimes I like to have fun and encode a 1337-code easter egg in the hexadecimal representation
cytocync 4 months ago
[dead]