Nix Archive (NAR) Format and Merkle Graph Content Addressing

Examples of such serialisations are the ZIP and TAR file formats. However, for our purposes these formats have two problems: - They do not have a canonical serialisation, meaning that given an FSO, there can be many different serialisations. For instance, TAR files can have variable amounts of padding between archive members; and some archive formats leave the order of directory entries undefined. This would be bad because we use serialisation to compute cryptographic hashes over file system objects, and for those hashes to be useful as a content address or for integrity checking, uniqueness is crucial. Otherwise, correct hashes would report false mismatches, and the store would fail to find the content. - They store more information than we have in our notion of FSOs, such as time stamps. This can cause FSOs that Nix should consider equal to hash to different values on different machines, just because the dates differ. - As a practical consideration, the TAR format is the only truly universal format in the Unix environment. It has many problems, such as an inability to deal with long file names and files larger than 2^33 bytes. Current implementations such as GNU Tar work around these limitations in various ways. For these reasons, Nix has its very own archive format—the Nix Archive (NAR) format, which is carefully designed to avoid the problems described above. The exact specification of the Nix Archive format is in [specified here](../../protocols/nix-archive.md). ## Content addressing File System Objects beyond a single serialisation pass Serialising the entire tree and then hashing that binary string is not the only option for content addressing, however. Another technique is that of a [Merkle graph](https://en.wikipedia.org/wiki/Merkle_tree), where previously computed hashes are included in subsequent byte strings to be hashed. In particular, the Merkle graphs can match the original graph structure of file system objects: we can first hash (serialised) child file system objects, and then hash parent objects using the hashes of their children in the serialisation (to be hashed) of the parent file system objects. Currently, there is one such Merkle DAG content addressing method supported. ### Git ([experimental][xp-feature-git-hashing]) { #git } > **Warning** > > This method is part of the [`git-hashing`][xp-feature-git-hashing] experimental feature. Git's file system model is very close to Nix's, and so Git's content addressing method is a pretty good fit. Just as with regular Git, files and symlinks are hashed as git "blobs", and directories are hashed as git "trees". However, one difference between Nix's and Git's file system model needs special treatment. Plain files, executable files, and symlinks are not differentiated as distinctly addressable objects, but by their context: by the directory entry that refers to them. That means so long as the root object is a directory, there is no problem: every non-directory object is owned by a parent directory, and the entry that refers to it provides the missing information. However, if the root object is not a directory, then we have no way of knowing which one of an executable file, non-executable file, or symlink it is supposed to be. In response to this, we have decided to treat a bare file as non-executable file. This is similar to do what we do with [flat serialisation](#serial-flat), which also lacks this information. To avoid an address collision, attempts to hash a bare executable file or symlink will result in an error (just as would happen for flat serialisation also). Thus, Git can encode some, but not all of Nix's "File System Objects", and this sort of content-addressing is likewise partial. In the future, we may support a Git-like hash for such file system objects, or we may adopt another Merkle DAG format which is capable of representing all Nix file system objects.

This text explains why standard archive formats like ZIP and TAR are unsuitable for Nix's content addressing due to their non-canonical serialization and inclusion of irrelevant metadata (like timestamps). To solve this, Nix uses its own highly deterministic Nix Archive (NAR) format. Beyond single-pass serialization, Nix also supports content addressing via Merkle graphs, where hashes of child objects are embedded in parent objects' hashes. An experimental method, 'Git hashing', applies Git's blob/tree model for files/symlinks and directories respectively, though it has limitations when addressing bare root objects that are executable files or symlinks, leading to errors in such cases to prevent address collisions.