Content-Addressed Storage from First Principles -- T34ch Tech

Every storage system you have used has the same basic interface: put data somewhere, give it a name, get it back later by that name. A file path, a database key, a URL -- these are all names that point to locations where data lives. The name and the data are independent. You can change the data without changing the name. You can change the name without changing the data. The name tells you where to look. It tells you nothing about what you will find.

Content-addressed storage inverts this relationship. Instead of choosing a name and assigning it to data, you compute the name from the data. The address of a piece of content is its cryptographic hash. If you have the address, you can verify the content. If you have the content, you can compute the address. The two are bound together by mathematics.

This single change -- deriving addresses from content rather than assigning them -- produces a cascade of useful properties. Deduplication becomes automatic. Integrity verification is built into every read. Immutability is enforced by the addressing scheme itself. Caching becomes trivial. Distribution becomes trustless. And the system works the same way whether the storage backend is a local filesystem, a distributed cluster, or a peer-to-peer network spanning continents.

This article builds CAS from first principles. We start with hash functions, derive the core properties, examine the data structures that make CAS practical at scale, survey real implementations, and work through the problems that CAS does not solve -- because every abstraction has a cost.

Location-Addressed vs. Content-Addressed

In a location-addressed system, an address identifies a place. /var/log/syslog means "the file at this path on this filesystem on this machine." The address encodes the storage topology: a machine, a filesystem, a directory hierarchy, a filename. If you move the file, the address breaks. If you copy the file to another machine, it gets a different address. Two identical files at different paths have different addresses. The address tells you nothing about what is in the file.

In a content-addressed system, an address identifies the content. blake3:af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9 means "the data that, when hashed with BLAKE3, produces this digest." The address does not encode where the data is stored. It encodes what the data is. You can store it anywhere. You can copy it to any number of machines. Every copy has the same address because every copy has the same content. The address tells you exactly what you should find -- and lets you verify that you found it.

Fig. 01 -- Location-addressed vs. content-addressed

The fundamental difference. In location-addressed storage, the name is assigned independently of the content. In content-addressed storage, the name is computed from the content. This single difference produces all the properties that make CAS useful.

Hash Functions as Addresses

The entire CAS model depends on cryptographic hash functions. A hash function takes arbitrary-length input and produces a fixed-length output called a digest. For CAS to work, the hash function needs three properties.

Deterministic. The same input always produces the same output. Hash the same file a million times, on a million different machines, and you get the same digest every time. Without determinism, addresses would be unstable. The same content would have different addresses depending on where and when you hashed it.

Collision-resistant. It should be computationally infeasible to find two different inputs that produce the same output. If collisions were easy to find, an attacker could create a malicious file with the same hash as a legitimate one. You would request the legitimate file by its address, receive the malicious file, verify the hash, and accept it as authentic. Collision resistance is what makes CAS addresses trustworthy.

Fast. Because every write requires hashing the content and every read can optionally verify the hash, the hash function is in the critical path of every I/O operation. A slow hash function turns CAS into a bottleneck. This is why modern CAS systems prefer BLAKE3 over SHA-256 -- BLAKE3 is roughly four times faster on a single core and parallelizes across cores, while SHA-256 is inherently sequential.

SHA-256 produces a 256-bit (32-byte) digest. The probability of a collision is roughly 1 in 2^128, which is the number of atoms in several billion observable universes. BLAKE3 also produces a 256-bit digest by default (though it can produce longer ones) with equivalent collision resistance. Both are considered safe for CAS addressing for any foreseeable future.

Why not MD5 or SHA-1? MD5 has been broken since 2004 -- practical collision attacks exist that can produce two different files with the same MD5 hash. SHA-1 was broken in 2017 (the SHAttered attack by Google and CWI Amsterdam). Both are still widely used for non-security checksums, but neither is safe as a CAS address. An attacker who can produce collisions can substitute malicious content that passes hash verification. Git used SHA-1 for its object store from 2005 until the migration to SHA-256 began in 2021. The migration is still ongoing.

The Core Insight

Here is the idea that makes everything else work: if the address is the hash of the content, then verifying the address verifies the content.

In a location-addressed system, when you read /var/data/report.pdf, you trust the filesystem to return what was last written to that path. If the disk has a silent corruption (a bit flip in a sector), you get the wrong data and have no way to detect it unless you kept a separate checksum. If someone modified the file, you see the modification only if you remember what was there before.

In a content-addressed system, when you fetch the data for blake3:af1349b9..., you hash the bytes you received and compare the result to the address you requested. If they match, the data is correct. If they do not match, the data is corrupt or tampered with. It does not matter who gave you the data, what network it traveled over, or what storage medium it lived on. The verification is self-contained.

This means you can fetch content-addressed data from an untrusted source and verify it yourself. You do not need to trust the server, the CDN, the peer-to-peer swarm, or the USB stick someone handed you. You only need to trust the hash function. This is why content-addressed systems are called "trustless" -- not because no trust is involved, but because the trust is placed in mathematics rather than in infrastructure operators.

Deduplication

Because identical content produces identical hashes, two users who store the same file will produce the same address. The storage system can recognize this and store the data only once, regardless of how many users reference it.

In a location-addressed system, deduplication is expensive. You have to hash every file, maintain an index of hashes, and check each incoming file against the index. The deduplication logic is a separate system bolted onto the storage layer. It can be turned off, bypassed, or broken without affecting the storage itself.

In a content-addressed system, deduplication is structural. The address is the hash. Looking up the address is checking for duplicates. If the address already exists in the store, the content is already present. No additional index, no separate system, no opt-in behavior. Deduplication is a consequence of the addressing scheme.

The savings can be significant. In a backup system where daily snapshots differ by only a few percent, CAS stores only the changed blocks. In a software distribution system where many packages share common libraries, CAS stores each library once regardless of how many packages reference it. In a version control system, CAS stores each version of each file once, even across branches that share most of their content.

Immutability

If you change the content, the hash changes. A different hash means a different address. So "modifying" content-addressed data does not overwrite the original -- it creates new data with a new address. The old data, at the old address, is unchanged.

This makes CAS naturally append-only. You can add new content. You can stop referencing old content. But you cannot change existing content without creating a new address for the result. In a system that only stores content-addressed data, there is no concept of "update in place." Every change is an addition.

Immutability simplifies concurrency. If data cannot change after it is written, there are no race conditions on read. Two processes reading the same address will always get the same bytes. No locks required. No read-after-write consistency problems. The data is frozen at the moment of its creation.

Immutability also simplifies caching. A cache entry for a content-addressed block never goes stale. The address is the content. If the address is in the cache, the data is correct. There is no invalidation problem. The cache can grow until it runs out of space, and any eviction policy works because any evicted entry can be re-fetched and will return the same bytes.

Merkle Trees and DAGs

Individual content-addressed blocks are useful, but most real data has structure. A file is a sequence of blocks. A directory is a collection of files. A repository is a tree of directories. CAS represents structure using Merkle trees -- trees where each node's address is the hash of its contents, and those contents include the addresses of its children.

A Merkle tree works like this: start at the leaves, which are the actual data blocks. Hash each leaf to get its address. Then create parent nodes that contain the addresses of their children. Hash each parent to get its address. Continue up to the root. The root hash is a single fixed-size value that uniquely identifies the entire tree -- every block of data in every leaf, and the entire structure that organizes them.

Fig. 02 -- Merkle tree structure

A Merkle tree of four data blocks. Each internal node's hash covers its children's hashes. The root hash is a commitment to the entire tree -- every leaf and every structural relationship. Changing any single byte in any leaf produces a different root hash.

The power of a Merkle tree is that the root hash is a commitment to the entire dataset. If you have the root hash from a trusted source, you can verify any piece of the tree by fetching the data along with a proof path -- the sibling hashes at each level from the leaf to the root. This is how lightweight blockchain clients work, how certificate transparency logs work, and how Git verifies repository integrity.

Git's internal object store is a content-addressed DAG (directed acyclic graph) rather than a strict tree. A commit object contains the hash of a tree object (the root of the directory structure at that point in time), the hash of the parent commit(s), the author, and the commit message. The tree object contains the hashes of blob objects (file contents) and other tree objects (subdirectories). Every object is stored by its SHA-1 hash (or SHA-256 in newer versions). The commit hash at the tip of a branch is a commitment to the entire history -- every file, every directory, every previous commit.

IPFS extends this further with IPLD (InterPlanetary Linked Data), a data model where any data structure can be represented as a DAG of content-addressed nodes. Each node is a block with a CID (Content Identifier) that encodes the hash function, the codec (how to interpret the bytes), and the digest. CIDs are self-describing -- you can determine the hash function and data format from the CID itself, which allows the system to evolve without breaking existing addresses.

Chunking Strategies

Real files are too large to store as single content-addressed blocks. A 10 GB video file with a single hash offers no deduplication against a slightly different version of the same file. The entire file is one block, so any change produces a completely different hash. To get meaningful deduplication and efficient transfer, files must be split into chunks, with each chunk stored as a separate content-addressed block.

The question is how to split.

Fixed-Size Chunking

The simplest approach: split the file into chunks of a fixed size -- say, 256 KB. Block 0 is bytes 0 through 262143. Block 1 is bytes 262144 through 524287. And so on. Each block is hashed independently and stored by its hash.

Fixed-size chunking is fast and predictable. But it has a fatal flaw for deduplication: if you insert a single byte at the beginning of a file, every subsequent block boundary shifts by one byte. Every block after the insertion is different, even though the content has barely changed. Two files that differ by a single inserted byte share zero blocks under fixed-size chunking.

Content-Defined Chunking

Content-defined chunking (CDC) solves the boundary-shift problem by choosing split points based on the content rather than the position. The most common technique uses a rolling hash -- typically a Rabin fingerprint or a Buzhash -- that slides a window over the data. When the rolling hash meets a predetermined condition (for example, the low 13 bits are all zeros), that position becomes a chunk boundary.

Because the boundaries depend on the content within the rolling hash window, not on the absolute position in the file, an insertion shifts only the boundaries near the insertion point. Chunks far from the insertion are unchanged. Two files that differ by a single inserted byte will share almost all of their chunks.

The expected chunk size is controlled by the boundary condition. If you split when the low 13 bits of the rolling hash are zero, the expected chunk size is 2^13 = 8192 bytes. Tighter conditions produce larger chunks. Looser conditions produce smaller chunks. Most implementations also enforce minimum and maximum chunk sizes to prevent pathological cases -- a minimum of 2 KB avoids tiny chunks from repetitive data, and a maximum of 1 MB avoids enormous chunks from data that never triggers the boundary condition.

Rabin fingerprints in practice The Rabin fingerprint is a polynomial hash over GF(2) -- the field with two elements. It can be computed incrementally: as the window slides one byte forward, the contribution of the outgoing byte is removed and the contribution of the incoming byte is added, both in constant time. This makes the rolling computation fast -- a few XOR and table-lookup operations per byte. The FastCDC algorithm (Xia et al., 2016) further optimizes this by using a gear hash and skipping the minimum chunk size before checking the boundary condition, reducing CPU usage by 10x compared to naive Rabin CDC.

Fig. 03 -- Fixed-size vs. content-defined chunking

Fixed-size chunking shifts every boundary after an insertion, destroying all deduplication. Content-defined chunking places boundaries based on the data itself, so chunks away from the edit point are unchanged. The difference in deduplication efficiency is dramatic for files that are similar but not identical.

Practical Implementations

CAS is not a theoretical exercise. It is the storage model behind some of the most widely deployed systems in the world.

Git

Git stores every file version, every directory tree, and every commit as a content-addressed object in .git/objects/. A blob object is the hash of a file's contents (with a type-length header). A tree object is the hash of a sorted list of entries, where each entry is a mode, a filename, and the hash of the corresponding blob or subtree. A commit object is the hash of the tree, parent commit hashes, author, committer, and message.

When you run git add, Git hashes the file, stores the blob, and updates the index. When you run git commit, Git creates a tree from the index, creates a commit pointing to that tree and the previous commit, and updates the branch ref. The branch ref is the one piece of mutable state in Git -- everything else is content-addressed and immutable.

Git's deduplication is a direct consequence of CAS. If two branches contain the same file, they reference the same blob. If two commits have the same tree (because no files changed), they reference the same tree object. The repository grows proportionally to the unique content, not proportionally to the number of branches or tags that reference it.

Git also uses pack files for efficient storage and transfer. Loose objects are individually compressed with zlib. When the number of loose objects grows large, Git packs them into a single file with delta compression -- each object is stored as a base or as a delta against another object. Pack files are themselves content-addressed (the pack index maps hashes to offsets within the pack). This gives Git both the integrity guarantees of CAS and the storage efficiency of delta compression.

IPFS

IPFS (InterPlanetary File System) is a peer-to-peer content-addressed storage network. Files added to IPFS are split into blocks (using a combination of fixed-size and content-defined chunking, depending on configuration), organized into a Merkle DAG, and addressed by CIDs. Any node in the network that has a block can serve it. Fetching a file by its CID means asking the network "who has this block?" and downloading it from whoever responds.

IPFS uses a DHT (Distributed Hash Table) for content routing -- mapping CIDs to the network addresses of nodes that have the corresponding blocks. When you add a file, your node announces to the DHT that it has those CIDs. When someone requests a CID, the DHT resolves it to your node's address, and the requester downloads the block directly.

The CAS properties translate directly to the P2P context. Because the CID is the hash of the content, you can download a block from any peer and verify it yourself. A malicious peer cannot serve you bad data for a CID you requested -- the hash will not match. This eliminates the need for trusted servers and makes content distribution inherently resilient to tampering.

Iroh and iroh-blobs

Iroh (from n0, formerly number0) is a networking library for building distributed systems. Its blob storage layer, iroh-blobs, is a content-addressed store that uses BLAKE3 for hashing. BLAKE3 produces a hash tree natively -- the output is a Merkle tree of 1 KB chunks with the root hash as the content address. This means that verified streaming is built into the hash function: you can verify each chunk as it arrives without waiting for the entire file, because the hash tree provides a proof for each chunk.

Iroh uses QUIC for transport, which gives it connection migration, multiplexing, and built-in encryption. Content discovery and transfer happen over the same QUIC connections. The combination of BLAKE3's streaming verification and QUIC's multiplexed transport makes iroh-blobs well-suited for real-time content synchronization across nodes.

Object Storage Patterns

Even traditional cloud storage systems incorporate CAS concepts. Amazon S3 supports content-MD5 headers for upload verification (though this is verification, not addressing). Backblaze B2 stores content SHA-1 hashes alongside objects. Many backup tools (restic, Borg, Duplicacy) use CAS internally -- they chunk files, hash each chunk, store chunks by hash, and maintain a separate index that maps file paths to the sequence of chunk hashes that compose them.

Restic is a good case study. It uses CDC with a Rabin fingerprint to chunk files into variable-size blocks (default target size 1 MB). Each block is compressed, encrypted, and stored by the SHA-256 hash of the ciphertext. The encryption key is derived from the user's password, so even the storage backend cannot read the data. But because the content addressing happens after encryption, deduplication only works within a single repository (same key). Two users backing up the same file will produce different ciphertexts and different hashes.

Garbage Collection

Content-addressed storage is append-only by nature. You add blocks but never overwrite them. Over time, blocks accumulate. Some are no longer referenced by anything -- old file versions that have been superseded, temporary data that was part of a computation, blocks from deleted entries. These unreferenced blocks waste storage. Garbage collection reclaims them.

There are two standard approaches.

Reference Counting

Each block maintains a count of how many other objects reference it. When a reference is added (a new tree node includes this block's hash), the count increments. When a reference is removed (the tree node is itself garbage collected), the count decrements. When the count reaches zero, the block can be deleted.

Reference counting is simple and incremental -- you pay a small cost on each add and remove, and garbage collection happens naturally. The problem is cycles. If block A references block B and block B references block A, both have a reference count of 1 but neither is reachable from any root. In practice, CAS structures are DAGs (directed acyclic graphs), so cycles do not occur naturally. But if the system allows user-defined links, cycles become possible, and reference counting alone is insufficient.

Mark-and-Sweep

Start from a set of known roots (branch tips in Git, pinned CIDs in IPFS, named snapshots in a backup tool). Traverse the DAG from each root, marking every reachable block. After traversal, any block that is not marked is unreachable and can be deleted.

Mark-and-sweep handles cycles correctly and does not require maintaining counts on every write. The cost is that it requires a full traversal of the reachable set, which is proportional to the total data size. For large stores (terabytes of blocks), a full GC pass can take hours. Git mitigates this with incremental GC -- it runs a quick check frequently and a full GC rarely. IPFS mitigates it by requiring explicit pinning -- only pinned content and its descendants are protected from GC.

The danger of premature GC In a distributed CAS system, garbage collection is dangerous. Node A might be in the process of building a new tree that references a block on Node B. If Node B runs GC before Node A finishes and publishes the new root, Node B might delete a block that Node A's new tree needs. This is the "GC race" and it affects every distributed CAS system. Solutions include GC locks (pause GC during writes -- simple but blocks), GC grace periods (do not delete blocks newer than N hours -- trades space for safety), and two-phase GC (mark, wait, sweep -- safe but slow).

The Naming Problem

Content addresses are not human-readable. blake3:af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9 is not a name anyone will remember, type, or share verbally. Worse, content addresses change when content changes. If you publish a document and then fix a typo, the corrected version has a different address. Anyone who bookmarked the old address still gets the old version.

This is fundamental. Content addresses are immutable because the content they address is immutable. But humans need mutable names. The file is called "report.pdf" and the latest version should always be at the same name, even if the content has changed.

Every CAS system solves this by adding a mutable pointer layer on top of the immutable content layer.

Git refs. A branch name like "main" is a file in .git/refs/heads/main that contains a commit hash. Updating the branch means overwriting that file with a new hash. The name "main" is mutable. The commit it points to is not.

IPNS. The InterPlanetary Name System maps a public key to a CID. The owner of the private key can update the mapping by signing a new record. The name (the public key hash) stays the same. The content it resolves to can change.

DNS. Conventional DNS can map human-readable names to content addresses. A TXT record or a DNSLink record can contain a CID. Updating the DNS record changes what CID the name resolves to. This bridges the entire existing DNS infrastructure to the CAS world.

The pattern is always the same: a small amount of mutable state (the pointer) that references a large amount of immutable state (the content-addressed data). The mutable layer is where all the complexity of consistency, conflict resolution, and access control lives. The immutable layer is simple by comparison.

Security Properties

CAS provides integrity verification for free. But integrity is not the only security property that matters. It is worth being precise about what CAS gives you and what it does not.

Integrity: yes. If the hash matches, the data is correct. Full stop. No MITM, no silent corruption, no substitution attack can defeat this as long as the hash function is collision-resistant.

Authenticity: no. A content address tells you what the data is. It does not tell you who created it. To verify authorship, you need signatures -- the author signs the content (or the root hash of a tree) with their private key, and you verify the signature with their public key. CAS and digital signatures are complementary. CAS gives you "this is the data." Signatures give you "this person created the data."

Confidentiality: no. A content address is derived from the content. If you know (or can guess) the content, you can compute the address. This means CAS leaks information: an observer who knows a content address can check whether a specific piece of data exists in the store by computing its hash and checking. This is called a confirmation attack. If confidentiality matters, encrypt the data before hashing. The CAS address then corresponds to the ciphertext, not the plaintext, and the confirmation attack no longer works (assuming the encryption is randomized).

Availability: no. CAS prevents undetected modification but does not prevent deletion. A storage node can delete any block it wants. The block is gone. You will know it is gone (a request for that address returns nothing), but knowing does not help if you needed the data. Availability requires redundancy -- storing copies on multiple independent nodes -- which is an infrastructure problem, not a cryptographic one.

Performance Considerations

Hashing Cost

Every write requires hashing the data. Every verified read requires hashing the data again and comparing. For BLAKE3, the hashing throughput on a modern CPU (AVX-512) is roughly 5 GB/s per core. For SHA-256, it is roughly 500 MB/s per core without hardware acceleration, or 2-3 GB/s with SHA-NI instructions (available on most x86 CPUs since 2016). For most workloads, hashing is not the bottleneck -- disk I/O is. But for high-throughput ingest (bulk imports, backup of large datasets), the hash function's speed matters, and BLAKE3's parallelism across cores gives it a significant advantage.

Lookup Patterns

CAS addresses are uniformly distributed (because hash outputs are uniformly distributed). This means that hash-based indexes -- hash tables, LSM trees keyed by hash -- work well. There is no locality of reference in the key space (nearby addresses do not correspond to related data), so range queries are meaningless. But point lookups are fast, and the uniform distribution avoids hot spots.

The lack of locality is a cost for sequential access patterns. If you need to read a file sequentially, you need to follow the Merkle tree from the root to each leaf in order. Each leaf lookup is a random access in the store. This is fine for SSD-backed stores but expensive for spinning disks. Practical implementations mitigate this by co-locating chunks of the same file in the storage backend, trading some of the purity of content addressing for sequential read performance.

Caching

Content-addressed data is the ideal caching candidate. The cache key is the hash. The cached value never goes stale (the content is immutable). Cache hit rates are high because popular content is requested by the same address everywhere. Cache invalidation -- the famously hard problem -- does not exist for CAS data. An entry in the cache is valid forever, or until you evict it for space. This is why CDNs work so well with content-addressed data and why CAS-based systems can aggressively cache at every layer.

When to Use CAS

CAS is not always the right choice. It adds complexity (hashing, chunk management, GC, naming layer) and it changes the access patterns (no in-place updates, no range queries on addresses). The trade-off is worth it when the properties CAS provides are properties you actually need.

Use CAS when integrity matters. Backup systems, software distribution, version control, archival storage, audit logs -- anywhere that undetected modification of data is a serious problem. CAS makes corruption detectable by default.

Use CAS when deduplication matters. Backup systems (again), container image registries, document stores with many similar versions, any system where the same data appears in multiple places. CAS deduplication is free and automatic.

Use CAS when you want trustless distribution. CDNs, peer-to-peer networks, mirroring, caching layers. If the consumer can verify the data by checking the hash, the source does not need to be trusted. This opens up distribution strategies that are impossible with location-addressed data.

Use CAS when immutability is a feature. Audit trails, regulatory archives, scientific data repositories, legal evidence stores. If the data must never change after creation, CAS enforces that structurally rather than by policy.

Do not use CAS when you need frequent small updates to large objects. A database row that changes every second is a poor fit for CAS. Each change creates a new version, a new hash, and eventually a GC burden. Traditional mutable storage with journaling or WAL is better for this access pattern.

Do not use CAS when human-readable naming is the primary access pattern. A filesystem where users navigate by path is location-addressed by nature. You can build CAS underneath (and many filesystems do, for snapshots and deduplication), but the user-facing interface is still paths and names. CAS adds value in the storage layer, not the naming layer.

Content-addressed storage is built on a single idea: derive the address from the data. From that idea, you get integrity verification without a separate checksum, deduplication without a separate index, immutability without a separate enforcement mechanism, and trustless distribution without a separate trust model. The cost is that you lose in-place mutation and human-readable names, both of which must be rebuilt on top. Every CAS system in production -- Git, IPFS, Iroh, restic, container registries -- is an exercise in managing those trade-offs. The ones that succeed are the ones that make the mutable pointer layer simple enough that users never think about the immutable content layer underneath.

The concept is old. Merkle patented the hash tree in 1979. Git shipped in 2005. IPFS launched in 2015. The implementations keep getting faster and the use cases keep expanding, but the core idea has not changed in 45 years. When the address is the content, the storage system tells the truth by construction. That property does not go out of date.