Xet is Hugging Face’s content-addressable storage (CAS) protocol. Instead of uploading files as opaque blobs, Xet splits each file into variable-size chunks, deduplicates at the chunk level, groups chunks into compressed bundles called xorbs, and tracks reconstruction metadata in binary files called shards. HugBucket implements this protocol in pure Python so data it writes is fully compatible withDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/sachnun/hugbucket/llms.txt
Use this file to discover all available pages before exploring further.
huggingface_hub and the HF web UI.
Content-defined chunking
HugBucket splits files using a Gearhash rolling hash (CDC), the same algorithm used by xet-core. A 64-byte sliding window rolls over the data; when the hash matches a boundary condition, a chunk boundary is emitted. Chunk size constraints fromhugbucket/config.py:
hugbucket/config.py
chunker.py so that on average a boundary occurs roughly every 64 KiB:
hugbucket/xet/chunker.py
chunk_data function returns a list of Chunk objects:
hugbucket/xet/chunker.py
CDC produces the same chunk boundaries regardless of where new data is inserted. Prepending or appending bytes to a file only changes the boundary at the edit site — the rest of the chunks remain identical and are not re-uploaded.
Hashing
All hashing uses BLAKE3 with separate 32-byte keyed contexts for each hash type. There are four hash functions inhugbucket/xet/hasher.py:
| Function | Input | Key | Purpose |
|---|---|---|---|
chunk_hash | chunk bytes | DATA_KEY | Unique ID for a single chunk |
xorb_hash | chunk hashes + sizes | INTERNAL_NODE_KEY (Merkle tree) | Unique ID for a xorb |
file_hash | chunk hashes + sizes | Merkle root then FILE_KEY (32 zero bytes) | Unique ID for a file; stored as xetHash |
verification_hash | chunk hashes concatenated | VERIFICATION_KEY | Per-term integrity check in the shard |
hugbucket/xet/hasher.py
hash_to_hex function uses a non-standard encoding: each 8-byte group within the 32-byte hash is treated as a little-endian u64 and formatted as 16 hex characters. This matches the encoding used by xet-core and huggingface_hub.
Xorbs
A xorb is a binary container of LZ4-compressed chunks. Each chunk is prefixed by an 8-byte header:hugbucket/xet/xorb.py
serialize_xorb compresses each chunk with LZ4 frame format (falling back to uncompressed if LZ4 does not shrink the data) and returns the serialized bytes plus the byte offset of each chunk within the xorb:
hugbucket/xet/xorb.py
deserialize_xorb is the inverse — it reads the binary blob and returns a list of ChunkEntry objects with the decompressed bytes. It handles both LZ4 frame format (written by xet-core and HugBucket) and the older LZ4 block format:
hugbucket/xet/xorb.py
Shards
A shard is a binary metadata file that carries file reconstruction information and xorb metadata. The CAS uses shards to answer reconstruction queries.build_shard in hugbucket/xet/shard.py assembles the complete binary blob:
hugbucket/xet/shard.py
Upload flow
All CPU-bound work (chunking, hashing, LZ4 compression) is offloaded to a thread viaasyncio.to_thread so the async event loop stays responsive during uploads.
CDC chunk the data
chunk_data(data) splits the file into variable-size chunks using the Gearhash rolling hash (min 8 KiB, target 64 KiB, max 128 KiB).Hash all chunks
Each chunk’s bytes are hashed with
chunk_hash(c.data) using BLAKE3 keyed with DATA_KEY. Chunk sizes are also recorded.Compute the file hash
file_hash(c_hashes, c_sizes) builds a Merkle tree over all chunk hashes (mean branching factor 4), then applies one final BLAKE3 pass with FILE_KEY. The resulting hex string is stored as xetHash on HF Hub.Group chunks into xorbs and compress
Chunks are batched into xorbs (up to 1 024 chunks or half of 64 MiB, whichever comes first).
serialize_xorb(xorb_chunks) LZ4-compresses each chunk and packs them with 8-byte headers. xorb_hash is computed over the xorb’s chunk hashes.Upload xorbs to CAS
A write token is obtained from HF Hub (
GET /api/buckets/{bucket_id}/xet-write-token). Each xorb is POSTed to {cas_url}/v1/xorbs/default/{xorb_hash} with exponential-backoff retry (up to 3 attempts, base delay 1 s).Build and upload the shard
build_shard([file_info], xorb_infos) assembles the binary shard blob. It is POSTed to {cas_url}/v1/shards. This is also CPU-bound and runs in a thread.Download flow
Get file metadata
hub.get_paths_info(bucket_id, [key]) returns a BucketFile with xet_hash and size. If the file is zero bytes, an empty iterator is returned immediately.Get a read token
_get_read_token(bucket_id) returns a cached XetConnectionInfo (refreshed 60 seconds before expiry) containing cas_url and access_token.Get the reconstruction plan
cas.get_reconstruction(conn, file_hash) calls GET {cas_url}/v1/reconstructions/{file_hash}. The response contains a list of ReconstructionTerm objects (each referencing a xorb hash and a chunk range) and a fetch_info map of presigned CDN URLs. Results are cached for 5 minutes (up to 1 024 entries).Skip terms outside the requested byte range
For each
ReconstructionTerm, the cumulative byte boundaries are pre-computed. Terms that fall entirely before byte_range.start are skipped with continue; terms past byte_range.end break the loop. This makes random-access seeks O(relevant terms) instead of O(all terms).Fetch xorb ranges from CDN
cas.fetch_xorb_range(fetch) issues an HTTP range request (Range: bytes=start-end) to the presigned CDN URL. The compressed xorb bytes are returned.The xorb cache key is
{xorb_hash}:{range_start}:{range_end}. A cache hit means the CDN fetch and decompression are both skipped entirely for that xorb range.