feat(hash): rapidhash is the default; mh seed hardening; NOTICE attributions#225
Merged
Conversation
…tency benchmark Medium tier of mbo/hash/TODO.md: - SipHash (hash_siphash.h): constexpr-safe SipHash-c-d (canonical 2-4 via GetHash64, 1-3 via GetHash64Sip13) with the reference 128-bit key API; verified against reference vectors (empty-input value matches the SipHash paper's table). The keyed, hash-flooding-resistant option the library lacked; Algorithm derives the key from the seed as (seed, Fmix64(seed)), documented as requiring a secret seed for adversarial protection. - rapidhash V3 (hash_rapidhash.h): constexpr-safe transcription of the reference FAST variant (wyhash family; small-key latency champion, SMHasher3-clean); verified against vectors generated by compiling the official header across every dispatch branch. hash_internal gains Mult128 (full 64x64->128, both halves). - Differential test (hash_differential_test.cc): our xxh64/xxh3 vs the actual reference library over 40k+ randomized (input, seed, length) cases incl. block boundaries and 100KB buffers - bit-for-bit agreement, proving the scalar transcription matches even the reference's SIMD path. Reference xxHash 0.8.3 pulled as a test-only http_archive (the BCR module's BUILD predates Bazel 9's rule-autoload removal). - Mixed-length latency benchmark (BmHash64Latency): unpredictable key sizes with a serialized hash->index dependency chain. Instantly instructive: the ranking flips vs hot-loop throughput (rapidhash 11ns < xxh3 12.2 < mh 16.6 at 0..16B mixes). TODO.md: medium tier cleared except the one-off SMHasher3 run; XXH3-128 added as a future item; MD5/SHA recorded as a non-goal (use BoringSSL; our fast file-identity answer is XXH3).
…ion over crypto-lib dependency BoringSSL is live-at-head, unversioned, and has a history of non-reproducible archives - an unverifiable supply chain for exactly the code that most needs verifying. Digests are spec-frozen pure functions; if interop is ever needed, transcribe + pin against NIST vectors in-repo.
…erately not 'crypto')
The 32-bit counterpart of Hash128To64: all 64 bits contribute. Plain truncation would be sound for the strong algorithms (full-avalanche finalizers), but the fold is the correct default for every algorithm - FNV-1a's low bits are biased and XOR-folding is that algorithm's official shrinking recommendation.
Adds the HasGetHash32 concept, Hasher<Algo>::GetHash32, and the top-level GetHash32<Algo>(data, seed). Algorithms may provide a native 32-bit variant (e.g. a future canonical XXH32 / FNV-1a-32 / murmur3-x86); everything else synthesizes the XOR-fold of the 64-bit hash (Hash64To32) - the per-algorithm correct approach, selected automatically.
…ty BUILD file Fixes four design problems with the differential test's reference archive: - The BUILD file is a real file (third_party/xxhash/xxhash.BUILD.bazel), not an inline build_file_content string. third_party/ holds per-dependency directories for external BUILD files and patches; the files are named <dep>.BUILD.bazel so they never load as helly25_mbo packages. (A nested MODULE.bazel boundary was tried first: Bazel 9 does not treat it as a package-traversal boundary - the BUILD file loaded as a main-repo package.) - The library is testonly. - Visibility is restricted to //mbo/hash. - The http_archive moved to bazelmod/dev.MODULE.bazel with dev_dependency = True: consumers of helly25_mbo never fetch it. The differential test is tagged manual so consumer wildcards never analyze a target whose dependency does not exist for them; our CI runs it explicitly.
Adds the 128-bit XXH3 variant to hash_xxh3.h, transcribed from the reference implementation v0.8.2: all dispatch classes (0/1-3/4-8/9-16 with the mult128 mixers, 17-128 and 129-240 via Mix32B, and the long path reusing the existing stripe/scramble machinery with the dual mergeAccs). xxh3::Algorithm is now 128-bit native, so the typed framework and benchmark cover the 128-bit side automatically. Verified three ways: 34 reference vectors across every class boundary and seed, a 40k-case differential test against libxxhash (XXH3_128bits_withSeed, including its SIMD path on 100KB buffers), and the structural property that h1 equals GetHash64 for >240-byte and 1-3-byte inputs by construction.
Adds the streaming contract designed as the shared foundation for the future mbo/digest library: algorithms MAY provide StreamState/StreamInit/StreamUpdate/ StreamFinalize with the guarantee that any chunking of the input produces exactly the one-shot GetHash64 value. The HasStreaming concept detects support; Streamer<Algo> is the object-style wrapper (chainable Update, non-destructive Finalize, fully constexpr). Implementations: - siphash: native mode of the block ARX construction (8-byte buffer). - xxh64: the reference streaming semantics (4 accumulators, 32-byte buffer; the one-shot tail is refactored into a shared FinishTail so the paths cannot drift). - mh: own definition matching the one-shot tier dispatch exactly - buffers up to 64 bytes until the stripe tier is decided, then consumes stripes as the one-shot loop would (including the finalize-time drain of a final full stripe), replaying the <32-byte rest through the 2-lane block/tail code. - rapidhash/fnv1a/simple/xxh3/murmur3: no streaming (rapidhash has no canonical form; xxh3/murmur3 deferred) - absence is detected honestly. Tested: typed chunked==one-shot property test (random split points incl. empty chunks, byte-at-a-time, lengths crossing every dispatch tier), non-destructive peek-Finalize, and constexpr streaming (static_assert).
The one-off credibility run (methodology in SMHASHER3.md): mh fails the research battery in two clusters - near-raw seed handling (SeedBlockLen/ Offset/BIC/Bitflip) and core-round diffusion limits (BIC, Sparse 9/4, Bitflip, PerlinNoise). Honest result recorded; the default-algorithm decision (keep-with-caveat / harden / switch to rapidhash) is now a TODO.
…butions Decisions (b)+(c) from the SMHasher3 evaluation: - DefaultHashAlgorithm = rapidhash (SMHasher3-clean, canonical, best mixed-length latency); GetHash128 defaults to the new Default128HashAlgorithm = xxh3 (SMHasher3-clean and 128-bit *native* - rapidhash has no 128-bit form and a synthesized default would be a silent quality downgrade). mh remains available; its header and README no longer call it the default. - mh seed hardening (values change): the seed is finalized through Fmix64 before deriving any lane (one-shot both tiers, and the streaming state), targeting SMHasher3's SeedBlockLen/SeedBlockOffset/SeedBIC/SeedBitflip failure families. Core-round hardening (BIC/Sparse/Bitflip/PerlinNoise) stays a TODO. License diligence for the transcriptions (rapidhash is MIT - no obstacle to making it the default): new repository-root NOTICE file reproduces the upstream notices verbatim (rapidhash MIT, xxHash BSD-2 full two-clause text, MurmurHash3/SipHash/FNV public-domain/CC0 attributions); file headers carry a short attribution pointing at NOTICE; README links it. LICENSE stays pure Apache-2.0.
The MIT/BSD-2 obligations are satisfied by shipping NOTICE; a flag would flip the default algorithm per build configuration (the footgun class rejected for auto-MBO_HASH_MANGLE). The algorithm headers are self-contained, so notice-free consumers can include only the attribution-free set; per-algorithm targets remain addable compatibly.
Seed finalization cleared every Seed* family and PerlinNoise (147 -> 183). The remaining five failures (BIC, Sparse 9/4, Bitflip) are the core-round diffusion cluster, tracked in TODO.md.
mh may be made faster, have its rounds strengthened, be replaced entirely, or be renamed; consumers must not depend on the mh namespace or its values beyond a single build.
Fab-Cat
approved these changes
Jul 4, 2026
# Conflicts: # CHANGELOG.md # mbo/hash/hash_differential_test.cc # mbo/hash/hash_test.cc
The digest scope (SHA-1/224/256 and the wider SHA-2 family, SHA-3, BLAKE2b/3, HMAC, MD5 for legacy interop) and the positioning - spec-based, fast, constexpr-compatible, Apache-licensed, no-nonsense implementations verified against official vectors, with the explicit stance against BoringSSL-style dependencies (live-at-head, unversioned, historically non-reproducible archives) - now live in mbo/digest/README.md. mbo/hash/README.md states the same principles for the hash side (canonical-or-honest, constexpr single path, NOTICE-attributed). The stale TODO non-goal paragraph (which wrongly implied MD5/SHA would never exist anywhere) is reduced to pointers; MD5's per-algorithm guidance (fine for accidental corruption, collision-broken against adversaries) is in the digest README.
# Conflicts: # CHANGELOG.md # mbo/hash/TODO.md # mbo/hash/hash.h # mbo/hash/hash_mh.h # mbo/hash/hash_test.cc # mbo/hash/hash_xxh3.h
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Top of the stack (base:
hash/streaming). Merge order: #219 → #221 → #222 → #224 → this.Implements the decisions from the SMHasher3 evaluation (#224): (b) harden mh + (c) rapidhash as default.
(c) Default switch
DefaultHashAlgorithm = rapidhash::Algorithm— SMHasher3-clean, canonical, best mixed-length latency in our benchmarks.GetHash128defaults to the newDefault128HashAlgorithm = xxh3::Algorithm— SMHasher3-clean and 128-bit native; rapidhash has no 128-bit form, and defaulting to the synthesized two-pass fallback would have been a silent quality downgrade.mhstays available (fast hot-loop throughput, streaming-capable); docs no longer call it the default.(b) mh seed hardening (values change)
The seed is finalized through
Fmix64before deriving any lane — one-shot (both tiers) and streaming state alike — targeting the SeedBlockLen/SeedBlockOffset/SeedBIC/SeedBitflip failure families. An SMHasher3 re-run of the hardened mh is in flight; results will be appended to SMHASHER3.md. Core-round hardening (BIC/Sparse/Bitflip/PerlinNoise) remains a TODO.License diligence (the question that gated (c))
rapidhash is verbatim MIT (© 2025 Nicolas De Carli, based on public-domain wyhash) — fully Apache-2.0-compatible, no obstacle. To satisfy notice-preservation obligations properly across all transcriptions:
NOTICEfile (Apache convention) reproducing upstream notices verbatim: rapidhash MIT full text, xxHash BSD-2 full two-clause text (the previous one-line attribution was not strictly sufficient), MurmurHash3/SipHash/FNV attributions.LICENSEstays pure Apache-2.0.