Skip to content

feat(hash): rapidhash is the default; mh seed hardening; NOTICE attributions#225

Merged
helly25 merged 23 commits into
mainfrom
hash/rapidhash-default
Jul 4, 2026
Merged

feat(hash): rapidhash is the default; mh seed hardening; NOTICE attributions#225
helly25 merged 23 commits into
mainfrom
hash/rapidhash-default

Conversation

@helly25

@helly25 helly25 commented Jul 4, 2026

Copy link
Copy Markdown
Owner

Top of the stack (base: hash/streaming). Merge order: #219#221#222#224 → this.

Implements the decisions from the SMHasher3 evaluation (#224): (b) harden mh + (c) rapidhash as default.

(c) Default switch

  • DefaultHashAlgorithm = rapidhash::Algorithm — SMHasher3-clean, canonical, best mixed-length latency in our benchmarks.
  • GetHash128 defaults to the new Default128HashAlgorithm = xxh3::Algorithm — SMHasher3-clean and 128-bit native; rapidhash has no 128-bit form, and defaulting to the synthesized two-pass fallback would have been a silent quality downgrade.
  • mh stays available (fast hot-loop throughput, streaming-capable); docs no longer call it the default.

(b) mh seed hardening (values change)

The seed is finalized through Fmix64 before deriving any lane — one-shot (both tiers) and streaming state alike — targeting the SeedBlockLen/SeedBlockOffset/SeedBIC/SeedBitflip failure families. An SMHasher3 re-run of the hardened mh is in flight; results will be appended to SMHASHER3.md. Core-round hardening (BIC/Sparse/Bitflip/PerlinNoise) remains a TODO.

License diligence (the question that gated (c))

rapidhash is verbatim MIT (© 2025 Nicolas De Carli, based on public-domain wyhash) — fully Apache-2.0-compatible, no obstacle. To satisfy notice-preservation obligations properly across all transcriptions:

  • New repository-root NOTICE file (Apache convention) reproducing upstream notices verbatim: rapidhash MIT full text, xxHash BSD-2 full two-clause text (the previous one-line attribution was not strictly sufficient), MurmurHash3/SipHash/FNV attributions.
  • File headers carry short attributions pointing at NOTICE; README links it; LICENSE stays pure Apache-2.0.

helly25 and others added 15 commits July 4, 2026 11:49
…tency benchmark

Medium tier of mbo/hash/TODO.md:

- SipHash (hash_siphash.h): constexpr-safe SipHash-c-d (canonical 2-4 via
  GetHash64, 1-3 via GetHash64Sip13) with the reference 128-bit key API;
  verified against reference vectors (empty-input value matches the SipHash
  paper's table). The keyed, hash-flooding-resistant option the library
  lacked; Algorithm derives the key from the seed as (seed, Fmix64(seed)),
  documented as requiring a secret seed for adversarial protection.

- rapidhash V3 (hash_rapidhash.h): constexpr-safe transcription of the
  reference FAST variant (wyhash family; small-key latency champion,
  SMHasher3-clean); verified against vectors generated by compiling the
  official header across every dispatch branch. hash_internal gains Mult128
  (full 64x64->128, both halves).

- Differential test (hash_differential_test.cc): our xxh64/xxh3 vs the actual
  reference library over 40k+ randomized (input, seed, length) cases incl.
  block boundaries and 100KB buffers - bit-for-bit agreement, proving the
  scalar transcription matches even the reference's SIMD path. Reference
  xxHash 0.8.3 pulled as a test-only http_archive (the BCR module's BUILD
  predates Bazel 9's rule-autoload removal).

- Mixed-length latency benchmark (BmHash64Latency): unpredictable key sizes
  with a serialized hash->index dependency chain. Instantly instructive: the
  ranking flips vs hot-loop throughput (rapidhash 11ns < xxh3 12.2 < mh 16.6
  at 0..16B mixes).

TODO.md: medium tier cleared except the one-off SMHasher3 run; XXH3-128 added
as a future item; MD5/SHA recorded as a non-goal (use BoringSSL; our fast
file-identity answer is XXH3).
…ion over crypto-lib dependency

BoringSSL is live-at-head, unversioned, and has a history of
non-reproducible archives - an unverifiable supply chain for exactly the
code that most needs verifying. Digests are spec-frozen pure functions;
if interop is ever needed, transcribe + pin against NIST vectors in-repo.
The 32-bit counterpart of Hash128To64: all 64 bits contribute. Plain
truncation would be sound for the strong algorithms (full-avalanche
finalizers), but the fold is the correct default for every algorithm -
FNV-1a's low bits are biased and XOR-folding is that algorithm's official
shrinking recommendation.
Adds the HasGetHash32 concept, Hasher<Algo>::GetHash32, and the top-level
GetHash32<Algo>(data, seed). Algorithms may provide a native 32-bit variant
(e.g. a future canonical XXH32 / FNV-1a-32 / murmur3-x86); everything else
synthesizes the XOR-fold of the 64-bit hash (Hash64To32) - the per-algorithm
correct approach, selected automatically.
…ty BUILD file

Fixes four design problems with the differential test's reference archive:

- The BUILD file is a real file (third_party/xxhash/xxhash.BUILD.bazel), not
  an inline build_file_content string. third_party/ holds per-dependency
  directories for external BUILD files and patches; the files are named
  <dep>.BUILD.bazel so they never load as helly25_mbo packages. (A nested
  MODULE.bazel boundary was tried first: Bazel 9 does not treat it as a
  package-traversal boundary - the BUILD file loaded as a main-repo package.)
- The library is testonly.
- Visibility is restricted to //mbo/hash.
- The http_archive moved to bazelmod/dev.MODULE.bazel with
  dev_dependency = True: consumers of helly25_mbo never fetch it. The
  differential test is tagged manual so consumer wildcards never analyze a
  target whose dependency does not exist for them; our CI runs it explicitly.
Adds the 128-bit XXH3 variant to hash_xxh3.h, transcribed from the reference
implementation v0.8.2: all dispatch classes (0/1-3/4-8/9-16 with the mult128
mixers, 17-128 and 129-240 via Mix32B, and the long path reusing the existing
stripe/scramble machinery with the dual mergeAccs). xxh3::Algorithm is now
128-bit native, so the typed framework and benchmark cover the 128-bit side
automatically.

Verified three ways: 34 reference vectors across every class boundary and
seed, a 40k-case differential test against libxxhash (XXH3_128bits_withSeed,
including its SIMD path on 100KB buffers), and the structural property that
h1 equals GetHash64 for >240-byte and 1-3-byte inputs by construction.
Adds the streaming contract designed as the shared foundation for the future
mbo/digest library: algorithms MAY provide StreamState/StreamInit/StreamUpdate/
StreamFinalize with the guarantee that any chunking of the input produces
exactly the one-shot GetHash64 value. The HasStreaming concept detects
support; Streamer<Algo> is the object-style wrapper (chainable Update,
non-destructive Finalize, fully constexpr).

Implementations:
- siphash: native mode of the block ARX construction (8-byte buffer).
- xxh64: the reference streaming semantics (4 accumulators, 32-byte buffer;
  the one-shot tail is refactored into a shared FinishTail so the paths
  cannot drift).
- mh: own definition matching the one-shot tier dispatch exactly - buffers up
  to 64 bytes until the stripe tier is decided, then consumes stripes as the
  one-shot loop would (including the finalize-time drain of a final full
  stripe), replaying the <32-byte rest through the 2-lane block/tail code.
- rapidhash/fnv1a/simple/xxh3/murmur3: no streaming (rapidhash has no
  canonical form; xxh3/murmur3 deferred) - absence is detected honestly.

Tested: typed chunked==one-shot property test (random split points incl.
empty chunks, byte-at-a-time, lengths crossing every dispatch tier),
non-destructive peek-Finalize, and constexpr streaming (static_assert).
The one-off credibility run (methodology in SMHASHER3.md): mh fails the
research battery in two clusters - near-raw seed handling (SeedBlockLen/
Offset/BIC/Bitflip) and core-round diffusion limits (BIC, Sparse 9/4,
Bitflip, PerlinNoise). Honest result recorded; the default-algorithm
decision (keep-with-caveat / harden / switch to rapidhash) is now a TODO.
…butions

Decisions (b)+(c) from the SMHasher3 evaluation:

- DefaultHashAlgorithm = rapidhash (SMHasher3-clean, canonical, best
  mixed-length latency); GetHash128 defaults to the new
  Default128HashAlgorithm = xxh3 (SMHasher3-clean and 128-bit *native* -
  rapidhash has no 128-bit form and a synthesized default would be a silent
  quality downgrade). mh remains available; its header and README no longer
  call it the default.

- mh seed hardening (values change): the seed is finalized through Fmix64
  before deriving any lane (one-shot both tiers, and the streaming state),
  targeting SMHasher3's SeedBlockLen/SeedBlockOffset/SeedBIC/SeedBitflip
  failure families. Core-round hardening (BIC/Sparse/Bitflip/PerlinNoise)
  stays a TODO.

License diligence for the transcriptions (rapidhash is MIT - no obstacle to
making it the default): new repository-root NOTICE file reproduces the
upstream notices verbatim (rapidhash MIT, xxHash BSD-2 full two-clause text,
MurmurHash3/SipHash/FNV public-domain/CC0 attributions); file headers carry a
short attribution pointing at NOTICE; README links it. LICENSE stays pure
Apache-2.0.
The MIT/BSD-2 obligations are satisfied by shipping NOTICE; a flag would
flip the default algorithm per build configuration (the footgun class
rejected for auto-MBO_HASH_MANGLE). The algorithm headers are
self-contained, so notice-free consumers can include only the
attribution-free set; per-algorithm targets remain addable compatibly.
Seed finalization cleared every Seed* family and PerlinNoise (147 -> 183).
The remaining five failures (BIC, Sparse 9/4, Bitflip) are the core-round
diffusion cluster, tracked in TODO.md.
mh may be made faster, have its rounds strengthened, be replaced entirely,
or be renamed; consumers must not depend on the mh namespace or its values
beyond a single build.
@helly25 helly25 requested a review from Fab-Cat July 4, 2026 13:41
helly25 added 7 commits July 4, 2026 15:25
# Conflicts:
#	CHANGELOG.md
#	mbo/hash/hash_differential_test.cc
#	mbo/hash/hash_test.cc
The digest scope (SHA-1/224/256 and the wider SHA-2 family, SHA-3, BLAKE2b/3,
HMAC, MD5 for legacy interop) and the positioning - spec-based, fast,
constexpr-compatible, Apache-licensed, no-nonsense implementations verified
against official vectors, with the explicit stance against BoringSSL-style
dependencies (live-at-head, unversioned, historically non-reproducible
archives) - now live in mbo/digest/README.md. mbo/hash/README.md states the
same principles for the hash side (canonical-or-honest, constexpr single
path, NOTICE-attributed). The stale TODO non-goal paragraph (which wrongly
implied MD5/SHA would never exist anywhere) is reduced to pointers; MD5's
per-algorithm guidance (fine for accidental corruption, collision-broken
against adversaries) is in the digest README.
Base automatically changed from hash/streaming to main July 4, 2026 14:49
# Conflicts:
#	CHANGELOG.md
#	mbo/hash/TODO.md
#	mbo/hash/hash.h
#	mbo/hash/hash_mh.h
#	mbo/hash/hash_test.cc
#	mbo/hash/hash_xxh3.h
@helly25 helly25 merged commit 3f4d0ed into main Jul 4, 2026
4 checks passed
@helly25 helly25 deleted the hash/rapidhash-default branch July 4, 2026 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants