docs(hash): SMHasher3 evaluation of mh - FAIL 147/188, recorded honestly#224
Closed
helly25 wants to merge 11 commits into
Closed
docs(hash): SMHasher3 evaluation of mh - FAIL 147/188, recorded honestly#224helly25 wants to merge 11 commits into
helly25 wants to merge 11 commits into
Conversation
…tency benchmark Medium tier of mbo/hash/TODO.md: - SipHash (hash_siphash.h): constexpr-safe SipHash-c-d (canonical 2-4 via GetHash64, 1-3 via GetHash64Sip13) with the reference 128-bit key API; verified against reference vectors (empty-input value matches the SipHash paper's table). The keyed, hash-flooding-resistant option the library lacked; Algorithm derives the key from the seed as (seed, Fmix64(seed)), documented as requiring a secret seed for adversarial protection. - rapidhash V3 (hash_rapidhash.h): constexpr-safe transcription of the reference FAST variant (wyhash family; small-key latency champion, SMHasher3-clean); verified against vectors generated by compiling the official header across every dispatch branch. hash_internal gains Mult128 (full 64x64->128, both halves). - Differential test (hash_differential_test.cc): our xxh64/xxh3 vs the actual reference library over 40k+ randomized (input, seed, length) cases incl. block boundaries and 100KB buffers - bit-for-bit agreement, proving the scalar transcription matches even the reference's SIMD path. Reference xxHash 0.8.3 pulled as a test-only http_archive (the BCR module's BUILD predates Bazel 9's rule-autoload removal). - Mixed-length latency benchmark (BmHash64Latency): unpredictable key sizes with a serialized hash->index dependency chain. Instantly instructive: the ranking flips vs hot-loop throughput (rapidhash 11ns < xxh3 12.2 < mh 16.6 at 0..16B mixes). TODO.md: medium tier cleared except the one-off SMHasher3 run; XXH3-128 added as a future item; MD5/SHA recorded as a non-goal (use BoringSSL; our fast file-identity answer is XXH3).
…ion over crypto-lib dependency BoringSSL is live-at-head, unversioned, and has a history of non-reproducible archives - an unverifiable supply chain for exactly the code that most needs verifying. Digests are spec-frozen pure functions; if interop is ever needed, transcribe + pin against NIST vectors in-repo.
…erately not 'crypto')
The 32-bit counterpart of Hash128To64: all 64 bits contribute. Plain truncation would be sound for the strong algorithms (full-avalanche finalizers), but the fold is the correct default for every algorithm - FNV-1a's low bits are biased and XOR-folding is that algorithm's official shrinking recommendation.
Adds the HasGetHash32 concept, Hasher<Algo>::GetHash32, and the top-level GetHash32<Algo>(data, seed). Algorithms may provide a native 32-bit variant (e.g. a future canonical XXH32 / FNV-1a-32 / murmur3-x86); everything else synthesizes the XOR-fold of the 64-bit hash (Hash64To32) - the per-algorithm correct approach, selected automatically.
…ty BUILD file Fixes four design problems with the differential test's reference archive: - The BUILD file is a real file (third_party/xxhash/xxhash.BUILD.bazel), not an inline build_file_content string. third_party/ holds per-dependency directories for external BUILD files and patches; the files are named <dep>.BUILD.bazel so they never load as helly25_mbo packages. (A nested MODULE.bazel boundary was tried first: Bazel 9 does not treat it as a package-traversal boundary - the BUILD file loaded as a main-repo package.) - The library is testonly. - Visibility is restricted to //mbo/hash. - The http_archive moved to bazelmod/dev.MODULE.bazel with dev_dependency = True: consumers of helly25_mbo never fetch it. The differential test is tagged manual so consumer wildcards never analyze a target whose dependency does not exist for them; our CI runs it explicitly.
The one-off credibility run (methodology in SMHASHER3.md): mh fails the research battery in two clusters - near-raw seed handling (SeedBlockLen/ Offset/BIC/Bitflip) and core-round diffusion limits (BIC, Sparse 9/4, Bitflip, PerlinNoise). Honest result recorded; the default-algorithm decision (keep-with-caveat / harden / switch to rapidhash) is now a TODO.
Fab-Cat
approved these changes
Jul 4, 2026
# Conflicts: # mbo/hash/TODO.md
# Conflicts: # mbo/hash/TODO.md
Owner
Author
|
Superseded: #225 already carried this branch's content (merged into it during stacking), and merged to main with the final SMHASHER3.md (including the 183/188 re-run postscript) plus the decision outcome. Merging this PR now would only regress SMHASHER3.md/TODO.md to their pre-decision text and strip the newer attribution notices. Nothing is lost. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #219. The one-off SMHasher3 run from the roadmap (the credibility bar for a novel construction) — and the honest answer is FAIL: 147/188.
Failures cluster in two areas (full analysis in
mbo/hash/SMHASHER3.md):Context:
rapidhashandxxh3— both SMHasher3-clean upstream, both canonical, both already in this library — and rapidhash also wins our mixed-length latency benchmark.Decision now queued in TODO.md: keep
mhas default with the caveat documented, harden it (seed fix cheap, rounds not), or promoterapidhashtoDefaultHashAlgorithm. Deliberately not decided in this PR.