Skip to content

docs(hash): SMHasher3 evaluation of mh - FAIL 147/188, recorded honestly#224

Closed
helly25 wants to merge 11 commits into
mainfrom
hash/smhasher3
Closed

docs(hash): SMHasher3 evaluation of mh - FAIL 147/188, recorded honestly#224
helly25 wants to merge 11 commits into
mainfrom
hash/smhasher3

Conversation

@helly25

@helly25 helly25 commented Jul 4, 2026

Copy link
Copy Markdown
Owner

Stacked on #219. The one-off SMHasher3 run from the roadmap (the credibility bar for a novel construction) — and the honest answer is FAIL: 147/188.

Failures cluster in two areas (full analysis in mbo/hash/SMHASHER3.md):

  1. Seed handling (the bulk — SeedBlockLen [8..31], SeedBlockOffset, SeedBIC, SeedBitflip): the seed enters the lanes nearly raw. A cheap candidate fix exists (Fmix64 the seed before lane derivation).
  2. Core-round diffusion (BIC, Sparse 9/4, Bitflip, PerlinNoise): one rotate-multiply per lane per absorb doesn't reach full bit independence — the research-grade version of what our structured-key test caught in feat(hash): seed-avalanche + structured-key tests; fix sparse-key collisions in mh #217.

Context: rapidhash and xxh3 — both SMHasher3-clean upstream, both canonical, both already in this library — and rapidhash also wins our mixed-length latency benchmark.

Decision now queued in TODO.md: keep mh as default with the caveat documented, harden it (seed fix cheap, rounds not), or promote rapidhash to DefaultHashAlgorithm. Deliberately not decided in this PR.

helly25 and others added 8 commits July 4, 2026 11:49
…tency benchmark

Medium tier of mbo/hash/TODO.md:

- SipHash (hash_siphash.h): constexpr-safe SipHash-c-d (canonical 2-4 via
  GetHash64, 1-3 via GetHash64Sip13) with the reference 128-bit key API;
  verified against reference vectors (empty-input value matches the SipHash
  paper's table). The keyed, hash-flooding-resistant option the library
  lacked; Algorithm derives the key from the seed as (seed, Fmix64(seed)),
  documented as requiring a secret seed for adversarial protection.

- rapidhash V3 (hash_rapidhash.h): constexpr-safe transcription of the
  reference FAST variant (wyhash family; small-key latency champion,
  SMHasher3-clean); verified against vectors generated by compiling the
  official header across every dispatch branch. hash_internal gains Mult128
  (full 64x64->128, both halves).

- Differential test (hash_differential_test.cc): our xxh64/xxh3 vs the actual
  reference library over 40k+ randomized (input, seed, length) cases incl.
  block boundaries and 100KB buffers - bit-for-bit agreement, proving the
  scalar transcription matches even the reference's SIMD path. Reference
  xxHash 0.8.3 pulled as a test-only http_archive (the BCR module's BUILD
  predates Bazel 9's rule-autoload removal).

- Mixed-length latency benchmark (BmHash64Latency): unpredictable key sizes
  with a serialized hash->index dependency chain. Instantly instructive: the
  ranking flips vs hot-loop throughput (rapidhash 11ns < xxh3 12.2 < mh 16.6
  at 0..16B mixes).

TODO.md: medium tier cleared except the one-off SMHasher3 run; XXH3-128 added
as a future item; MD5/SHA recorded as a non-goal (use BoringSSL; our fast
file-identity answer is XXH3).
…ion over crypto-lib dependency

BoringSSL is live-at-head, unversioned, and has a history of
non-reproducible archives - an unverifiable supply chain for exactly the
code that most needs verifying. Digests are spec-frozen pure functions;
if interop is ever needed, transcribe + pin against NIST vectors in-repo.
The 32-bit counterpart of Hash128To64: all 64 bits contribute. Plain
truncation would be sound for the strong algorithms (full-avalanche
finalizers), but the fold is the correct default for every algorithm -
FNV-1a's low bits are biased and XOR-folding is that algorithm's official
shrinking recommendation.
Adds the HasGetHash32 concept, Hasher<Algo>::GetHash32, and the top-level
GetHash32<Algo>(data, seed). Algorithms may provide a native 32-bit variant
(e.g. a future canonical XXH32 / FNV-1a-32 / murmur3-x86); everything else
synthesizes the XOR-fold of the 64-bit hash (Hash64To32) - the per-algorithm
correct approach, selected automatically.
…ty BUILD file

Fixes four design problems with the differential test's reference archive:

- The BUILD file is a real file (third_party/xxhash/xxhash.BUILD.bazel), not
  an inline build_file_content string. third_party/ holds per-dependency
  directories for external BUILD files and patches; the files are named
  <dep>.BUILD.bazel so they never load as helly25_mbo packages. (A nested
  MODULE.bazel boundary was tried first: Bazel 9 does not treat it as a
  package-traversal boundary - the BUILD file loaded as a main-repo package.)
- The library is testonly.
- Visibility is restricted to //mbo/hash.
- The http_archive moved to bazelmod/dev.MODULE.bazel with
  dev_dependency = True: consumers of helly25_mbo never fetch it. The
  differential test is tagged manual so consumer wildcards never analyze a
  target whose dependency does not exist for them; our CI runs it explicitly.
The one-off credibility run (methodology in SMHASHER3.md): mh fails the
research battery in two clusters - near-raw seed handling (SeedBlockLen/
Offset/BIC/Bitflip) and core-round diffusion limits (BIC, Sparse 9/4,
Bitflip, PerlinNoise). Honest result recorded; the default-algorithm
decision (keep-with-caveat / harden / switch to rapidhash) is now a TODO.
Base automatically changed from hash/medium-batch to main July 4, 2026 14:23
@helly25

helly25 commented Jul 4, 2026

Copy link
Copy Markdown
Owner Author

Superseded: #225 already carried this branch's content (merged into it during stacking), and merged to main with the final SMHASHER3.md (including the 183/188 re-run postscript) plus the decision outcome. Merging this PR now would only regress SMHASHER3.md/TODO.md to their pre-decision text and strip the newer attribution notices. Nothing is lost.

@helly25 helly25 closed this Jul 4, 2026
@helly25 helly25 deleted the hash/smhasher3 branch July 4, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants