[SPARK-57029][SQL][TESTS] Add byte-level visibility golden for ICU collation sort keys#56096
Open
yaooqinn wants to merge 1 commit into
Open
[SPARK-57029][SQL][TESTS] Add byte-level visibility golden for ICU collation sort keys#56096yaooqinn wants to merge 1 commit into
yaooqinn wants to merge 1 commit into
Conversation
…llation sort keys ### What changes were proposed in this pull request? Add a test-only visibility golden suite for ICU collation sort keys: - New test: `sql/core/src/test/scala/org/apache/spark/sql/ICUCollationSortKeyGoldenSuite.scala` - New golden: `sql/core/src/test/resources/collations/ICU-collations-sort-keys.md` (38 cells, ~1900 bytes) The suite snapshots `(collation, input) -> hex(CollationKey)` for 14 dimensions covering the ICU surface Spark uses: UCA primary / tertiary case / secondary diacritic; NFC vs NFD canonical equivalence; combining-mark reorder visibility; SMP surrogate path; BMP precomposed Hangul; ASCII punct / space at primary; Turkish locale tailoring (en_USA + tr); CJK Han implicit weighting; empty string boundary; U+FFFD; C0 controls; variation selectors. Each test asserts a contract on the recorded bytes: row existence, non-empty hex, level segmentation for NON_IGNORABLE alternate handling, prefix-share invariants for Turkish tailoring, and the ICU compressed-sortkey lead byte for CJK implicit weights. Drift-prone dims fire with named-condition messages if Spark's ICU configuration or library version changes the semantic; stable dims fire if a regression silently drops or folds a cell. The pattern mirrors `ICUCollationsMapSuite` (which lists the ICU locale surface) and is scoped to ICU-backed collations only. `UTF8_LCASE` is out of scope -- it does not go through `com.ibm.icu.text.Collator.getCollationKey()` and is already covered by `CollationFactorySuite`. ### Why are the changes needed? icu4j upgrades silently change `ORDER BY ... COLLATE` semantics across Spark versions. Past upgrade PRs (e.g. SPARK-50189, SPARK-52038, SPARK-54447, SPARK-55308, SPARK-56397) touch only the dependency file and benchmark results -- they ship no byte-level regression on sort output, so a CLDR re-weighting can land in master without any reviewer signal. Empirical evidence from a local cross-version probe (icu4j 72.1 through 78.3, 33 test cells covering Latin / Turkish / zh_CN): the icu4j 75 → 76 transition altered 23/33 cell sortkeys (UCA primary base shift, e.g. `en_US 'a': 0x2a → 0x2b`); 77.1 → 78.3 (Spark 4.1 → master, SPARK-52038 → SPARK-56397) altered 4/33 cells silently. None of these drifts surfaced in PR review. This suite makes such drift visible during ICU upgrade review: any change to the recorded bytes shows up as a golden diff that a reviewer must explicitly accept. It is **not** a stability contract -- the disclaimer at golden line 1, the `GOLDEN_DISCLAIMER` constant (and the line-1 assert that pins it), and the suite scaladoc all state that downstream consumers MUST NOT rely on byte equality across Spark versions. The file is a review-trigger snapshot, nothing more. Reviewer note: when this golden file changes on a PR that does not bump `icu4j`, please request a revert -- regeneration belongs in the ICU upgrade PR. ### Does this PR introduce _any_ user-facing change? No. Test-only; no SQLConf, no public API, no production code path. ### How was this patch tested? - New suite `ICUCollationSortKeyGoldenSuite` (16 tests). Local 16/16 PASS on master, two-pass deterministic: regenerate the golden, then re-run from disk -- bytes match. - Regenerate the golden with `SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.ICUCollationSortKeyGoldenSuite"`; the suite enforces idempotency and that on-disk bytes match the regen output. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7
b0fe8cc to
8c38b17
Compare
Member
Author
|
Hi @dongjoon-hyun, I shared the same concern with you until copilot showed me this sibling file. If you think we shall use txt like sql golden files, I can switch to txt based. |
Member
|
Got it. Never mind for my previous comment. cc @cloud-fan |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


What changes were proposed in this pull request?
Add a test-only visibility golden suite for ICU collation sort keys:
sql/core/src/test/scala/org/apache/spark/sql/ICUCollationSortKeyGoldenSuite.scalasql/core/src/test/resources/collations/ICU-collations-sort-keys.md(38 cells, ~1900 bytes)The suite snapshots
(collation, input) -> hex(CollationKey)for 14 dimensions covering the ICU surface Spark uses: UCA primary / tertiary case / secondary diacritic; NFC vs NFD canonical equivalence; combining-mark reorder visibility; SMP surrogate path; BMP precomposed Hangul; ASCII punct / space at primary; Turkish locale tailoring (en_USA + tr); CJK Han implicit weighting; empty string boundary; U+FFFD; C0 controls; variation selectors.Each test asserts a contract on the recorded bytes: row existence, non-empty hex, level segmentation for NON_IGNORABLE alternate handling, prefix-share invariants for Turkish tailoring, and the ICU compressed-sortkey lead byte for CJK implicit weights. Drift-prone dims fire with named-condition messages if Spark's ICU configuration or library version changes the semantic; stable dims fire if a regression silently drops or folds a cell.
The pattern mirrors
ICUCollationsMapSuite(which lists the ICU locale surface) and is scoped to ICU-backed collations only.UTF8_LCASEis out of scope -- it does not go throughcom.ibm.icu.text.Collator.getCollationKey()and is already covered byCollationFactorySuite.Why are the changes needed?
icu4j upgrades silently change
ORDER BY ... COLLATEsemantics across Spark versions. Past upgrade PRs (e.g. SPARK-50189, SPARK-52038, SPARK-54447, SPARK-55308, SPARK-56397) touch only the dependency file and benchmark results -- they ship no byte-level regression on sort output, so a CLDR re-weighting can land in master without any reviewer signal.Empirical evidence from a local cross-version probe (icu4j 72.1 through 78.3, 33 test cells covering Latin / Turkish / zh_CN): the icu4j 75 → 76 transition altered 23/33 cell sortkeys (UCA primary base shift, e.g.
en_US 'a': 0x2a → 0x2b); 77.1 → 78.3 (Spark 4.1 → master, SPARK-52038 → SPARK-56397) altered 4/33 cells silently. None of these drifts surfaced in PR review.This suite makes such drift visible during ICU upgrade review: any change to the recorded bytes shows up as a golden diff that a reviewer must explicitly accept. It is not a stability contract -- the disclaimer at golden line 1, the
GOLDEN_DISCLAIMERconstant (and the line-1 assert that pins it), and the suite scaladoc all state that downstream consumers MUST NOT rely on byte equality across Spark versions. The file is a review-trigger snapshot, nothing more.Reviewer note: when this golden file changes on a PR that does not bump
icu4j, please request a revert -- regeneration belongs in the ICU upgrade PR.Does this PR introduce any user-facing change?
No. Test-only; no SQLConf, no public API, no production code path.
How was this patch tested?
ICUCollationSortKeyGoldenSuite(16 tests). Local 16/16 PASS on master, two-pass deterministic: regenerate the golden, then re-run from disk -- bytes match.SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.ICUCollationSortKeyGoldenSuite"; the suite enforces idempotency and that on-disk bytes match the regen output.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7