-
Notifications
You must be signed in to change notification settings - Fork 151
Sparse Dictionary Canonicalize #7841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
gatesn
wants to merge
14
commits into
develop
Choose a base branch
from
ngates/sparse-dict
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
baceda2
Sparse Dictionary Canonicalize
gatesn 90bc59c
Document sparse dictionary cardinality estimation
gatesn 4137b92
Document sparse cardinality estimators
gatesn 4d65cf1
Share dictionary referenced-values mask
gatesn c3e209d
Merge remote-tracking branch 'origin/develop' into ngates/sparse-dict
gatesn 4ab2dc5
Skip sparse dict compaction for referenced dictionaries
gatesn 13ae594
Avoid sparse dict overhead on dense paths
gatesn 6a07393
Merge remote-tracking branch 'origin/develop' into ngates/sparse-dict
gatesn fd59336
Gate sparse dict sampling by shape
gatesn 58bb1d4
Keep sparse dict helpers cold
gatesn cc38823
Merge remote-tracking branch 'origin/develop' into ngates/sparse-dict
gatesn d6e1096
Avoid sparse dict compaction over filters
gatesn 54b8dba
Skip sparse dict gate for filter values
gatesn 22be09b
Deduplicate cardinality code
gatesn File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,135 @@ | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // SPDX-FileCopyrightText: Copyright the Vortex contributors | ||
|
|
||
| //! Sampling-based cardinality estimation for dictionary codes. | ||
| //! | ||
| //! This module is used only as a cheap gate before the exact sparse-dictionary remap pass and as a | ||
| //! routing hint for downstream exporters. The estimate may be conservative or noisy, but | ||
| //! correctness does not depend on it: callers must still collect the exact unique code set and | ||
| //! re-check the sparse threshold before compacting. | ||
|
|
||
| use vortex_mask::Mask; | ||
|
|
||
| use crate::arrays::PrimitiveArray; | ||
| use crate::dtype::IntegerPType; | ||
|
|
||
| const SAMPLE_SIZE: usize = 128; | ||
| const REPEATED_CODE_PROBE_SIZE: usize = 16; | ||
|
|
||
| /// Return whether a small deterministic probe observes a repeated non-null code. | ||
| /// | ||
| /// Sparse canonicalization always has a cheap worst-case gate before it samples. This probe is the | ||
| /// next, cheaper filter for cases that are not sparse by row count alone: dense dictionaries should | ||
| /// not pay the full estimator cost unless the code stream first shows evidence of repeated codes. | ||
| /// A `true` result only means "run the estimator"; it is not enough to compact by itself. | ||
| pub fn has_repeated_code_sample<I: IntegerPType>( | ||
| codes: &PrimitiveArray, | ||
| validity_mask: &Mask, | ||
| ) -> bool { | ||
| let sample_count = codes.len().min(REPEATED_CODE_PROBE_SIZE); | ||
| let mut observed_codes = Vec::<usize>::with_capacity(sample_count); | ||
|
|
||
| for sample_idx in 0..sample_count { | ||
| let idx = sample_index(sample_idx, codes.len(), sample_count); | ||
| if !validity_mask.value(idx) { | ||
| continue; | ||
| } | ||
|
|
||
| let code: usize = codes.as_slice::<I>()[idx].as_(); | ||
| if observed_codes.contains(&code) { | ||
| return true; | ||
| } | ||
| observed_codes.push(code); | ||
| } | ||
|
|
||
| false | ||
| } | ||
|
|
||
| /// Estimate the number of distinct non-null dictionary codes. | ||
| /// | ||
| /// The estimator samples deterministic bucket midpoints so repeated executions make the same | ||
| /// compaction decision for the same input. Returning `None` means no valid sampled codes were seen. | ||
| /// A returned value should only be used to decide whether an exact pass is worth attempting. | ||
| pub fn estimate_code_cardinality<I: IntegerPType>( | ||
| codes: &PrimitiveArray, | ||
| validity_mask: &Mask, | ||
| ) -> Option<usize> { | ||
| let sample_count = codes.len().min(SAMPLE_SIZE); | ||
| let mut observed_codes = Vec::<(usize, usize)>::new(); | ||
|
|
||
| // Sample deterministic bucket midpoints instead of using randomness. The estimate only gates | ||
| // whether to run the exact pass; correctness never depends on the sample. | ||
| for sample_idx in 0..sample_count { | ||
| let idx = sample_index(sample_idx, codes.len(), sample_count); | ||
| if !validity_mask.value(idx) { | ||
| continue; | ||
| } | ||
|
|
||
| let code: usize = codes.as_slice::<I>()[idx].as_(); | ||
| if let Some((_, count)) = observed_codes | ||
| .iter_mut() | ||
| .find(|(observed, _)| *observed == code) | ||
| { | ||
| *count += 1; | ||
| } else { | ||
| observed_codes.push((code, 1)); | ||
| } | ||
| } | ||
|
|
||
| estimate_cardinality_from_observations(&observed_codes) | ||
| } | ||
|
|
||
| /// Estimate total cardinality from `(code, observed_count)` sample observations. | ||
| /// | ||
| /// The correction is Chao1-style: singleton-heavy samples imply more unseen codes, while repeated | ||
| /// observations imply the code stream is likely low-cardinality. | ||
| fn estimate_cardinality_from_observations(observed_codes: &[(usize, usize)]) -> Option<usize> { | ||
| if observed_codes.is_empty() { | ||
| return None; | ||
| } | ||
|
|
||
| let unique_count = observed_codes.len(); | ||
| let singleton_count = observed_codes | ||
| .iter() | ||
| .filter(|(_, count)| *count == 1) | ||
| .count(); | ||
| let doubleton_count = observed_codes | ||
| .iter() | ||
| .filter(|(_, count)| *count == 2) | ||
| .count(); | ||
|
|
||
| // Chao1-style lower-bias estimate for unseen codes. Repeated samples keep the estimate small | ||
| // for low-cardinality code streams; many singleton samples make dense streams look expensive. | ||
| let unseen_estimate = if doubleton_count == 0 { | ||
| singleton_count.saturating_mul(singleton_count.saturating_sub(1)) / 2 | ||
| } else { | ||
| div_ceil( | ||
| singleton_count.saturating_mul(singleton_count), | ||
| 2 * doubleton_count, | ||
| ) | ||
| }; | ||
|
|
||
| Some(unique_count.saturating_add(unseen_estimate)) | ||
| } | ||
|
|
||
| /// Return the midpoint index for one deterministic sampling bucket. | ||
| /// | ||
| /// Splitting the full code range into buckets avoids clustering all samples near the start while | ||
| /// avoiding RNG state in a hot execution path. | ||
| fn sample_index(sample_idx: usize, len: usize, sample_count: usize) -> usize { | ||
| debug_assert!(len > 0); | ||
| debug_assert!(sample_count > 0); | ||
|
|
||
| let sample_idx = sample_idx as u128; | ||
| let len = len as u128; | ||
| let sample_count = sample_count as u128; | ||
| let bucket_start = sample_idx * len / sample_count; | ||
| let bucket_end = (sample_idx + 1) * len / sample_count; | ||
|
|
||
| ((bucket_start + bucket_end) / 2).min(len - 1) as usize | ||
| } | ||
|
|
||
| fn div_ceil(numerator: usize, denominator: usize) -> usize { | ||
| debug_assert!(denominator > 0); | ||
| numerator / denominator + usize::from(!numerator.is_multiple_of(denominator)) | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you return
valuesin this code path? Am I missing something very obvious?