Conversation
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
|
I think you want this idea: vortex/vortex-array/src/arrays/dict/array.rs Line 144 in 74d803b |
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com> # Conflicts: # vortex-array/src/arrays/dict/vtable/mod.rs # vortex-duckdb/src/exporter/dict.rs
Merging this PR will improve performance by 24.89%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | bench_many_codes_few_values[1024] |
397.5 µs | 325.1 µs | +22.27% |
| ⚡ | Simulation | bench_many_codes_few_values[2048] |
368.5 µs | 295.9 µs | +24.55% |
| ⚡ | Simulation | bench_many_codes_few_values[4096] |
374.3 µs | 301.7 µs | +24.08% |
| ⚡ | Simulation | bench_many_nulls[0.5] |
361.8 µs | 317.5 µs | +13.96% |
| ⚡ | Simulation | bench_many_nulls[0.9] |
537.1 µs | 456.3 µs | +17.72% |
| ⚡ | Simulation | bench_sparse_coverage[0.01] |
366.4 µs | 293.7 µs | +24.75% |
| ⚡ | Simulation | bench_sparse_coverage[0.1] |
364.8 µs | 292.1 µs | +24.89% |
| ⚡ | Simulation | bench_sparse_coverage[0.5] |
364.8 µs | 292.1 µs | +24.89% |
Comparing ngates/sparse-dict (22be09b) with develop (f3d5f09)
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.005x ➖ datafusion / vortex-file-compressed (1.005x ➖, 0↑ 0↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.990x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.010x ➖, 0↑ 0↓)
datafusion / parquet (0.982x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (1.040x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.008x ➖, 0↑ 0↓)
duckdb / parquet (0.987x ➖, 1↑ 1↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.009x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.010x ➖, 0↑ 1↓)
datafusion / parquet (0.998x ➖, 2↑ 1↓)
datafusion / arrow (1.018x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.141x ❌, 0↑ 16↓)
duckdb / vortex-compact (1.067x ➖, 0↑ 1↓)
duckdb / parquet (1.098x ➖, 0↑ 12↓)
duckdb / duckdb (1.035x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.030x ➖, 0↑ 3↓)
datafusion / vortex-compact (1.030x ➖, 2↑ 4↓)
datafusion / parquet (1.020x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.067x ➖, 0↑ 26↓)
duckdb / vortex-compact (1.093x ➖, 0↑ 40↓)
duckdb / parquet (1.019x ➖, 1↑ 5↓)
duckdb / duckdb (1.059x ➖, 0↑ 18↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.021x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.045x ➖, 0↑ 0↓)
datafusion / parquet (0.984x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.001x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.037x ➖, 0↑ 0↓)
duckdb / parquet (1.017x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.979x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.979x ➖, 1↑ 0↓)
duckdb / parquet (1.007x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.996x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.011x ➖, 0↑ 0↓)
datafusion / parquet (0.965x ➖, 0↑ 0↓)
datafusion / arrow (0.971x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.986x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.994x ➖, 0↑ 0↓)
duckdb / parquet (1.008x ➖, 0↑ 0↓)
duckdb / duckdb (0.992x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.019x ➖, 1↑ 0↓)
datafusion / vortex-compact (0.989x ➖, 1↑ 1↓)
datafusion / parquet (1.115x ➖, 0↑ 4↓)
duckdb / vortex-file-compressed (1.056x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.090x ➖, 0↑ 0↓)
duckdb / parquet (1.025x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.075x ➖, 0↑ 13↓)
datafusion / parquet (1.069x ➖, 0↑ 10↓)
duckdb / vortex-file-compressed (1.121x ❌, 0↑ 26↓)
duckdb / parquet (1.038x ➖, 0↑ 3↓)
duckdb / duckdb (1.045x ➖, 0↑ 3↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.100x ➖, 0↑ 4↓)
datafusion / vortex-compact (1.011x ➖, 1↑ 0↓)
datafusion / parquet (1.020x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.000x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.016x ➖, 0↑ 0↓)
duckdb / parquet (1.099x ➖, 0↑ 0↓)
Full attributed analysis
|
|
The problem is, we now run a rather expensive check on every canonicalize... Need to think about how we can constrain this some more. I think it may involve storing the approximate unique count of codes as an array statistic |
robert3005
left a comment
There was a problem hiding this comment.
I think beyond the exporter changes that are unclear to me why they're in this pr the rest of the code is fine, I mostly left some questions since it looks like too much llm code
| // TODO(joe): if the range is size 1 replace with a constant array | ||
| if let Some(code) = sliced_code.as_opt::<Constant>() { | ||
| let code = code.scalar().as_primitive().as_::<usize>(); | ||
| return if let Some(code) = code { | ||
| let values = array.values().slice(code..code + 1)?; | ||
| Ok(Some( | ||
| DictArray::new( | ||
| // SAFETY: the only dictionary value is referenced by every non-null code when the | ||
| // slice is non-empty. An empty code stream cannot reference a non-empty values | ||
| // array. | ||
| let sliced = unsafe { | ||
| DictArray::new_unchecked( | ||
| ConstantArray::new(0u8, sliced_code.len()).into_array(), | ||
| values, | ||
| ) | ||
| .into_array(), | ||
| )) | ||
| }; | ||
| if sliced_code.is_empty() { | ||
| Ok(Some(sliced.into_array())) | ||
| } else { | ||
| Ok(Some(unsafe { | ||
| sliced.set_all_values_referenced(true).into_array() | ||
| })) | ||
| } |
There was a problem hiding this comment.
can't you return values in this code path? Am I missing something very obvious?
| } | ||
|
|
||
| if !array.has_all_values_referenced() | ||
| // `take(FilterArray(...))` is also represented as a dictionary. Compacting that shape |
| pub struct ConversionCache { | ||
| pub dict_cache: DashMap<usize, (ArrayRef, ReusableDict)>, | ||
| pub values_cache: DashMap<usize, (ArrayRef, Arc<Mutex<Vector>>)>, | ||
| pub canonical_cache: DashMap<usize, (ArrayRef, Canonical)>, |
There was a problem hiding this comment.
little bit confused why this would be removed here
| use crate::executor::ExecutionCtx; | ||
| use crate::validity::Validity; | ||
|
|
||
| // TODO: Replace this fixed sparse-dictionary threshold with a cost model that accounts for values |
There was a problem hiding this comment.
this should be an issue number
| array: &DictArray, | ||
| ctx: &mut ExecutionCtx, | ||
| ) -> VortexResult<Option<Canonical>> { | ||
| let codes = array.codes().as_::<Primitive>().into_owned(); |
There was a problem hiding this comment.
this looks wrong, and will panic in practice
There was a problem hiding this comment.
This will be fine but it's unclear from here that we have previously canonicalzed codes
| ctx: &mut ExecutionCtx, | ||
| ) -> VortexResult<Option<Canonical>> { | ||
| let codes = array.codes().as_::<Primitive>().into_owned(); | ||
| let Some(sparse_codes) = collect_sparse_codes(&codes, array.values().len(), ctx)? else { |
There was a problem hiding this comment.
It's weird that you're passing things by reference here if you made the value owned line before
| // Sampling is only a preflight for cases that are not sparse by row count alone. Keep it away | ||
| // from tiny dictionary domains and near-dense slices where the estimator overhead dominates. | ||
| values_len >= SPARSE_CANONICALIZE_MIN_SAMPLED_VALUES_LEN | ||
| && codes_len.saturating_mul(SPARSE_CANONICALIZE_SAMPLED_CODES_PER_VALUE_THRESHOLD) |
There was a problem hiding this comment.
part of me would want to remove SPARSE_CANONICALIZE prefix so they're different
| unique_codes: PrimitiveArray::new(Buffer::from_iter(unique_codes), Validity::NonNullable), | ||
| remapped_codes: PrimitiveArray::new(Buffer::from_iter(remapped_codes), validity), |
There was a problem hiding this comment.
why are you copying here
When the number of unique codes is much much smaller than the number of values, we should collect unique codes, filter the values, and remap the codes into the smaller dictionary's value space.
I don't expect this PR to immediately have any impact. But I have some other changes that propagate filter masks deeper into the tree so they actually hit the DictLayout