Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions encodings/fsst/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ _test-harness = ["dep:rand", "vortex-array/_test-harness"]
divan = { workspace = true }
rand = { workspace = true }
rstest = { workspace = true }
test-with = { workspace = true }
vortex-array = { workspace = true, features = ["_test-harness"] }

[[bench]]
Expand Down
2 changes: 2 additions & 0 deletions encodings/fsst/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ mod slice;
pub mod test_utils;
#[cfg(test)]
mod tests;
#[cfg(test)]
mod tests_large;

pub use array::*;
pub use compress::*;
74 changes: 74 additions & 0 deletions encodings/fsst/src/tests_large.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright the Vortex contributors

//! Stress regression tests for FSST compression at the i32-offset boundary.
//!
//! Gated to CI runs (collected but skipped when `CI` is unset; opt-out with
//! `VORTEX_SKIP_SLOW_TESTS=1`) because of the multi-GiB memory footprint.

use rand::SeedableRng;
use rand::rngs::StdRng;
use rand::seq::IndexedRandom;
use vortex_array::LEGACY_SESSION;
use vortex_array::VortexSessionExecute;
use vortex_array::arrays::varbin::builder::VarBinBuilder;
use vortex_array::dtype::DType;
use vortex_array::dtype::Nullability;

use crate::fsst_compress;
use crate::fsst_train_compressor;

/// Regression for #7833: `fsst_compress` must accept inputs whose cumulative
/// compressed bytes exceed `i32::MAX`. Today this panics in
/// `vortex-array/src/arrays/varbin/builder.rs:62` because `fsst_compress_iter`
/// (`encodings/fsst/src/compress.rs:72`) hardcodes `VarBinBuilder::<i32>` for
/// the FSST output buffer regardless of input size.
///
/// The input is built with `VarBinBuilder::<i64>` to confirm that widening the
/// input alone does not help — the overflow is on the FSST output side.
///
/// `#[should_panic]` captures today's behavior; when the underlying bug is
/// fixed, drop the `#[should_panic]` so the trailing `assert_eq!` becomes the
/// regression assertion.
///
/// Allocates ~2.5 GiB for the input plus ~2.5 GiB for the FSST output.
#[test_with::env(CI)]
#[test_with::no_env(VORTEX_SKIP_SLOW_TESTS)]
#[should_panic(expected = "to offset of type i32")]
fn fsst_compress_offsets_overflow_i32() {
// High-entropy ASCII strings sliced from a random pool. FSST is a
// symbol-table compressor; pseudo-random data with no recurring byte
// sequences resists compression, so the compressed output stays close
// to input size and crosses the i32 boundary.
const STRING_LEN: usize = 64 * 1024;
const TOTAL_BYTES: usize = (1usize << 31) + (512 << 20); // ~2.5 GiB
const N: usize = TOTAL_BYTES / STRING_LEN;
const POOL_LEN: usize = 64 * 1024 * 1024;

// Printable ASCII alphabet so the result is valid UTF-8.
const ALPHABET: &[u8; 95] =
b" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~";

let mut rng = StdRng::seed_from_u64(0xC0DE_C011_B711);
let pool: Vec<u8> = (0..POOL_LEN)
.map(|_| *ALPHABET.choose(&mut rng).unwrap())
.collect();
Comment on lines +53 to +55
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need this? vs using a single char (so the test runs faster)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried first: FSST collapses repetitive bytes, so the output never crosses 2 GiB.


let mut builder = VarBinBuilder::<i64>::with_capacity(N);
for i in 0..N {
let off = (i.wrapping_mul(31337)) % (POOL_LEN - STRING_LEN);
builder.append_value(&pool[off..off + STRING_LEN]);
}
let array = builder.finish(DType::Utf8(Nullability::NonNullable));

let compressor = fsst_train_compressor(&array);
let len = array.len();
let dtype = array.dtype().clone();
let mut ctx = LEGACY_SESSION.create_execution_ctx();
Comment on lines +57 to +67
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we directly build the fsst array to save some time?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug is in fsst_compress_iter (compress.rs:72); constructing the array directly skips the panicking path.


// Pre-fix: panics in `VarBinBuilder::<i32>::append_value` once cumulative
// compressed bytes pass `i32::MAX`. Post-fix: must succeed with the row
// count preserved.
let compressed = fsst_compress(array, len, &dtype, &compressor, &mut ctx);
assert_eq!(compressed.len(), len);
}
Loading