Skip to content

feat: add lance_dataset_drop_columns for metadata-only column removal#42

Open
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:feat/dataset-drop-columns
Open

feat: add lance_dataset_drop_columns for metadata-only column removal#42
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:feat/dataset-drop-columns

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

Summary

First of three PRs against #41 (schema evolution). Exposes upstream's drop_columns — a metadata-only manifest commit that removes the named columns from the schema without rewriting any data files. Materializing the projection is left to a later _compact_files (and a future cleanup operation, once exposed, removes the old version's files).

Mutates the dataset in place under an exclusive write lock; scanners already in flight keep their pre-drop snapshot view via the existing Arc clone-on-write, same as _delete / _update / _compact_files.

Surface

int32_t lance_dataset_drop_columns(
    LanceDataset* dataset,
    const char* const* columns,
    size_t num_columns
);

Inputs are validated up front with per-index error messages so the precise cause is observable from lance_last_error_message(). NULL handle, NULL pointer array, zero count, NULL or empty-string entries, and non-UTF-8 names all return LANCE_ERR_INVALID_ARGUMENT; upstream's own rejections (unknown column, attempt to drop every column) map to the same code.

The C++ wrapper takes const std::vector<std::string>& and follows the update / merge_insert sibling convention — passes col_ptrs.data() unconditionally. An empty vector flows through the Rust-side num_columns == 0 guard so the error message says "num_columns must be > 0" rather than the misleading "columns must not be NULL".

Tests

Eleven new Rust integration tests covering single-drop, multi-drop, version bump, data preservation (downcasts the surviving Arrow columns and checks the actual values, not just shape), and the full rejection surface (NULL dataset / NULL array / zero count / NULL entry / empty-string entry / unknown column / drop-all). C and C++ smoke tests snapshot ArrowSchema.n_children pre/post drop, exercise the drop-last-column rejection path, and verify the version is unchanged when a drop fails. cargo test and cargo test --test compile_and_run_test -- --ignored both green.

Follow-ups

  • lance_dataset_alter_columns — rename / nullability / type change
  • lance_dataset_add_columns — SQL expressions / AllNulls / ArrowArrayStream

The README roadmap entry stays unticked until all three ship.

First of three PRs covering the schema-evolution roadmap entry. Exposes
upstream's `drop_columns` — a metadata-only manifest commit that removes
the named columns from the schema without rewriting data files.
@LuciferYang
Copy link
Copy Markdown
Contributor Author

The macOS arm64 leg of consumer-smoke-test failure here is pre-existing on main (failing since #24 — same unresolved _IO* symbols from sysinfo / objc2_io_kit), not introduced by this PR. Sent #43 as a focused fix to declare -framework IOKit in the CMake / pkg-config link line; once that merges, a rerun here should go green.

jja725 pushed a commit that referenced this pull request May 23, 2026
## Summary

The macOS arm64 consumer-smoke-test job has been failing on `main` since
#24 with a long list of unresolved `_IO*` symbols (`_IOObjectRelease`,
`_IOServiceMatching`, `_IOHIDEventSystemClientCreate`,
`_IORegistryEntryCreateCFProperty`, …) — sample run:
https://github.com/lance-format/lance-c/actions/runs/26272649710.

Root cause is plumbing, not the consumer example: `sysinfo` (pulled in
transitively via the lance crates) calls IOKit on macOS for disk
enumeration, CPU frequency, and thermal sensors, and `objc2_io_kit`
declares the binding. Cargo's `rustc-link-lib=framework=IOKit` is
honored when this repo builds, but a downstream consumer linking against
the installed `liblance_c.a` via `find_package(LanceC)` (or pkg-config)
only sees the frameworks we declare in our config files — and IOKit was
missing.

Add `-framework IOKit` next to the existing `CoreFoundation` /
`Security` / `SystemConfiguration` entries in all three mirroring
places:

- `CMakeLists.txt` — build-tree `LanceC_platform_deps` interface library
- `cmake/LanceCConfig.cmake.in` — installed `find_package(LanceC)`
consumers
- `CMakeLists.txt` — pkg-config `Libs.private`

## Verification

Same `cmake --install` → `examples/cmake-consumer` build path the CI
runs, on arm64 macOS (15.0 SDK, AppleClang 17):

```
$ cmake --install build --prefix _install
$ cmake -S examples/cmake-consumer -B consumer-build -DCMAKE_PREFIX_PATH="$PWD/_install"
$ cmake --build consumer-build
…
[100%] Built target consumer
$ consumer-build/consumer
usage: consumer <dataset_uri>
$ echo $?
2
```

Before the patch the same sequence dies at link with `Undefined symbols
for architecture arm64`. After it, the link succeeds and the binary
exits 2 (usage error) as the CI step expects.

## After this lands

Unblocks the consumer-smoke macOS leg for every open PR — #42
(schema-evolution drop_columns) hits this exact failure on its CI run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant