Skip to content

cylf-dev/chonkle

Repository files navigation

chonkle

A Python host for Wasm codec pipelines. Pipelines are directed acyclic graphs (DAGs) of codec steps defined in JSON. The orchestrator parses the DAG, validates wiring against codec signatures, and executes the pipeline via Wasmtime.

Status: proof of concept.

Codec backends

chonkle supports three codec backends. Each implements the same Codec ABC (call(direction, port_map) and signature()), so backends can be mixed freely within a single pipeline.

Component Model Wasm.wasm components implementing the chonkle:codec/transform@0.1.0 WIT interface. Any language with a Component Model toolchain (Rust, C, Python via componentize-py) can produce a conforming component. Data transfer uses the canonical ABI. The Wasmtime sandbox isolates each component from the host.

Core Wasm — wasm32-wasi reactor modules using a binary port-map wire format via Memory.read/Memory.write. When consecutive pipeline steps are both core wasm, data transfers between their linear memories use ctypes.memmove (single-copy, no serialization round-trip).

Native (numcodecs) — Python codecs from the numcodecs library. No Wasm overhead. numcodecs and numpy are optional dependencies, imported lazily. Adding a new numcodecs codec requires only adding a signature file.

The Resolver selects among available implementations using a configurable backend preference list. The default preference order is ["native", "core", "component"].

Usage

CLI

# Run a pipeline
chonkle run pipeline.json --input bytes=chunk.bin --output bytes=out.bin

# With resolver options
chonkle run pipeline.json --input bytes=chunk.bin \
  --direction decode \
  --codec-store ./codec/ \
  --preference core,component,native \
  --override zlib=zlib-rs \
  --source zlib=https://example.com/zlib.wasm

# List installed codecs
chonkle codecs

# Show details for a specific codec
chonkle codecs zlib

# Embed a signature into a .wasm binary (build-time tool)
chonkle embed-signature codec.wasm signature.json

Python API

from chonkle.pipeline import prepare
from chonkle.executor import run

prepared = prepare("pipeline.json", direction="decode")
outputs = run(prepared, {"bytes": chunk_bytes})

Format drivers

The executor is format-agnostic. It accepts a pipeline DAG and chunk data, runs the codecs, and returns the result. It has no knowledge of Zarr, Parquet, COG, ORC, or any other file format.

A format driver is the layer above the executor that bridges a specific file format and the pipeline executor. It reads format-specific metadata, translates it into a pipeline DAG, supplies metadata-derived inputs, and manages chunk I/O. Format drivers are outside the scope of this repository.

Documentation

See docs/README.md for the full index.

Development

  • Package manager: uv
  • Build backend: hatchling
  • Python: >= 3.13
  • Linting/formatting: ruff
  • Type checking: mypy
  • Testing: pytest
  • Pre-commit: ruff check, ruff format, mypy, yaml/toml validation
  • CI: GitHub Actions (lint on 3.14, test on 3.13 and 3.14)
# Install dependencies
uv sync

# Include native (numcodecs) backend
uv sync --extra native

# Run tests
uv run pytest

# Run linter
uv run ruff check

# Network tests (downloads codecs from OCI registries)
uv run pytest --run-network

Acknowledgements

Partially supported by NASA-IMPACT VEDA project.

About

Codec pipeline library for chunked array data (Zarr, COG) — mix Python codecs with WebAssembly codecs that run at near-native speed in a portable sandbox.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Contributors