-
Notifications
You must be signed in to change notification settings - Fork 51
feat: Seed dataset plugins #191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
71b2030 to
e78bbe5
Compare
| plugin_manager = PluginManager() | ||
|
|
||
| _SeedSourceT: TypeAlias = LocalFileSeedSource | HuggingFaceSeedSource | DataFrameSeedSource | ||
| _SeedSourceT = plugin_manager.inject_into_seed_source_type_union(_SeedSourceT) | ||
|
|
||
| SeedSourceT = Annotated[_SeedSourceT, Field(discriminator="seed_type")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type alias needs to live in a separate module instead of config/seed_source.py because:
- The config classes for seed dataset plugins are expected to inherit from
SeedSource - Since
configcode is always save to import/load for plugins, we do so as part of pydantic validation of eachPlugininstance to validate the config discriminator is set properly - If the code highlighted here is defined in the same module as
SeedSource, we run into a deadlock: the plugin registry is actively discovering/loading plugins and so has grabbed a lock, and to instantiate and load a seed dataset plugin we need to load the module containing the baseSeedSourceclass, and when doing so we hit thePluginManager()call, which initializes a (singleton) registry, which tries to grab the lock that is already taken.
We didn't see this before* for column generator plugins because the base classes to inherit from are already defined in a different module than the ColumnConfigT type alias union (config.column_configs and config.column_types respectively)
*or maybe we did see this before and these two modules were set up the way they are now for this very reason; I don't know the history
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*or maybe we did see this before and these two modules were set up the way they are now for this very reason; I don't know the history
haha yes indeed that's why we split them up
src/data_designer/plugins/plugin.py
Outdated
|
|
||
| class PluginType(str, Enum): | ||
| COLUMN_GENERATOR = "column-generator" | ||
| SEED_DATASET = "seed-dataset" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this be SEED_SOURCE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe SEED_READER?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Probably SEED_READER I guess, since the existing plugin type is essentially aligned with the impl class (which makes sense).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed in 7ee33ec
| @echo "🧹 Cleaning e2e test environment..." | ||
| rm -rf e2e_tests/uv.lock e2e_tests/.pycache e2e_tests/.venv | ||
| @echo "🧪 Running e2e tests..." | ||
| uv run --no-cache --refresh --directory e2e_tests pytest -s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need --group dev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe uv run automatically uses the dev group? Seems to be working fine as-is, but I'm not opposed to adding it explicitly
| John,Coltrane | ||
| Miles,Davis | ||
| Bill,Evans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dream trio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing bass and drums to fill out the quintet 😂 In all seriousness, I'm happy to change this to anything else if we want to have a "house style" for dummy data in tests like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Astronomy and jazz legends" sounds like a sweet house style to me!
johnnygreco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛸
da1c8d1 to
e1c842c
Compare
Adds support for seed dataset plugins. Plugin authors can register new kinds of seed datasets by defining a plugin with a
SeedSourceand a correspondingSeedReader. Like column generation plugins, seed dataset plugins are automatically wired in so they Just Work with the config builder and the data designer interface. The mechanics are a little different from column generation plugins, since readers are passed in rather than registered in a purely backend registry, but the effect in the library is the same; in NMP, the service will instantiate the readers it can use (HuggingFace + a custom plugin it defines for NMP Files service) and pass them tocreate_resource_provideritself. Note: I think we could consider changing the seed reader mechanics to be more similar to column generators in the future, if we want.Also adds a harness for really high-level e2e tests, with two initial tests added for the two plugin types. These tests prove the plugin registry system works all the way at the level of registering plugins in a
pyproject.toml. I'm not sure how many additional tests we'd want to put at this high a level vs. keep as traditional unit tests, but there's room if we want it, and I really like the confidence it gives for plugins in particular.