-
Notifications
You must be signed in to change notification settings - Fork 42
feat: integrate Hugging Face Hub functionality with dataset push/pull #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… capabilities - Added `HuggingFaceHubMixin` to facilitate pushing and pulling datasets to/from Hugging Face Hub. - Implemented `pull_from_hub` method in `DatasetCreationResults` for loading datasets and artifacts from the hub. - Created `HubDatasetResults` class to encapsulate results from hub operations, including datasets, analysis, and metadata. - Developed integration tests for verifying push and pull operations with Hugging Face Hub. - Introduced dataset card generation for datasets pushed to the hub, enhancing documentation and usability.
|
All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO document and I hereby sign the DCO. |
- Introduced `_sanitize_metadata_file_paths` method to convert local file paths in metadata to remote paths suitable for Hugging Face Hub. - Updated existing metadata handling to utilize the new sanitization method, ensuring consistent file path formatting. - Enhanced comments for clarity on metadata processing and artifact uploading.
nabinchha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidberenstein1957 Thank you for your contribution in opening Data Designer's first non-trivial PR from the community! 🎉 This is exactly the kind of feature that will make Data Designer more useful for everyone, and I really appreciate the effort you've put into this.
I left a few comments in the PR with some specific feedback. The core functionality you've implemented is looking great, and I can see you've thought through the user experience. Before we merge this in, I'd love for us to collaborate on refining the implementation to align with Data Designer's architecture patterns. This will help ensure the integration is maintainable long-term and sets us up well for future third-party integrations.
Let's work through this together! I've broken down my thoughts into three areas: Code Architecture, Code Best Practices, and the Feature PR Process.
Code Architecture
-
Anticipating possible additional third-party integrations in the future, let's create a dedicated space within the repo at
src/data_designer/integrations. This specific integration can live insrc/data_designer/integrations/huggingface. -
Let's make sure we build components in Data Designer as loosely coupled as possible. For this HuggingFace integration, instead of using the mixin pattern to add push/pull-to-HuggingFace functionality to
DatasetCreationResults, let's create a separate class (may be namedHuggingFaceHubClient?) to encapsulate that functionality.DatasetCreationResultscan then simply compose over that object to provide the same functionality without any possibility of side-effects due to collisions through multiple inheritance as the class evolves over time. -
The approach above will keep the HuggingFace integration scoped out in a separate module that can be tested in isolation without necessarily worrying about how it integrates with the rest of the package. This means everything in
src/data_designer/integrations/huggingfaceshould have an equivalenttests/integrations/huggingfacetest module for unit tests.
Code Best Practices
-
Prefer smaller, scoped methods over long implementations. This not only helps with readability, but also with testability and maintainability. Smaller methods are easier to understand, debug, and modify. They can be tested independently with focused unit tests, and when bugs arise, they're easier to isolate and fix.
-
Write unit tests before integration tests. Both are necessary, but there should be more unit tests than integration tests. Ideally, all public surfaces exposed by module/class/method should have unit tests. Unit tests are faster to run, easier to debug when they fail, and help ensure each component works correctly in isolation before testing how they work together.
-
Import dependencies at the top of the file. Unless it is necessary for performance reasons (e.g., expensive imports that are rarely used), we should import dependencies at the top. This makes dependencies explicit and easier to track, improves code readability, and aligns with Python conventions and the project's style guidelines.
-
Work off of concrete types instead of dictionary objects. Using concrete types (classes, dataclasses, Pydantic models) instead of dictionaries provides better type safety, IDE autocomplete, and clearer APIs. It catches errors at development time rather than runtime and makes the code self-documenting.
Feature PR Process
Given the scope of changes, for this particular PR (and possibly future PRs of the same nature), I'd highly recommend breaking the larger problem into smaller pieces we can execute on and deliver incrementally. This is a common practice for large features—it makes review easier, allows us to iterate on design, and means we can start shipping value sooner. We can make use of GitHub Discussions to hash out the initial design before execution. For this PR, it might be good to break it up into several smaller PRs we can merge into Data Designer:
-
PR for the HuggingFace dataset card: Just the class, the template, and unit tests.
-
PR for the new class that encapsulates this integration (name TBD): This may include several classes and module methods, along with unit tests.
-
PR to make use of the integration class above in
DatasetCreationResults: Should include unit tests assuming the integration class is fully tested.
I'm happy to help guide this process and answer questions you may have!
src/data_designer/interface/huggingface/dataset_card_template.md
Outdated
Show resolved
Hide resolved
| ### Example {{ idx + 1 }} | ||
|
|
||
| ```json | ||
| {{ sample_records[idx] | tojson(indent=2) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if thi can be handled differently as I think that part of the value lies in the exact and complete columns names. We can go overboard with formatting specification but I'd be happy to hear if you have a concrete suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/data_designer/interface/huggingface/dataset_card_template.md
Outdated
Show resolved
Hide resolved
- Introduced `HuggingFaceHubClient` for managing dataset interactions with Hugging Face Hub, including pushing and pulling datasets. - Replaced `HuggingFaceHubMixin` with the new client in `DatasetCreationResults` for improved structure and clarity. - Added `DataDesignerDatasetCard` class to create custom dataset cards for datasets pushed to the hub. - Implemented methods for reconstructing dataset creation results from hub data, enhancing usability and integration. - Updated tests to cover new functionality and ensure robust integration with Hugging Face Hub.
- Simplified the retrieval of the config_builder property in the WithRecordSamplerMixin class. - Removed the try-except block for accessing config_builder, directly using the property for improved clarity and efficiency.
|
Hi @nabinchha, Thanks for the review comments. I agree with most of them, although I won't have the time or energy to wrap up three separate PRs during/before the holiday period, as I will be taking some vacation during and immediately after. If you prefer to split it up, I expect to have more time again around the end of January again. |
…egration files - Reformatted dictionary comprehension for better readability in `HuggingFaceHubClient`. - Consolidated multiple lines into a single line for downloading processor datasets and artifacts. - Streamlined logging message formatting in `_load_column_config` function. - Removed unnecessary blank lines in various files to enhance code cleanliness.
|
Very nice to see this integration and especially happy to see the auto-generated dataset card being included! One suggestion/question to see if this could work better for large datasets (and leverage The current implementation goes through pandas for both push and pull: # Push (lines 177-178)
dataset_df = self._dataset_provider.load_dataset()
hf_dataset = Dataset.from_pandas(dataset_df)
# Pull (line 760)
dataset_df = hf_dataset.to_pandas()For large synthetic datasets, this could OOM since it requires loading the entire dataset into memory? Do you think there is an alternative approach? Since # Push - read parquet directly into Arrow, avoiding pandas
from datasets import Dataset
parquet_path = artifact_storage.final_dataset_path
hf_dataset = Dataset.from_parquet(str(parquet_path / "*.parquet"))
hf_dataset.push_to_hub(repo_id, token=token)You can still update the card, etc. This would also give you the benefits of the
For pull, you could consider returning the I'm not familiar enough with the design philosophy of |
@davanstrien thanks for the catch and the call out! The current implementation indeed will run into OOM for large datasets. In addition, there's also a discrepancy between the partitions generated by data designer and what's ultimately uploaded to huggingface hub and referenced in metadata. Here's an example artifact partitions generated by data designer as reported on the HF card metadata. To simulate multiple partitions, I called All these partitions got coalesced into one and uploaded to HF hub. +1 to your suggestion above, thank you!
pandas is our current backend abstraction, but it's designed in a way we can swap it out later. While we use pandas, we work in batches and incrementally persist batch results to disk as parquet partitions so we shouldn't get into runtime OOM as long as data designer runs in an environment with reasonable (4/8gb?) memory allocation. |
|
@nabinchha sg! In case it is useful, we have some (quite high-level) pointers on integrating data tools with the Hub here: https://huggingface.co/docs/hub/datasets-libraries#integrating-data-libraries-and-tools-with-the-hub. One other option to throw into the mix is to use non-blocking uploads via |
|
@davanstrien, thanks for the pointer! It looks like we can simplify this integration significantly by leveraging the Hugging Face Here is what I am proposing:
Concretely, here's an example artifact folder with the main dataset in Here's a screenshot of how this could live on Hugging Face as dataset with subsets. @davidberenstein1957, what do you think about this simplified approach? Would you be open to limiting the scope of this PR to just adding the dataset card (template and class)? The template you added can serve as the foundation for building the remaining pieces incrementally. We are happy to open follow-up PRs to contribute to this effort shortly afterward if you expect to be busy through the end of January. |
| # Dataset Card | ||
|
|
||
| This dataset was generated using **NeMo Data Designer**, a comprehensive framework for creating high-quality synthetic datasets from scratch or using seed data. | ||
|
|
||
| ## About NeMo Data Designer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kirit93 brought up a good point about making some room for custom user description that folks can provide about the dataset... perhaps when they call <>.push_to_hub(repo_id, description)?
| ```python | ||
| from data_designer.interface.results import DatasetCreationResults | ||
|
|
||
| # Load dataset with all artifacts (analysis, configs, etc.) | ||
| results = DatasetCreationResults.pull_from_hub("{{ repo_id }}") | ||
|
|
||
| # Access the dataset | ||
| df = results.load_dataset() | ||
|
|
||
| # Access the analysis | ||
| analysis = results.load_analysis() | ||
|
|
||
| # Access the config builder | ||
| config_builder = results.config_builder | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my suggestion to focus on push_to_hub, let's leave this bit out for now. WDYT?
| ## Metadata | ||
|
|
||
| ```json | ||
| {{ metadata | tojson(indent=2) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be in a different PR, but in the metadata that gets rendered let's make sure it is sanitized for Hugging Face. For example, file_paths in the json show the local path to the files in the file system where this sdg pipeline ran.






Closes #7
HuggingFaceHubClientto facilitate pushing and pulling datasets to/from Hugging Face Hub.pull_from_hubmethod inDatasetCreationResultsfor loading datasets and artifacts from the hub.HubDatasetResultsclass to encapsulate results from hub operations, including datasets, analysis, and metadata.Example dataset