Skip to content

Conversation

@davidberenstein1957
Copy link

@davidberenstein1957 davidberenstein1957 commented Dec 12, 2025

Closes #7

  • Added HuggingFaceHubClient to facilitate pushing and pulling datasets to/from Hugging Face Hub.
  • Implemented pull_from_hub method in DatasetCreationResults for loading datasets and artifacts from the hub.
  • Created HubDatasetResults class to encapsulate results from hub operations, including datasets, analysis, and metadata.
  • Developed integration tests for verifying push and pull operations with Hugging Face Hub.
  • Introduced dataset card generation for datasets pushed to the hub, enhancing documentation and usability.

Example dataset

import os
import sys

from data_designer.config.sampler_params import UniformSamplerParams
from data_designer.essentials import (
    CategorySamplerParams,
    DataDesigner,
    DataDesignerConfigBuilder,
    SamplerColumnConfig,
    SamplerType,
)


def main() -> None:
    """Create a dataset and push it to Hugging Face Hub."""
    # Check for token
    token = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_HUB_TOKEN")
    if not token:
        print(
            "⚠️  Warning: HF_TOKEN or HUGGINGFACE_HUB_TOKEN not set. "
            "The push may fail if you're not logged in to Hugging Face Hub."
        )

    # Create a simple dataset configuration
    config_builder = DataDesignerConfigBuilder()

    # Add a product category
    config_builder.add_column(
        SamplerColumnConfig(
            name="product_category",
            sampler_type=SamplerType.CATEGORY,
            params=CategorySamplerParams(
                values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
            ),
        )
    )

    # Add a simple numeric column
    config_builder.add_column(
        SamplerColumnConfig(
            name="rating",
            sampler_type=SamplerType.UNIFORM,
            params=UniformSamplerParams(low=1, high=5),
        )
    )

    # Initialize DataDesigner and create a dataset
    data_designer = DataDesigner()
    num_records = 1
    print(f"🎨 Creating dataset with {num_records} records...")
    results = data_designer.create(config_builder=config_builder, num_records=num_records)

    # Verify dataset was created
    df = results.load_dataset()
    print(f"✅ Dataset created with {len(df)} records and {len(df.columns)} columns")

    # Push to Hugging Face Hub
    repo_id = "davidberenstein1957/datadesigner-test"
    print(f"🚀 Pushing dataset to {repo_id}...")
    try:
        results.push_to_hub(repo_id, token=token, generate_card=True)
        print(f"✅ Successfully pushed dataset to {repo_id}!")
        print(f"📊 View it at: https://huggingface.co/datasets/{repo_id}")

        # Demonstrate loading the dataset back from the hub
        print(f"\n🔄 Loading dataset back from {repo_id}...")
        from data_designer.interface.results import DatasetCreationResults

        loaded_results = DatasetCreationResults.pull_from_hub(repo_id, token=token)
        loaded_df = loaded_results.load_dataset()
        loaded_analysis = loaded_results.load_analysis()

        print("✅ Successfully loaded dataset from hub!")
        print(f"   - Records: {len(loaded_df)}")
        print(f"   - Columns: {len(loaded_df.columns)}")
        print(f"   - Analysis available: {loaded_analysis is not None}")
        if loaded_analysis:
            print(f"   - Completion: {loaded_analysis.percent_complete:.1f}%")
    except Exception as e:
        print(f"❌ Error pushing to hub: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)


if __name__ == "__main__":
    main()

… capabilities

- Added `HuggingFaceHubMixin` to facilitate pushing and pulling datasets to/from Hugging Face Hub.
- Implemented `pull_from_hub` method in `DatasetCreationResults` for loading datasets and artifacts from the hub.
- Created `HubDatasetResults` class to encapsulate results from hub operations, including datasets, analysis, and metadata.
- Developed integration tests for verifying push and pull operations with Hugging Face Hub.
- Introduced dataset card generation for datasets pushed to the hub, enhancing documentation and usability.
@github-actions
Copy link
Contributor

github-actions bot commented Dec 12, 2025

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@davidberenstein1957
Copy link
Author

I have read the DCO document and I hereby sign the DCO.

@davidberenstein1957 davidberenstein1957 changed the title feat: integrate Hugging Face Hub functionality with dataset push/pull capabilities feat: integrate Hugging Face Hub functionality with dataset push/pull Dec 12, 2025
- Introduced `_sanitize_metadata_file_paths` method to convert local file paths in metadata to remote paths suitable for Hugging Face Hub.
- Updated existing metadata handling to utilize the new sanitization method, ensuring consistent file path formatting.
- Enhanced comments for clarity on metadata processing and artifact uploading.
Copy link
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidberenstein1957 Thank you for your contribution in opening Data Designer's first non-trivial PR from the community! 🎉 This is exactly the kind of feature that will make Data Designer more useful for everyone, and I really appreciate the effort you've put into this.

I left a few comments in the PR with some specific feedback. The core functionality you've implemented is looking great, and I can see you've thought through the user experience. Before we merge this in, I'd love for us to collaborate on refining the implementation to align with Data Designer's architecture patterns. This will help ensure the integration is maintainable long-term and sets us up well for future third-party integrations.

Let's work through this together! I've broken down my thoughts into three areas: Code Architecture, Code Best Practices, and the Feature PR Process.

Code Architecture

  • Anticipating possible additional third-party integrations in the future, let's create a dedicated space within the repo at src/data_designer/integrations. This specific integration can live in src/data_designer/integrations/huggingface.

  • Let's make sure we build components in Data Designer as loosely coupled as possible. For this HuggingFace integration, instead of using the mixin pattern to add push/pull-to-HuggingFace functionality to DatasetCreationResults, let's create a separate class (may be named HuggingFaceHubClient?) to encapsulate that functionality. DatasetCreationResults can then simply compose over that object to provide the same functionality without any possibility of side-effects due to collisions through multiple inheritance as the class evolves over time.

  • The approach above will keep the HuggingFace integration scoped out in a separate module that can be tested in isolation without necessarily worrying about how it integrates with the rest of the package. This means everything in src/data_designer/integrations/huggingface should have an equivalent tests/integrations/huggingface test module for unit tests.

Code Best Practices

  • Prefer smaller, scoped methods over long implementations. This not only helps with readability, but also with testability and maintainability. Smaller methods are easier to understand, debug, and modify. They can be tested independently with focused unit tests, and when bugs arise, they're easier to isolate and fix.

  • Write unit tests before integration tests. Both are necessary, but there should be more unit tests than integration tests. Ideally, all public surfaces exposed by module/class/method should have unit tests. Unit tests are faster to run, easier to debug when they fail, and help ensure each component works correctly in isolation before testing how they work together.

  • Import dependencies at the top of the file. Unless it is necessary for performance reasons (e.g., expensive imports that are rarely used), we should import dependencies at the top. This makes dependencies explicit and easier to track, improves code readability, and aligns with Python conventions and the project's style guidelines.

  • Work off of concrete types instead of dictionary objects. Using concrete types (classes, dataclasses, Pydantic models) instead of dictionaries provides better type safety, IDE autocomplete, and clearer APIs. It catches errors at development time rather than runtime and makes the code self-documenting.

Feature PR Process

Given the scope of changes, for this particular PR (and possibly future PRs of the same nature), I'd highly recommend breaking the larger problem into smaller pieces we can execute on and deliver incrementally. This is a common practice for large features—it makes review easier, allows us to iterate on design, and means we can start shipping value sooner. We can make use of GitHub Discussions to hash out the initial design before execution. For this PR, it might be good to break it up into several smaller PRs we can merge into Data Designer:

  1. PR for the HuggingFace dataset card: Just the class, the template, and unit tests.

  2. PR for the new class that encapsulates this integration (name TBD): This may include several classes and module methods, along with unit tests.

  3. PR to make use of the integration class above in DatasetCreationResults: Should include unit tests assuming the integration class is fully tested.


I'm happy to help guide this process and answer questions you may have!

cc @johnnygreco @eric-tramel @andreatgretel

### Example {{ idx + 1 }}

```json
{{ sample_records[idx] | tojson(indent=2) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: When the values of keys are pretty long, this does not seem ideal for quick preview. Could there be another way that is more user friendly?

Image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if thi can be handled differently as I think that part of the value lies in the exact and complete columns names. We can go overboard with formatting specification but I'd be happy to hear if you have a concrete suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hugging face already provides a nice UI to explore some data like so: Image

WDYT about excluding the examples section since the above accomplishes the same?

- Introduced `HuggingFaceHubClient` for managing dataset interactions with Hugging Face Hub, including pushing and pulling datasets.
- Replaced `HuggingFaceHubMixin` with the new client in `DatasetCreationResults` for improved structure and clarity.
- Added `DataDesignerDatasetCard` class to create custom dataset cards for datasets pushed to the hub.
- Implemented methods for reconstructing dataset creation results from hub data, enhancing usability and integration.
- Updated tests to cover new functionality and ensure robust integration with Hugging Face Hub.
- Simplified the retrieval of the config_builder property in the WithRecordSamplerMixin class.
- Removed the try-except block for accessing config_builder, directly using the property for improved clarity and efficiency.
@davidberenstein1957
Copy link
Author

Hi @nabinchha,

Thanks for the review comments. I agree with most of them, although I won't have the time or energy to wrap up three separate PRs during/before the holiday period, as I will be taking some vacation during and immediately after. If you prefer to split it up, I expect to have more time again around the end of January again.

…egration files

- Reformatted dictionary comprehension for better readability in `HuggingFaceHubClient`.
- Consolidated multiple lines into a single line for downloading processor datasets and artifacts.
- Streamlined logging message formatting in `_load_column_config` function.
- Removed unnecessary blank lines in various files to enhance code cleanliness.
@davanstrien
Copy link

Very nice to see this integration and especially happy to see the auto-generated dataset card being included!

One suggestion/question to see if this could work better for large datasets (and leverage datasets more for some of the logic). It's possible that Pandas is always the target for DataDesigner, in which ignore the comment!

The current implementation goes through pandas for both push and pull:

# Push (lines 177-178)
dataset_df = self._dataset_provider.load_dataset()
hf_dataset = Dataset.from_pandas(dataset_df)

# Pull (line 760)
dataset_df = hf_dataset.to_pandas()

For large synthetic datasets, this could OOM since it requires loading the entire dataset into memory?

Do you think there is an alternative approach?

Since DataDesigner already stores data as parquet files in artifact_storage.final_dataset_path, you could potentially bypass the pandas-based load_dataset() for the Hub integration and use Dataset.from_parquet() directly i.e. something like

# Push - read parquet directly into Arrow, avoiding pandas
from datasets import Dataset

parquet_path = artifact_storage.final_dataset_path
hf_dataset = Dataset.from_parquet(str(parquet_path / "*.parquet"))
hf_dataset.push_to_hub(repo_id, token=token)

You can still update the card, etc.

This would also give you the benefits of the datasets library's push_to_hub() for free, including:

  • Automatic sharding for large datasets
  • Parquet CDC (Change Data Capture) support, which works well with Xet storage for efficient incremental updates. This would be useful for iterating on the same dataset a lot since you only push a small diff in this case.
  • The dataset viewer on the Hub should always work because row group sizes in Parquet are managed for you, etc.

For pull, you could consider returning the Dataset object directly rather than converting to pandas - users can call .to_pandas() themselves if needed.

I'm not familiar enough with the design philosophy of DataDesigner, so it's possible the idea is to keep everything pandas-centric, and users are expected to only work with datasets that always fit in memory? If so, the current approach makes sense for consistency with the rest of the library. Just wanted to flag the potential scaling limitation in case large dataset support is a goal.

@nabinchha
Copy link
Contributor

nabinchha commented Dec 17, 2025

Very nice to see this integration and especially happy to see the auto-generated dataset card being included!

One suggestion/question to see if this could work better for large datasets (and leverage datasets more for some of the logic). It's possible that Pandas is always the target for DataDesigner, in which ignore the comment!

The current implementation goes through pandas for both push and pull:

# Push (lines 177-178)
dataset_df = self._dataset_provider.load_dataset()
hf_dataset = Dataset.from_pandas(dataset_df)

# Pull (line 760)
dataset_df = hf_dataset.to_pandas()

For large synthetic datasets, this could OOM since it requires loading the entire dataset into memory?

Do you think there is an alternative approach?

Since DataDesigner already stores data as parquet files in artifact_storage.final_dataset_path, you could potentially bypass the pandas-based load_dataset() for the Hub integration and use Dataset.from_parquet() directly i.e. something like

# Push - read parquet directly into Arrow, avoiding pandas
from datasets import Dataset

parquet_path = artifact_storage.final_dataset_path
hf_dataset = Dataset.from_parquet(str(parquet_path / "*.parquet"))
hf_dataset.push_to_hub(repo_id, token=token)

You can still update the card, etc.

This would also give you the benefits of the datasets library's push_to_hub() for free, including:

  • Automatic sharding for large datasets
  • Parquet CDC (Change Data Capture) support, which works well with Xet storage for efficient incremental updates. This would be useful for iterating on the same dataset a lot since you only push a small diff in this case.
  • The dataset viewer on the Hub should always work because row group sizes in Parquet are managed for you, etc.

For pull, you could consider returning the Dataset object directly rather than converting to pandas - users can call .to_pandas() themselves if needed.

I'm not familiar enough with the design philosophy of DataDesigner, so it's possible the idea is to keep everything pandas-centric, and users are expected to only work with datasets that always fit in memory? If so, the current approach makes sense for consistency with the rest of the library. Just wanted to flag the potential scaling limitation in case large dataset support is a goal.

@davanstrien thanks for the catch and the call out! The current implementation indeed will run into OOM for large datasets. In addition, there's also a discrepancy between the partitions generated by data designer and what's ultimately uploaded to huggingface hub and referenced in metadata.

Here's an example artifact partitions generated by data designer as reported on the HF card metadata. To simulate multiple partitions, I called data_designer.set_buffer_size(2) on the data designer object before calling data_designer.create(config_builder, num_records=20)
Screenshot 2025-12-17 at 8 57 27 AM

All these partitions got coalesced into one and uploaded to HF hub.
Screenshot 2025-12-17 at 8 57 36 AM

+1 to your suggestion above, thank you!

I'm not familiar enough with the design philosophy of DataDesigner, so it's possible the idea is to keep everything pandas-centric, and users are expected to only work with datasets that always fit in memory? If so, the current approach makes sense for consistency with the rest of the library. Just wanted to flag the potential scaling limitation in case large dataset support is a goal.

pandas is our current backend abstraction, but it's designed in a way we can swap it out later. While we use pandas, we work in batches and incrementally persist batch results to disk as parquet partitions so we shouldn't get into runtime OOM as long as data designer runs in an environment with reasonable (4/8gb?) memory allocation.

@davanstrien
Copy link

@nabinchha sg! In case it is useful, we have some (quite high-level) pointers on integrating data tools with the Hub here: https://huggingface.co/docs/hub/datasets-libraries#integrating-data-libraries-and-tools-with-the-hub.

One other option to throw into the mix is to use non-blocking uploads via huggingface_hub https://huggingface.co/docs/hub/datasets-libraries#integrating-data-libraries-and-tools-with-the-hub to push in the background during generation, but this might be overkill for a first integration -- IMO probably just doing a push at the end is a good start.

@nabinchha
Copy link
Contributor

@davanstrien, thanks for the pointer! It looks like we can simplify this integration significantly by leveraging the Hugging Face upload_folder API (as shown in the links you shared) and making a few adjustments to how Data Designer generates metadata.

Here is what I am proposing:

  1. Streamline Data Designer metadata generation: Currently, we generate column_configs.json, model_configs.json, and metadata.json. Ideally, we should collapse these into a single metadata.json (sanitized for Hugging Face) and a new sdg.json file (or a different name). The latter would capture the serialized version of the entire Data Designer SDG pipeline, allowing anyone to recreate it using the DataDesignerConfigBuilder.from_config(...) API. This approach allows us to focus initially on the push_to_hub integration, as "re-hydrating" a DatasetCreationResults object via pull_to_hub seems a bit more complex. I can create a GitHub issue and a PR for this at the beginning of next year.
  2. Use the upload_folder API for artifact uploads: We can upload the entire artifact directory to Hugging Face with a few folder renames to create a dataset on Hugging Face with subsets. The default subset would point to the contents of parquet-files. Since @andreatgretel recently added support for processors to Data Designer, any auxiliary datasets they generate can be uploaded as additional subsets. This approach reduces the Hugging Face client class in Data Designer to a simple wrapper around upload_folder with the underlying HfApi handling most of the complexities. This also automatically addresses multiple parquet partition files being present in the generated artifact folder.

Concretely, here's an example artifact folder with the main dataset in parquet-files in addition to auxiliary datasets generated by processors:
Screenshot 2025-12-22 at 7 31 32 PM

Here's a screenshot of how this could live on Hugging Face as dataset with subsets.
Screenshot 2025-12-22 at 7 33 24 PM

@davidberenstein1957, what do you think about this simplified approach? Would you be open to limiting the scope of this PR to just adding the dataset card (template and class)? The template you added can serve as the foundation for building the remaining pieces incrementally. We are happy to open follow-up PRs to contribute to this effort shortly afterward if you expect to be busy through the end of January.

Comment on lines +9 to +13
# Dataset Card

This dataset was generated using **NeMo Data Designer**, a comprehensive framework for creating high-quality synthetic datasets from scratch or using seed data.

## About NeMo Data Designer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirit93 brought up a good point about making some room for custom user description that folks can provide about the dataset... perhaps when they call <>.push_to_hub(repo_id, description)?

Comment on lines +39 to +53
```python
from data_designer.interface.results import DatasetCreationResults

# Load dataset with all artifacts (analysis, configs, etc.)
results = DatasetCreationResults.pull_from_hub("{{ repo_id }}")

# Access the dataset
df = results.load_dataset()

# Access the analysis
analysis = results.load_analysis()

# Access the config builder
config_builder = results.config_builder
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my suggestion to focus on push_to_hub, let's leave this bit out for now. WDYT?

## Metadata

```json
{{ metadata | tojson(indent=2) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be in a different PR, but in the metadata that gets rendered let's make sure it is sanitized for Hugging Face. For example, file_paths in the json show the local path to the files in the file system where this sdg pipeline ran.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Publish to dataset HF Hub

4 participants