diff --git a/docs/wiki-guide/Digital-Product-Lifecycle.md b/docs/wiki-guide/Digital-Product-Lifecycle.md index 73001b7..9f6529b 100644 --- a/docs/wiki-guide/Digital-Product-Lifecycle.md +++ b/docs/wiki-guide/Digital-Product-Lifecycle.md @@ -19,7 +19,7 @@ The following adds additional context and direction to supplement the diagram, o * **Datasets:** Hugging Face Dataset Repository ([Data checklist](Data-Checklist.md)). * For already published data usage, see the [Metadata Checklist](Metadata-Checklist.md). * **ML Models:** Hugging Face Model Repository ([Model checklist](Model-Checklist.md)). -* Though alternative storage options may be discussed, **Google Drive is not an acceptable storage location for research data, models, or code**. Folder activity does not include actual file additions or deletions, so content can be changed or removed without a record of when or by whom. All research, data, models, and code must be stored in **a version controlled repository, preferably in more than one location** to ensure preservation and full provenance tracking. +* Though alternative storage options may be discussed, **Google Drive, OneDrive, and other institutional user-tied locations are not an acceptable storage location for research data, models, or code**. Folder activity does not include actual file additions or deletions, so content can be changed or removed without a record of when or by whom. All research, data, models, and code must be stored in **a version controlled repository, preferably in more than one location** to ensure preservation and full provenance tracking. ### Exploration Phase diff --git a/docs/wiki-guide/GitHub-Repo-Guide.md b/docs/wiki-guide/GitHub-Repo-Guide.md index 8231805..38e3051 100644 --- a/docs/wiki-guide/GitHub-Repo-Guide.md +++ b/docs/wiki-guide/GitHub-Repo-Guide.md @@ -101,66 +101,119 @@ For more information on managing these environments and generating such files pr ### CITATION -Make it easier for people to cite your project by including a [CITATION.cff file](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files); you can copy-paste the template below. +Make it easier for people to cite your project by including a [CITATION.cff file](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files); you can copy-paste the [template below](#citation-templates). As with journal publications, we expect to be cited when someone uses our code. To facilitate proper attribution, GitHub will automatically read a [CITATION.cff file](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files) and display a link to "cite this repository". This file is also used to populate metadata fields in a [Zenodo](https://zenodo.org/) record when [auto-generating a DOI](DOI-Generation.md#2-generate-a-doi-with-zenodo). As with any other component of your project, this file may change over the project's lifespan (see [Digital Product Life Cycle](Digital-Product-Lifecycle.md) for details), but it should be present and updated before any release. Providing this file is as simple as copying the below example and filling in your information before uploading it to your repo. More examples and information about the Citation File Format can be found on the [citation-file-format repo](https://github.com/citation-file-format/citation-file-format), including helpful [related tools](https://github.com/citation-file-format/citation-file-format#tools-to-work-with-citationcff-files-wrench). -You can check your `CITATION.cff` file prior to upload using this [validator tool](https://www.yamllint.com/). +#### Citation Templates -!!! note "Note" - - When adding a DOI to your citation (`doi`), be sure to use the version-agnostic DOI from Zenodo. Since the DOI is not generated until _after_ the release, this ensures there will never be an "incorrect" DOI associated to the release—correct version reference is ensured through the `version` key, which should always be updated _**before**_ generating a new release. - - Subcategories of `preferred-citation` do not get bullet points, but the first subcategory of `references` must be bulleted (as below). - - This is generally intended as a reference for your code. Preferred citation can be used for the paper, though it is better to ask in the `README` that someone cites _both_ and provide the paper reference there (only the `preferred-citation` will show up to be copied from the citation box if it is included). - -```yaml { py linenums="1" } -abstract: "" -authors: -- family-names: - given-names: "" - orcid: "https://orcid.org/" -cff-version: 1.2.0 -date-released: "YYYY-MM-DD" -identifiers: - - description: "The GitHub release URL of tag ." - type: url - value: "https://github.com/ABC-Center//releases/tag/" - - description: "The GitHub URL of the commit tagged with ." - type: url - value: "https://github.com/ABC-Center//tree/" -keywords: - - biodiversity -license: -message: "If you find this software helpful in your research, please cite both the software and our paper." -repository-code: "https://github.com/ABC-Center/" -title: "" -version: -doi: # version agnostic DOI -type: software -preferred-citation: - type: article - authors: +- When adding a DOI to your citation (`doi`), be sure to use the version-agnostic DOI from Zenodo. Since the DOI is not generated until _after_ the release, this ensures there will never be an "incorrect" DOI associated to the release—correct version reference is ensured through the `version` key, which should always be updated _**before**_ generating a new release. +- A `CITATION.cff` is intended as a reference for your code; ask in the `README` that someone cites _both_ the repo and your paper, then provide the paper BibTeX there. +- Formatted display can be checked on a branch before merging to `main`. + +=== "Standard Citation File (Recommended)" + + !!! tip + Pair this citation file with a [`.zenodo.json`](#zenodo-metadata) for easier DOI metadata tracking (grants, references, associated papers). + + ```yaml { py linenums="1" } + abstract: "" + authors: - family-names: - given-names: + given-names: "" + orcid: "https://orcid.org/" + cff-version: 1.2.0 + date-released: "YYYY-MM-DD" + identifiers: + - description: "The GitHub release URL of tag ." + type: url + value: "https://github.com/ABC-Center//releases/tag/" + - description: "The GitHub URL of the commit tagged with ." + type: url + value: "https://github.com/ABC-Center//tree/" # Update on release + keywords: + - imageomics + license: + message: "If you find this software helpful in your research, please cite both the software and our paper." + repository-code: "https://github.com/ABC-Center/" + title: "" + version: + #doi: + type: software + ``` + +=== "Extended `CFF` (References and Citation redirect)" + + !!! warning + This is generally intended as a **reference for your code**. Preferred citation can be used for the paper, though it is better to ask in the `README` that someone cites _both_ and provide the paper reference there (only the `preferred-citation` will show up to be copied from the citation box if it is included). + + !!! success "Simplify version tracking for you code" + Pair the [standard citation file](#__tabbed_1_1) with a [.zenodo.json file](#zenodo-metadata), which can track references, associated papers, and grant information. + + !!! info + - Subcategories of `preferred-citation` do not get bullet points, but the first subcategory of `references` must be bulleted (as below). + - If including `references` or setting a `preferred-citation`, see this [bibtex to cff crosswalk](https://docs.ropensci.org/cffr/articles/bibtex-cff.html#fieldskey-crosswalk) for help in translating a BibTeX citation to the proper `CFF` format. + + ```yaml { py linenums="1" } + abstract: "" + authors: - family-names: - given-names: - title: - year: - journal: - doi: -references: - - authors: - - family-names: - given-names: - - family-names: - given-names: - title: - version: - type: - doi: - date-released: -``` + given-names: "" + orcid: "https://orcid.org/" + cff-version: 1.2.0 + date-released: "YYYY-MM-DD" + identifiers: + - description: "The GitHub release URL of tag ." + type: url + value: "https://github.com/ABC-Center//releases/tag/" + - description: "The GitHub URL of the commit tagged with ." + type: url + value: "https://github.com/ABC-Center//tree/" # Update on release + keywords: + - imageomics + license: + message: "If you find this software helpful in your research, please cite both the software and our paper." + repository-code: "https://github.com/ABC-Center/" + title: "" + version: + #doi: + type: software + # Only include the following if you want to present the paper citation instead of code on sidebar, + # Better to include paper citation in README + preferred-citation: + type: conference-paper + authors: + - family-names: + given-names: + - family-names: + given-names: + collection-title: # "Proceedings of the ..." + collection-type: proceedings + conference: + name: # Name of conference, e.g., "ICLR 2025" + pages: #"-" + start: # First page, int + end: # Last page, int + title: # Paper title + year: + doi: + # url: use only if DOI not available + # References can be added here, but will only be read from the .zenodo.json file + # Below example set to reference code repo, `preferred-citation` types apply + references: + - authors: + - family-names: + given-names: + - family-names: + given-names: + title: + version: + type: + doi: + date-released: + ``` ## Recommended Files @@ -186,7 +239,7 @@ A `.zenodo.json` can be created by applying [cffconvert](https://github.com/cita "creators": [ { "name": "family-names, given-names", - "orcid": "", + "orcid": "", // Just the ORCID number, not the URL "affiliation": "" }, { @@ -207,7 +260,10 @@ A `.zenodo.json` can be created by applying [cffconvert](https://github.com/cita { "id": "021nxhr62::2330423" // ABC NSF grant, NSERC requires manual update } - ] + ], + "references": [ + // list of references as strings in APA or similar format + ] } ``` diff --git a/docs/wiki-guide/The-Hugging-Face-Dataset-Upload-Guide.md b/docs/wiki-guide/The-Hugging-Face-Dataset-Upload-Guide.md index a16b312..49ff0f2 100644 --- a/docs/wiki-guide/The-Hugging-Face-Dataset-Upload-Guide.md +++ b/docs/wiki-guide/The-Hugging-Face-Dataset-Upload-Guide.md @@ -1,21 +1,50 @@ # Hugging Face Dataset Guide -## Create a New Dataset Repository +[Hugging Face](https://hf.co/) offers numerous methods for interacting with and creating datasets. This page provides a basic overview with some recommendations specifically targeting image dataset uploads, though the principles are transferrable to other data types. We list these options—in order of increasing complexity—with some guidance, recommendations, and links out to the appropriate parts of the Hugging Face docs for the most up-to-date information available. -When creating a new dataset repository, you can make the dataset **Public** (accessible to anyone on the internet) or **Private** (accessible only to members of the organization). +1. [Web interface (UI)](#upload-a-dataset-with-the-web-interface): For smaller, simpler uploads. +2. [Hugging Face Command Line Interface (CLI)](#upload-a-dataset-with-the-hugging-face-cli): For most use-cases, easy access from cluster. +3. [Hugging Face API (python package)](#upload-a-dataset-with-hfapi): For when more fine-grained control than is achievable with the CLI is needed. +4. [Git/Git LFS](#upload-a-dataset-with-git): Main use-case is when multiple PRs lead to merge conflicts—Hugging Face provides no other means for resolution. -![New dataset repository interface](images/HF-dataset-upload/346972860-ed0feb0e-529b-4021-b44f-41ac96680bc3.png){ loading=lazy, width=800 } -/// caption -/// +!!! info + Some sections of the Hugging Face docs, such as for the `huggingface_hub`, have only version specific links for stable versions. In this case, if the link directs to an older version, there will be a banner to alert you to a newer version available, so keep an eye out for that updated version banner. + +Most of the content below is covered in various parts of [Hugging Face's Upload Guide](https://huggingface.co/docs/huggingface_hub/en/guides/upload); this page is provided as a summary reference mainly to determine which method might be best and link to the appropriate docs. Additionally, we include an [integrity check](#integrity-check) to help you ensure that your repo contains all the desired files after uploading through any of these methods. + +[HF tips and tricks for large uploads](https://huggingface.co/docs/huggingface_hub/en/guides/upload#tips-and-tricks-for-large-uploads). + +## Note on Authentication + +All of these methods require authentication to edit datasets, ranging from passwords, to tokens, to SSH authentication, and all support editing **Public** (accessible to anyone on the internet) or **Private** (accessible only to members of the organization) repos. Two key notes on authentication: + +1. Private repositories are only visible if you are authenticated. +2. If using tokens for access, be sure to create a [fine-grained token](https://huggingface.co/docs/hub/en/security-tokens#what-are-user-access-tokens), specifically for your needs. ## Upload a Dataset with the Web Interface -In the Files and versions tab of the Dataset card, you can choose to add file in the hugging web interface. +In the Files and Versions tab of the repository, you can select "Contribute" to add or create files or start a pull request directly from the web interface. ![Dataset repository Add file button](images/HF-dataset-upload/346190430-9e6cef9b-18ef-4d4a-84c5-1a3f75ac9336.png){ loading=lazy } +This method is fine for smaller files (<100MB), or data uploads from distributed sources, that have a relatively flat structure with few directories, and/or have few files. If you are uploading existing files, navigate to the target folder first. + +## Upload a Dataset with the Hugging Face CLI + +Hugging Face provides a comprehensive Command Line Interface (CLI) and corresponding [docs](https://huggingface.co/docs/huggingface_hub/en/guides/cli). Note that this is installed with the `huggingface_hub` python package, but can also be installed directly, then called with `hf `. + +The Hugging Face CLI is the ideal method for uploads that are large in volume, have more than a few files, and/or a folder structure with many or nested directories. It works directly from HPC clusters, such as OSC. Under the hood, [`hf upload`](https://huggingface.co/docs/huggingface_hub/en/guides/cli#hf-upload) uses the same upload functions described below, under [Upload a Dataset with HfApi](#upload-a-dataset-with-hfapi). Review [Hugging Face's guidance on large folder uploads](https://huggingface.co/docs/huggingface_hub/v1.10.1/guides/upload#upload-a-large-folder) before selecting a method for uploading large folders to a non-empty repository. + +When uploading to a dataset, note that the repo type must be specified (`--repo-type=dataset`); this is also the case for spaces, since Hugging Face treats models as the default. + +There are specific [`hf datasets`](https://huggingface.co/docs/huggingface_hub/en/guides/cli#hf-datasets) and [`hf repo`](https://huggingface.co/docs/huggingface_hub/en/guides/cli#hf-repo) commands for more general queries and repo initialization. + ## Upload a Dataset with HfApi +When more complex dataset structures are involved or more fine-grained control (not exposed on CLI) over how a repo will be organized on Hugging Face is neeeded, the Hugging Face API may be the answer. For instance, if a glob pattern cannot sufficiently clarify necessary exclusions of subfolders or files, [`HfApi`](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api) is likely the preferred choice. This is a class, accessible through the [`huggingface_hub` package](https://huggingface.co/docs/huggingface_hub/index), that acts as a Python wrapper for the API. + +Please see the Hugging Face API docs for the most up-to-date guidance. For quick reference, to [upload by file](https://huggingface.co/docs/huggingface_hub/v1.10.1/package_reference/hf_api#huggingface_hub.HfApi.upload_file) or [upload by folder (structure maintained)](https://huggingface.co/docs/huggingface_hub/v1.10.1/package_reference/hf_api#huggingface_hub.HfApi.upload_folder). + ``` py linenums="1" from huggingface_hub import login @@ -24,70 +53,70 @@ login() from huggingface_hub import HfApi api = HfApi() +repo_id = "ABC-Center/" -api.upload_file ( +# Upload by file +api.upload_file( path_or_fileobj = , path_in_repo = , - repo_id = , + repo_id = repo_id, repo_type = 'dataset' ) -``` - -## Upload a Dataset with Git - -### If the Dataset is Less Than 5GB - -Navigate to the folder for the repository: +# Upload by folder (maintain structure) +api.upload_folder( + folder_path="/path/to/local/folder", # should end with folder containing data + path_in_repo="path/to/folder/", # path desired for folder in repo + repo_id= repo_id, + repo_type="dataset", +    token_id = "paste-token-here" # if you're not logged in, HF does **not** recommend this method +) ``` -# Clone the repository -git clone https://huggingface.co/datasets/username/repo-name -# Add, commit, and push the files -git add -git commit -m 'comments' -git push +Repos can also be created through the Hugging Face API using the [create_repo method](https://huggingface.co/docs/huggingface_hub/v1.10.1/en/package_reference/hf_api#huggingface_hub.HfApi.create_repo) with the following parameters: +```py linenums="1" +repo_id = "ABC-Center/" +repo_type = "dataset" +private = True # if you want the repo private ``` -### If the Dataset is Larger Than 5GB +See also instructions using the [datasets package](https://huggingface.co/docs/datasets/create_dataset). -#### Install Git LFS +## Upload a Dataset with Git -Follow instructions at +Using Git to interact with Hugging Face requires installation of [Git LFS](https://git-lfs.com/), the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli), and then enabling large file upload for the repo -#### Install the Hugging Face CLI +Hugging Face provides details on [git vs http](https://huggingface.co/docs/huggingface_hub/en/concepts/git_vs_http), really using Git vs HfApi -``` -brew install huggingface-cli -pip install -U "huggingface_hub[cli]" -``` +Hugging Face has moved away from Git LFS, instead utilizing [Xet](https://huggingface.co/docs/hub/en/xet/index) for data storage and version control; this is [backwards compatible with LFS](https://huggingface.co/docs/hub/en/xet/legacy-git-lfs). -#### Enable the repository to upload large files +## Other Repo Considerations -``` -huggingface-cli lfs-enable-largefiles -``` +One time where `git` may be needed, is if one encounters a [merge conflict](https://discuss.huggingface.co/t/how-to-fix-merge-conflicts-in-prs/160090). Unlike GitHub, Hugging Face ***does not*** have conflict resolution UI tools, nor does it provide merge conflict resolution capabilities in the CLI or HfApi. The only means for resolving merge conflicts is to manually update the pull request in a [local clone](The-Hugging-Face-Workflow.md#hugging-face-pull-requests-with-local-edits), pulling `main` into your PR branch and resolving the conflicts. -#### Initialize Git LFS +## Integrity Check -``` -git lfs install -``` +Sometimes uploads fail partway through, leaving one or more files un-uploaded. Unfortunately, it seems that there is not an easy solution to be alerted to these issues when not uploading through the UI. Additionally, using a glob pattern to set upload without a dry-run[^1] (in `git` terms, this would be running `git status` after adding files) can also lead to accidental exclusion. To catch these issues, we recommend the following integrity check after uploading a dataset[^1]. -#### Track large files (e.g., .csv files) +[^1]: The Hugging Face CLI does have a [dry-run mode](https://huggingface.co/docs/huggingface_hub/en/guides/cli#dry-run-mode) for *downloading* datasets. Additionally, if working with Git LFS, there is a [preupload LFS](https://huggingface.co/docs/huggingface_hub/en/guides/upload#preupload-lfs-files-before-commit) option to ensure all files are properly preset and organized before committing. There are additional considerations for sharding noted in the Hugging Face docs. -``` -# Adds a line to .gitattributes, which Git uses to determine files managed by LFS -git lfs track "*.csv" -git add .gitattributes -git commit -m "Track large files with Git LFS" -``` +```python +import pandas as pd +from huggingface_hub import HfApi -#### Add, commit, and push the files +api = HfApi() +repo_id = "ABC-Center/" +repo_type = "dataset" +file_list = api.list_repo_files(repo_id=repo_id, repo_type=repo_type) +file_df = pd.DataFrame(data = {"filepath": file_list}) +metadata = pd.read_csv("path/to/metadata/file") + +# assuming you use the same filepath in your system as in the repo +df = pd.merge(file_df, metadata, how = "inner", on = "filepath") +df.shape[0] # this should match the number of expected images ``` -git add -git commit -m 'comments' -git push -``` + +!!! tip "Pro tip" + If you don't have a metadata file for your images, use the [sum-buddy package](Helpful-Tools-for-your-Workflow.md#sum-buddy) to generate one in your local file system. This can also be used as a metadata file for the dataset viewer as needed (see [image datasets docs](https://huggingface.co/docs/hub/en/datasets-image) for more information on setting this up). Similar options are available for [audio](https://huggingface.co/docs/hub/en/datasets-audio) and [video](https://huggingface.co/docs/hub/en/datasets-video) datasets. diff --git a/docs/wiki-guide/images/HF-dataset-upload/346190430-9e6cef9b-18ef-4d4a-84c5-1a3f75ac9336.png b/docs/wiki-guide/images/HF-dataset-upload/346190430-9e6cef9b-18ef-4d4a-84c5-1a3f75ac9336.png index 265e540..d3380cb 100644 Binary files a/docs/wiki-guide/images/HF-dataset-upload/346190430-9e6cef9b-18ef-4d4a-84c5-1a3f75ac9336.png and b/docs/wiki-guide/images/HF-dataset-upload/346190430-9e6cef9b-18ef-4d4a-84c5-1a3f75ac9336.png differ diff --git a/mkdocs.yaml b/mkdocs.yaml index f87c5a3..9ab28fd 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -92,6 +92,8 @@ markdown_extensions: - pymdownx.inlinehilite - pymdownx.snippets - pymdownx.superfences + - pymdownx.tabbed: + alternate_style: true - pymdownx.tasklist - pymdownx.tilde - pymdownx.keys