Skip to content

Add support for datasets stored remotely in S3 buckets#1085

Merged
amisevsk merged 9 commits intokitops-ml:mainfrom
amisevsk:remote-datasets
Mar 11, 2026
Merged

Add support for datasets stored remotely in S3 buckets#1085
amisevsk merged 9 commits intokitops-ml:mainfrom
amisevsk:remote-datasets

Conversation

@amisevsk
Copy link
Copy Markdown
Contributor

@amisevsk amisevsk commented Feb 10, 2026

Description

This PR includes the following changes:

  • Validation for Kitfiles
    • We now check that manifestVersion is 1.0.0 in Kitfiles and print a warning if it is not. This is a potentially breaking change, as previously manifestVersion was completely unused. Hopefully it does not impact too many people, as this is the value used by kit init. Validating manifestVersion will allow us to more safely introduce Kitfile changes in the future
    • We now validate paths in the Kitfile to ensure they are as expected (e.g. no URLs, modelkit references only for model components, etc.)
  • Support for remote datasets, with data stored in an S3 bucket:
    • Kitfile datasets include two new fields:
      datasets:
      name: my-dataset
      path: some/local/path
      remotePath: s3://<bucket-name>/<bucket-key>
      remoteHash: <ETag for remote object>
    • Datasets with remotePath are not packed into ModelKits; instead, Kit verifies that the remoteHash for that dataset corresponds to the ETag for the bucket object
    • Remote datasets can be unpacked by specifying the --include-remote flag for kit unpack. If specified, Kit will download files from the S3 bucket, verifying ETag
    • S3 support is provided by the AWS Go SDK (v2) and includes environment support as specified there. In addition, Kit explicitly supports the following environment variables:
      • S3_API_ENDPOINT - API endpoint for S3-compatible services
      • S3_REGION - Region for S3
      • S3_ACCESS_KEY_ID - Access key ID
      • S3_SECRET_KEY - Secret access Key

I've tested these changes against AWS S3 and Cloudflare R2 and was able to pack and unpack ModelKits containing remote datasets.

Open questions for the implementation

  • We're currently only checking ETags, as they are widely supported; some services also support computing actual digests for contents (or potentially in some cases) -- should we also support better checksums
    • Note that ETags can be used to verify that the object hasn't been changed, but they are not straightforward digests of the content; for example, the ETag for blobs uploaded in multiple parts depend on the size of each part
  • I'm not sure if we should incorporate S3_REGION into the s3:// reference somehow, as it's often required (and different datasets may be stored in different regions). I don't see a clean way to include it in the remotePath and don't want to overpollute the dataset struct

Linked issues

Closes #887

AI-Assisted Code

  • This PR contains AI-generated code that I have reviewed and tested
  • I take full responsibility for all code in this PR, regardless of how it was created

@amisevsk amisevsk requested review from Copilot and gorkem February 10, 2026 01:03
@amisevsk amisevsk force-pushed the remote-datasets branch 2 times, most recently from ca9837d to e3e1251 Compare February 10, 2026 01:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class support for datasets stored remotely in S3 (verified by ETag) and strengthens Kitfile validation to enforce the current manifest schema/version, while refactoring common ModelKit reference utilities into the artifact package for consistent use across commands.

Changes:

  • Enforce Kitfile manifestVersion: 1.0.0 and validate component paths (local vs ModelKit refs vs URLs), including remote dataset fields (remotePath, remoteHash).
  • Add S3 integration (AWS SDK v2) to verify remote datasets during pack and optionally download them during unpack (--include-remote).
  • Centralize ModelKit reference parsing/formatting helpers under pkg/artifact and update commands/callers accordingly.

Reviewed changes

Copilot reviewed 60 out of 65 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/lib/repo/util/reference.go Removes reference parsing helpers from repo util (leaving OCI-manifest utilities).
pkg/lib/repo/util/kitfile.go Removes ModelKit reference detection helper from this package.
pkg/lib/repo/local/repository.go Switches digest detection helper to artifact.ReferenceIsDigest.
pkg/lib/repo/local/pull.go Switches digest detection helper to artifact.ReferenceIsDigest.
pkg/lib/kitfile/resolve.go Uses artifact.ParseReference / artifact.IsModelKitReference for resolution flow.
pkg/lib/filesystem/unpack/store.go Uses artifact.DefaultRegistry when deciding local vs remote store.
pkg/lib/filesystem/unpack/options.go Adds IncludeRemote option to library unpack options.
pkg/lib/filesystem/unpack/core.go Implements remote dataset handling during unpack (optional download via S3).
pkg/lib/filesystem/local-storage.go Skips packing remote datasets; verifies remote S3 objects exist/ETag matches.
pkg/lib/external/s3api/env.go Defines explicit S3-related environment variables supported by Kit.
pkg/lib/external/s3api/api_test.go Adds unit tests for S3 reference parsing and HEAD/ETag verification.
pkg/lib/external/s3api/api.go Adds S3 URL parsing, object verification, download, and client setup.
pkg/lib/external/hf/list.go Moves HF listing logic under pkg/lib/external/hf.
pkg/lib/external/hf/download.go Moves HF download logic under pkg/lib/external/hf with concurrency/progress.
pkg/lib/external/git/clone.go Moves git clone/LFS pull logic under pkg/lib/external/git.
pkg/lib/external/git/cleanup.go Adds helper to strip git metadata from cloned repos.
pkg/lib/completion/modelkits.go Uses artifact.FormatRepositoryForDisplay for completion output.
pkg/cmd/unpack/cmd.go Adds --include-remote and wires it through to library unpack options.
pkg/cmd/tag/cmd.go Uses artifact.ParseReference for tag command parsing.
pkg/cmd/remove/remove.go Uses artifact.FormatRepositoryForDisplay for output/display.
pkg/cmd/remove/remote.go Uses artifact.ReferenceIsDigest/FormatRepositoryForDisplay in remote remove flow.
pkg/cmd/remove/cmd.go Uses artifact.ParseReference/FormatRepositoryForDisplay.
pkg/cmd/push/cmd.go Uses artifact.ParseReference for push source/target parsing.
pkg/cmd/pull/pull.go Uses artifact.ParseReference and artifact.IsModelKitReference for parent pulls.
pkg/cmd/pull/cmd.go Uses artifact.ParseReference for pull argument parsing.
pkg/cmd/pack/pack.go Uses artifact.IsModelKitReference and display formatting for referenced base kits.
pkg/cmd/pack/cmd.go Uses artifact.ParseReference and artifact.DefaultReference.
pkg/cmd/list/list.go Uses artifact.FormatRepositoryForDisplay for listing output.
pkg/cmd/list/cmd.go Uses artifact.ParseReference for list argument parsing.
pkg/cmd/kitimport/hfimport.go Switches import to new pkg/lib/external/hf package location.
pkg/cmd/kitimport/gitimport.go Switches import to new pkg/lib/external/git package location.
pkg/cmd/kitimport/cmd.go Uses artifact.ParseReference and new external git package.
pkg/cmd/inspect/cmd.go Uses artifact.ParseReference/FormatRepositoryForDisplay/DefaultRegistry.
pkg/cmd/info/cmd.go Uses artifact.ParseReference/FormatRepositoryForDisplay/DefaultRegistry.
pkg/cmd/diff/cmd.go Uses artifact.ParseReference for both refs.
pkg/cmd/dev/opts.go Uses artifact.ParseReference to detect ModelKit references in dev start.
pkg/cmd/dev/dev.go Uses artifact.IsModelKitReference when resolving referenced base kits.
pkg/artifact/util_test.go Moves reference parsing tests into artifact package.
pkg/artifact/util.go New home for ModelKit reference utilities (parse/format/defaults).
pkg/artifact/testdata/validation/s3-path-for-prompts.yaml Validation fixture: disallow s3:// in prompts paths.
pkg/artifact/testdata/validation/s3-path-for-modelpart.yaml Validation fixture: disallow s3:// in model part paths.
pkg/artifact/testdata/validation/s3-path-for-model.yaml Validation fixture: disallow s3:// in model path (except ModelKit refs).
pkg/artifact/testdata/validation/s3-path-for-docs.yaml Validation fixture: disallow s3:// in docs paths.
pkg/artifact/testdata/validation/s3-path-for-datasets.yaml Validation fixture: disallow s3:// in dataset local path.
pkg/artifact/testdata/validation/s3-path-for-code.yaml Validation fixture: disallow s3:// in code paths.
pkg/artifact/testdata/validation/remote-datasets-supported.yaml Validation fixture: allow dataset remotePath + remoteHash.
pkg/artifact/testdata/validation/remote-dataset-requires-hash.yaml Validation fixture: require remoteHash when remotePath is set.
pkg/artifact/testdata/validation/remote-dataset-must-use-s3.yaml Validation fixture: remote dataset must be s3:// scheme.
pkg/artifact/testdata/validation/modelkit-ref-for-prompts.yaml Validation fixture: disallow ModelKit refs in prompts paths.
pkg/artifact/testdata/validation/modelkit-ref-for-modelpart.yaml Validation fixture: disallow ModelKit refs in model part paths.
pkg/artifact/testdata/validation/modelkit-ref-for-model.yaml Validation fixture: allow ModelKit refs for model path.
pkg/artifact/testdata/validation/modelkit-ref-for-docs.yaml Validation fixture: disallow ModelKit refs in docs paths.
pkg/artifact/testdata/validation/modelkit-ref-for-datasets.yaml Validation fixture: disallow ModelKit refs in dataset paths.
pkg/artifact/testdata/validation/modelkit-ref-for-code.yaml Validation fixture: disallow ModelKit refs in code paths.
pkg/artifact/testdata/validation/http-path-for-prompts.yaml Validation fixture: disallow HTTP(S) URLs in prompts paths.
pkg/artifact/testdata/validation/http-path-for-modelpart.yaml Validation fixture: disallow HTTP(S) URLs in model part paths.
pkg/artifact/testdata/validation/http-path-for-model.yaml Validation fixture: disallow HTTP(S) URLs in model path.
pkg/artifact/testdata/validation/http-path-for-docs.yaml Validation fixture: disallow HTTP(S) URLs in docs paths.
pkg/artifact/testdata/validation/http-path-for-datasets.yaml Validation fixture: disallow HTTP(S) URLs in dataset paths.
pkg/artifact/testdata/validation/http-path-for-code.yaml Validation fixture: disallow HTTP(S) URLs in code paths.
pkg/artifact/testdata/validation/bad-manifest-version.yaml Validation fixture: enforce manifestVersion equals 1.0.0.
pkg/artifact/kitfile_test.go Adds table-driven validation tests loading fixtures from testdata/validation.
pkg/artifact/kitfile.go Adds dataset remote fields, Kitfile validation, and URL/path classification helpers.
go.mod Adds AWS SDK v2 config + S3 dependencies.
go.sum Updates dependency checksums for added AWS SDK modules.
Comments suppressed due to low confidence (1)

pkg/lib/filesystem/unpack/core.go:272

  • datasetIdx is now used as an index into config.DataSets (skipping remote entries), so logging it as “Unpacked %d dataset layers” can be incorrect when remote datasets exist. Track a separate counter for unpacked dataset layers (increment when a dataset layer is actually processed).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/lib/external/s3api/api.go
Comment thread pkg/lib/filesystem/unpack/core.go Outdated
Comment thread pkg/lib/filesystem/unpack/core.go Outdated
Comment thread pkg/artifact/kitfile.go
Comment on lines +297 to +316
func GetPathType(path string) (PathType, error) {
if IsModelKitReference(path) {
return ModelReferencePathType, nil
}

// Treat things that don't parse as URLs as local paths
parsed, err := url.Parse(path)
if err != nil || (parsed.Scheme == "" && parsed.Host == "") {
return LocalPathType, nil
}

switch parsed.Scheme {
case "s3":
return S3PathType, nil
case "http", "https":
return UnknownPathType, fmt.Errorf("HTTP urls are not supported in paths (%s)", path)
}

return UnknownPathType, fmt.Errorf("unrecognized URL in path: %s", path)
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetPathType() uses url.Parse to classify paths; on Windows, drive-letter paths like C:\data\file parse as scheme c and will be rejected as an “unrecognized URL”. Since the project ships Windows builds, this validation should treat Windows absolute/drive paths as LocalPathType (e.g., by checking filepath.VolumeName/path patterns before url.Parse).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering leaving this as-is; our path requirements are a little shaky at the moment and I believe we assume they're defined unix-style even on Windows (where paths are converted to Windows-style paths before accessing disk). I don't have access to a Windows machine anymore, so I can't easily test these flows.

In any case, Kitfile paths must be relative, so drive-letter paths are not applicable.

Comment thread pkg/artifact/kitfile.go
Comment thread pkg/lib/external/s3api/api.go
Comment thread pkg/artifact/kitfile.go Outdated
Comment thread pkg/lib/filesystem/unpack/core.go Outdated
datasetIdx = idx + 1
break
}
if !shouldUnpackLayer(entry, opts.FilterConfs) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all datasets are remote. we would call shouldUnpackLayer with zero-value entry. I think we should update shouldUnpackLayer to explicitly handle empty entry (not because it has empty Name and Path) or not call shouldUnpackLayer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen, since the entry into this block is when we encounter a dataset layer in the manifest (i.e. the Kitfile has at least one non-remote dataset). I'll add a guard against this regardless.

clientOpts = append(clientOpts, func(o *s3.Options) { o.BaseEndpoint = aws.String(endpoint) })
// For now, use path style URLs (endpoint.com/bucket/key) instead of virtual-hosted style (bucket.endpoint.com/key) when
// an alternate endpoint is specified; this is used/supported by most S3-compatible APIs
clientOpts = append(clientOpts, func(o *s3.Options) { o.UsePathStyle = true })
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appearently use ARN region setting enables cross-region for the bucket urls too.

Suggested change
clientOpts = append(clientOpts, func(o *s3.Options) { o.UsePathStyle = true })
clientOpts = append(clientOpts, func(o *s3.Options) {
o.UseARNRegion = true
})

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets try it, I'm not sure how all these settings play together.

@gorkem
Copy link
Copy Markdown
Member

gorkem commented Feb 10, 2026

if the use ARN config solves the cross region issue for AWS S3, I do not think we need to add region to s3 url somehow. I think this region problem is on AWS only the other compatible may not even have "regions".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 60 out of 67 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/lib/external/s3api/api.go
Comment thread pkg/lib/external/s3api/api.go
Comment thread pkg/lib/external/s3api/api.go
Comment thread pkg/artifact/kitfile.go
}
defer obj.Body.Close()

outfile, err := os.Create(outputPath)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that this works different than saveContentLayer which writes to a temp file, then os.Rename to outputPath.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the equivalent for this function would be unpackLayer -- DownloadObject is called only when we want a local copy of the blob from S3. During pack (where we use saveContentLayer) we only verify the object exists and its ETag matches the remoteHash.

Comment thread pkg/artifact/kitfile.go
Copy link
Copy Markdown
Contributor Author

@amisevsk amisevsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, GitHub is interpreting my responses here as part of a review and not letting me submit them separately (unless I delete all of them, I assume)

Comment thread pkg/artifact/kitfile.go
}
defer obj.Body.Close()

outfile, err := os.Create(outputPath)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the equivalent for this function would be unpackLayer -- DownloadObject is called only when we want a local copy of the blob from S3. During pack (where we use saveContentLayer) we only verify the object exists and its ETag matches the remoteHash.

amisevsk added 3 commits March 6, 2026 17:43
Move standalone utility functions for parsing and validating OCI
references into the artifact package to support validating Kitfiles.

Signed-off-by: Angel Misevski <amisevsk@gmail.com>
Add utility functions and validation for Kitfiles. To support remote
dataset storage (e.g. in an S3 bucket), add field "Hash" to DataSet
struct; if a DataSet path is remote, a hash is required.

Signed-off-by: Angel Misevski <amisevsk@gmail.com>
If a Kitfile contains a dataset with an `s3://` path, verify that the
hash on that object matches the ETag of the object in S3 while packing.

S3 functionality should work with any S3-compatible storage, and
respects configuration via env vars

  * S3_API_ENDPOINT
  * S3_REGION
  * S3_ACCESS_KEY_ID
  * S3_SECRET_KEY

For AWS S3 storage, standard default configuration is retrieved from the
system.

Signed-off-by: Angel Misevski <amisevsk@gmail.com>
@amisevsk
Copy link
Copy Markdown
Contributor Author

amisevsk commented Mar 6, 2026

I've downgraded manifestVersion errors to (for now) printed warnings. I don't love the idea of including the output package in artifact but this was the cleanest way to add warnings without putting them throughout the repository

I've also remembered we did have some Kitfile validation code elsewhere; I've merged it into the current validate method (see commit 0ec8f95). In this process, I also included a bugfix for #1109.

if !opts.IncludeRemote {
output.Logf(output.LogLevelWarn, "ModelKit contains remote datasets. To unpack, specify the --include-remote flag")
} else {
client, err := s3api.SetUpClient(ctx)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this call shouldUnpackLayer() at some point to check if --filter is set?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I forgot this because originally path: contained the remote ref, but now that they're separate fields that makes sense.

Comment thread pkg/artifact/kitfile.go Outdated
return buf.Bytes(), nil
}

func (kf *KitFile) Validate() (err error, warnings []string) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return error as the last value to satisfy the Golang gods.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch -- I forgot about the Golang gods

Add flag `--include-remote` to unpack command; if specified, remote
datasets will be downloaded from S3 while unpacking.

S3 functionality should work with any S3-compatible storage, and
respects configuration via env vars

  * S3_API_ENDPOINT
  * S3_REGION
  * S3_ACCESS_KEY_ID
  * S3_SECRET_KEY

For AWS S3 storage, standard default configuration is retrieved from the
system.

Signed-off-by: Angel Misevski <amisevsk@gmail.com>
Signed-off-by: Angel Misevski <amisevsk@gmail.com>
Move libraries for communicating with Huggingface, S3, and git
repositories under pkg/lib/external/ to avoid overfilling pkg/lib

Signed-off-by: Angel Misevski <amisevsk@gmail.com>
Signed-off-by: Angel Misevski <amisevsk@gmail.com>
* Merge split Kitfile validation (previously in
  pkg/lib/kitfile/validate.go and pkg/artifact/kitfile.go) into the
  artifact package
* Fix handling of duplicate paths by checking against cleaned filepaths
  instead of raw strings
* Add test cases for validation covering added validation processes

Signed-off-by: Angel Misevski <amisevsk@gmail.com>
In order to preserve compatibility with older ModelKits, treat any
unrecognized manifestVersion as 1.0.0 and print a warning instead of
refusing to process the Kitfile

Signed-off-by: Angel Misevski <amisevsk@gmail.com>
@amisevsk amisevsk requested a review from gorkem March 10, 2026 16:51
@amisevsk amisevsk merged commit 903cf61 into kitops-ml:main Mar 11, 2026
8 checks passed
@amisevsk amisevsk deleted the remote-datasets branch March 11, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kit misses identical paths in validation if they are formatted differently Support for storing very large datasets remotely in cheaper blob storage

3 participants