Add support for datasets stored remotely in S3 buckets#1085
Add support for datasets stored remotely in S3 buckets#1085amisevsk merged 9 commits intokitops-ml:mainfrom
Conversation
ca9837d to
e3e1251
Compare
There was a problem hiding this comment.
Pull request overview
Adds first-class support for datasets stored remotely in S3 (verified by ETag) and strengthens Kitfile validation to enforce the current manifest schema/version, while refactoring common ModelKit reference utilities into the artifact package for consistent use across commands.
Changes:
- Enforce Kitfile
manifestVersion: 1.0.0and validate component paths (local vs ModelKit refs vs URLs), including remote dataset fields (remotePath,remoteHash). - Add S3 integration (AWS SDK v2) to verify remote datasets during pack and optionally download them during unpack (
--include-remote). - Centralize ModelKit reference parsing/formatting helpers under
pkg/artifactand update commands/callers accordingly.
Reviewed changes
Copilot reviewed 60 out of 65 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/lib/repo/util/reference.go | Removes reference parsing helpers from repo util (leaving OCI-manifest utilities). |
| pkg/lib/repo/util/kitfile.go | Removes ModelKit reference detection helper from this package. |
| pkg/lib/repo/local/repository.go | Switches digest detection helper to artifact.ReferenceIsDigest. |
| pkg/lib/repo/local/pull.go | Switches digest detection helper to artifact.ReferenceIsDigest. |
| pkg/lib/kitfile/resolve.go | Uses artifact.ParseReference / artifact.IsModelKitReference for resolution flow. |
| pkg/lib/filesystem/unpack/store.go | Uses artifact.DefaultRegistry when deciding local vs remote store. |
| pkg/lib/filesystem/unpack/options.go | Adds IncludeRemote option to library unpack options. |
| pkg/lib/filesystem/unpack/core.go | Implements remote dataset handling during unpack (optional download via S3). |
| pkg/lib/filesystem/local-storage.go | Skips packing remote datasets; verifies remote S3 objects exist/ETag matches. |
| pkg/lib/external/s3api/env.go | Defines explicit S3-related environment variables supported by Kit. |
| pkg/lib/external/s3api/api_test.go | Adds unit tests for S3 reference parsing and HEAD/ETag verification. |
| pkg/lib/external/s3api/api.go | Adds S3 URL parsing, object verification, download, and client setup. |
| pkg/lib/external/hf/list.go | Moves HF listing logic under pkg/lib/external/hf. |
| pkg/lib/external/hf/download.go | Moves HF download logic under pkg/lib/external/hf with concurrency/progress. |
| pkg/lib/external/git/clone.go | Moves git clone/LFS pull logic under pkg/lib/external/git. |
| pkg/lib/external/git/cleanup.go | Adds helper to strip git metadata from cloned repos. |
| pkg/lib/completion/modelkits.go | Uses artifact.FormatRepositoryForDisplay for completion output. |
| pkg/cmd/unpack/cmd.go | Adds --include-remote and wires it through to library unpack options. |
| pkg/cmd/tag/cmd.go | Uses artifact.ParseReference for tag command parsing. |
| pkg/cmd/remove/remove.go | Uses artifact.FormatRepositoryForDisplay for output/display. |
| pkg/cmd/remove/remote.go | Uses artifact.ReferenceIsDigest/FormatRepositoryForDisplay in remote remove flow. |
| pkg/cmd/remove/cmd.go | Uses artifact.ParseReference/FormatRepositoryForDisplay. |
| pkg/cmd/push/cmd.go | Uses artifact.ParseReference for push source/target parsing. |
| pkg/cmd/pull/pull.go | Uses artifact.ParseReference and artifact.IsModelKitReference for parent pulls. |
| pkg/cmd/pull/cmd.go | Uses artifact.ParseReference for pull argument parsing. |
| pkg/cmd/pack/pack.go | Uses artifact.IsModelKitReference and display formatting for referenced base kits. |
| pkg/cmd/pack/cmd.go | Uses artifact.ParseReference and artifact.DefaultReference. |
| pkg/cmd/list/list.go | Uses artifact.FormatRepositoryForDisplay for listing output. |
| pkg/cmd/list/cmd.go | Uses artifact.ParseReference for list argument parsing. |
| pkg/cmd/kitimport/hfimport.go | Switches import to new pkg/lib/external/hf package location. |
| pkg/cmd/kitimport/gitimport.go | Switches import to new pkg/lib/external/git package location. |
| pkg/cmd/kitimport/cmd.go | Uses artifact.ParseReference and new external git package. |
| pkg/cmd/inspect/cmd.go | Uses artifact.ParseReference/FormatRepositoryForDisplay/DefaultRegistry. |
| pkg/cmd/info/cmd.go | Uses artifact.ParseReference/FormatRepositoryForDisplay/DefaultRegistry. |
| pkg/cmd/diff/cmd.go | Uses artifact.ParseReference for both refs. |
| pkg/cmd/dev/opts.go | Uses artifact.ParseReference to detect ModelKit references in dev start. |
| pkg/cmd/dev/dev.go | Uses artifact.IsModelKitReference when resolving referenced base kits. |
| pkg/artifact/util_test.go | Moves reference parsing tests into artifact package. |
| pkg/artifact/util.go | New home for ModelKit reference utilities (parse/format/defaults). |
| pkg/artifact/testdata/validation/s3-path-for-prompts.yaml | Validation fixture: disallow s3:// in prompts paths. |
| pkg/artifact/testdata/validation/s3-path-for-modelpart.yaml | Validation fixture: disallow s3:// in model part paths. |
| pkg/artifact/testdata/validation/s3-path-for-model.yaml | Validation fixture: disallow s3:// in model path (except ModelKit refs). |
| pkg/artifact/testdata/validation/s3-path-for-docs.yaml | Validation fixture: disallow s3:// in docs paths. |
| pkg/artifact/testdata/validation/s3-path-for-datasets.yaml | Validation fixture: disallow s3:// in dataset local path. |
| pkg/artifact/testdata/validation/s3-path-for-code.yaml | Validation fixture: disallow s3:// in code paths. |
| pkg/artifact/testdata/validation/remote-datasets-supported.yaml | Validation fixture: allow dataset remotePath + remoteHash. |
| pkg/artifact/testdata/validation/remote-dataset-requires-hash.yaml | Validation fixture: require remoteHash when remotePath is set. |
| pkg/artifact/testdata/validation/remote-dataset-must-use-s3.yaml | Validation fixture: remote dataset must be s3:// scheme. |
| pkg/artifact/testdata/validation/modelkit-ref-for-prompts.yaml | Validation fixture: disallow ModelKit refs in prompts paths. |
| pkg/artifact/testdata/validation/modelkit-ref-for-modelpart.yaml | Validation fixture: disallow ModelKit refs in model part paths. |
| pkg/artifact/testdata/validation/modelkit-ref-for-model.yaml | Validation fixture: allow ModelKit refs for model path. |
| pkg/artifact/testdata/validation/modelkit-ref-for-docs.yaml | Validation fixture: disallow ModelKit refs in docs paths. |
| pkg/artifact/testdata/validation/modelkit-ref-for-datasets.yaml | Validation fixture: disallow ModelKit refs in dataset paths. |
| pkg/artifact/testdata/validation/modelkit-ref-for-code.yaml | Validation fixture: disallow ModelKit refs in code paths. |
| pkg/artifact/testdata/validation/http-path-for-prompts.yaml | Validation fixture: disallow HTTP(S) URLs in prompts paths. |
| pkg/artifact/testdata/validation/http-path-for-modelpart.yaml | Validation fixture: disallow HTTP(S) URLs in model part paths. |
| pkg/artifact/testdata/validation/http-path-for-model.yaml | Validation fixture: disallow HTTP(S) URLs in model path. |
| pkg/artifact/testdata/validation/http-path-for-docs.yaml | Validation fixture: disallow HTTP(S) URLs in docs paths. |
| pkg/artifact/testdata/validation/http-path-for-datasets.yaml | Validation fixture: disallow HTTP(S) URLs in dataset paths. |
| pkg/artifact/testdata/validation/http-path-for-code.yaml | Validation fixture: disallow HTTP(S) URLs in code paths. |
| pkg/artifact/testdata/validation/bad-manifest-version.yaml | Validation fixture: enforce manifestVersion equals 1.0.0. |
| pkg/artifact/kitfile_test.go | Adds table-driven validation tests loading fixtures from testdata/validation. |
| pkg/artifact/kitfile.go | Adds dataset remote fields, Kitfile validation, and URL/path classification helpers. |
| go.mod | Adds AWS SDK v2 config + S3 dependencies. |
| go.sum | Updates dependency checksums for added AWS SDK modules. |
Comments suppressed due to low confidence (1)
pkg/lib/filesystem/unpack/core.go:272
- datasetIdx is now used as an index into config.DataSets (skipping remote entries), so logging it as “Unpacked %d dataset layers” can be incorrect when remote datasets exist. Track a separate counter for unpacked dataset layers (increment when a dataset layer is actually processed).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func GetPathType(path string) (PathType, error) { | ||
| if IsModelKitReference(path) { | ||
| return ModelReferencePathType, nil | ||
| } | ||
|
|
||
| // Treat things that don't parse as URLs as local paths | ||
| parsed, err := url.Parse(path) | ||
| if err != nil || (parsed.Scheme == "" && parsed.Host == "") { | ||
| return LocalPathType, nil | ||
| } | ||
|
|
||
| switch parsed.Scheme { | ||
| case "s3": | ||
| return S3PathType, nil | ||
| case "http", "https": | ||
| return UnknownPathType, fmt.Errorf("HTTP urls are not supported in paths (%s)", path) | ||
| } | ||
|
|
||
| return UnknownPathType, fmt.Errorf("unrecognized URL in path: %s", path) | ||
| } |
There was a problem hiding this comment.
GetPathType() uses url.Parse to classify paths; on Windows, drive-letter paths like C:\data\file parse as scheme c and will be rejected as an “unrecognized URL”. Since the project ships Windows builds, this validation should treat Windows absolute/drive paths as LocalPathType (e.g., by checking filepath.VolumeName/path patterns before url.Parse).
There was a problem hiding this comment.
I'm considering leaving this as-is; our path requirements are a little shaky at the moment and I believe we assume they're defined unix-style even on Windows (where paths are converted to Windows-style paths before accessing disk). I don't have access to a Windows machine anymore, so I can't easily test these flows.
In any case, Kitfile paths must be relative, so drive-letter paths are not applicable.
| datasetIdx = idx + 1 | ||
| break | ||
| } | ||
| if !shouldUnpackLayer(entry, opts.FilterConfs) { |
There was a problem hiding this comment.
If all datasets are remote. we would call shouldUnpackLayer with zero-value entry. I think we should update shouldUnpackLayer to explicitly handle empty entry (not because it has empty Name and Path) or not call shouldUnpackLayer.
There was a problem hiding this comment.
This should never happen, since the entry into this block is when we encounter a dataset layer in the manifest (i.e. the Kitfile has at least one non-remote dataset). I'll add a guard against this regardless.
| clientOpts = append(clientOpts, func(o *s3.Options) { o.BaseEndpoint = aws.String(endpoint) }) | ||
| // For now, use path style URLs (endpoint.com/bucket/key) instead of virtual-hosted style (bucket.endpoint.com/key) when | ||
| // an alternate endpoint is specified; this is used/supported by most S3-compatible APIs | ||
| clientOpts = append(clientOpts, func(o *s3.Options) { o.UsePathStyle = true }) |
There was a problem hiding this comment.
Appearently use ARN region setting enables cross-region for the bucket urls too.
| clientOpts = append(clientOpts, func(o *s3.Options) { o.UsePathStyle = true }) | |
| clientOpts = append(clientOpts, func(o *s3.Options) { | |
| o.UseARNRegion = true | |
| }) |
There was a problem hiding this comment.
Lets try it, I'm not sure how all these settings play together.
|
if the use ARN config solves the cross region issue for AWS S3, I do not think we need to add region to s3 url somehow. I think this region problem is on AWS only the other compatible may not even have "regions". |
e3e1251 to
dd522bd
Compare
dd522bd to
6f30e20
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 60 out of 67 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6f30e20 to
49d438f
Compare
| } | ||
| defer obj.Body.Close() | ||
|
|
||
| outfile, err := os.Create(outputPath) |
There was a problem hiding this comment.
Noticed that this works different than saveContentLayer which writes to a temp file, then os.Rename to outputPath.
There was a problem hiding this comment.
I think the equivalent for this function would be unpackLayer -- DownloadObject is called only when we want a local copy of the blob from S3. During pack (where we use saveContentLayer) we only verify the object exists and its ETag matches the remoteHash.
amisevsk
left a comment
There was a problem hiding this comment.
For some reason, GitHub is interpreting my responses here as part of a review and not letting me submit them separately (unless I delete all of them, I assume)
| } | ||
| defer obj.Body.Close() | ||
|
|
||
| outfile, err := os.Create(outputPath) |
There was a problem hiding this comment.
I think the equivalent for this function would be unpackLayer -- DownloadObject is called only when we want a local copy of the blob from S3. During pack (where we use saveContentLayer) we only verify the object exists and its ETag matches the remoteHash.
Move standalone utility functions for parsing and validating OCI references into the artifact package to support validating Kitfiles. Signed-off-by: Angel Misevski <amisevsk@gmail.com>
Add utility functions and validation for Kitfiles. To support remote dataset storage (e.g. in an S3 bucket), add field "Hash" to DataSet struct; if a DataSet path is remote, a hash is required. Signed-off-by: Angel Misevski <amisevsk@gmail.com>
If a Kitfile contains a dataset with an `s3://` path, verify that the hash on that object matches the ETag of the object in S3 while packing. S3 functionality should work with any S3-compatible storage, and respects configuration via env vars * S3_API_ENDPOINT * S3_REGION * S3_ACCESS_KEY_ID * S3_SECRET_KEY For AWS S3 storage, standard default configuration is retrieved from the system. Signed-off-by: Angel Misevski <amisevsk@gmail.com>
49d438f to
cccdc99
Compare
0979b8e to
6536b4f
Compare
|
I've downgraded manifestVersion errors to (for now) printed warnings. I don't love the idea of including the output package in artifact but this was the cleanest way to add warnings without putting them throughout the repository I've also remembered we did have some Kitfile validation code elsewhere; I've merged it into the current validate method (see commit 0ec8f95). In this process, I also included a bugfix for #1109. |
| if !opts.IncludeRemote { | ||
| output.Logf(output.LogLevelWarn, "ModelKit contains remote datasets. To unpack, specify the --include-remote flag") | ||
| } else { | ||
| client, err := s3api.SetUpClient(ctx) |
There was a problem hiding this comment.
Should this call shouldUnpackLayer() at some point to check if --filter is set?
There was a problem hiding this comment.
Good point. I forgot this because originally path: contained the remote ref, but now that they're separate fields that makes sense.
| return buf.Bytes(), nil | ||
| } | ||
|
|
||
| func (kf *KitFile) Validate() (err error, warnings []string) { |
There was a problem hiding this comment.
Should this return error as the last value to satisfy the Golang gods.
There was a problem hiding this comment.
Yeah, good catch -- I forgot about the Golang gods
Add flag `--include-remote` to unpack command; if specified, remote datasets will be downloaded from S3 while unpacking. S3 functionality should work with any S3-compatible storage, and respects configuration via env vars * S3_API_ENDPOINT * S3_REGION * S3_ACCESS_KEY_ID * S3_SECRET_KEY For AWS S3 storage, standard default configuration is retrieved from the system. Signed-off-by: Angel Misevski <amisevsk@gmail.com>
Signed-off-by: Angel Misevski <amisevsk@gmail.com>
Move libraries for communicating with Huggingface, S3, and git repositories under pkg/lib/external/ to avoid overfilling pkg/lib Signed-off-by: Angel Misevski <amisevsk@gmail.com>
Signed-off-by: Angel Misevski <amisevsk@gmail.com>
* Merge split Kitfile validation (previously in pkg/lib/kitfile/validate.go and pkg/artifact/kitfile.go) into the artifact package * Fix handling of duplicate paths by checking against cleaned filepaths instead of raw strings * Add test cases for validation covering added validation processes Signed-off-by: Angel Misevski <amisevsk@gmail.com>
In order to preserve compatibility with older ModelKits, treat any unrecognized manifestVersion as 1.0.0 and print a warning instead of refusing to process the Kitfile Signed-off-by: Angel Misevski <amisevsk@gmail.com>
6536b4f to
95006e1
Compare
Description
This PR includes the following changes:
manifestVersionis1.0.0in Kitfiles and print a warning if it is not.This is a potentially breaking change, as previously manifestVersion was completely unused. Hopefully it does not impact too many people, as this is the value used bykit init. ValidatingmanifestVersionwill allow us to more safely introduce Kitfile changes in the futuremodelcomponents, etc.)remotePathare not packed into ModelKits; instead, Kit verifies that the remoteHash for that dataset corresponds to the ETag for the bucket object--include-remoteflag forkit unpack. If specified, Kit will download files from the S3 bucket, verifying ETagI've tested these changes against AWS S3 and Cloudflare R2 and was able to pack and unpack ModelKits containing remote datasets.
Open questions for the implementation
S3_REGIONinto thes3://reference somehow, as it's often required (and different datasets may be stored in different regions). I don't see a clean way to include it in theremotePathand don't want to overpollute the dataset structLinked issues
Closes #887
AI-Assisted Code