propose redacted R stored in KVS as R_redacted#523
Conversation
Problem: The optional `scheduling` key in RFC 20 R objects can be very large (megabytes of scheduler graph data). Storing and shipping the full R to every consumer, including one job shell per allocated rank, wastes overlay network bandwidthl. Most consumers need only the execution portion of R and never use `scheduling`. Add a `R_redacted` sibling KVS key alongside the existing `job.<id>.R` key to hold a copy of R with the optional `scheduling` key omitted. The complete R (including `scheduling`) is preserved unchanged in `job.<id>.R` for consumers that require it — the scheduler on hello/replay and the subinstance resource module — while high-volume consumers such as job shells may read the smaller `R_redacted`. Document this convention in RFC 16 (KVS schema), RFC 20 (R format), RFC 27 (alloc commit requirement), and RFC 28 (acquire protocol clarification). Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Problem: The RFC 27 Alloc Success section lists only the OPTIONAL `annotations` key as an additional field of the SUCCESS response, but the implementation requires an `R` key carrying the allocated resource set (RFC 20) and the job manager treats its absence as an error. Add `R` as a REQUIRED key of the SUCCESS response and specify that the scheduler MAY omit the OPTIONAL `scheduling` key from it, since the job manager needs only the execution portion. Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
|
|
|
Just brainstorming, but since not all schedulers need this, would it make sense to flip it so Alternatively, and more in keeping with the original design, add a way for the |
|
As I commented in flux-framework/flux-core#7615, this won't work because it would break backwards compatibility: the scheduler and resource modules of subinstances fetch the |
I like this suggestion if we could figure out a path to avoid breaking inter-version compatibility. Maybe job-info could assemble the full R by default unless a new flag is used? Then RPCs from older versions of Flux would continue to work, and the job shell and other same-version job-info callers could add the flag. I wonder if it would be simpler to add a new optional |
This PR updates relevant RFCs to require that an Rv1 object with any
schedulingkey be stored in the KVS asR_redactedco-located with any fullR. This is to solve the issue of a large scheduling key being unnecessarily fetched from the KVS for consumers that only need the execution section. See flux-framework/flux-core#7615A commit is also included that adds a missing requirement to the resource acquisition protocol that R be sent in the alloc response (de-facto requirement in the code now). The scheduling key is made OPTIONAL in this response in keeping with the spirit of this PR.