resumable: clarify handling of conflicts and retry policies#3319
resumable: clarify handling of conflicts and retry policies#3319danielresnick wants to merge 7 commits intohttpwg:mainfrom
Conversation
See httpwg#3275 for discussion.
guoye-zhang
left a comment
There was a problem hiding this comment.
Thanks for opening this PR, the text looks good. Left a few minor comments.
| - For a `5xx (Server Error)` status code, the client MAY automatically attempt upload resumption by retrieving the current offset ({{offset-retrieving}}). | ||
|
|
||
| If no final response was received at all due to connectivity issues, the client MAY automatically attempt upload resumption by retrieving the current offset ({{offset-retrieving}}). | ||
| - For a `409 (Conflict)` status code, the client SHOULD attempt to resume the upload by using the offset from the `Upload-Offset` response header field. |
There was a problem hiding this comment.
I think it's better to keep this flow the same as 5xx: recommending doing an offset retrieval before resuming. Otherwise this creates a new code path, and the text in "Upload Append" about picking the offset is also inconsistent with the recommendation here.
There was a problem hiding this comment.
Updated, how does this look? It feels wrong to completely omit the option to leverage the Upload-Offset from the response since it's called out as a MUST for the server to return it just below
| 3. **Optimistic Concurrency Control**: By maintaining the upload's state in a strongly consistent datastore, the server can atomically check if the `Upload-Offset` in a request matches the resource's current state. If they do not match, the server can reject the request with a `409 (Conflict)` status code. | ||
|
|
||
| Since implementing this approach is not always technically possible or feasible, other measures can be considered as well. A simpler approach is that the server only processes a new request to retrieve the offset ({{offset-retrieving}}), append representation data ({{upload-appending}}), or cancellation ({{upload-cancellation}}) once all previous requests have been processed. This effectively implements exclusive access to the upload resource through an access lock. However, since network interruptions can occur in ways that cause the request to hang from the server's perspective, it might take the server significant time to realize the interruption and time out the request. During this period, the client will be unable to access the resource and resume the upload, causing friction for the end users. Therefore, the recommended approach is to terminate previous requests to enable quick resumption of uploads. | ||
| Servers SHOULD choose a strategy that best fits their architecture while fulfilling the requirements of this section. Regardless of the chosen strategy, clients MUST be prepared to handle a `409 (Conflict)` response as a recoverable error, as described in {{upload-appending}}. |
There was a problem hiding this comment.
All retry behaviors are MAY/SHOULD, do we need a MUST here?
There was a problem hiding this comment.
Good catch, fixed to be consistent (SHOULD)
- Addressed feedback around inconsistencies (HAVE vs MUST). - Removed example 409 conflict recovery.
See discussion on httpwg#3319
|
|
||
| 2. **Pessimistic Locking**: The server processes requests for a given upload resource sequentially, effectively creating an exclusive lock on that resource. A new request is only processed after the previous one completes. This can be simpler to implement but may lead to delays if a request hangs from the server's perspective. | ||
|
|
||
| 3. **Optimistic Concurrency Control**: By maintaining the upload's state in a strongly consistent datastore, the server can atomically check if the `Upload-Offset` in a request matches the resource's current state. If they do not match, the server can reject the request with a `409 (Conflict)` status code. |
There was a problem hiding this comment.
I'm not sure if this is enough to guarantee issue-free uploads. What if the server receives two requests for the same (correct) offset? They both match, but which content is appended to the upload resource? Did you mean to say that the content should be appended in one transaction together with the offset check? But that would make this the same as the pessimistic locking, I think.
There was a problem hiding this comment.
I agree with Acconut here. To me approach 2 and 3 seem very similar in how they would work in practice. The server would receive two requests with the same offset (why though, really?) and one is slightly faster than the other i.e. it will get the exclusive lock and the thus the second request would fail. I think the wording of point 2 hints that the requests are queued and processed after each other which I guess is not the case (the offset would most likely never match for the second request)
There was a problem hiding this comment.
I had a go at rewording this to be clearer as it was both too vague and overly specific in different areas - is this any better? The key distinction between 2 & 3 is that servers don't need to acquire an exclusive lock for the resource which means they're able to achieve higher availability during network partitions
|
|
||
| 2. **Pessimistic Locking**: The server processes requests for a given upload resource sequentially, effectively creating an exclusive lock on that resource. A new request is only processed after the previous one completes. This can be simpler to implement but may lead to delays if a request hangs from the server's perspective. | ||
|
|
||
| 3. **Optimistic Concurrency Control**: The server detects concurrent modifications by atomically checking the resource's state before applying changes. If a conflict is detected, the server rejects the request with a 409 (Conflict) status code, ensuring that only one of multiple parallel requests can succeed. |
There was a problem hiding this comment.
I'm wondering how this would be achieved in practice on the server-side. Does the server read the entire request content first and then tries to apply it in one atomic operation that checks the offset and appends the content? The offset would then be validated after receiving the request content, not when the headers are sent. Not a problem, but a different angle to the task.
Small suggestion:
| 3. **Optimistic Concurrency Control**: The server detects concurrent modifications by atomically checking the resource's state before applying changes. If a conflict is detected, the server rejects the request with a 409 (Conflict) status code, ensuring that only one of multiple parallel requests can succeed. | |
| 3. **Optimistic Concurrency Control**: The server handles concurrent modifications by checking the resource's state and applying changes together in one atomic operation. If a conflict is detected, the server rejects the request with a 409 (Conflict) status code, ensuring that only one of multiple parallel requests can succeed. Upload cancellation is implemented through a similar atomic operation. |
| 3. **Optimistic Concurrency Control**: The server detects concurrent modifications by atomically checking the resource's state before applying changes. If a conflict is detected, the server rejects the request with a 409 (Conflict) status code, ensuring that only one of multiple parallel requests can succeed. | ||
|
|
||
| Since implementing this approach is not always technically possible or feasible, other measures can be considered as well. A simpler approach is that the server only processes a new request to retrieve the offset ({{offset-retrieving}}), append representation data ({{upload-appending}}), or cancellation ({{upload-cancellation}}) once all previous requests have been processed. This effectively implements exclusive access to the upload resource through an access lock. However, since network interruptions can occur in ways that cause the request to hang from the server's perspective, it might take the server significant time to realize the interruption and time out the request. During this period, the client will be unable to access the resource and resume the upload, causing friction for the end users. Therefore, the recommended approach is to terminate previous requests to enable quick resumption of uploads. | ||
| Servers SHOULD choose a strategy that best fits their architecture while fulfilling the requirements of this section. Regardless of the chosen strategy, clients SHOULD be prepared to handle a `409 (Conflict)` response as a recoverable error, as described in {{upload-appending}}. |
There was a problem hiding this comment.
I think this paragraph isn't necessary as it restates normative language that appeared previously in the section without adding new information.
There was a problem hiding this comment.
Did you try to push changes to this PR? I'm not seeing any yet :)
| The RECOMMENDED approach is as follows: If an upload resource receives a new request to retrieve the offset ({{offset-retrieving}}), append representation data ({{upload-appending}}), or cancel the upload ({{upload-cancellation}}) while a previous request for creating the upload ({{upload-creation}}) or appending representation data ({{upload-appending}}) is still ongoing, the resource SHOULD prevent race conditions, data loss, and corruption by terminating the previous request before processing the new request. Due to network delay and reordering, the resource might still be receiving representation data from an ongoing transfer for the same upload resource, which in the client's perspective has failed. Since the client is not allowed to perform multiple transfers in parallel, the upload resource can assume that the previous attempt has already failed. Therefore, the server MAY abruptly terminate the previous HTTP connection or stream. | ||
| To meet these requirements, a server can use various strategies. Three common approaches are: | ||
|
|
||
| 1. **Preemptive Cancellation**: If the server receives a new request for an upload resource while a previous request for the same resource is in-flight, it abruptly terminates the previous HTTP connection or stream before processing the new request. |
There was a problem hiding this comment.
I think we should keep more context from the original wording that explains why this works:
Since the client is not allowed to perform multiple transfers in parallel, the upload resource can assume that the previous attempt has already failed. Therefore, the server MAY abruptly terminate the previous HTTP connection or stream.
|
|
||
| 1. **Preemptive Cancellation**: If the server receives a new request for an upload resource while a previous request for the same resource is in-flight, it abruptly terminates the previous HTTP connection or stream before processing the new request. | ||
|
|
||
| 2. **Pessimistic Locking**: The server processes requests for a given upload resource sequentially, effectively creating an exclusive lock on that resource. A new request is only processed after the previous one completes. This can be simpler to implement but may lead to delays if a request hangs from the server's perspective. |
There was a problem hiding this comment.
Similarly, the text would benefit from emphasizing the downsides of this (simple) approach:
However, since network interruptions can occur in ways that cause the request to hang from the server's perspective, it might take the server significant time to realize the interruption and time out the request. During this period, the client will be unable to access the resource and resume the upload, causing friction for the end users.
| - For a `5xx (Server Error)` status code, the client MAY automatically attempt upload resumption by retrieving the current offset ({{offset-retrieving}}). | ||
|
|
||
| If no final response was received at all due to connectivity issues, the client MAY automatically attempt upload resumption by retrieving the current offset ({{offset-retrieving}}). | ||
| - For a `409 (Conflict)` status code, the client SHOULD attempt to resume the upload by either retrieving the current offset ({{offset-retrieving}}) or using the value from the response's `Upload-Offset` header field. |
There was a problem hiding this comment.
See my concern here: #3275 (comment)
But I think that's an underlying design concern rather than a concern with your specific change towards that design, so feel free to resolve this as off-topic.
There was a problem hiding this comment.
I think it's sufficient to tie the retries to the presence of Upload-Complete: ?0 in this case. It would indicate that the 409 comes from the resumable upload layer.
| - For a `409 (Conflict)` status code, the client SHOULD attempt to resume the upload by either retrieving the current offset ({{offset-retrieving}}) or using the value from the response's `Upload-Offset` header field. | |
| - For a `409 (Conflict)` status code with the `Upload-Complete: ?0` header field, the client SHOULD attempt to resume the upload by either retrieving the current offset ({{offset-retrieving}}) or using the value from the response's `Upload-Offset` header field. |
See #3275 for discussion.