Support the meta delete with tombstone to prevent dirty writes#73
Conversation
8175c0f to
ca371ac
Compare
oosidat
left a comment
There was a problem hiding this comment.
Went through some of pair-review's suggestions, looks good overall, just wanted to highlight some potential edge cases.
Review assisted by pair-review
bb0b019 to
59dc053
Compare
mrattle
left a comment
There was a problem hiding this comment.
Looks good, I did some extra digging and tophatting to confirm the behaviour of the zero-byte value case and that seems fine without any additional handling.
mrattle
left a comment
There was a problem hiding this comment.
Commented on a couple of things that are extensions of latent behaviours (which don't seem to be blowing up anywhere as far as we know) and one telemetry-related thing
| unless attributes.frozen? | ||
| attributes['value_bytesize'] = total_value_bytesize | ||
| attributes['hit_count'] = results.count { |_key, result| result.hit? } | ||
| attributes['miss_count'] = results.count { |_key, result| result.miss? } |
There was a problem hiding this comment.
Two things here:
- we're treating stale hits as
miss=falseandresult.hit?will returntrue. Do we want to be including them in the hit count when they're not actually returning usable data? In the current setup where we simply do a hard delete the following reads would be treated as misses, so it feels like we're artificially inflating our hitrate with garbage data here. - as this is written,
hit_count+miss_countwill sum tokeys.length. We're doing two passes of the keys here when we could just be doing a single pass for the misses and then subtracting that number from the length of the array of keys to get the number of hits for a minor performance optimization. Further, this can probably also be wrapped into thekeys.eachblock above.
There was a problem hiding this comment.
Good point — agreed that API/result semantics and hit-rate metrics should be separated here. A stale tombstone can still be CacheResult#hit? because memcached returned an item, but counting it as a metrics hit would inflate hit rate compared to the previous hard-delete behavior.
I changed the metrics semantics to:
hit_count: fresh hits onlymiss_count: true misses + stale tombstones / other non-fresh resultsstale_count: stale tombstones, so we can observe them separately
For read_multi_with_status, the counts are computed while filling results rather than doing extra passes over the result hash.
| routing_suffix = RequestFormatter.routing_tokens(**routing_token_kwargs(req_options)) | ||
| post_get_req = optimized_for_raw ? "v k q#{routing_suffix}\r\n" : "v f k q#{routing_suffix}\r\n" | ||
| keys.each do |key| | ||
| @connection_manager.write("mg #{key} #{post_get_req}") |
There was a problem hiding this comment.
In this method we don't pass our key throughKeyRegularizer.encode as we do in other methods. This is carried through from the existing read_multi_req method, which has no documented reasoning for skipping that normalization and I think it was an error by ommission.
The end result is that you can potentially have a key with whitespace in it (which is forbidden by the server and will return an error), or you can have injected CRLF which can allow arbitrary commands to be sent via key injection. I don't think that's actually a plausible vector for any use-cases that we currently have, but we still leverage this sanitization in (almost) every other method + the header parsing so it should probably be present here too.
There was a problem hiding this comment.
Good catch — agreed. I updated this path to run keys through KeyRegularizer.encode and include the b flag when needed, rather than interpolating the raw key into the meta command. Created a helper method write_read_multi_get to process the keys.
I also applied the same fix to the existing read_multi_req path since it had the same omission. Response parsing now goes through response_processor.key_from_tokens, so base64 response keys are decoded back to the original key before namespace stripping / miss filling.
Added coverage for Unicode, whitespace, and CRLF-containing keys for both get_multi and get_multi_with_status.
There was a problem hiding this comment.
Also added unit tests to cover the cases.
| @connection_manager.flush | ||
| result = response_processor.meta_get_with_status | ||
| unless attributes.frozen? | ||
| attributes['value_bytesize'] = result.value.nil? ? 0 : result.value.bytesize |
There was a problem hiding this comment.
Going to preface this one as another "bug" that exists elsewhere (in the existing get method) but does not seem to have any impact under our use conditions, so it can probably be safely ignored.
pair-review is calling out the potential for result.value to be a non-String type that doesn't respond to .bytesize, which would raise a NoMethodError.
There are a set of conditions that need to be met for this to happen, and I don't think that we really encounter them:
- Need to be using OTel middleware so the above
attributes.frozen?is false - Need to be using Dalli directly, using Rails cache will protect against this
- Need to be caching non-String types without first serializing them
To handle this cleanly, you could capture the raw byte count inside meta_get_with_status (where the wire read already happens via read_data) and return it alongside the CacheResult:
def meta_get_with_status
tokens = error_on_unexpected!([VA, EN, HD])
return [::Dalli::CacheResult.new(value: nil, miss: true), 0] if tokens.first == EN
if tokens.first == VA
raw_body = read_data(tokens[1].to_i)
value = @value_marshaller.retrieve(raw_body, bitflags_from_tokens(tokens))
[::Dalli::CacheResult.new(value: value, stale: stale_from_tokens(tokens)), raw_body.bytesize]
else
[::Dalli::CacheResult.new(value: nil, miss: true), 0]
end
endThen in get_with_status:
result, raw_bytesize = response_processor.meta_get_with_status
unless attributes.frozen?
attributes['value_bytesize'] = raw_bytesize
attributes['hit_count'] = result.miss? ? 0 : 1
attributes['miss_count'] = result.miss? ? 1 : 0
end
resultThere was a problem hiding this comment.
Agreed — this is a better place to measure the size. I updated meta_get_with_status to return both the CacheResult and the raw response body bytesize captured before deserialization, then get_with_status uses that raw bytesize for the OTel value_bytesize attribute.
I kept the updated stale metrics semantics from the earlier change (hit_count = fresh hits only, miss_count = non-fresh, stale_count separately reported).
Also added coverage for a non-String value through get_with_status with OTel enabled, which would have raised on result.value.bytesize before this change.
cc6e261 to
7182f37
Compare
Support the meta delete with tombstone to prevent dirty writes.