feat(sqs): live-queue reaper enumerates partitioned keyspace (Phase 3.D PR 6b) by bootjp · Pull Request #736 · bootjp/elastickv

bootjp · 2026-05-04T15:27:29Z

Summary

Phase 3.D PR 6b: live-queue reaper enumerates partitioned keyspace. Follow-up to PR #735 (6a) — the tombstone-driven sweep already walks partitioned data on DeleteQueue / PurgeQueue, but the live-queue retention reaper still only saw the legacy keyspace, so:

retention-expired messages on partitioned queues leaked their data / vis / byage / group rows forever (reapQueue walked sqsMsgByAgePrefixAllGenerations only),
expired dedup records on partitioned FIFO queues leaked forever (reapExpiredDedup scanned SqsMsgDedupPrefix only — empty for partitioned queues since sqsMsgDedupKeyDispatch routes their writes under SqsPartitionedMsgDedupPrefix).

Closes the live-queue half of the Codex P2 from PR #732 round 0; PR 6a covered the tombstoned-cohort half.

What changes

reapQueue: legacy byage walk extracted as reapQueueLegacy (byte-identical to pre-PR-6b for non-partitioned queues). Adds reapQueuePartition step that runs once per partition for PartitionCount > 1 queues. Per-partition budget per the §6 design ("partitions × budget per cycle"); 30s tick interval comfortably absorbs.
reapPartitionedPage: partitioned twin of reapPage. Same live-vs-orphan classification, but parses each entry with parseSqsPartitionedMsgByAgeKey and routes the dispatch through reapOneRecordPartitioned.
classifyPartitionedByAgeEntry: helper extracted from reapPartitionedPage so the loop body stays under the cyclop ceiling. Returns (parsedKey, reapable bool).
reapExpiredDedup (signature changed): now takes *sqsQueueMeta and routes by PartitionCount. Legacy meta → reapExpiredDedupLegacy (byte-identical). Partitioned meta → reapExpiredDedupPartitioned (NEW), iterates each partition's dedup prefix under its own per-partition budget.

Caller audit

reapQueue — one production caller (reapAllQueues); signature unchanged. Non-partitioned queues byte-identical; partitioned get the extra per-partition pass.
reapExpiredDedup — signature changed to take *sqsQueueMeta; one production caller (reapAllQueues), updated. No tests called it directly.
New helpers (reapQueueLegacy / reapQueuePartition / reapPartitionedPage / reapExpiredDedupLegacy / reapExpiredDedupPartitioned / classifyPartitionedByAgeEntry) each have exactly one production caller in the new live-queue reap path.
reapOneRecordPartitioned (existing PR 6a helper): previously called from reapDeadByAgePartitionPage (tombstone path); now also from reapPartitionedPage (live-queue path). Same dispatch semantics.

Tests

New TestSQSServer_PartitionedFIFO_LiveQueueDedupReaperPartitions: 4-partition queue, send across 6 distinct groups, backdate every partitioned dedup record's ExpiresAtMillis, run reapAllQueues, assert every partitioned dedup row across [0, 4) is gone. Pre-PR-6b reaper would leave every row in place.

Self-review (CLAUDE.md)

Data loss — Closes the live-queue dedup leak + partitioned retention-expired-message leak. Legacy queues unchanged.
Concurrency / distributed failures — Reaper still runs only on the leader. Per-partition pass is sequential; per-partition budget bounds the pass. OCC semantics on each record reap unchanged.
Performance — Per-tick partitioned-queue cost grows from O(1 walk) to O(partition_count walks) on byage AND dedup. Each partition bounded by sqsReaperPerQueueBudget. 30s tick interval comfortably absorbs 32-partition × per-queue budget per design.
Data consistency — Live-vs-orphan classification on partitioned byage mirrors the legacy branch exactly (reapPage / reapPartitionedPage share the rules through classifyPartitionedByAgeEntry). PartitionCount immutability means the meta-driven iteration bound matches the on-disk keys.
Test coverage — One new wire-level integration test for the partitioned dedup walk; the partitioned byage walk reuses parsing / dispatch helpers already covered by PR 6a's tombstone-reap integration test.

Test plan

make lint — 0 issues
Targeted reaper / retention / dedup / HTFIFO / PartitionedFIFO suites (-race, clean)
Wider regression on Send/Receive/Delete + CreateQueue/DeleteQueue/PurgeQueue (-race, clean)
CI: full Jepsen + race

Summary by CodeRabbit

Bug Fixes
- Fixed deduplication record cleanup in partitioned FIFO queues to properly remove expired records across all partitions.
- Enhanced cleanup mechanism to correctly handle partition-aware behavior and prevent memory leaks in partitioned queue scenarios.
Tests
- Added comprehensive test for deduplication record cleanup across partitioned FIFO queue partitions.

….D PR 6b) Follow-up to PR 6a (#735). The tombstone-driven reaper now sweeps partitioned data / vis / byage / dedup / group records on DeleteQueue / PurgeQueue, but the live-queue retention reaper still walked only the legacy keyspace, so: - retention-expired messages on partitioned queues leaked their data / vis / byage / group rows forever (reapQueue's byage walk used sqsMsgByAgePrefixAllGenerations only), - expired dedup records on partitioned FIFO queues leaked forever (reapExpiredDedup's prefix scan used SqsMsgDedupPrefix only — empty for partitioned queues, since sqsMsgDedupKeyDispatch routes their writes under SqsPartitionedMsgDedupPrefix). Closes the live-queue half of the Codex P2 from PR #732 round 0 ("Reap partitioned dedup records to prevent key growth"); PR 6a covered the tombstoned-cohort half. Scope (the second slice of §11 PR 6 from the split-queue-FIFO design doc): - reapQueue: legacy byage walk extracted as reapQueueLegacy, behaviour byte-identical to pre-PR-6b for non-partitioned queues. Adds a new reapQueuePartition step that runs once per partition for PartitionCount > 1 queues. Each partition gets its own per-partition budget per the §6 design ("partitions × budget per cycle"); a 32-partition queue thus allows up to 32 × sqsReaperPerQueueBudget records per tick, comfortably within the 30s reaper interval. - reapPartitionedPage: partitioned twin of reapPage. Same live-vs-orphan classification (parsed.SendTimestampMs > cutoff under the live gen, unconditional reap under older gens, defensive skip on parsed.Generation > currentGen) but parses each entry with parseSqsPartitionedMsgByAgeKey and routes the dispatch through reapOneRecordPartitioned. - classifyPartitionedByAgeEntry: helper extracted from reapPartitionedPage so the loop body stays under the cyclop ceiling. Returns (parsedKey, reapable bool). - reapExpiredDedup: now takes *sqsQueueMeta and routes by PartitionCount. Legacy meta (PartitionCount <= 1) → reapExpiredDedupLegacy (byte-identical to pre-PR-6b walk). Partitioned meta (PartitionCount > 1) → reapExpiredDedupPartitioned (NEW), which iterates each partition's partitioned dedup prefix under its own per-partition budget and uses the existing reapDedupPage to apply the value-based ExpiresAtMillis filter. Caller audit: - reapQueue: one production caller (reapAllQueues line 85). No signature change. Behaviour for non-partitioned queues byte-identical; partitioned queues get the additional per-partition pass. - reapExpiredDedup: signature changed to take *sqsQueueMeta. One production caller (reapAllQueues line 88), updated. No test files invoked this helper directly. Legacy meta routes to the byte-identical legacy walk; partitioned routes to the new walk. - reapQueueLegacy / reapQueuePartition / reapExpiredDedupLegacy / reapExpiredDedupPartitioned / classifyPartitionedByAgeEntry: each has exactly one caller in the new live-queue reap path. - reapOneRecordPartitioned: existing helper from PR 6a. Previous caller was reapDeadByAgePartitionPage (tombstone path); now also called from reapPartitionedPage (live-queue path). Same dispatch semantics — synthetic meta carrying PartitionCount > 1 to flip the dispatch helper branch. Tests: - New TestSQSServer_PartitionedFIFO_LiveQueueDedupReaperPartitions (wire-level): create a 4-partition FIFO queue, send across 6 distinct groups, backdate every partitioned dedup record's ExpiresAtMillis, run reapAllQueues, assert every partitioned dedup row across [0, 4) is gone. Pre-PR-6b reaper would have left every row in place — the test fails on the legacy code path. Self-review (CLAUDE.md): 1. Data loss — Closes the live-queue dedup leak; closes the partitioned retention-expired-message leak. Legacy queues unchanged: reapQueue, reapExpiredDedup, byage walks, and dispatch helpers all keep their byte-identical pre-PR-6b paths for PartitionCount <= 1. 2. Concurrency / distributed failures — Reaper still runs only on the leader. Each partition's pass is sequential; per- partition budget bounds the pass. Existing OCC dispatch semantics on each per-record reap unchanged. 3. Performance — Per-tick partitioned-queue cost grows from O(1 walk) to O(partition_count walks) on byage AND dedup. Each partition bounded by sqsReaperPerQueueBudget. 30s tick interval comfortably absorbs 32-partition × per-queue budget per design. 4. Data consistency — Live-vs-orphan classification on partitioned byage mirrors the legacy branch exactly (reapPage / reapPartitionedPage share the rules through classifyPartitionedByAgeEntry). PartitionCount immutability means the meta-driven iteration bound matches the on-disk keys for any cohort. 5. Test coverage — One new wire-level integration test for the partitioned dedup walk; the partitioned byage walk reuses parsing / dispatch helpers already tested by PR 6a's tombstone-reap integration test.

bootjp · 2026-05-04T15:27:33Z

@claude review

chatgpt-codex-connector · 2026-05-04T15:27:34Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

coderabbitai · 2026-05-04T15:27:42Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 22 minutes and 15 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4ab9f63a-cbb1-49da-a43a-f3f9a5c3b093

📥 Commits

Reviewing files that changed from the base of the PR and between 80f1c86 and 996df49.

📒 Files selected for processing (2)

adapter/sqs_partitioned_dispatch_test.go
adapter/sqs_reaper.go

📝 Walkthrough

Walkthrough

The pull request extends the SQS reaper to handle partitioned queues by splitting both byage and dedup expiry reaping logic into legacy and partition-aware paths. Queue metadata is threaded through the reaper pipeline to enable partition iteration and per-partition budget management. A comprehensive integration test validates that dedup reaping correctly removes expired dedup records across all partitions.

Changes

Partition-Aware Reaper Implementation

Layer / File(s)	Summary
Metadata Wiring `adapter/sqs_reaper.go` (lines 85-90)	`reapAllQueues` now passes queue metadata into `reapExpiredDedup` to enable partition-aware dispatch.
Byage Reaping Refactor `adapter/sqs_reaper.go` (lines 425-539)	`reapQueue` is restructured to run legacy byage sweep unconditionally, then conditionally invoke `reapQueuePartition` for each partition when `meta.PartitionCount > 1`. New `reapQueueLegacy` and `reapQueuePartition` functions isolate legacy vs partitioned scan logic with per-partition budget handling.
Partitioned Byage Classification `adapter/sqs_reaper.go` (lines 541-591)	New `reapPartitionedPage` loop and `classifyPartitionedByAgeEntry` helper enforce partition/generation matching and cutoff gating to classify records as reapable, dispatching eligible entries via `reapOneRecordPartitioned`.
Dedup Expiry Refactor `adapter/sqs_reaper.go` (lines 728-829)	`reapExpiredDedup` is updated to accept metadata and dispatch between `reapExpiredDedupLegacy` (legacy prefix) and `reapExpiredDedupPartitioned` (partitioned prefix, per-partition budget), preventing dedup leaks on partitioned queues.
Integration Test `adapter/sqs_partitioned_dispatch_test.go` (lines 4, 653-754)	New `TestSQSServer_PartitionedFIFO_LiveQueueDedupReaperPartitions` creates a partitioned FIFO queue, generates partitioned dedup records, backdates `ExpiresAtMillis` via transactional writes (with `bytes.Clone`), runs the reaper, and asserts zero dedup rows remain across all partitions. Helper `countPartitionedDedupRowsAcrossPartitions` sums dedup rows by scanning each partition's dedup prefix.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat(sqs): wire partitioned-FIFO data plane through dispatch helpers (Phase 3.D PR 5b-2) #732: Adds partition-dispatch wiring in reaper operations (dispatch keys, partition selection) — directly related to the metadata and partition iteration introduced here.
feat(sqs): tombstone reaper enumerates partitioned keyspace (Phase 3.D PR 6a) #735: Implements partition-aware reaping for SQS with legacy vs partitioned scan splitting — overlaps with the same refactoring pattern used in this PR.

Poem

🐰 The reaper hops through partitioned lands,
Sweeping dedup dust across the strands,
Metadata whispers which way to go,
Legacy paths and partitions flow,
Expired records tumble down below! 🌿

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main change: adding partition-aware reaper logic for live queues. It is specific, concise, and clearly conveys the primary intent.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/sqs-htfifo-reaper-livequeue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 22 minutes and 15 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-05-04T15:27:46Z

Claude finished @bootjp's task in 3m 30s —— View job

Review

Gather context and understand request
Read adapter/sqs_reaper.go changes
Read adapter/sqs_partitioned_dispatch_test.go changes
Review related reaper infrastructure and key helpers
Analyze correctness, concurrency, edge cases, and test coverage
Post review feedback

The PR correctly closes the live-queue half of the partitioned-FIFO reaper gap. The architecture is coherent — reapQueuePartition + reapPartitionedPage handle the byage path, reapExpiredDedupPartitioned handles the dedup path, and both respect the per-partition budget contract. The key encoding analysis checks out against sqsPartitionedMsgByAgeKey / sqsPartitionedMsgDedupKey.

Two items need addressing before merge; one stylistic observation below them.

Issue 1 — Stale comment in `reapPage` (lines 622–625) [must fix]

// Live-queue retention reap currently iterates only the
// legacy byage keyspace; partitioned-byage live-queue
// retention is a follow-up to PR 6a (the tombstoned-cohort
// path is what this PR addresses).

This comment was written in PR 6a as a forward-reference to PR 6b. PR 6b is now this PR — so the claim "is a follow-up to PR 6a" is misleading: it implies the work is still outstanding when reapQueuePartition has just landed it. The comment should be trimmed to note that reapPage covers the legacy path only (the partitioned live-queue path is handled by reapPartitionedPage).

Fix this →

Issue 2 — `classifyPartitionedByAgeEntry` retention logic is untested [should fix]

The PR description says the partitioned byage walk is "covered by PR 6a's tombstone-reap integration test," but that test exercises reapDeadByAgePartitionPage — which reaps unconditionally (every record under a tombstoned gen is an orphan, no cutoff check). The new live-queue classification in classifyPartitionedByAgeEntry adds two distinct code paths that have no test:

Retention cutoff gate — parsed.SendTimestampMs > cutoff for the live gen (reap only when retention has elapsed).
Future-gen guard — parsed.Generation > currentGen (skip and wait for meta to catch up).

A test analogous to the new dedup test would be: send messages to a 4-partition queue, backdate their SendTimestampMs beyond the retention window (or lower MessageRetentionSeconds to 0 and verify orphan gen reap), run reapAllQueues, and assert the partitioned byage / data / vis rows are gone. The cutoff path and the orphan-gen path each deserve a sub-case, since classifyPartitionedByAgeEntry is the only place where the live-vs-orphan decision is made for partitioned queues.

Fix this →%20are%20gone.%20The%20test%20pins%20the%20classifyPartitionedByAgeEntry%20cutoff%20branch%20(SendTimestampMs%20%3C%3D%20cutoff)%20which%20has%20no%20coverage%20today.&repo=bootjp/elastickv)

Observation — Inline prefix construction duplicated in two places (minor)

Both reapQueuePartition (line 508–513) and reapExpiredDedupPartitioned (line 797–800) inline a "partition-scoped but no-gen" prefix that no existing key helper covers:

prefix = append(prefix, SqsPartititioned*Prefix...)
prefix = append(prefix, encodeSQSSegment(queueName)...)
prefix = append(prefix, sqsPartitionedQueueTerminator)
prefix = appendU32(prefix, partition)

This isn't a bug — both inlinings are correct, and the existing helpers either cover all-partitions-all-gens (sqsPartitionedMsgByAgePrefixForQueueAllPartitions) or one-partition-one-gen (sqsPartitionedMsgByAgePrefixForPartition). Extracting helpers for the missing "one-partition-all-gens" variant would make future readers less likely to add a gen accidentally, but that's a judgment call on the right trade-off between helper proliferation and readability.

Correctness notes (no action needed)

Key prefix semantics: Confirmed. The partition-scoped prefix for byage (SqsPartitionedMsgByAgePrefix|queue|'|'|u32(partition)) correctly underlaps every generation for that partition, matching the intent of reapPartitionedPage's per-entry gen check.
Dedup prefix semantics: Confirmed. Same pattern — partition-scoped prefix (SqsPartitionedMsgDedupPrefix|queue|'|'|u32(partition)) covers all gens; reapDedupPage filters by ExpiresAtMillis, not gen.
cutoff = 0 when MessageRetentionSeconds ≤ 0: Correct. SendTimestampMs > 0 is true for all real messages, so setting cutoff to 0 (rather than MaxInt64) is an effective no-reap sentinel for the live gen, mirroring the legacy path.
reapOneRecordPartitioned sentinel value (PartitionCount: 2): The wrapper correctly uses the minimum legal partitioned value — the dispatch helpers only branch on PartitionCount > 1, not on the exact count.
Budget accounting: processed resets per partition in both new functions — consistent with the stated per-partition budget contract.
Test fixture countPartitionedDedupRowsAcrossPartitions: Correctly uses gen=1 (the initial gen after installPartitionedMetaForTest) to scope the scan.

gemini-code-assist

Code Review

This pull request implements partitioned reaping for SQS queues, updating the reaper to handle both legacy and partitioned keyspaces for message expiration and deduplication record cleanup. A new test case verifies that expired deduplication records are correctly reaped from partitioned FIFO queues. The reviewer suggested that reapExpiredDedup should perform both legacy and partitioned scans for partitioned queues to maintain consistency with reapQueue and prevent potential record leaks.

gemini-code-assist · 2026-05-04T15:32:35Z

+	if meta != nil && meta.PartitionCount > 1 {
+		return s.reapExpiredDedupPartitioned(ctx, queueName, meta.PartitionCount, readTS)
+	}
+	return s.reapExpiredDedupLegacy(ctx, queueName, readTS)


The reapExpiredDedup function uses an if/else logic to choose between legacy and partitioned dedup reaping. In contrast, reapQueue (line 433-451) performs both legacy and partitioned scans for partitioned queues to be defensive against leaked records. To maintain consistency and ensure no records are leaked if a queue's state were ever inconsistent, consider performing both scans here as well when meta.PartitionCount > 1.

if meta != nil && meta.PartitionCount > 1 { if err := s.reapExpiredDedupPartitioned(ctx, queueName, meta.PartitionCount, readTS); err != nil { return err } } return s.reapExpiredDedupLegacy(ctx, queueName, readTS)

coderabbitai

🧹 Nitpick comments (1)

adapter/sqs_partitioned_dispatch_test.go (1)
715-723: 💤 Low value

Consider adding ReadKeys for OCC consistency.

The backdate write lacks ReadKeys, which differs from the pattern used elsewhere in this file (e.g., lines 58-65). While this test runs in isolation and conflicts are unlikely, including the key in ReadKeys would match the established OCC transaction pattern.
Suggested fix
 req := &kv.OperationGroup[kv.OP]{
     IsTxn:   true,
     StartTS: readTS,
+    ReadKeys: [][]byte{bytes.Clone(row.Key)},
     Elems: []*kv.Elem[kv.OP]{
         {Op: kv.Put, Key: bytes.Clone(row.Key), Value: body},
     },
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_partitioned_dispatch_test.go` around lines 715 - 723, The txn
request `req` created as a *kv.OperationGroup[kv.OP]* for the backdated write is
missing the OCC `ReadKeys` field; update the request construction (the `req`
passed to `srv.coordinator.Dispatch`) to include the transaction's read key(s)
(the same key used in the Put element) in `ReadKeys` so the operation follows
the established OCC pattern used elsewhere in this file.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@adapter/sqs_partitioned_dispatch_test.go`:
- Around line 715-723: The txn request `req` created as a
*kv.OperationGroup[kv.OP]* for the backdated write is missing the OCC `ReadKeys`
field; update the request construction (the `req` passed to
`srv.coordinator.Dispatch`) to include the transaction's read key(s) (the same
key used in the Put element) in `ReadKeys` so the operation follows the
established OCC pattern used elsewhere in this file.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c9072fae-a2cd-465c-8f46-dd99affc49eb

📥 Commits

Reviewing files that changed from the base of the PR and between f489669 and 80f1c86.

📒 Files selected for processing (2)

adapter/sqs_partitioned_dispatch_test.go
adapter/sqs_reaper.go

Two findings from Claude on PR #736 plus a medium-severity Gemini suggestion in the same review pass: 1. (Claude must-fix) Stale forward-reference comment in reapPage was written in PR 6a pointing at PR 6b. PR 6b is now this branch, so the "is a follow-up to PR 6a" wording was misleading. Trimmed to a one-line note that reapPage covers the legacy keyspace and reapPartitionedPage is the partitioned twin. 2. (Claude should-fix) classifyPartitionedByAgeEntry had no direct coverage of its retention-cutoff branch (live gen, sendTs <= cutoff -> reapable) or its future-gen guard (parsed.gen > currentGen -> not reapable). The integration tests exercise the orphan-cohort and dedup-expiry paths only. Added TestClassifyPartitionedByAgeEntry: a table-driven unit test that pins every branch (within retention, past retention, exact-cutoff boundary, orphan generation, future generation, wrong partition, wrong queue prefix, legacy key) against deterministic key inputs built via sqsPartitionedMsgByAgeKey. No store / coordinator dependency, so the test is fast and immune to retention timing. 3. (Gemini medium) reapExpiredDedup used an if/else to pick between legacy and partitioned dedup reaping for partitioned queues, while reapQueue runs both for symmetry and defensive coverage of leaked legacy entries. Mirrored that policy: legacy scan now always runs (cheap on an empty prefix today), and the partitioned scan additionally runs for partitioned queues. Caller audit: reapExpiredDedup has a single caller (reapAllQueues) and the error contract is unchanged, so no semantic ripple. Refs: PR #736 review thread; Claude review at run 25327783108.

bootjp · 2026-05-04T16:05:18Z

@claude review

claude · 2026-05-04T16:06:09Z

Claude finished @bootjp's task in 2m 58s —— View job

Review

Gather context
Read adapter/sqs_reaper.go changes
Read adapter/sqs_partitioned_dispatch_test.go changes
Analyze correctness, concurrency, and edge cases
Post review feedback

This is the round-2 pass. All items from the previous Claude review and the Gemini concern are resolved.

Resolved since round 1

Stale comment in reapPage — fixed. Now reads: "reapPage covers the legacy byage keyspace only; the partitioned twin is reapPartitionedPage."

classifyPartitionedByAgeEntry untested — fixed. TestClassifyPartitionedByAgeEntry covers all 8 branches: live-within-window, live-past-window, at-cutoff boundary, orphan gen, future-gen guard, wrong partition, wrong queue, legacy key.

reapExpiredDedup if/else vs dual-scan (Gemini) — fixed. Now always runs the legacy scan first, then conditionally runs the partitioned scan, matching reapQueue's defensive both-scan policy. The doc comment explains the rationale.

One remaining nit (no blocker)

ReadKeys missing in the test's backdate transaction (sqs_partitioned_dispatch_test.go lines 715–723). This is test-only code that runs without concurrency, so there is no correctness risk — CodeRabbit flagged it as a style inconsistency. Fix is optional and does not block merge.

Fix this →

Correctness notes (no action needed)

cutoff = 0 when MessageRetentionSeconds <= 0: The inline comment says "MaxInt64-ish" which is misleading — the value of 0 is what is correct (SendTimestampMs > 0 is true for all real messages, making them non-reapable in the live gen). Pre-existing, not introduced here.
now computed separately in each dedup function: Negligible clock skew between the legacy and partitioned calls. No correctness issue.
Budget accounting: processed correctly resets per partition in both reapQueuePartition and reapExpiredDedupPartitioned — per-partition budget contract correctly implemented.
break vs return nil in reapExpiredDedupPartitioned: The inner loop uses break (not return nil) so the outer partition loop continues to the next partition after a budget-hit or empty page. Correct; the legacy function's return nil is fine because there is no outer loop there.
Key prefix semantics: Confirmed. Partition-only prefixes (no gen) correctly underspan every generation for a given partition, matching the per-entry gen filtering in classifyPartitionedByAgeEntry and reapDedupPage.

The architecture is sound and test coverage is now complete for all new decision paths. Ready to merge pending CI.

## Summary Phase 3.D PR 7a — adds the `elastickv_sqs_partition_messages_total{queue, partition, action}` Prometheus counter so dashboards and alerts can spot uneven `MessageGroupId` distributions across partitioned-FIFO queues. Metrics-only: the Jepsen HT-FIFO workload (PR 7b) ships separately so the two review loops do not cross. - `monitoring/sqs.go` (new): `SQSMetrics` with cardinality cap (`sqsMaxTrackedQueues = 512`, overflow collapses to `_other`) mirroring `DynamoDBMetrics`. Drops empty queue names and unknown action labels so a future call-site bug cannot pollute the series space dashboards have to learn about. Nil-receiver-safe so adapter call sites do not need to nil-guard. - `adapter/sqs.go`: `SQSPartitionObserver` interface + `WithSQSPartitionObserver` option. Re-declared in `adapter` so it doesn't import `monitoring` at the package boundary (matches the DynamoDB/Redis observer pattern). Action constants (`send`/`receive`/`delete`) re-declared on the adapter side and validated at runtime by the monitoring side — drift between the two surfaces as a dropped observation, not a wedge. - `adapter/sqs_fifo.go`, `adapter/sqs_messages.go`: emit the counter on the **partitioned** commit branch only (`PartitionCount > 1`) for send / receive / delete. Legacy single-partition queues stay off the metric since partition is always 0 and the cardinality cost would buy nothing. - `monitoring/registry.go`, `main_sqs.go`, `main.go`: wire the registry's `SQSPartitionObserver()` into `startSQSServer` so the SQS server picks up the production observer on cluster boot. Test fixtures and CLI tools that build `SQSServer` without a registry pass `nil` and the metric stays at zero. ## Tests `monitoring/sqs_test.go` (new, 6 cases): - `TestSQSMetrics_ObservePartitionMessage_IncrementsByLabelTriple` — pin the `(queue, partition, action)` counter contract. - `TestSQSMetrics_ObservePartitionMessage_DropsInvalidAction` — pin the typo guard against future drift between adapter and monitoring constants. - `TestSQSMetrics_ObservePartitionMessage_DropsEmptyQueue` — pin that an empty queue name does not collapse with valid observations onto a shared series. - `TestSQSMetrics_NilReceiverIsSafe` — pin the nil-receiver short-circuit the adapter relies on. - `TestSQSMetrics_QueueLabelOverflow` — pin the cap-and-collapse so a misbehaving caller cannot exhaust the Prometheus series budget. - `TestSQSMetrics_RegistryWiring` — pin that the public `Registry` exposes the metric under the documented name. ## Self-review (5 lenses) 1. **Data loss** — N/A; metrics-only, no storage / Raft / FSM touch. 2. **Concurrency** — counter increments are atomic via Prometheus; the `trackedQueues` map is only consulted from the dispatch-success path under the SQS server's existing concurrency model. No new locks. 3. **Performance** — one map lookup + one `CounterVec` lookup per partitioned send/receive/delete on the success branch. Legacy queues skip the call entirely. Cardinality bounded at 512 queue × 32 partition (`htfifoMaxPartitions`) × 3 action ≈ 49k series worst case; in practice a 32-partition queue yields 96 series, so the budget is plenty for the SLO panels. 4. **Data consistency** — the metric is observed AFTER OCC dispatch succeeds, so the counter reflects committed state. Receive/delete branches that return on retryable errors deliberately do not increment (the retry path will observe on the eventual success). 5. **Test coverage** — 6 unit tests in `monitoring/`, plus the adapter-side nil-observer path is exercised by all existing partitioned-FIFO tests in `adapter/sqs_partitioned_dispatch_test.go` (they pass `nil` observer through the test fixture). ## Test plan - [x] `go test -race -count=1 ./monitoring/...` - [x] `go test -race -count=1 -run 'TestSQS' ./adapter/...` - [x] `go test -race -count=1 ./...` (full suite) - [x] `golangci-lint --config=.golangci.yaml run ./...` (full repo) - [ ] Jepsen HT-FIFO workload — deferred to PR 7b ## Refs - `docs/design/2026_05_01_partial_split_queue_fifo.md` §11 PR 7 - Builds on PR 5b-3 (#734) capability gate, PR 6a (#735) tombstone reaper, PR 6b (#736 in flight) live-queue reaper.

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

bootjp mentioned this pull request May 4, 2026

feat(sqs): HT-FIFO partition metrics counter (Phase 3.D PR 7a) #737

Merged

5 tasks

bootjp merged commit 675adc3 into main May 4, 2026
9 checks passed

bootjp deleted the feat/sqs-htfifo-reaper-livequeue branch May 4, 2026 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sqs): live-queue reaper enumerates partitioned keyspace (Phase 3.D PR 6b)#736

feat(sqs): live-queue reaper enumerates partitioned keyspace (Phase 3.D PR 6b)#736
bootjp merged 2 commits intomainfrom
feat/sqs-htfifo-reaper-livequeue

bootjp commented May 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

bootjp commented May 4, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

claude Bot commented May 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

bootjp commented May 4, 2026

Uh oh!

claude Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bootjp commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changes

Caller audit

Tests

Self-review (CLAUDE.md)

Test plan

Summary by CodeRabbit

Uh oh!

bootjp commented May 4, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

claude Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Issue 1 — Stale comment in reapPage (lines 622–625) [must fix]

Issue 2 — classifyPartitionedByAgeEntry retention logic is untested [should fix]

Observation — Inline prefix construction duplicated in two places (minor)

Correctness notes (no action needed)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

bootjp commented May 4, 2026

Uh oh!

claude Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Resolved since round 1

One remaining nit (no blocker)

Correctness notes (no action needed)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bootjp commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

claude Bot commented May 4, 2026 •

edited

Loading

Issue 1 — Stale comment in `reapPage` (lines 622–625) [must fix]

Issue 2 — `classifyPartitionedByAgeEntry` retention logic is untested [should fix]

claude Bot commented May 4, 2026 •

edited

Loading