Skip to content

NUTCH-3162 Latency metrics to properly merge data from all threads and tasks#906

Open
lewismc wants to merge 4 commits intoapache:masterfrom
lewismc:NUTCH-3162
Open

NUTCH-3162 Latency metrics to properly merge data from all threads and tasks#906
lewismc wants to merge 4 commits intoapache:masterfrom
lewismc:NUTCH-3162

Conversation

@lewismc
Copy link
Member

@lewismc lewismc commented Mar 11, 2026

PR for NUTCH-3162 which addresses shortcomings in job-level latency percentiles (p50, p95, p99) for Fetcher, ParseSegment, and Indexer by merging TDigest data from all map tasks and threads and writing counters in a single reducer (or a dedicated merge job for Indexer). It should fix the cases where per-task counters were summed and percentiles were not merged.

This patch touches the following jobs

  • Fetcher: Per-thread latency merged in mapper; single reducer merges TDigests and sets job-level p50/p95/p99.
  • ParseSegment:
    • Mapper emits latency digest under LATENCY_KEY
    • Custom partitioner sends LATENCY_KEY to partition 0 so one reducer merges all TDigests
    • Reducer merges and sets correct percentile counters.
  • Indexer:
    • Reducer writes TDigest to side output
    • IndexingJob runs a new “Indexer Latency Merge” job which merges reducer sets percentile counters. On merge failure: LOG.error and driver-level ErrorTracker categorization is only run.

I think this fixes the issues. Arguably it is more complex than logging to file and performing some ETL to extract metrics from logs however this solution does stick with convention by keeping metrics within the Hadoop ecosystem.

Finally, the PR is complemented with unit tests. This asllowed me to think more about how we can add metrics validation in Nutch but that will come in a separate issue/PR under NUTCH-3131.

Thanks for any review.

@lewismc lewismc requested a review from sebastian-nagel March 11, 2026 13:30
@lewismc lewismc self-assigned this Mar 11, 2026
+ RANDOM.nextInt());

FileOutputFormat.setOutputPath(job, tmp);
// Driver-level error tracking: categorization + LOG.error only (no job counters; see ErrorTracker Javadoc).
Copy link
Member Author

@lewismc lewismc Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put this comment in here for the time being. Ultimately I think it is fine to track driver-level errors in memory (via errorTracker) even though they are not written to the MapReduce counter(s). I've documented this behavior in ErrorTracker.java.

@lewismc
Copy link
Member Author

lewismc commented Mar 11, 2026

Refactored tests and introduced LatencyTestUtil.java to centralize boilerplate for latency-generating test code.

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
43.2% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant