Skip to content

Core: Fix background thread leak in ScanTaskIterable#16768

Open
sejal-gupta-ksolves wants to merge 1 commit into
apache:mainfrom
sejal-gupta-ksolves:fix/scan-task-iterable-thread-leak
Open

Core: Fix background thread leak in ScanTaskIterable#16768
sejal-gupta-ksolves wants to merge 1 commit into
apache:mainfrom
sejal-gupta-ksolves:fix/scan-task-iterable-thread-leak

Conversation

@sejal-gupta-ksolves

Copy link
Copy Markdown

Closes: #16758

Problem

When downstream query engines (such as StarRocks, Trino, or Spark) cancel or abort a REST table scan early due to client disconnects, timeouts, or query limits, they trigger the cleanup sequence on the outer execution container.

In Apache Iceberg, ScanTaskIterable.close() was implemented as an empty no-op method. Because this outer close() call failed to cascade the shutdown signal to the underlying data structures:

  • The internal shutdown state atomic flag remained false.
  • Background PlanTaskWorker threads continued running indefinitely.
  • Once the internal taskQueue reached its 1000 item capacity limit, all active worker threads became permanently deadlocked inside offerWithTimeout(), leading to thread pool exhaustion on the engine coordinator side.

Solution

  1. Implemented State Tracking and Cleanup: Added thread-safe execution barriers inside ScanTaskIterable.close() utilizing shutdown.compareAndSet(false, true).
  2. Queue Eviction Matrix: Updated the close block to explicitly flush taskQueue, planTasks, and initialFileScanTasks lists upon termination. This allows background threads stuck in an offer wait cycle to instantly unblock, evaluate the flipped shutdown state, and exit gracefully.
  3. Decoupled Iterator Lifecycle: Refactored the internal ScanTasksIterator.close() block to eliminate redundant code duplication, rewriting it to delegate its cleanup tasks straight up to ScanTaskIterable.this.close(). This ensures unified thread termination safety across all potential entry points.
  4. Regression Test Addition: Designed and integrated TestScanTaskIterableLeak under the org.apache.iceberg.rest test package, proving that active planning thread allocations successfully scale back down to 0 upon premature termination.

Verification Testing

# 1. Clean format code using Spotless rules
./gradlew spotlessApply

# 2. Run static quality analysis lint checks on modified packages
./gradlew :iceberg-core:compileJava :iceberg-core:compileTestJava

# 3. Verify the core build pass and execute the regression test case
./gradlew :iceberg-core:test --tests "org.apache.iceberg.rest.TestScanTaskIterableLeak" --info

Fixes an issue where background PlanTaskWorker threads remain indefinitely blocked in offerWithTimeout() when a query is cancelled or abandoned early because the outer ScanTaskIterable.close() method was a no-op.
@github-actions github-actions Bot added the core label Jun 11, 2026
@singhpk234 singhpk234 self-requested a review June 11, 2026 15:18
Comment on lines +110 to +121
public void close() throws IOException {
if (shutdown.compareAndSet(false, true)) {
LOG.info(
"ScanTaskIterable is closing. Clearing {} queued tasks, {} plan tasks, and {} initial file scan tasks.",
taskQueue.size(),
planTasks.size(),
initialFileScanTasks.size());
taskQueue.clear();
planTasks.clear();
initialFileScanTasks.clear();
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call ScanTasksIterator.close() instead ?

Comment on lines +111 to +120
if (shutdown.compareAndSet(false, true)) {
LOG.info(
"ScanTaskIterable is closing. Clearing {} queued tasks, {} plan tasks, and {} initial file scan tasks.",
taskQueue.size(),
planTasks.size(),
initialFileScanTasks.size());
taskQueue.clear();
planTasks.clear();
initialFileScanTasks.clear();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make a helper function and reuse it in both the places where close is called ?

}

@Override
public void close() throws IOException {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can register the iterable in the closeable group and then close it as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

REST ScanTaskIterable: background workers not stopped on iterable close; offerWithTimeout can block forever when consumer aborts

2 participants