Prevent OutOfMemoryError in ForkPDFLayoutTextStripper for zero-height whitespace by bkoragan · Pull Request #5830 · spring-projects/spring-ai

bkoragan · 2026-04-19T22:31:44Z

Fixes: #5829

Summary

ForkPDFLayoutTextStripper.getNumberOfNewLinesFromPreviousTextPosition now guards against non-positive and non-finite TextPosition.getHeight(), and caps the resulting line count at 500. Removes a division-by-zero path that produced Double.POSITIVE_INFINITY → Integer.MAX_VALUE → ~400 GiB of TextLine allocations.
Narrow in scope: one method in one file + a regression test. Default behavior on well-formed input is unchanged.

Root cause
Two consecutive whitespace TextPositions with height == 0.0, separated by more than 5.5 points along the Y axis, hit the division-by-zero branch in getNumberOfNewLinesFromPreviousTextPosition:

Math.floor(delta) / 0.0 → Double.POSITIVE_INFINITY
(int) POSITIVE_INFINITY → Integer.MAX_VALUE
createNewEmptyNewLines(~2.1B) allocates a TextLine per iteration, each with a char[] sized by the page width — hundreds of GB of attempted allocation → heap OOM.

The pattern occurs naturally in published PDFs (the reporter hit it on the authors section of an open-access journal article). That makes this reachable from untrusted PDF input — DoS-flavored..

Fix

Primary guard: if !Double.isFinite(height) || height <= 0.0, return 1 (matches the existing Math.max(1, ...) floor — a Y-delta > 5.5 points still warrants at least one line break; we just can't compute an exact count without a valid height).
Defensive cap: clamp numberOfLines at MAX_NEW_LINES_PER_POSITION_GAP = 500, covering legitimate-but-tiny positive heights (e.g. 0.001) that would otherwise produce thousands of spurious blank lines.. Any real PDF layout is well below 500.

No API change, no autoconfig change. Well-formed input path is unchanged.

Validation

Added PagePdfDocumentReaderOomTests with a deterministic PDF reproducer (adapted from the reporter's minimal example).
Before the fix on main, with a 512 MB heap cap:
java.lang.OutOfMemoryError: Java heap space
at TextLine.(TextLine.java:41)
at ForkPDFLayoutTextStripper.addNewLine(ForkPDFLayoutTextStripper.java:185)
at ForkPDFLayoutTextStripper.createNewEmptyNewLines(ForkPDFLayoutTextStripper.java:157)
BUILD FAILURE (21.9 s)

After the fix, normal heap:
Tests run: 1, Failures: 0, Errors: 0 — 0.194 s

Test plan

./mvnw -pl document-readers/pdf-reader -Dtest=PagePdfDocumentReaderOomTests test — passes in 194 ms; reproduces OOM on main under -Xmx512m.
./mvnw -pl document-readers/pdf-reader -am verify — spring-javaformat:validate, checkstyle clean..
Cross-check against the reporter's original PDF (https://www.clinical-lung-cancer.com/action/showPdf?pii=S1525-7304%2822%2900115-2) — not run locally; the synthesized reproducer test exercises the same code path.

Backport candidates
Happy to open backport PRs as required for previous versions - once this lands on main.

… whitespace - ForkPDFLayoutTextStripper.getNumberOfNewLinesFromPreviousTextPosition: guard against non-positive or non-finite TextPosition.getHeight(). Previously, two consecutive whitespace TextPositions with height == 0.0 separated by more than 5.5 points along the Y axis produced Math.floor(delta) / 0.0 == POSITIVE_INFINITY, which cast to Integer.MAX_VALUE and drove createNewEmptyNewLines to allocate up to ~400 GiB worth of TextLine char[] buffers, reliably OOM-ing the JVM. Fall back to the existing minimum count (1) when height is degenerate. - Apply a defensive upper bound (MAX_NEW_LINES_PER_POSITION_GAP = 500) on computed line counts to cover legitimate but tiny positive heights that would otherwise yield very large but finite values. - Add PagePdfDocumentReaderOomTests with a deterministic reproducer (adapted from the reporter's minimal example) that synthesizes a PDF whose authors section triggers the pathological y-offset; on main it OOMs under a 512m heap, with the fix it completes in under 200 ms. See spring-projectsgh-5829. Signed-off-by: Bapuji Koraganti <bapuk.2008@gmail.com>

bkoragan · 2026-04-19T22:33:42Z

@maintainers Please review this PR changes, and let me know if any feedback. This fixes #5829. Thanks!

asw12 · 2026-04-20T21:32:59Z

+	 * consecutive {@link TextPosition}s. Any real PDF layout produces a value well below
+	 * this; anything higher indicates a malformed document (see gh-5829).
+	 */
+	static final int MAX_NEW_LINES_PER_POSITION_GAP = 500;


Thanks for the prompt pull request! It does solve my problem; my only comment is on the use of 500 here. In actuality, wouldn't 1 or 2 new lines suffice?

If the main purpose of ForkPDFLayoutTextStripper is to feed text into the *PdfDocumentReader classes (and of course, there's nothing that guarantees these classes will be tightly coupled together in perpetuity), then in fact one of the very next things we do is feed the extracted text into this ExtractedTextFormatter regex (catastrophically backtracking! #2247) that strips out consecutive newlines anyway:

spring-ai/spring-ai-commons/src/main/java/org/springframework/ai/reader/ExtractedTextFormatter.java

Line 87 in c66a3c5

return pageText.replaceAll("(?m)(^ *\n)", "\n").replaceAll("(?m)^$([\r\n]+?)(^$[\r\n]+?^)+", "$1");

So 500 new lines still seems like a waste of ~100 kB to me, iff we are certain that the ExtractedTextFormatter is guaranteed to be set upon its output.

@asw12 Good catch on #2247 — I just looked into it, the regex is a textbook catastrophic-backtracking case ((^$[\r\n]+?^)+ overlaps on itself), with the added wrinkle that the JDK matcher's recursion blows the stack before the exponential blow-up even kicks in.. That is actually a useful constraint for this cap: the stripper should never emit more blank lines than the downstream formatter can safely collapse..

Using 20 as the cap keeps us (1) order of magnitude below the #2247 threshold while still covering any realistic paragraph/section break and bounding per-gap allocation to ~8 KB. I'll update the PR with 20 shortly. Hope that works with safe cap.

#2247 seems a separate fix — either rewrite the collapse with an atomic group / possessive quantifier, or just replace the regex with a two-pass loop. Happy to tackle it as a follow-up PR once this one lands..

Updated the PR with cap tightened to 20, with the Javadoc now calling out the #2247 stack-overflow constraint so the rationale survives in code.. Per-gap allocation drops from ~100 kB to ~8 kB. Happy to tackle #2247 itself in a separate PR. Please review again. Thanks!

Review feedback on spring-projectsgh-5829: 500 is far above any realistic gap. Any legitimate paragraph or section break fits well within 20, and the downstream ExtractedTextFormatter.trimAdjacentBlankLines regex becomes unstable (StackOverflowError, spring-projectsgh-2247) around ~150 blank lines. Keeping the cap at 20 stays ~1 order of magnitude below that threshold, drops per-gap allocation from ~100 kB to ~8 kB, and preserves layout fidelity for standalone consumers of ForkPDFLayoutTextStripper. Signed-off-by: Bapuji Koraganti <bapuk.2008@gmail.com>

asw12 reviewed Apr 20, 2026

View reviewed changes

bkoragan force-pushed the gh-5829 branch from f12cadc to 00ddc64 Compare April 21, 2026 01:35

bkoragan requested a review from asw12 April 21, 2026 01:37

asw12 approved these changes Apr 21, 2026

View reviewed changes

destitutus mentioned this pull request May 14, 2026

Stack overflow exception in trimAdjacentBlankLines() method due to inefficient regex in ExtractedTextFormatter #2247

Open

spring-projects-issues added the status: waiting-for-triage label Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent OutOfMemoryError in ForkPDFLayoutTextStripper for zero-height whitespace#5830

Prevent OutOfMemoryError in ForkPDFLayoutTextStripper for zero-height whitespace#5830
bkoragan wants to merge 2 commits into
spring-projects:mainfrom
bkoragan:gh-5829

bkoragan commented Apr 19, 2026 •

edited

Loading

Uh oh!

bkoragan commented Apr 19, 2026

Uh oh!

asw12 Apr 20, 2026 •

edited

Loading

Uh oh!

bkoragan Apr 21, 2026

Uh oh!

bkoragan Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bkoragan commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bkoragan commented Apr 19, 2026

Uh oh!

asw12 Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkoragan Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

bkoragan Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bkoragan commented Apr 19, 2026 •

edited

Loading

asw12 Apr 20, 2026 •

edited

Loading