Skip to content

Prevent OutOfMemoryError in ForkPDFLayoutTextStripper for zero-height whitespace#5830

Open
bkoragan wants to merge 2 commits into
spring-projects:mainfrom
bkoragan:gh-5829
Open

Prevent OutOfMemoryError in ForkPDFLayoutTextStripper for zero-height whitespace#5830
bkoragan wants to merge 2 commits into
spring-projects:mainfrom
bkoragan:gh-5829

Conversation

@bkoragan

@bkoragan bkoragan commented Apr 19, 2026

Copy link
Copy Markdown

Fixes: #5829

Summary

  • ForkPDFLayoutTextStripper.getNumberOfNewLinesFromPreviousTextPosition now guards against non-positive and non-finite TextPosition.getHeight(), and caps the resulting line count at 500. Removes a division-by-zero path that produced Double.POSITIVE_INFINITY → Integer.MAX_VALUE → ~400 GiB of TextLine allocations.
  • Narrow in scope: one method in one file + a regression test. Default behavior on well-formed input is unchanged.

Root cause
Two consecutive whitespace TextPositions with height == 0.0, separated by more than 5.5 points along the Y axis, hit the division-by-zero branch in getNumberOfNewLinesFromPreviousTextPosition:

  1. Math.floor(delta) / 0.0Double.POSITIVE_INFINITY
  2. (int) POSITIVE_INFINITYInteger.MAX_VALUE
  3. createNewEmptyNewLines(~2.1B) allocates a TextLine per iteration, each with a char[] sized by the page width — hundreds of GB of attempted allocation → heap OOM.

The pattern occurs naturally in published PDFs (the reporter hit it on the authors section of an open-access journal article). That makes this reachable from untrusted PDF input — DoS-flavored..

Fix

  • Primary guard: if !Double.isFinite(height) || height <= 0.0, return 1 (matches the existing Math.max(1, ...) floor — a Y-delta > 5.5 points still warrants at least one line break; we just can't compute an exact count without a valid height).
  • Defensive cap: clamp numberOfLines at MAX_NEW_LINES_PER_POSITION_GAP = 500, covering legitimate-but-tiny positive heights (e.g. 0.001) that would otherwise produce thousands of spurious blank lines.. Any real PDF layout is well below 500.

No API change, no autoconfig change. Well-formed input path is unchanged.

Validation

Added PagePdfDocumentReaderOomTests with a deterministic PDF reproducer (adapted from the reporter's minimal example).
Before the fix on main, with a 512 MB heap cap:
java.lang.OutOfMemoryError: Java heap space
at TextLine.(TextLine.java:41)
at ForkPDFLayoutTextStripper.addNewLine(ForkPDFLayoutTextStripper.java:185)
at ForkPDFLayoutTextStripper.createNewEmptyNewLines(ForkPDFLayoutTextStripper.java:157)
BUILD FAILURE (21.9 s)

After the fix, normal heap:
Tests run: 1, Failures: 0, Errors: 0 — 0.194 s

Test plan

  • ./mvnw -pl document-readers/pdf-reader -Dtest=PagePdfDocumentReaderOomTests test — passes in 194 ms; reproduces OOM on main under -Xmx512m.
  • ./mvnw -pl document-readers/pdf-reader -am verify — spring-javaformat:validate, checkstyle clean..
  • Cross-check against the reporter's original PDF (https://www.clinical-lung-cancer.com/action/showPdf?pii=S1525-7304%2822%2900115-2) — not run locally; the synthesized reproducer test exercises the same code path.

Backport candidates
Happy to open backport PRs as required for previous versions - once this lands on main.

… whitespace

- ForkPDFLayoutTextStripper.getNumberOfNewLinesFromPreviousTextPosition:
  guard against non-positive or non-finite TextPosition.getHeight().
  Previously, two consecutive whitespace TextPositions with height == 0.0
  separated by more than 5.5 points along the Y axis produced
  Math.floor(delta) / 0.0 == POSITIVE_INFINITY, which cast to
  Integer.MAX_VALUE and drove createNewEmptyNewLines to allocate up to
  ~400 GiB worth of TextLine char[] buffers, reliably OOM-ing the JVM.
  Fall back to the existing minimum count (1) when height is degenerate.
- Apply a defensive upper bound (MAX_NEW_LINES_PER_POSITION_GAP = 500)
  on computed line counts to cover legitimate but tiny positive heights
  that would otherwise yield very large but finite values.
- Add PagePdfDocumentReaderOomTests with a deterministic reproducer
  (adapted from the reporter's minimal example) that synthesizes a PDF
  whose authors section triggers the pathological y-offset; on main it
  OOMs under a 512m heap, with the fix it completes in under 200 ms.

See spring-projectsgh-5829.

Signed-off-by: Bapuji Koraganti <bapuk.2008@gmail.com>
@bkoragan

Copy link
Copy Markdown
Author

@maintainers Please review this PR changes, and let me know if any feedback. This fixes #5829. Thanks!

* consecutive {@link TextPosition}s. Any real PDF layout produces a value well below
* this; anything higher indicates a malformed document (see gh-5829).
*/
static final int MAX_NEW_LINES_PER_POSITION_GAP = 500;

@asw12 asw12 Apr 20, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt pull request! It does solve my problem; my only comment is on the use of 500 here. In actuality, wouldn't 1 or 2 new lines suffice?

If the main purpose of ForkPDFLayoutTextStripper is to feed text into the *PdfDocumentReader classes (and of course, there's nothing that guarantees these classes will be tightly coupled together in perpetuity), then in fact one of the very next things we do is feed the extracted text into this ExtractedTextFormatter regex (catastrophically backtracking! #2247) that strips out consecutive newlines anyway:

return pageText.replaceAll("(?m)(^ *\n)", "\n").replaceAll("(?m)^$([\r\n]+?)(^$[\r\n]+?^)+", "$1");

So 500 new lines still seems like a waste of ~100 kB to me, iff we are certain that the ExtractedTextFormatter is guaranteed to be set upon its output.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asw12 Good catch on #2247 — I just looked into it, the regex is a textbook catastrophic-backtracking case ((^$[\r\n]+?^)+ overlaps on itself), with the added wrinkle that the JDK matcher's recursion blows the stack before the exponential blow-up even kicks in.. That is actually a useful constraint for this cap: the stripper should never emit more blank lines than the downstream formatter can safely collapse..

Using 20 as the cap keeps us (1) order of magnitude below the #2247 threshold while still covering any realistic paragraph/section break and bounding per-gap allocation to ~8 KB. I'll update the PR with 20 shortly. Hope that works with safe cap.

#2247 seems a separate fix — either rewrite the collapse with an atomic group / possessive quantifier, or just replace the regex with a two-pass loop. Happy to tackle it as a follow-up PR once this one lands..

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR with cap tightened to 20, with the Javadoc now calling out the #2247 stack-overflow constraint so the rationale survives in code.. Per-gap allocation drops from ~100 kB to ~8 kB. Happy to tackle #2247 itself in a separate PR. Please review again. Thanks!

Review feedback on spring-projectsgh-5829: 500 is far above any realistic gap. Any
legitimate paragraph or section break fits well within 20, and the
downstream ExtractedTextFormatter.trimAdjacentBlankLines regex
becomes unstable (StackOverflowError, spring-projectsgh-2247) around ~150 blank
lines. Keeping the cap at 20 stays ~1 order of magnitude below that
threshold, drops per-gap allocation from ~100 kB to ~8 kB, and
preserves layout fidelity for standalone consumers of
ForkPDFLayoutTextStripper.

Signed-off-by: Bapuji Koraganti <bapuk.2008@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants