Prevent OutOfMemoryError in ForkPDFLayoutTextStripper for zero-height whitespace#5830
Prevent OutOfMemoryError in ForkPDFLayoutTextStripper for zero-height whitespace#5830bkoragan wants to merge 2 commits into
Conversation
… whitespace - ForkPDFLayoutTextStripper.getNumberOfNewLinesFromPreviousTextPosition: guard against non-positive or non-finite TextPosition.getHeight(). Previously, two consecutive whitespace TextPositions with height == 0.0 separated by more than 5.5 points along the Y axis produced Math.floor(delta) / 0.0 == POSITIVE_INFINITY, which cast to Integer.MAX_VALUE and drove createNewEmptyNewLines to allocate up to ~400 GiB worth of TextLine char[] buffers, reliably OOM-ing the JVM. Fall back to the existing minimum count (1) when height is degenerate. - Apply a defensive upper bound (MAX_NEW_LINES_PER_POSITION_GAP = 500) on computed line counts to cover legitimate but tiny positive heights that would otherwise yield very large but finite values. - Add PagePdfDocumentReaderOomTests with a deterministic reproducer (adapted from the reporter's minimal example) that synthesizes a PDF whose authors section triggers the pathological y-offset; on main it OOMs under a 512m heap, with the fix it completes in under 200 ms. See spring-projectsgh-5829. Signed-off-by: Bapuji Koraganti <bapuk.2008@gmail.com>
|
@maintainers Please review this PR changes, and let me know if any feedback. This fixes #5829. Thanks! |
| * consecutive {@link TextPosition}s. Any real PDF layout produces a value well below | ||
| * this; anything higher indicates a malformed document (see gh-5829). | ||
| */ | ||
| static final int MAX_NEW_LINES_PER_POSITION_GAP = 500; |
There was a problem hiding this comment.
Thanks for the prompt pull request! It does solve my problem; my only comment is on the use of 500 here. In actuality, wouldn't 1 or 2 new lines suffice?
If the main purpose of ForkPDFLayoutTextStripper is to feed text into the *PdfDocumentReader classes (and of course, there's nothing that guarantees these classes will be tightly coupled together in perpetuity), then in fact one of the very next things we do is feed the extracted text into this ExtractedTextFormatter regex (catastrophically backtracking! #2247) that strips out consecutive newlines anyway:
So 500 new lines still seems like a waste of ~100 kB to me, iff we are certain that the ExtractedTextFormatter is guaranteed to be set upon its output.
There was a problem hiding this comment.
@asw12 Good catch on #2247 — I just looked into it, the regex is a textbook catastrophic-backtracking case ((^$[\r\n]+?^)+ overlaps on itself), with the added wrinkle that the JDK matcher's recursion blows the stack before the exponential blow-up even kicks in.. That is actually a useful constraint for this cap: the stripper should never emit more blank lines than the downstream formatter can safely collapse..
Using 20 as the cap keeps us (1) order of magnitude below the #2247 threshold while still covering any realistic paragraph/section break and bounding per-gap allocation to ~8 KB. I'll update the PR with 20 shortly. Hope that works with safe cap.
#2247 seems a separate fix — either rewrite the collapse with an atomic group / possessive quantifier, or just replace the regex with a two-pass loop. Happy to tackle it as a follow-up PR once this one lands..
There was a problem hiding this comment.
Review feedback on spring-projectsgh-5829: 500 is far above any realistic gap. Any legitimate paragraph or section break fits well within 20, and the downstream ExtractedTextFormatter.trimAdjacentBlankLines regex becomes unstable (StackOverflowError, spring-projectsgh-2247) around ~150 blank lines. Keeping the cap at 20 stays ~1 order of magnitude below that threshold, drops per-gap allocation from ~100 kB to ~8 kB, and preserves layout fidelity for standalone consumers of ForkPDFLayoutTextStripper. Signed-off-by: Bapuji Koraganti <bapuk.2008@gmail.com>
Fixes: #5829
Summary
ForkPDFLayoutTextStripper.getNumberOfNewLinesFromPreviousTextPositionnow guards against non-positive and non-finiteTextPosition.getHeight(), and caps the resulting line count at 500. Removes a division-by-zero path that producedDouble.POSITIVE_INFINITY → Integer.MAX_VALUE → ~400 GiBofTextLineallocations.Root cause
Two consecutive whitespace
TextPositions withheight == 0.0, separated by more than 5.5 points along the Y axis, hit the division-by-zero branch ingetNumberOfNewLinesFromPreviousTextPosition:Math.floor(delta) / 0.0→Double.POSITIVE_INFINITY(int) POSITIVE_INFINITY→Integer.MAX_VALUEcreateNewEmptyNewLines(~2.1B)allocates aTextLineper iteration, each with achar[]sized by the page width — hundreds of GB of attempted allocation → heap OOM.The pattern occurs naturally in published PDFs (the reporter hit it on the authors section of an open-access journal article). That makes this reachable from untrusted PDF input — DoS-flavored..
Fix
!Double.isFinite(height) || height <= 0.0, return1(matches the existingMath.max(1, ...)floor — a Y-delta > 5.5 points still warrants at least one line break; we just can't compute an exact count without a valid height).numberOfLinesatMAX_NEW_LINES_PER_POSITION_GAP = 500, covering legitimate-but-tiny positive heights (e.g. 0.001) that would otherwise produce thousands of spurious blank lines.. Any real PDF layout is well below 500.No API change, no autoconfig change. Well-formed input path is unchanged.
Validation
Added
PagePdfDocumentReaderOomTestswith a deterministic PDF reproducer (adapted from the reporter's minimal example).Before the fix on
main, with a 512 MB heap cap:java.lang.OutOfMemoryError: Java heap space
at TextLine.(TextLine.java:41)
at ForkPDFLayoutTextStripper.addNewLine(ForkPDFLayoutTextStripper.java:185)
at ForkPDFLayoutTextStripper.createNewEmptyNewLines(ForkPDFLayoutTextStripper.java:157)
BUILD FAILURE (21.9 s)
After the fix, normal heap:
Tests run: 1, Failures: 0, Errors: 0 — 0.194 s
Test plan
./mvnw -pl document-readers/pdf-reader -Dtest=PagePdfDocumentReaderOomTests test— passes in 194 ms; reproduces OOM onmainunder-Xmx512m../mvnw -pl document-readers/pdf-reader -am verify— spring-javaformat:validate, checkstyle clean..Backport candidates
Happy to open backport PRs as required for previous versions - once this lands on
main.