WIP: title generation use font size spans instead of blocks#475
Draft
amahuli03 wants to merge 3 commits intoCodeForPhilly:developfrom
Draft
WIP: title generation use font size spans instead of blocks#475amahuli03 wants to merge 3 commits intoCodeForPhilly:developfrom
amahuli03 wants to merge 3 commits intoCodeForPhilly:developfrom
Conversation
The old "scan first couple pages" logic used get_text("blocks") and picked the first
block matching a title regex, which frequently selected preambles,
journal names, and article headers instead of the actual title.
The new approach uses get_text("dict") to find the largest font size
across the first few pages and collects contiguous runs of text at
that size, since research paper titles are typically the
largest font.
marks, apostrophes, and non-breaking spaces in titles.
Refactor test helpers to use get_text("dict") structure instead of
get_text("blocks"). Add tests for multi-span joining, short span
filtering, regex rejection, and multi-page title detection.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR replaces block-position title extraction with font-size-based approach. The old logic used
get_text("blocks")and picked the first block matching a title regex, which frequently selected preambles, journal names, and article headers instead of the actual title. The new approach usesget_text("dict")to find the largest font size across the first few pages and collects contiguous runs of text at that size.Loosens the title validation regex to allow years, question marks, exclamation marks, apostrophes (including
'), and non-breaking spaces (\xa0) in titles — all of which appeared in real PDF titles and were incorrectly rejected.Updates and expands tests for the new
get_text("dict")mock structure, including tests for multi-span joining, short span filtering, regex rejection, and multi-page title detection.TODO: Not sure about the regex changes. Need to know more about the reasoning behind the original regex.
How it works
Title extraction follows a 3-tier fallback:
Related Issue
Fixes #469
Manual Tests
Before making these changes, I tested 50 file uploads
Every wrong title generated was through the second path with the text blocks, so I proceeded implementing my fix which would replace that path.
After making these changes and getting unit tests passing, I end-to-end manually tested the files that had wrong titles with the original approach. As shown in the spreadsheet, all correctly extracted the title with my changes.
I need to test a little more before I can feel confident with this approach.
TODO: manually test some more files with new approach. I haven't yet tested any of the ones that were passing with the original implementation, and I want to make sure I didn't break something that was working.
Automated Tests
it's a lot but writing it all here to document it
make_page_dict,make_mock_doc) to build mocks usingget_text("dict")structure (blocks -> lines -> spans with text and size fields) instead of the oldget_text("blocks")tuple format, matching the new extraction implementation.test_falls_back_to_font_size_if_metadata_title_is_empty,test_falls_back_to_font_size_if_metadata_title_does_not_match_regex) to provide font-size-annotated span data so the font-size extractor can find a title. Previously these tests relied on block-position matching.test_font_size_returns_none_when_no_regex_match-- verifies that largest-font text that doesn't match the title regex (e.g., "Psychiatry Research" with only 2 words) causes the function to return None, falling through to the GPT fallback.test_font_size_joins_adjacent_spans_in_same_block-- verifies that a title split across multiple spans (ex: "Advances in Mood Disorder" + "Pharmacotherapy") is joined into a single candidate.test_font_size_finds_title_on_later_page-- verifies that a title on page 2 is found when it has a larger font size than page 1 text.Reviewers
@sahilds1
Notes
Known limitations: