Skip to content

WIP: title generation use font size spans instead of blocks#475

Draft
amahuli03 wants to merge 3 commits intoCodeForPhilly:developfrom
amahuli03:title-generation-font-size
Draft

WIP: title generation use font size spans instead of blocks#475
amahuli03 wants to merge 3 commits intoCodeForPhilly:developfrom
amahuli03:title-generation-font-size

Conversation

@amahuli03
Copy link
Collaborator

@amahuli03 amahuli03 commented Mar 6, 2026

Description

This PR replaces block-position title extraction with font-size-based approach. The old logic used get_text("blocks") and picked the first block matching a title regex, which frequently selected preambles, journal names, and article headers instead of the actual title. The new approach uses get_text("dict") to find the largest font size across the first few pages and collects contiguous runs of text at that size.
Loosens the title validation regex to allow years, question marks, exclamation marks, apostrophes (including '), and non-breaking spaces (\xa0) in titles — all of which appeared in real PDF titles and were incorrectly rejected.
Updates and expands tests for the new get_text("dict") mock structure, including tests for multi-span joining, short span filtering, regex rejection, and multi-page title detection.

TODO: Not sure about the regex changes. Need to know more about the reasoning behind the original regex.

How it works

Title extraction follows a 3-tier fallback:

  1. Metadata — use the PDF's embedded title if it passes the regex
  2. Font size (new) — scan the first 3 pages, find the max font size, collect contiguous runs of text at that size, pick the longest regex-matching candidate
  3. GPT-4 fallback — summarize the first page if no title found

Related Issue

Fixes #469

Manual Tests

Before making these changes, I tested 50 file uploads
Every wrong title generated was through the second path with the text blocks, so I proceeded implementing my fix which would replace that path.
After making these changes and getting unit tests passing, I end-to-end manually tested the files that had wrong titles with the original approach. As shown in the spreadsheet, all correctly extracted the title with my changes.

I need to test a little more before I can feel confident with this approach.
TODO: manually test some more files with new approach. I haven't yet tested any of the ones that were passing with the original implementation, and I want to make sure I didn't break something that was working.

Automated Tests

it's a lot but writing it all here to document it

  • Refactored test helpers (make_page_dict, make_mock_doc) to build mocks using get_text("dict") structure (blocks -> lines -> spans with text and size fields) instead of the old get_text("blocks") tuple format, matching the new extraction implementation.
  • updated existing tests (test_falls_back_to_font_size_if_metadata_title_is_empty, test_falls_back_to_font_size_if_metadata_title_does_not_match_regex) to provide font-size-annotated span data so the font-size extractor can find a title. Previously these tests relied on block-position matching.
  • Added test_font_size_returns_none_when_no_regex_match-- verifies that largest-font text that doesn't match the title regex (e.g., "Psychiatry Research" with only 2 words) causes the function to return None, falling through to the GPT fallback.
  • Added test_font_size_joins_adjacent_spans_in_same_block -- verifies that a title split across multiple spans (ex: "Advances in Mood Disorder" + "Pharmacotherapy") is joined into a single candidate.
  • Added test_font_size_finds_title_on_later_page-- verifies that a title on page 2 is found when it has a larger font size than page 1 text.

Reviewers

@sahilds1

Notes

Known limitations:

  • if something like a journal header is larger than the title, this won't work. We'd have to manually edit the title
  • I broadened the regex because I found that some valid titles were being rejected by it.

The old "scan first couple pages" logic used get_text("blocks") and picked the first
block matching a title regex, which frequently selected preambles,
journal names, and article headers instead of the actual title.

The new approach uses get_text("dict") to find the largest font size
across the first few pages and collects contiguous runs of text at
that size, since research paper titles are typically the
largest font.
marks, apostrophes, and non-breaking spaces in titles.
Refactor test helpers to use get_text("dict") structure instead of
get_text("blocks"). Add tests for multi-span joining, short span
filtering, regex rejection, and multi-page title detection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PDF title extraction in FileUpload picks up partial text or wrong content

1 participant