WIP: title generation use font size spans instead of blocks by amahuli03 · Pull Request #475 · CodeForPhilly/balancer-main

amahuli03 · 2026-03-06T03:50:10Z

Description

This PR replaces block-position title extraction with font-size-based approach. The old logic used get_text("blocks") and picked the first block matching a title regex, which frequently selected preambles, journal names, and article headers instead of the actual title. The new approach uses get_text("dict") to find the largest font size across the first few pages and collects contiguous runs of text at that size.
Loosens the title validation regex to allow years, question marks, exclamation marks, apostrophes (including '), and non-breaking spaces (\xa0) in titles — all of which appeared in real PDF titles and were incorrectly rejected.
Updates and expands tests for the new get_text("dict") mock structure, including tests for multi-span joining, short span filtering, regex rejection, and multi-page title detection.

TODO: Not sure about the regex changes. Need to know more about the reasoning behind the original regex.

How it works

Title extraction follows a 3-tier fallback:

Metadata — use the PDF's embedded title if it passes the regex
Font size (new) — scan the first 3 pages, find the max font size, collect contiguous runs of text at that size, pick the longest regex-matching candidate
GPT-4 fallback — summarize the first page if no title found

Related Issue

Fixes #469

Manual Tests

Before making these changes, I tested 50 file uploads
Every wrong title generated was through the second path with the text blocks, so I proceeded implementing my fix which would replace that path.
After making these changes and getting unit tests passing, I end-to-end manually tested the files that had wrong titles with the original approach. As shown in the spreadsheet, all correctly extracted the title with my changes.

I need to test a little more before I can feel confident with this approach.
TODO: manually test some more files with new approach. I haven't yet tested any of the ones that were passing with the original implementation, and I want to make sure I didn't break something that was working.

Automated Tests

it's a lot but writing it all here to document it

Refactored test helpers (make_page_dict, make_mock_doc) to build mocks using get_text("dict") structure (blocks -> lines -> spans with text and size fields) instead of the old get_text("blocks") tuple format, matching the new extraction implementation.
updated existing tests (test_falls_back_to_font_size_if_metadata_title_is_empty, test_falls_back_to_font_size_if_metadata_title_does_not_match_regex) to provide font-size-annotated span data so the font-size extractor can find a title. Previously these tests relied on block-position matching.
Added test_font_size_returns_none_when_no_regex_match-- verifies that largest-font text that doesn't match the title regex (e.g., "Psychiatry Research" with only 2 words) causes the function to return None, falling through to the GPT fallback.
Added test_font_size_joins_adjacent_spans_in_same_block -- verifies that a title split across multiple spans (ex: "Advances in Mood Disorder" + "Pharmacotherapy") is joined into a single candidate.
Added test_font_size_finds_title_on_later_page-- verifies that a title on page 2 is found when it has a larger font size than page 1 text.

Reviewers

@sahilds1

Notes

Known limitations:

if something like a journal header is larger than the title, this won't work. We'd have to manually edit the title
I broadened the regex because I found that some valid titles were being rejected by it.

The old "scan first couple pages" logic used get_text("blocks") and picked the first block matching a title regex, which frequently selected preambles, journal names, and article headers instead of the actual title. The new approach uses get_text("dict") to find the largest font size across the first few pages and collects contiguous runs of text at that size, since research paper titles are typically the largest font.

marks, apostrophes, and non-breaking spaces in titles.

Refactor test helpers to use get_text("dict") structure instead of get_text("blocks"). Add tests for multi-span joining, short span filtering, regex rejection, and multi-page title detection.

amahuli03 added 3 commits March 5, 2026 22:25

loosens the title regex to allow years, question

2a822f6

marks, apostrophes, and non-breaking spaces in titles.

Update tests for font-size-based title extraction

edf1eb6

Refactor test helpers to use get_text("dict") structure instead of get_text("blocks"). Add tests for multi-span joining, short span filtering, regex rejection, and multi-page title detection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: title generation use font size spans instead of blocks#475

WIP: title generation use font size spans instead of blocks#475
amahuli03 wants to merge 3 commits intoCodeForPhilly:developfrom
amahuli03:title-generation-font-size

amahuli03 commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

amahuli03 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How it works

Related Issue

Manual Tests

Automated Tests

Reviewers

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

amahuli03 commented Mar 6, 2026 •

edited

Loading