-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Bug: AttributeError when enabling formula enrichment
Description
When do_formula_enrichment=True is set in PdfPipelineOptions, Docling fails with:
AttributeError: 'dict' object has no attribute 'model_type'
Error Details
Full Traceback:
File "/.../docling/pipeline/standard_pdf_pipeline.py", line 456, in _init_models
CodeFormulaModel(
File "/.../docling/models/code_formula_model.py", line 108, in __init__
self._processor = AutoProcessor.from_pretrained(
File "/.../transformers/tokenization_utils_base.py", line 2419, in _from_pretrained
if _is_local and _config.model_type not in [
AttributeError: 'dict' object has no attribute 'model_type'
Root Cause
The issue occurs in docling/models/code_formula_model.py line 108:
self._processor = AutoProcessor.from_pretrained(artifacts_path,)When artifacts_path (a Path object) is passed to AutoProcessor.from_pretrained(), transformers loads the tokenizer config as a dict from JSON, but then tries to access _config.model_type as an object attribute (line 2419 in transformers/tokenization_utils_base.py).
Why it fails:
- Loading from local
Path: Config loaded asdict→ AttributeError when accessing.model_type - Loading from model name: Config properly converted to config object → Works correctly
Environment
- Docling version: 2.63.0 (latest)
- Transformers version: 4.57.2
- Python version: 3.12
- OS: macOS (darwin 24.6.0)
- Device: MPS (Apple Silicon)
Steps to Reproduce
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
import os
import subprocess
# Set TESSDATA_PREFIX if needed
if 'TESSDATA_PREFIX' not in os.environ:
tesseract_prefix = subprocess.run(
['brew', '--prefix', 'tesseract'],
capture_output=True, text=True, check=True
).stdout.strip()
if tesseract_prefix:
os.environ['TESSDATA_PREFIX'] = f'{tesseract_prefix}/share/tessdata'
# This configuration fails
pipeline_options = PdfPipelineOptions(
ocr_options=TesseractOcrOptions(lang=['eng'], force_full_page_ocr=True),
do_formula_enrichment=True, # ← This triggers the error
)
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
# Fails here
result = converter.convert('document.pdf')Expected Behavior
Formula enrichment should initialize successfully and extract formulas from the document.
Actual Behavior
Initialization fails with AttributeError before any document processing occurs.
Workaround
A monkey patch can work around the issue by intercepting AutoProcessor.from_pretrained() and converting Path objects to model names:
from transformers import AutoProcessor
from pathlib import Path
original_from_pretrained = AutoProcessor.from_pretrained
def patched_from_pretrained(model_name_or_path, **kwargs):
if isinstance(model_name_or_path, Path):
path_str = str(model_name_or_path)
if 'CodeFormulaV2' in path_str:
return original_from_pretrained('docling-project/CodeFormulaV2', **kwargs)
return original_from_pretrained(model_name_or_path, **kwargs)
AutoProcessor.from_pretrained = patched_from_pretrainedSuggested Fix
In docling/models/code_formula_model.py line 108, change:
# Current (fails):
self._processor = AutoProcessor.from_pretrained(artifacts_path,)
# Suggested fix:
self._processor = AutoProcessor.from_pretrained('docling-project/CodeFormulaV2',)The transformers library will automatically use the cached model, so there's no need to pass the local path. This avoids the transformers bug while maintaining the same functionality.
Additional Context
- The model
docling-project/CodeFormulaV2loads successfully when using the model name directly - The issue appears to be a compatibility problem between how Docling passes paths and how transformers 4.57.2 handles local path loading
- Related issues: Bug Report: LaTeX Formula Spacing Issue with do_formula_enrichment=True #2374, Bug: do_formula_enrichment=True produces garbled text (e.g., /C0 apod) and generate_picture_images=True creates empty folders & `` placeholders #2568 (formula enrichment issues, but may be different bugs)