feat: improve PDF image extraction with caption-based labeling and fallback matching

- Enhance pdf_image_extractor with caption text extraction near images/tables
- Add figure/table type correction based on caption content
- Implement sequential numbering fallback for unmatched items
- Improve figure linking in pages with manifest ID matching and fallback strategies
- Remove docling dependency, add dev dependency group

This commit is contained in:

Rain&Bus

2026-06-09 14:07:21 +08:00

parent 32978b3fc5

commit 18f44ac244

4 changed files with 343 additions and 1593 deletions

uv.lock

Generated

+12 -1495

View File

File diff suppressed because it is too large Load Diff