daily-paper

Files

T

Rain-Bus b42e9149e5 feat: improve PDF extraction with image clustering, find_tables() integration, and JPEG output

- Add subfigure clustering in _find_figure_top(): collect all images near caption, cluster by Y proximity, use largest cluster's min y
- Add _find_figure_horizontal(): determine crop range from caption + embedded image union
- Refactor _find_table_region() to use page.find_tables() as primary method with segment merging, fallback to block-based detection
- Extract _scan_blocks_direction() for bidirectional block scanning with table data density awareness
- Add _TABLE_DATA_GAP_THRESHOLD for denser gap tolerance after table data blocks
- Fix caption regex to use (?-i:[A-Z]) for correct case-insensitive matching
- Switch image output from PNG to JPEG (5-10x smaller for web delivery)
- Update cleanup and filter to handle both .png and .jpg formats
- Reformat imports and conditional expressions in pages.py

2026-06-10 23:17:03 +08:00

routes

feat: improve PDF extraction with image clustering, find_tables() integration, and JPEG output

2026-06-10 23:17:03 +08:00

services

feat: improve PDF extraction with image clustering, find_tables() integration, and JPEG output

2026-06-10 23:17:03 +08:00

static

fix: PDF extraction bbox compatibility, update date formats, and bump max retries

2026-06-09 18:30:04 +08:00

templates

feat: enhance PDF extraction with section-based figure routing and improved caption detection