Files
daily-paper/app
Rain-Bus b42e9149e5 feat: improve PDF extraction with image clustering, find_tables() integration, and JPEG output
- Add subfigure clustering in _find_figure_top(): collect all images near caption, cluster by Y proximity, use largest cluster's min y
- Add _find_figure_horizontal(): determine crop range from caption + embedded image union
- Refactor _find_table_region() to use page.find_tables() as primary method with segment merging, fallback to block-based detection
- Extract _scan_blocks_direction() for bidirectional block scanning with table data density awareness
- Add _TABLE_DATA_GAP_THRESHOLD for denser gap tolerance after table data blocks
- Fix caption regex to use (?-i:[A-Z]) for correct case-insensitive matching
- Switch image output from PNG to JPEG (5-10x smaller for web delivery)
- Update cleanup and filter to handle both .png and .jpg formats
- Reformat imports and conditional expressions in pages.py
2026-06-10 23:17:03 +08:00
..
2026-06-05 21:56:40 +08:00