daily-paper

Author	SHA1	Message	Date
Rain-Bus	1ccac1f29a	refactor: replace Phase 2 label matching with PDF text-stream caption pairing - Extract captions from PDF text dict instead of DocLayout caption boxes - Use _CaptionBlock dataclass to carry authoritative ID, kind, text, bbox - Pair captions to content boxes with directional preference (figure below, table above) - Filter out uncaptioned boxes (Algorithm pseudo-code, unnumbered appendix tables, false positives) - Remove label_images_by_summary and Phase 2 rename pipeline entirely - Update tests to cover text-based caption pairing and filtering	2026-06-15 01:09:29 +08:00
Rain-Bus	29fb20828e	feat: add concurrency safety, caption detection, admin enhancements, and performance improvements	2026-06-14 22:20:02 +08:00
Rain-Bus	8f13c31991	refactor: 清理冗余代码和过时配置	2026-06-14 12:56:02 +08:00
Rain-Bus	90fe705e8f	refactor: 迁移布局检测模型从 PicoDet 到 DocLayout-YOLO - 核心变更： - app/services/layout_detector.py: 重写布局检测器，从 PicoDet-S_layout_3cls 迁移到 DocLayout-YOLO (DocStructBench, imgsz=1024) - 支持多设备推理 (CPU/CUDA/DirectML/OpenVINO 等)，自动探测最优设备 - 预处理改为 letterbox (保比例缩放+灰边 padding)，坐标还原使用 (model_coord - padding) / ratio 公式 - 后处理解析 YOLOv10 end-to-end 输出 [N,6]=[x1,y1,x2,y2,conf,cls] - 类别映射改为按 class name 动态匹配 (figure/figure_group→picture, table/table_group→table) - 新增文件： - scripts/export_doclayout_yolo_onnx.py: DocLayout-YOLO ONNX 导出脚本 (独立 venv 运行) - tests/test_layout_detector.py: 布局检测器完整测试 (35 个用例) - 配置更新： - .env.example: 更新布局检测配置 (新增 LAYOUT_IMGSZ, LAYOUT_DEVICE, LAYOUT_DEVICE_ID) - app/config.py: Settings 类对应字段 - pyproject.toml: 新增 export 依赖组 (torch, doclayout-yolo, onnx 等) - 删除旧文件： - scripts/export_picodet_onnx.py: 旧 PicoDet 导出脚本 - 文档更新： - README.md: 更新环境变量说明 - 相关服务注释更新 (pdf_image_extractor.py, summary_persister.py, reextract_images.py) 此重构遵循项目初期开发阶段规范，大胆调整数据模型，无需向后兼容。	2026-06-14 10:41:44 +08:00
Rain-Bus	21f16e6756	feat: refactor summarizer and PDF extraction pipeline - Split summarizer into summary_generator and summary_persister modules - Refactor pdf_image_extractor to two-phase pipeline with PicoDet layout detection - Add layout_detector service for PicoDet-S_layout_3cls integration - Add exceptions module with ConflictError and NotFoundError - Improve admin dashboard with better statistics and task management - Add design review document with system optimization suggestions - Add new tests for crawler, pdf_downloader, pipeline, and summary_utils - Update dependencies and configuration - Clean up dead code and improve error handling	2026-06-13 13:16:47 +08:00
Rain-Bus	e2f0e1a8be	feat: add claude backend, refactor summary utilities, improve batch worker pattern, add pymupdf4llm	2026-06-12 22:25:57 +08:00
Rain-Bus	b42e9149e5	feat: improve PDF extraction with image clustering, find_tables() integration, and JPEG output - Add subfigure clustering in _find_figure_top(): collect all images near caption, cluster by Y proximity, use largest cluster's min y - Add _find_figure_horizontal(): determine crop range from caption + embedded image union - Refactor _find_table_region() to use page.find_tables() as primary method with segment merging, fallback to block-based detection - Extract _scan_blocks_direction() for bidirectional block scanning with table data density awareness - Add _TABLE_DATA_GAP_THRESHOLD for denser gap tolerance after table data blocks - Fix caption regex to use (?-i:[A-Z]) for correct case-insensitive matching - Switch image output from PNG to JPEG (5-10x smaller for web delivery) - Update cleanup and filter to handle both .png and .jpg formats - Reformat imports and conditional expressions in pages.py	2026-06-10 23:17:03 +08:00
Rain-Bus	a1e0962820	feat: enhance PDF extraction with section-based figure routing and improved caption detection	2026-06-10 02:05:30 +08:00
Rain-Bus	c94ff48254	fix: PDF extraction bbox compatibility, update date formats, and bump max retries - Fix bbox format detection in pdf_image_extractor (support Rect and tuple) - Update date display format to include year (%Y-%m-%d) across templates - Increase SUMMARY_MAX_RETRIES from 1 to 2 for better error recovery - Widen date input field for better usability	2026-06-09 18:30:04 +08:00
Rain-Bus	1fc6303e09	feat: refactor PDF extraction to caption-based screenshots, add upvote refresh, clean up UI - PDF extractor: rewrite from embedded bitmap extraction to caption-based page region screenshots. Finds Figure/Table captions via regex,截取上方/下方 page region, handles compound figures and vector graphics. - Upvote refresh: new crawler.refresh_upvotes() re-fetches upvotes for recent N days without inserting new papers. Scheduler runs daily 30min after pipeline. - Admin: add /admin/refresh-upvotes endpoint and dashboard button. - UI: remove date quick nav, show upvote update time on detail/card pages, clean up CSS date-chip styles. - Utils: add recent_date_strs() helper.	2026-06-09 18:01:01 +08:00
Rain-Bus	18f44ac244	feat: improve PDF image extraction with caption-based labeling and fallback matching - Enhance pdf_image_extractor with caption text extraction near images/tables - Add figure/table type correction based on caption content - Implement sequential numbering fallback for unmatched items - Improve figure linking in pages with manifest ID matching and fallback strategies - Remove docling dependency, add dev dependency group	2026-06-09 14:07:21 +08:00
Rain-Bus	32978b3fc5	feat: add admin dashboard, pipeline service, lightbox, and update dependencies	2026-06-09 09:32:10 +08:00
Rain-Bus	0d293422ac	feat: enhance UI, refactor services, improve templates and tests - Replace image_extractor with pdf_image_extractor service - Enhance pi_client with expanded API capabilities - Improve summarizer service with additional features - Update admin routes with more endpoints - Add login page template - Enhance detail page with comprehensive layout - Improve search and trends pages - Update base template with additional elements - Refactor tests for better coverage - Add validate_summary script - Update project configuration and dependencies	2026-06-07 19:38:58 +08:00

13 Commits