Commit Graph

13 Commits

Author SHA1 Message Date
Rain-Bus 1ccac1f29a refactor: replace Phase 2 label matching with PDF text-stream caption pairing
- Extract captions from PDF text dict instead of DocLayout caption boxes
- Use _CaptionBlock dataclass to carry authoritative ID, kind, text, bbox
- Pair captions to content boxes with directional preference (figure below, table above)
- Filter out uncaptioned boxes (Algorithm pseudo-code, unnumbered appendix tables, false positives)
- Remove label_images_by_summary and Phase 2 rename pipeline entirely
- Update tests to cover text-based caption pairing and filtering
2026-06-15 01:09:29 +08:00
Rain-Bus 29fb20828e feat: add concurrency safety, caption detection, admin enhancements, and performance improvements 2026-06-14 22:20:02 +08:00
Rain-Bus 8f13c31991 refactor: 清理冗余代码和过时配置 2026-06-14 12:56:02 +08:00
Rain-Bus 90fe705e8f refactor: 迁移布局检测模型从 PicoDet 到 DocLayout-YOLO
- 核心变更:
  - app/services/layout_detector.py: 重写布局检测器,从 PicoDet-S_layout_3cls 迁移到 DocLayout-YOLO (DocStructBench, imgsz=1024)
  - 支持多设备推理 (CPU/CUDA/DirectML/OpenVINO 等),自动探测最优设备
  - 预处理改为 letterbox (保比例缩放+灰边 padding),坐标还原使用 (model_coord - padding) / ratio 公式
  - 后处理解析 YOLOv10 end-to-end 输出 [N,6]=[x1,y1,x2,y2,conf,cls]
  - 类别映射改为按 class name 动态匹配 (figure/figure_group→picture, table/table_group→table)

- 新增文件:
  - scripts/export_doclayout_yolo_onnx.py: DocLayout-YOLO ONNX 导出脚本 (独立 venv 运行)
  - tests/test_layout_detector.py: 布局检测器完整测试 (35 个用例)

- 配置更新:
  - .env.example: 更新布局检测配置 (新增 LAYOUT_IMGSZ, LAYOUT_DEVICE, LAYOUT_DEVICE_ID)
  - app/config.py: Settings 类对应字段
  - pyproject.toml: 新增 export 依赖组 (torch, doclayout-yolo, onnx 等)

- 删除旧文件:
  - scripts/export_picodet_onnx.py: 旧 PicoDet 导出脚本

- 文档更新:
  - README.md: 更新环境变量说明
  - 相关服务注释更新 (pdf_image_extractor.py, summary_persister.py, reextract_images.py)

此重构遵循项目初期开发阶段规范,大胆调整数据模型,无需向后兼容。
2026-06-14 10:41:44 +08:00
Rain-Bus 21f16e6756 feat: refactor summarizer and PDF extraction pipeline
- Split summarizer into summary_generator and summary_persister modules
- Refactor pdf_image_extractor to two-phase pipeline with PicoDet layout detection
- Add layout_detector service for PicoDet-S_layout_3cls integration
- Add exceptions module with ConflictError and NotFoundError
- Improve admin dashboard with better statistics and task management
- Add design review document with system optimization suggestions
- Add new tests for crawler, pdf_downloader, pipeline, and summary_utils
- Update dependencies and configuration
- Clean up dead code and improve error handling
2026-06-13 13:16:47 +08:00
Rain-Bus e2f0e1a8be feat: add claude backend, refactor summary utilities, improve batch worker pattern, add pymupdf4llm 2026-06-12 22:25:57 +08:00
Rain-Bus b42e9149e5 feat: improve PDF extraction with image clustering, find_tables() integration, and JPEG output
- Add subfigure clustering in _find_figure_top(): collect all images near caption, cluster by Y proximity, use largest cluster's min y
- Add _find_figure_horizontal(): determine crop range from caption + embedded image union
- Refactor _find_table_region() to use page.find_tables() as primary method with segment merging, fallback to block-based detection
- Extract _scan_blocks_direction() for bidirectional block scanning with table data density awareness
- Add _TABLE_DATA_GAP_THRESHOLD for denser gap tolerance after table data blocks
- Fix caption regex to use (?-i:[A-Z]) for correct case-insensitive matching
- Switch image output from PNG to JPEG (5-10x smaller for web delivery)
- Update cleanup and filter to handle both .png and .jpg formats
- Reformat imports and conditional expressions in pages.py
2026-06-10 23:17:03 +08:00
Rain-Bus a1e0962820 feat: enhance PDF extraction with section-based figure routing and improved caption detection 2026-06-10 02:05:30 +08:00
Rain-Bus c94ff48254 fix: PDF extraction bbox compatibility, update date formats, and bump max retries
- Fix bbox format detection in pdf_image_extractor (support Rect and tuple)
- Update date display format to include year (%Y-%m-%d) across templates
- Increase SUMMARY_MAX_RETRIES from 1 to 2 for better error recovery
- Widen date input field for better usability
2026-06-09 18:30:04 +08:00
Rain-Bus 1fc6303e09 feat: refactor PDF extraction to caption-based screenshots, add upvote refresh, clean up UI
- PDF extractor: rewrite from embedded bitmap extraction to caption-based
  page region screenshots. Finds Figure/Table captions via regex,截取上方/下方
  page region, handles compound figures and vector graphics.
- Upvote refresh: new crawler.refresh_upvotes() re-fetches upvotes for recent
  N days without inserting new papers. Scheduler runs daily 30min after pipeline.
- Admin: add /admin/refresh-upvotes endpoint and dashboard button.
- UI: remove date quick nav, show upvote update time on detail/card pages,
  clean up CSS date-chip styles.
- Utils: add recent_date_strs() helper.
2026-06-09 18:01:01 +08:00
Rain-Bus 18f44ac244 feat: improve PDF image extraction with caption-based labeling and fallback matching
- Enhance pdf_image_extractor with caption text extraction near images/tables
- Add figure/table type correction based on caption content
- Implement sequential numbering fallback for unmatched items
- Improve figure linking in pages with manifest ID matching and fallback strategies
- Remove docling dependency, add dev dependency group
2026-06-09 14:07:21 +08:00
Rain-Bus 32978b3fc5 feat: add admin dashboard, pipeline service, lightbox, and update dependencies 2026-06-09 09:32:10 +08:00
Rain-Bus 0d293422ac feat: enhance UI, refactor services, improve templates and tests
- Replace image_extractor with pdf_image_extractor service
- Enhance pi_client with expanded API capabilities
- Improve summarizer service with additional features
- Update admin routes with more endpoints
- Add login page template
- Enhance detail page with comprehensive layout
- Improve search and trends pages
- Update base template with additional elements
- Refactor tests for better coverage
- Add validate_summary script
- Update project configuration and dependencies
2026-06-07 19:38:58 +08:00