daily-paper

Author	SHA1	Message	Date
Rain-Bus	29fb20828e	feat: add concurrency safety, caption detection, admin enhancements, and performance improvements	2026-06-14 22:20:02 +08:00
Rain-Bus	90fe705e8f	refactor: 迁移布局检测模型从 PicoDet 到 DocLayout-YOLO - 核心变更： - app/services/layout_detector.py: 重写布局检测器，从 PicoDet-S_layout_3cls 迁移到 DocLayout-YOLO (DocStructBench, imgsz=1024) - 支持多设备推理 (CPU/CUDA/DirectML/OpenVINO 等)，自动探测最优设备 - 预处理改为 letterbox (保比例缩放+灰边 padding)，坐标还原使用 (model_coord - padding) / ratio 公式 - 后处理解析 YOLOv10 end-to-end 输出 [N,6]=[x1,y1,x2,y2,conf,cls] - 类别映射改为按 class name 动态匹配 (figure/figure_group→picture, table/table_group→table) - 新增文件： - scripts/export_doclayout_yolo_onnx.py: DocLayout-YOLO ONNX 导出脚本 (独立 venv 运行) - tests/test_layout_detector.py: 布局检测器完整测试 (35 个用例) - 配置更新： - .env.example: 更新布局检测配置 (新增 LAYOUT_IMGSZ, LAYOUT_DEVICE, LAYOUT_DEVICE_ID) - app/config.py: Settings 类对应字段 - pyproject.toml: 新增 export 依赖组 (torch, doclayout-yolo, onnx 等) - 删除旧文件： - scripts/export_picodet_onnx.py: 旧 PicoDet 导出脚本 - 文档更新： - README.md: 更新环境变量说明 - 相关服务注释更新 (pdf_image_extractor.py, summary_persister.py, reextract_images.py) 此重构遵循项目初期开发阶段规范，大胆调整数据模型，无需向后兼容。	2026-06-14 10:41:44 +08:00
Rain-Bus	743d69efd0	refactor: extract admin business logic to services, introduce job queue, add derived index helpers - Move DB operations from routes/admin.py to services/admin.py (get_logs_context, query_summary_statuses, retry_failed, delete/reset operations) - Add services/jobs.py with Job/JobEvent-based async job queue (create_job, run_job, enqueue_job) - Add services/derived.py with FTS5 reindex and paper index deletion helpers - Refactor scheduler to use job queue instead of direct pipeline calls - Add heartbeat_at/expires_at to TaskLock for lock health tracking - Remove DESIGN_REVIEW.md - Update tests: remove redundant integration tests, add unit tests for new services	2026-06-13 18:31:43 +08:00
Rain-Bus	21f16e6756	feat: refactor summarizer and PDF extraction pipeline - Split summarizer into summary_generator and summary_persister modules - Refactor pdf_image_extractor to two-phase pipeline with PicoDet layout detection - Add layout_detector service for PicoDet-S_layout_3cls integration - Add exceptions module with ConflictError and NotFoundError - Improve admin dashboard with better statistics and task management - Add design review document with system optimization suggestions - Add new tests for crawler, pdf_downloader, pipeline, and summary_utils - Update dependencies and configuration - Clean up dead code and improve error handling	2026-06-13 13:16:47 +08:00
Rain-Bus	b42e9149e5	feat: improve PDF extraction with image clustering, find_tables() integration, and JPEG output - Add subfigure clustering in _find_figure_top(): collect all images near caption, cluster by Y proximity, use largest cluster's min y - Add _find_figure_horizontal(): determine crop range from caption + embedded image union - Refactor _find_table_region() to use page.find_tables() as primary method with segment merging, fallback to block-based detection - Extract _scan_blocks_direction() for bidirectional block scanning with table data density awareness - Add _TABLE_DATA_GAP_THRESHOLD for denser gap tolerance after table data blocks - Fix caption regex to use (?-i:[A-Z]) for correct case-insensitive matching - Switch image output from PNG to JPEG (5-10x smaller for web delivery) - Update cleanup and filter to handle both .png and .jpg formats - Reformat imports and conditional expressions in pages.py	2026-06-10 23:17:03 +08:00
Rain-Bus	a1e0962820	feat: enhance PDF extraction with section-based figure routing and improved caption detection	2026-06-10 02:05:30 +08:00
Rain-Bus	1fc6303e09	feat: refactor PDF extraction to caption-based screenshots, add upvote refresh, clean up UI - PDF extractor: rewrite from embedded bitmap extraction to caption-based page region screenshots. Finds Figure/Table captions via regex,截取上方/下方 page region, handles compound figures and vector graphics. - Upvote refresh: new crawler.refresh_upvotes() re-fetches upvotes for recent N days without inserting new papers. Scheduler runs daily 30min after pipeline. - Admin: add /admin/refresh-upvotes endpoint and dashboard button. - UI: remove date quick nav, show upvote update time on detail/card pages, clean up CSS date-chip styles. - Utils: add recent_date_strs() helper.	2026-06-09 18:01:01 +08:00
Rain-Bus	18f44ac244	feat: improve PDF image extraction with caption-based labeling and fallback matching - Enhance pdf_image_extractor with caption text extraction near images/tables - Add figure/table type correction based on caption content - Implement sequential numbering fallback for unmatched items - Improve figure linking in pages with manifest ID matching and fallback strategies - Remove docling dependency, add dev dependency group	2026-06-09 14:07:21 +08:00
Rain-Bus	32978b3fc5	feat: add admin dashboard, pipeline service, lightbox, and update dependencies	2026-06-09 09:32:10 +08:00
Rain-Bus	0d293422ac	feat: enhance UI, refactor services, improve templates and tests - Replace image_extractor with pdf_image_extractor service - Enhance pi_client with expanded API capabilities - Improve summarizer service with additional features - Update admin routes with more endpoints - Add login page template - Enhance detail page with comprehensive layout - Improve search and trends pages - Update base template with additional elements - Refactor tests for better coverage - Add validate_summary script - Update project configuration and dependencies	2026-06-07 19:38:58 +08:00
Rain-Bus	4072a05460	chore: update project config, services, templates and styling	2026-06-06 00:49:45 +08:00
Rain-Bus	904eec392e	feat: overhaul UI styling, improve templates, enhance services and tests	2026-06-06 00:38:56 +08:00
Rain-Bus	85c4cfb9e8	refactor: restructure services and add image/pdf extraction utilities - Add image_extractor, pdf_downloader, pi_client, trends services - Add shared utils module - Refactor summarizer, embedder, routes for cleaner separation - Update tests to match new service structure	2026-06-06 00:00:55 +08:00
Rain-Bus	ba9afa212c	feat: add compare, trends routes, embedder service, and phase5 tests	2026-06-05 23:32:06 +08:00
Rain-Bus	2cfd1a8a9f	feat: add admin crawl, cleanup, delete, logs endpoints with scheduler and tests - Add POST /admin/crawl with TaskLock-based reentrancy guard - Add POST /admin/cleanup (tmp files older than 24h) with CrawlLog - Add POST /admin/delete with date range and 'DELETE' confirm token - Add GET /admin/logs (paginated CrawlLog + DataDeleteJob viewer) - Add app/services/cleaner.py (cleanup_tmp, delete_papers_by_date_range) - Add app/services/scheduler.py (APScheduler daily crawl/cleanup jobs) - Wire scheduler startup/shutdown hooks in app/main.py - Add admin nav link in base.html and APP_HOST security warning - Add apscheduler>=3.10 dependency - Add tests/test_admin_phase4.py covering the new endpoints	2026-06-05 23:07:45 +08:00
Rain-Bus	1538d564f6	feat: add search and user data routes, services, and tests	2026-06-05 22:53:27 +08:00
Rain-Bus	29e6797c12	feat: add admin routes, summarizer service, and CLI summarize command - Add /admin routes for manual trigger and status inspection - Add summarizer service with batch/single summary support - Add summarize CLI command (single arxiv_id or batch pending) - Register admin router in main app - Add tests for summarizer	2026-06-05 22:29:33 +08:00
Rain-Bus	f1be24ab83	feat: initial project structure - Add FastAPI app with paper browsing UI and REST API - Add crawler service and database models - Add scripts for DB init and manual crawl - Add docs (api-and-ui, data-model, services) - Add requirements and project config	2026-06-05 21:56:40 +08:00

18 Commits