feat: refactor summarizer and PDF extraction pipeline

- Split summarizer into summary_generator and summary_persister modules
- Refactor pdf_image_extractor to two-phase pipeline with PicoDet layout detection
- Add layout_detector service for PicoDet-S_layout_3cls integration
- Add exceptions module with ConflictError and NotFoundError
- Improve admin dashboard with better statistics and task management
- Add design review document with system optimization suggestions
- Add new tests for crawler, pdf_downloader, pipeline, and summary_utils
- Update dependencies and configuration
- Clean up dead code and improve error handling
This commit is contained in:
2026-06-13 13:16:47 +08:00
parent e2f0e1a8be
commit 21f16e6756
43 changed files with 3304 additions and 1494 deletions
+8 -5
View File
@@ -207,11 +207,14 @@ async def delete_papers_by_date_range(
completed_at=utc_now(),
papers_found=total,
papers_new=deleted,
details_json=json.dumps({
"total_before": total,
"deleted": deleted,
"failed": len(failed_items),
}, ensure_ascii=False),
details_json=json.dumps(
{
"total_before": total,
"deleted": deleted,
"failed": len(failed_items),
},
ensure_ascii=False,
),
error=job_error,
)
db.add(log_entry)