Commit Graph

21 Commits

Author SHA1 Message Date
Rain-Bus b42e9149e5 feat: improve PDF extraction with image clustering, find_tables() integration, and JPEG output
- Add subfigure clustering in _find_figure_top(): collect all images near caption, cluster by Y proximity, use largest cluster's min y
- Add _find_figure_horizontal(): determine crop range from caption + embedded image union
- Refactor _find_table_region() to use page.find_tables() as primary method with segment merging, fallback to block-based detection
- Extract _scan_blocks_direction() for bidirectional block scanning with table data density awareness
- Add _TABLE_DATA_GAP_THRESHOLD for denser gap tolerance after table data blocks
- Fix caption regex to use (?-i:[A-Z]) for correct case-insensitive matching
- Switch image output from PNG to JPEG (5-10x smaller for web delivery)
- Update cleanup and filter to handle both .png and .jpg formats
- Reformat imports and conditional expressions in pages.py
2026-06-10 23:17:03 +08:00
Rain-Bus 9aa0102e95 chore: add test artifacts to .gitignore 2026-06-10 02:05:59 +08:00
Rain-Bus a1e0962820 feat: enhance PDF extraction with section-based figure routing and improved caption detection 2026-06-10 02:05:30 +08:00
Rain-Bus c94ff48254 fix: PDF extraction bbox compatibility, update date formats, and bump max retries
- Fix bbox format detection in pdf_image_extractor (support Rect and tuple)
- Update date display format to include year (%Y-%m-%d) across templates
- Increase SUMMARY_MAX_RETRIES from 1 to 2 for better error recovery
- Widen date input field for better usability
2026-06-09 18:30:04 +08:00
Rain-Bus 1fc6303e09 feat: refactor PDF extraction to caption-based screenshots, add upvote refresh, clean up UI
- PDF extractor: rewrite from embedded bitmap extraction to caption-based
  page region screenshots. Finds Figure/Table captions via regex,截取上方/下方
  page region, handles compound figures and vector graphics.
- Upvote refresh: new crawler.refresh_upvotes() re-fetches upvotes for recent
  N days without inserting new papers. Scheduler runs daily 30min after pipeline.
- Admin: add /admin/refresh-upvotes endpoint and dashboard button.
- UI: remove date quick nav, show upvote update time on detail/card pages,
  clean up CSS date-chip styles.
- Utils: add recent_date_strs() helper.
2026-06-09 18:01:01 +08:00
Rain-Bus b72b5a31bb feat: enhance UI styling, add date picker, and clean up inline CSS 2026-06-09 15:12:58 +08:00
Rain-Bus 18f44ac244 feat: improve PDF image extraction with caption-based labeling and fallback matching
- Enhance pdf_image_extractor with caption text extraction near images/tables
- Add figure/table type correction based on caption content
- Implement sequential numbering fallback for unmatched items
- Improve figure linking in pages with manifest ID matching and fallback strategies
- Remove docling dependency, add dev dependency group
2026-06-09 14:07:21 +08:00
Rain-Bus 32978b3fc5 feat: add admin dashboard, pipeline service, lightbox, and update dependencies 2026-06-09 09:32:10 +08:00
Rain-Bus 0d293422ac feat: enhance UI, refactor services, improve templates and tests
- Replace image_extractor with pdf_image_extractor service
- Enhance pi_client with expanded API capabilities
- Improve summarizer service with additional features
- Update admin routes with more endpoints
- Add login page template
- Enhance detail page with comprehensive layout
- Improve search and trends pages
- Update base template with additional elements
- Refactor tests for better coverage
- Add validate_summary script
- Update project configuration and dependencies
2026-06-07 19:38:58 +08:00
Rain-Bus 4a72c35452 chore: update .env.example defaults and add uv.lock 2026-06-06 01:02:47 +08:00
Rain-Bus 4072a05460 chore: update project config, services, templates and styling 2026-06-06 00:49:45 +08:00
Rain-Bus 5d4de50c8b chore: update README and remove deprecated documentation 2026-06-06 00:44:23 +08:00
Rain-Bus 904eec392e feat: overhaul UI styling, improve templates, enhance services and tests 2026-06-06 00:38:56 +08:00
Rain-Bus f7f1a4c0cb refactor: split monolithic phase tests into per-module test files
- rename test_admin_phase4.py -> test_admin.py, test_search.py -> test_searcher.py
- split test_phase5.py into test_cleaner, test_embedder, test_image_extractor, test_pages
- move schema tests from test_summarizer.py into dedicated test_schemas.py
- add sample_papers_range and sample_papers_with_summary fixtures in conftest
- update .gitignore to exclude all of data/
2026-06-06 00:34:30 +08:00
Rain-Bus 85c4cfb9e8 refactor: restructure services and add image/pdf extraction utilities
- Add image_extractor, pdf_downloader, pi_client, trends services
- Add shared utils module
- Refactor summarizer, embedder, routes for cleaner separation
- Update tests to match new service structure
2026-06-06 00:00:55 +08:00
Rain-Bus ba9afa212c feat: add compare, trends routes, embedder service, and phase5 tests 2026-06-05 23:32:06 +08:00
Rain-Bus 2cfd1a8a9f feat: add admin crawl, cleanup, delete, logs endpoints with scheduler and tests
- Add POST /admin/crawl with TaskLock-based reentrancy guard
- Add POST /admin/cleanup (tmp files older than 24h) with CrawlLog
- Add POST /admin/delete with date range and 'DELETE' confirm token
- Add GET /admin/logs (paginated CrawlLog + DataDeleteJob viewer)
- Add app/services/cleaner.py (cleanup_tmp, delete_papers_by_date_range)
- Add app/services/scheduler.py (APScheduler daily crawl/cleanup jobs)
- Wire scheduler startup/shutdown hooks in app/main.py
- Add admin nav link in base.html and APP_HOST security warning
- Add apscheduler>=3.10 dependency
- Add tests/test_admin_phase4.py covering the new endpoints
2026-06-05 23:07:45 +08:00
Rain-Bus 1538d564f6 feat: add search and user data routes, services, and tests 2026-06-05 22:53:27 +08:00
Rain-Bus 29e6797c12 feat: add admin routes, summarizer service, and CLI summarize command
- Add /admin routes for manual trigger and status inspection
- Add summarizer service with batch/single summary support
- Add summarize CLI command (single arxiv_id or batch pending)
- Register admin router in main app
- Add tests for summarizer
2026-06-05 22:29:33 +08:00
Rain-Bus d69df2be10 docs: add README.md 2026-06-05 21:58:06 +08:00
Rain-Bus f1be24ab83 feat: initial project structure
- Add FastAPI app with paper browsing UI and REST API
- Add crawler service and database models
- Add scripts for DB init and manual crawl
- Add docs (api-and-ui, data-model, services)
- Add requirements and project config
2026-06-05 21:56:40 +08:00