feat: initial project structure
- Add FastAPI app with paper browsing UI and REST API - Add crawler service and database models - Add scripts for DB init and manual crawl - Add docs (api-and-ui, data-model, services) - Add requirements and project config
This commit is contained in:
@@ -0,0 +1,41 @@
|
||||
# ─── 应用 ────────────────────────────────
|
||||
APP_HOST=127.0.0.1
|
||||
APP_PORT=8000
|
||||
APP_DEBUG=false
|
||||
BASE_URL=http://127.0.0.1:8000
|
||||
APP_TIMEZONE=Asia/Shanghai
|
||||
|
||||
# ─── 安全 ────────────────────────────────
|
||||
ADMIN_TOKEN=change-me
|
||||
|
||||
# ─── HuggingFace / arXiv ────────────────
|
||||
HF_API_BASE=https://huggingface.co/api
|
||||
HF_PROXY=
|
||||
TOP_N=20
|
||||
HTTP_TIMEOUT_SECONDS=30
|
||||
HTTP_MAX_RETRIES=3
|
||||
HTTP_USER_AGENT=hf-daily-papers-local/0.1
|
||||
|
||||
# ─── AI 总结(Phase 2 使用)──────────────
|
||||
PI_BIN=/home/rainbus/.local/share/mise/installs/pi/latest/pi
|
||||
SUMMARY_SKILL=daily-paper-summary
|
||||
SUMMARY_CONCURRENCY=3
|
||||
SUMMARY_TIMEOUT_SECONDS=300
|
||||
SUMMARY_MAX_RETRIES=1
|
||||
|
||||
# ─── 调度(Phase 4 使用)─────────────────
|
||||
SCHEDULER_ENABLED=false
|
||||
SCHEDULE_HOUR=8
|
||||
SCHEDULE_MINUTE=0
|
||||
APP_WORKERS=1
|
||||
|
||||
# ─── 数据库 ─────────────────────────────
|
||||
DATABASE_URL=sqlite:///data/db/papers.db
|
||||
|
||||
# ─── 语义搜索(Phase 5 增强,暂留空)─────
|
||||
CHROMA_ENABLED=false
|
||||
CHROMA_DIR=data/chroma
|
||||
EMBED_API_BASE=
|
||||
EMBED_API_KEY=
|
||||
EMBED_MODEL=
|
||||
EMBED_DIMENSIONS=
|
||||
+15
@@ -0,0 +1,15 @@
|
||||
.env
|
||||
__pycache__/
|
||||
*.pyc
|
||||
*.pyo
|
||||
data/db/*.db
|
||||
data/papers/
|
||||
data/tmp/
|
||||
data/chroma/
|
||||
logs/*.log
|
||||
.venv/
|
||||
venv/
|
||||
*.egg-info/
|
||||
dist/
|
||||
build/
|
||||
.DS_Store
|
||||
+236
@@ -0,0 +1,236 @@
|
||||
# HF Daily Papers — 中文论文导览站
|
||||
|
||||
> 每日从 HuggingFace Daily Papers 获取热门论文,生成中文结构化解读,提供浏览、搜索、收藏和管理的本地 Web 应用。
|
||||
|
||||
---
|
||||
|
||||
## 文档索引
|
||||
|
||||
| 文档 | 内容 |
|
||||
|------|------|
|
||||
| [services.md](docs/services.md) | 服务模块:爬虫、AI 总结、搜索、清理、调度、安全等 |
|
||||
| [data-model.md](docs/data-model.md) | SQLite 表结构、summary.json schema、索引和校验策略 |
|
||||
| [api-and-ui.md](docs/api-and-ui.md) | 路由、页面、用户流程、验收标准 |
|
||||
|
||||
---
|
||||
|
||||
## 1. 产品边界
|
||||
|
||||
### 当前目标
|
||||
|
||||
构建一个本地运行的论文导览站:
|
||||
|
||||
1. 按日期抓取 HuggingFace Daily Papers。
|
||||
2. 提取必要元数据,写入 SQLite。
|
||||
3. 总结阶段按需下载 PDF,调用 pi CLI 为论文生成中文结构化总结,完成后清理下载文件。
|
||||
4. 展示首页、日期列表、论文详情、搜索结果、阅读列表。
|
||||
5. 支持收藏、阅读状态、个人笔记。
|
||||
6. 提供安全的管理接口,用于手动抓取、总结、清理和查看日志。
|
||||
|
||||
### 暂不做
|
||||
|
||||
- 不做 Docker / Docker Compose。
|
||||
- 不做自动归档。
|
||||
- 不保留下载文件作为长期资产:PDF/源码只用于解析和总结,流程完成后清理。
|
||||
- 不做 PDF 图片兜底提取。
|
||||
- 不做多用户账号体系。
|
||||
- 不做公网服务设计,默认本地或内网部署。
|
||||
|
||||
---
|
||||
|
||||
## 2. 技术选型
|
||||
|
||||
| 层 | 选型 | 说明 |
|
||||
|----|------|------|
|
||||
| 后端框架 | FastAPI | 页面路由、JSON API、管理接口 |
|
||||
| 模板 | Jinja2 | 服务端渲染 HTML |
|
||||
| 前端交互 | HTMX + 少量原生 JS | 收藏、状态、搜索、局部刷新 |
|
||||
| 样式 | 自定义 CSS,参考 kami 风格 | kami 只作为视觉和排版参考,不调用 kami 构建管线 |
|
||||
| 数据库 | SQLite + SQLAlchemy | 单文件、本地低运维 |
|
||||
| 全文搜索 | SQLite FTS5 | 标题、摘要、总结、作者、标签关键词搜索 |
|
||||
| 语义搜索 | ChromaDB(可选增强) | MVP 后接入,用在线嵌入服务生成向量 |
|
||||
| AI 总结 | pi CLI | 一篇论文一次 pi 调用 |
|
||||
| 调度 | APScheduler | 单进程内嵌调度,禁止多 worker 重复运行 |
|
||||
|
||||
---
|
||||
|
||||
## 3. 项目结构
|
||||
|
||||
```text
|
||||
paper/
|
||||
├── README.md
|
||||
├── REQUIREMENTS.md
|
||||
├── docs/
|
||||
│ ├── services.md
|
||||
│ ├── data-model.md
|
||||
│ └── api-and-ui.md
|
||||
├── .env
|
||||
├── .env.example
|
||||
├── pyproject.toml
|
||||
│
|
||||
├── app/
|
||||
│ ├── main.py
|
||||
│ ├── config.py
|
||||
│ ├── database.py
|
||||
│ ├── models.py
|
||||
│ ├── security.py
|
||||
│ ├── cli.py
|
||||
│ │
|
||||
│ ├── routes/
|
||||
│ │ ├── pages.py
|
||||
│ │ ├── api.py
|
||||
│ │ ├── search.py
|
||||
│ │ ├── user.py
|
||||
│ │ └── admin.py
|
||||
│ │
|
||||
│ ├── services/
|
||||
│ │ ├── crawler.py
|
||||
│ │ ├── summarizer.py
|
||||
│ │ ├── searcher.py
|
||||
│ │ ├── cleaner.py
|
||||
│ │ ├── user_data.py
|
||||
│ │ └── scheduler.py
|
||||
│ │
|
||||
│ ├── templates/
|
||||
│ │ ├── base.html
|
||||
│ │ ├── index.html
|
||||
│ │ ├── detail.html
|
||||
│ │ ├── search.html
|
||||
│ │ ├── reading_list.html
|
||||
│ │ ├── admin_logs.html
|
||||
│ │ └── partials/
|
||||
│ │ ├── paper_card.html
|
||||
│ │ ├── date_nav.html
|
||||
│ │ └── search_bar.html
|
||||
│ │
|
||||
│ └── static/
|
||||
│ ├── css/style.css
|
||||
│ └── js/app.js
|
||||
│
|
||||
├── data/
|
||||
│ ├── db/papers.db
|
||||
│ ├── papers/{arxiv_id}/
|
||||
│ │ ├── meta.json
|
||||
│ │ ├── summary.json
|
||||
│ │ └── raw_output.txt
|
||||
│ ├── tmp/{arxiv_id}/
|
||||
│ │ ├── paper.pdf
|
||||
│ │ └── source/
|
||||
│ └── chroma/
|
||||
│
|
||||
├── logs/
|
||||
├── tests/
|
||||
└── scripts/
|
||||
├── init_db.py
|
||||
└── manual_crawl.py
|
||||
```
|
||||
|
||||
`data/tmp/` 是临时文件目录。PDF、LaTeX 源码等下载文件只在总结阶段按需下载,解析和总结完成后删除;数据库、`meta.json`、`summary.json` 和 `raw_output.txt` 可长期保留。
|
||||
|
||||
---
|
||||
|
||||
## 4. 配置项
|
||||
|
||||
```bash
|
||||
# 应用
|
||||
APP_HOST=127.0.0.1
|
||||
APP_PORT=8000
|
||||
APP_DEBUG=false
|
||||
BASE_URL=http://127.0.0.1:8000
|
||||
APP_TIMEZONE=Asia/Shanghai
|
||||
|
||||
# 安全
|
||||
ADMIN_TOKEN=change-me
|
||||
|
||||
# HuggingFace / arXiv
|
||||
HF_API_BASE=https://huggingface.co/api
|
||||
HF_PROXY=
|
||||
TOP_N=20
|
||||
HTTP_TIMEOUT_SECONDS=30
|
||||
HTTP_MAX_RETRIES=3
|
||||
HTTP_USER_AGENT=hf-daily-papers-local/0.1
|
||||
|
||||
# AI 总结
|
||||
PI_BIN=/home/rainbus/.local/share/mise/installs/pi/latest/pi
|
||||
SUMMARY_SKILL=daily-paper-summary
|
||||
SUMMARY_CONCURRENCY=3
|
||||
SUMMARY_TIMEOUT_SECONDS=300
|
||||
SUMMARY_MAX_RETRIES=1
|
||||
|
||||
# 调度
|
||||
SCHEDULER_ENABLED=true
|
||||
SCHEDULE_HOUR=8
|
||||
SCHEDULE_MINUTE=0
|
||||
APP_WORKERS=1
|
||||
|
||||
# 数据库
|
||||
DATABASE_URL=sqlite:///data/db/papers.db
|
||||
|
||||
# 语义搜索(后续增强,可为空)
|
||||
CHROMA_ENABLED=false
|
||||
CHROMA_DIR=data/chroma
|
||||
EMBED_API_BASE=
|
||||
EMBED_API_KEY=
|
||||
EMBED_MODEL=
|
||||
EMBED_DIMENSIONS=
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. 里程碑
|
||||
|
||||
### Phase 1 — MVP:抓取、入库、浏览
|
||||
|
||||
- [ ] FastAPI + SQLite + SQLAlchemy 项目骨架。
|
||||
- [ ] 数据表、FTS5 表、基础迁移或初始化脚本。
|
||||
- [ ] HF Daily Papers 抓取:支持日期、TOP_N、去重、重试、空日期。
|
||||
- [ ] 抓取阶段只入库元数据,不长期保存 PDF。
|
||||
- [ ] 首页 `/day/{date}` 和论文详情页 `/paper/{arxiv_id}`。
|
||||
- [ ] CLI:手动抓取指定日期。
|
||||
|
||||
### Phase 2 — AI 总结
|
||||
|
||||
- [ ] pi CLI 集成:一篇论文一次调用。
|
||||
- [ ] 总结阶段按需下载 PDF,成功或失败后清理临时文件。
|
||||
- [ ] summary.json schema 校验、降级展示、失败重试。
|
||||
- [ ] 总结状态追踪。
|
||||
- [ ] raw_output.txt 保存和管理后台复跑。
|
||||
- [ ] 总结完成后更新 `papers`、`paper_summaries`、FTS5。
|
||||
|
||||
### Phase 3 — 搜索和个人化
|
||||
|
||||
- [ ] FTS5 关键词搜索。
|
||||
- [ ] 收藏、阅读状态、个人笔记。
|
||||
- [ ] 阅读列表页。
|
||||
- [ ] RSS Feed。
|
||||
|
||||
### Phase 4 — 管理和自动化
|
||||
|
||||
- [ ] APScheduler 每日自动抓取和总结。
|
||||
- [ ] 管理接口 token 鉴权。
|
||||
- [ ] 管理后台日志。
|
||||
- [ ] 手动删除指定时间段内的数据。
|
||||
- [ ] 临时文件清理任务。
|
||||
|
||||
### Phase 5 — 后续增强
|
||||
|
||||
- [ ] ChromaDB 语义搜索。
|
||||
- [ ] 相似论文推荐。
|
||||
- [ ] 趋势看板。
|
||||
- [ ] 论文对比。
|
||||
- [ ] LaTeX 图片提取。
|
||||
|
||||
---
|
||||
|
||||
## 6. 核心验收标准
|
||||
|
||||
1. 重复抓取同一天不会重复入库。
|
||||
2. HuggingFace 或 arXiv 请求失败时有超时、重试和日志。
|
||||
3. 某篇论文总结失败不会阻塞其他论文。
|
||||
4. 首页能展示四种状态:未总结、总结中、总结失败、总结完成。
|
||||
5. 详情页在无总结时展示英文标题、摘要、作者、链接和手动总结入口。
|
||||
6. 搜索至少能匹配标题、摘要、作者、标签和中文总结正文。
|
||||
7. 管理接口没有 token 时不能触发抓取、总结、删除等写操作。
|
||||
8. PDF/源码临时文件在流程完成后被清理。
|
||||
9. 手动删除指定日期范围后,页面、搜索索引、用户数据和本地文件保持一致。
|
||||
10. 调度器在单 worker 下只触发一次每日任务。
|
||||
+66
@@ -0,0 +1,66 @@
|
||||
"""CLI 工具 — 手动抓取论文。"""
|
||||
|
||||
import asyncio
|
||||
import sys
|
||||
from datetime import date
|
||||
|
||||
import typer
|
||||
from dotenv import load_dotenv
|
||||
|
||||
# 在导入 app 模块前加载 .env
|
||||
load_dotenv()
|
||||
|
||||
cli_app = typer.Typer(help="HF Daily Papers 管理 CLI")
|
||||
|
||||
|
||||
@cli_app.command()
|
||||
def crawl(
|
||||
date_str: str = typer.Argument(
|
||||
None,
|
||||
help="抓取日期 (YYYY-MM-DD),默认今天",
|
||||
),
|
||||
top_n: int = typer.Option(None, "--top", "-n", help="取前 N 篇"),
|
||||
):
|
||||
"""手动抓取指定日期的 HuggingFace Daily Papers。"""
|
||||
from app.config import settings
|
||||
from app.database import SessionLocal, engine
|
||||
from app.models import init_db as _init
|
||||
from app.services.crawler import crawl_daily
|
||||
|
||||
target = date_str or date.today().isoformat()
|
||||
|
||||
# 确保数据库和表存在
|
||||
import os
|
||||
os.makedirs(settings.db_path.parent, exist_ok=True)
|
||||
_init(engine)
|
||||
typer.echo(f"📡 开始抓取 {target} ...")
|
||||
|
||||
db = SessionLocal()
|
||||
try:
|
||||
result = asyncio.run(crawl_daily(db, target, top_n))
|
||||
if result["status"] == "success":
|
||||
typer.echo(
|
||||
f"✅ 抓取完成:发现 {result['found']} 篇,新增 {result['new']} 篇"
|
||||
)
|
||||
else:
|
||||
typer.echo(f"❌ 抓取失败:{result['error']}", err=True)
|
||||
raise typer.Exit(code=1)
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@cli_app.command()
|
||||
def init_db():
|
||||
"""初始化数据库表。"""
|
||||
from app.config import settings
|
||||
from app.database import engine
|
||||
from app.models import init_db as _init
|
||||
|
||||
import os
|
||||
os.makedirs(settings.db_path.parent, exist_ok=True)
|
||||
_init(engine)
|
||||
typer.echo(f"✅ 数据库已初始化:{settings.db_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
cli_app()
|
||||
@@ -0,0 +1,73 @@
|
||||
"""应用配置 — 从 .env / 环境变量加载。"""
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from pydantic_settings import BaseSettings
|
||||
|
||||
BASE_DIR = Path(__file__).resolve().parent.parent
|
||||
|
||||
|
||||
class Settings(BaseSettings):
|
||||
# 应用
|
||||
APP_HOST: str = "127.0.0.1"
|
||||
APP_PORT: int = 8000
|
||||
APP_DEBUG: bool = False
|
||||
BASE_URL: str = "http://127.0.0.1:8000"
|
||||
APP_TIMEZONE: str = "Asia/Shanghai"
|
||||
|
||||
# 安全
|
||||
ADMIN_TOKEN: str = "change-me"
|
||||
|
||||
# HuggingFace / arXiv
|
||||
HF_API_BASE: str = "https://huggingface.co/api"
|
||||
HF_PROXY: str = ""
|
||||
TOP_N: int = 20
|
||||
HTTP_TIMEOUT_SECONDS: int = 30
|
||||
HTTP_MAX_RETRIES: int = 3
|
||||
HTTP_USER_AGENT: str = "hf-daily-papers-local/0.1"
|
||||
|
||||
# AI 总结(Phase 2)
|
||||
PI_BIN: str = ""
|
||||
SUMMARY_SKILL: str = "daily-paper-summary"
|
||||
SUMMARY_CONCURRENCY: int = 3
|
||||
SUMMARY_TIMEOUT_SECONDS: int = 300
|
||||
SUMMARY_MAX_RETRIES: int = 1
|
||||
|
||||
# 调度(Phase 4)
|
||||
SCHEDULER_ENABLED: bool = False
|
||||
SCHEDULE_HOUR: int = 8
|
||||
SCHEDULE_MINUTE: int = 0
|
||||
APP_WORKERS: int = 1
|
||||
|
||||
# 数据库
|
||||
DATABASE_URL: str = "sqlite:///data/db/papers.db"
|
||||
|
||||
# 语义搜索(Phase 5)
|
||||
CHROMA_ENABLED: bool = False
|
||||
CHROMA_DIR: str = "data/chroma"
|
||||
EMBED_API_BASE: str = ""
|
||||
EMBED_API_KEY: str = ""
|
||||
EMBED_MODEL: str = ""
|
||||
EMBED_DIMENSIONS: int = 0
|
||||
|
||||
model_config = {
|
||||
"env_file": str(BASE_DIR / ".env"),
|
||||
"env_file_encoding": "utf-8",
|
||||
"extra": "ignore",
|
||||
}
|
||||
|
||||
@property
|
||||
def db_path(self) -> Path:
|
||||
"""从 DATABASE_URL 解析出 SQLite 文件路径。"""
|
||||
# sqlite:///data/db/papers.db → data/db/papers.db
|
||||
url = self.DATABASE_URL
|
||||
if url.startswith("sqlite:///"):
|
||||
return BASE_DIR / url[len("sqlite:///"):]
|
||||
raise ValueError(f"Unsupported DATABASE_URL: {url}")
|
||||
|
||||
@property
|
||||
def http_proxy(self) -> str | None:
|
||||
return self.HF_PROXY or None
|
||||
|
||||
|
||||
settings = Settings()
|
||||
@@ -0,0 +1,41 @@
|
||||
"""数据库引擎、会话工厂、初始化。"""
|
||||
|
||||
from sqlalchemy import event, create_engine
|
||||
from sqlalchemy.orm import DeclarativeBase, sessionmaker
|
||||
|
||||
from app.config import settings
|
||||
|
||||
|
||||
class Base(DeclarativeBase):
|
||||
pass
|
||||
|
||||
|
||||
def _make_engine():
|
||||
"""创建 SQLite 引擎,启用 foreign_keys。"""
|
||||
engine = create_engine(
|
||||
settings.DATABASE_URL,
|
||||
echo=settings.APP_DEBUG,
|
||||
connect_args={"check_same_thread": False},
|
||||
)
|
||||
|
||||
@event.listens_for(engine, "connect")
|
||||
def _set_sqlite_pragma(dbapi_connection, _connection_record):
|
||||
cursor = dbapi_connection.cursor()
|
||||
cursor.execute("PRAGMA foreign_keys=ON")
|
||||
cursor.execute("PRAGMA journal_mode=WAL")
|
||||
cursor.close()
|
||||
|
||||
return engine
|
||||
|
||||
|
||||
engine = _make_engine()
|
||||
SessionLocal = sessionmaker(bind=engine, autoflush=False, autocommit=False)
|
||||
|
||||
|
||||
def get_db():
|
||||
"""FastAPI 依赖注入:获取数据库会话。"""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
yield db
|
||||
finally:
|
||||
db.close()
|
||||
+59
@@ -0,0 +1,59 @@
|
||||
"""FastAPI 应用入口。"""
|
||||
|
||||
import logging
|
||||
import os
|
||||
|
||||
from fastapi import FastAPI
|
||||
from fastapi.staticfiles import StaticFiles
|
||||
|
||||
from app.config import settings
|
||||
from app.database import engine
|
||||
from app.models import init_db
|
||||
from app.routes.pages import router as pages_router
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.DEBUG if settings.APP_DEBUG else logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def create_app() -> FastAPI:
|
||||
app = FastAPI(
|
||||
title="HF Daily Papers",
|
||||
description="HuggingFace Daily Papers — 中文论文导览站",
|
||||
version="0.1.0",
|
||||
)
|
||||
|
||||
# 确保数据目录存在
|
||||
os.makedirs(settings.db_path.parent, exist_ok=True)
|
||||
|
||||
# 初始化数据库
|
||||
init_db(engine)
|
||||
logger.info("Database initialized at %s", settings.db_path)
|
||||
|
||||
# 安全警告
|
||||
if settings.ADMIN_TOKEN == "change-me":
|
||||
logger.warning("⚠️ ADMIN_TOKEN is the default value 'change-me'. Please change it in .env!")
|
||||
|
||||
# 静态文件
|
||||
app.mount("/static", StaticFiles(directory="app/static"), name="static")
|
||||
|
||||
# 路由
|
||||
app.include_router(pages_router)
|
||||
|
||||
return app
|
||||
|
||||
|
||||
app = create_app()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import uvicorn
|
||||
|
||||
uvicorn.run(
|
||||
"app.main:app",
|
||||
host=settings.APP_HOST,
|
||||
port=settings.APP_PORT,
|
||||
reload=settings.APP_DEBUG,
|
||||
)
|
||||
+235
@@ -0,0 +1,235 @@
|
||||
"""SQLAlchemy ORM 模型 — papers, authors, tags, summaries, FTS5, logs, locks, user data。"""
|
||||
|
||||
from datetime import date, datetime
|
||||
|
||||
from sqlalchemy import (
|
||||
Boolean,
|
||||
Column,
|
||||
Date,
|
||||
DateTime,
|
||||
ForeignKey,
|
||||
Index,
|
||||
Integer,
|
||||
String,
|
||||
Text,
|
||||
UniqueConstraint,
|
||||
text,
|
||||
)
|
||||
from sqlalchemy.orm import relationship
|
||||
|
||||
from app.database import Base
|
||||
|
||||
|
||||
# ── papers ──────────────────────────────────────────────────────────────
|
||||
class Paper(Base):
|
||||
__tablename__ = "papers"
|
||||
|
||||
id = Column(Integer, primary_key=True, autoincrement=True)
|
||||
arxiv_id = Column(String, unique=True, nullable=False, index=True)
|
||||
title_en = Column(String, nullable=False)
|
||||
title_zh = Column(String)
|
||||
abstract = Column(Text)
|
||||
published_at = Column(Date)
|
||||
paper_date = Column(Date, nullable=False, index=True)
|
||||
crawled_at = Column(DateTime, nullable=False)
|
||||
upvotes = Column(Integer, default=0)
|
||||
hf_url = Column(String)
|
||||
arxiv_url = Column(String)
|
||||
pdf_url = Column(String)
|
||||
source_url = Column(String)
|
||||
asset_status = Column(String, default="not_downloaded")
|
||||
asset_error = Column(String)
|
||||
meta_path = Column(String)
|
||||
summary_path = Column(String)
|
||||
raw_output_path = Column(String)
|
||||
summary_quality = Column(String)
|
||||
|
||||
authors = relationship("PaperAuthor", back_populates="paper", cascade="all, delete-orphan")
|
||||
tags = relationship("PaperTag", back_populates="paper", cascade="all, delete-orphan")
|
||||
summary = relationship("PaperSummary", back_populates="paper", uselist=False, cascade="all, delete-orphan")
|
||||
summary_status = relationship("SummaryStatus", back_populates="paper", uselist=False, cascade="all, delete-orphan")
|
||||
bookmark = relationship("UserBookmark", back_populates="paper", uselist=False, cascade="all, delete-orphan")
|
||||
reading_status = relationship("UserReadingStatus", back_populates="paper", uselist=False, cascade="all, delete-orphan")
|
||||
note = relationship("UserNote", back_populates="paper", uselist=False, cascade="all, delete-orphan")
|
||||
|
||||
|
||||
# ── paper_authors ───────────────────────────────────────────────────────
|
||||
class PaperAuthor(Base):
|
||||
__tablename__ = "paper_authors"
|
||||
__table_args__ = (UniqueConstraint("paper_id", "name"),)
|
||||
|
||||
id = Column(Integer, primary_key=True, autoincrement=True)
|
||||
paper_id = Column(Integer, ForeignKey("papers.id", ondelete="CASCADE"), nullable=False)
|
||||
name = Column(String, nullable=False)
|
||||
position = Column(Integer, default=0)
|
||||
|
||||
paper = relationship("Paper", back_populates="authors")
|
||||
|
||||
|
||||
# ── paper_tags ──────────────────────────────────────────────────────────
|
||||
class PaperTag(Base):
|
||||
__tablename__ = "paper_tags"
|
||||
__table_args__ = (UniqueConstraint("paper_id", "tag", "source"),)
|
||||
|
||||
id = Column(Integer, primary_key=True, autoincrement=True)
|
||||
paper_id = Column(Integer, ForeignKey("papers.id", ondelete="CASCADE"), nullable=False)
|
||||
tag = Column(String, nullable=False)
|
||||
source = Column(String, default="hf")
|
||||
|
||||
paper = relationship("Paper", back_populates="tags")
|
||||
|
||||
|
||||
# ── paper_summaries ─────────────────────────────────────────────────────
|
||||
class PaperSummary(Base):
|
||||
__tablename__ = "paper_summaries"
|
||||
|
||||
paper_id = Column(Integer, ForeignKey("papers.id", ondelete="CASCADE"), primary_key=True)
|
||||
one_line = Column(Text)
|
||||
difficulty = Column(String)
|
||||
prerequisites_json = Column(Text)
|
||||
motivation_problem = Column(Text)
|
||||
motivation_goal = Column(Text)
|
||||
motivation_gap = Column(Text)
|
||||
method_overview = Column(Text)
|
||||
method_key_idea = Column(Text)
|
||||
method_steps_json = Column(Text)
|
||||
method_novelty = Column(Text)
|
||||
results_main_json = Column(Text)
|
||||
results_benchmarks_json = Column(Text)
|
||||
limitations_json = Column(Text)
|
||||
weaknesses_json = Column(Text)
|
||||
future_work_json = Column(Text)
|
||||
reproducibility = Column(String)
|
||||
full_json = Column(Text, nullable=False)
|
||||
updated_at = Column(DateTime, nullable=False)
|
||||
|
||||
paper = relationship("Paper", back_populates="summary")
|
||||
|
||||
|
||||
# ── summary_status ──────────────────────────────────────────────────────
|
||||
class SummaryStatus(Base):
|
||||
__tablename__ = "summary_status"
|
||||
__table_args__ = (UniqueConstraint("paper_id"),)
|
||||
|
||||
id = Column(Integer, primary_key=True, autoincrement=True)
|
||||
paper_id = Column(Integer, ForeignKey("papers.id", ondelete="CASCADE"), nullable=False)
|
||||
status = Column(String, nullable=False, default="pending")
|
||||
quality = Column(String)
|
||||
error_type = Column(String)
|
||||
error = Column(Text)
|
||||
retry_count = Column(Integer, default=0)
|
||||
raw_output_saved = Column(Boolean, default=False)
|
||||
started_at = Column(DateTime)
|
||||
completed_at = Column(DateTime)
|
||||
|
||||
paper = relationship("Paper", back_populates="summary_status")
|
||||
|
||||
|
||||
# ── crawl_logs ──────────────────────────────────────────────────────────
|
||||
class CrawlLog(Base):
|
||||
__tablename__ = "crawl_logs"
|
||||
|
||||
id = Column(Integer, primary_key=True, autoincrement=True)
|
||||
task = Column(String, nullable=False)
|
||||
status = Column(String, nullable=False)
|
||||
date = Column(Date)
|
||||
papers_found = Column(Integer)
|
||||
papers_new = Column(Integer)
|
||||
error = Column(Text)
|
||||
started_at = Column(DateTime, nullable=False)
|
||||
completed_at = Column(DateTime)
|
||||
|
||||
|
||||
# ── task_locks ──────────────────────────────────────────────────────────
|
||||
class TaskLock(Base):
|
||||
__tablename__ = "task_locks"
|
||||
|
||||
id = Column(Integer, primary_key=True, autoincrement=True)
|
||||
task = Column(String, nullable=False)
|
||||
lock_key = Column(String, nullable=False)
|
||||
status = Column(String, nullable=False)
|
||||
owner = Column(String)
|
||||
acquired_at = Column(DateTime, nullable=False)
|
||||
released_at = Column(DateTime)
|
||||
|
||||
|
||||
# ── user data ──────────────────────────────────────────────────────────
|
||||
class UserBookmark(Base):
|
||||
__tablename__ = "user_bookmarks"
|
||||
__table_args__ = (UniqueConstraint("paper_id"),)
|
||||
|
||||
id = Column(Integer, primary_key=True, autoincrement=True)
|
||||
paper_id = Column(Integer, ForeignKey("papers.id", ondelete="CASCADE"), nullable=False)
|
||||
note = Column(Text)
|
||||
created_at = Column(DateTime, nullable=False)
|
||||
|
||||
paper = relationship("Paper", back_populates="bookmark")
|
||||
|
||||
|
||||
class UserReadingStatus(Base):
|
||||
__tablename__ = "user_reading_status"
|
||||
__table_args__ = (UniqueConstraint("paper_id"),)
|
||||
|
||||
id = Column(Integer, primary_key=True, autoincrement=True)
|
||||
paper_id = Column(Integer, ForeignKey("papers.id", ondelete="CASCADE"), nullable=False)
|
||||
status = Column(String, nullable=False, default="unread")
|
||||
updated_at = Column(DateTime, nullable=False)
|
||||
|
||||
paper = relationship("Paper", back_populates="reading_status")
|
||||
|
||||
|
||||
class UserNote(Base):
|
||||
__tablename__ = "user_notes"
|
||||
__table_args__ = (UniqueConstraint("paper_id"),)
|
||||
|
||||
id = Column(Integer, primary_key=True, autoincrement=True)
|
||||
paper_id = Column(Integer, ForeignKey("papers.id", ondelete="CASCADE"), nullable=False)
|
||||
content = Column(Text, nullable=False)
|
||||
created_at = Column(DateTime, nullable=False)
|
||||
updated_at = Column(DateTime, nullable=False)
|
||||
|
||||
paper = relationship("Paper", back_populates="note")
|
||||
|
||||
|
||||
# ── data_delete_jobs ───────────────────────────────────────────────────
|
||||
class DataDeleteJob(Base):
|
||||
__tablename__ = "data_delete_jobs"
|
||||
|
||||
id = Column(Integer, primary_key=True, autoincrement=True)
|
||||
date_start = Column(Date, nullable=False)
|
||||
date_end = Column(Date, nullable=False)
|
||||
include_notes = Column(Boolean, default=True)
|
||||
paper_count = Column(Integer, default=0)
|
||||
status = Column(String, nullable=False)
|
||||
error = Column(Text)
|
||||
started_at = Column(DateTime, nullable=False)
|
||||
completed_at = Column(DateTime)
|
||||
|
||||
|
||||
# ── FTS5 索引初始化 SQL(普通虚拟表,由应用层维护)──────────────────────
|
||||
FTS5_CREATE_SQL = """
|
||||
CREATE VIRTUAL TABLE IF NOT EXISTS papers_fts USING fts5(
|
||||
title_en,
|
||||
title_zh,
|
||||
abstract,
|
||||
authors,
|
||||
tags,
|
||||
summary_text,
|
||||
tokenize='unicode61'
|
||||
);
|
||||
"""
|
||||
|
||||
FTS5_TRIGGER_INDEX = """
|
||||
-- partial index for task_locks running
|
||||
CREATE UNIQUE INDEX IF NOT EXISTS uq_task_locks_running
|
||||
ON task_locks(task, lock_key) WHERE status = 'running';
|
||||
"""
|
||||
|
||||
|
||||
def init_db(engine):
|
||||
"""创建所有 ORM 表 + FTS5 虚拟表。"""
|
||||
Base.metadata.create_all(engine)
|
||||
with engine.connect() as conn:
|
||||
conn.execute(text(FTS5_CREATE_SQL))
|
||||
conn.execute(text(FTS5_TRIGGER_INDEX))
|
||||
conn.commit()
|
||||
@@ -0,0 +1,109 @@
|
||||
"""页面路由 — 首页、日期页、论文详情。"""
|
||||
|
||||
from datetime import date, datetime, timedelta
|
||||
from zoneinfo import ZoneInfo
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException, Request
|
||||
from fastapi.responses import RedirectResponse
|
||||
from fastapi.templating import Jinja2Templates
|
||||
from sqlalchemy.orm import Session, joinedload
|
||||
|
||||
from app.config import settings
|
||||
from app.database import get_db
|
||||
from app.models import Paper
|
||||
|
||||
router = APIRouter()
|
||||
templates = Jinja2Templates(directory="app/templates")
|
||||
|
||||
|
||||
def _today() -> str:
|
||||
tz = ZoneInfo(settings.APP_TIMEZONE)
|
||||
return datetime.now(tz).strftime("%Y-%m-%d")
|
||||
|
||||
|
||||
@router.get("/")
|
||||
def index(request: Request):
|
||||
"""重定向到 /day/{today}。"""
|
||||
return RedirectResponse(url=f"/day/{_today()}")
|
||||
|
||||
|
||||
@router.get("/day/{date_str}")
|
||||
def day_page(date_str: str, request: Request, db: Session = Depends(get_db)):
|
||||
"""指定日期论文列表。"""
|
||||
try:
|
||||
target = date.fromisoformat(date_str)
|
||||
except ValueError:
|
||||
raise HTTPException(status_code=404, detail="Invalid date format")
|
||||
|
||||
prev_day = (target - timedelta(days=1)).isoformat()
|
||||
next_day = (target + timedelta(days=1)).isoformat()
|
||||
today_str = _today()
|
||||
|
||||
papers = (
|
||||
db.query(Paper)
|
||||
.filter(Paper.paper_date == date_str)
|
||||
.options(
|
||||
joinedload(Paper.authors),
|
||||
joinedload(Paper.tags),
|
||||
joinedload(Paper.summary_status),
|
||||
joinedload(Paper.bookmark),
|
||||
)
|
||||
.order_by(Paper.upvotes.desc())
|
||||
.all()
|
||||
)
|
||||
|
||||
dates_raw = (
|
||||
db.query(Paper.paper_date)
|
||||
.distinct()
|
||||
.order_by(Paper.paper_date.desc())
|
||||
.limit(30)
|
||||
.all()
|
||||
)
|
||||
available_dates = [d[0].isoformat() if isinstance(d[0], date) else str(d[0]) for d in dates_raw]
|
||||
|
||||
return templates.TemplateResponse(
|
||||
request, "index.html",
|
||||
{
|
||||
"papers": papers,
|
||||
"current_date": date_str,
|
||||
"prev_day": prev_day,
|
||||
"next_day": next_day,
|
||||
"today": today_str,
|
||||
"available_dates": available_dates,
|
||||
"page_title": f"{date_str} 论文列表",
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
@router.get("/paper/{arxiv_id}")
|
||||
def paper_detail(arxiv_id: str, request: Request, db: Session = Depends(get_db)):
|
||||
"""论文详情页。"""
|
||||
paper = (
|
||||
db.query(Paper)
|
||||
.filter(Paper.arxiv_id == arxiv_id)
|
||||
.options(
|
||||
joinedload(Paper.authors),
|
||||
joinedload(Paper.tags),
|
||||
joinedload(Paper.summary),
|
||||
joinedload(Paper.summary_status),
|
||||
joinedload(Paper.bookmark),
|
||||
joinedload(Paper.reading_status),
|
||||
joinedload(Paper.note),
|
||||
)
|
||||
.first()
|
||||
)
|
||||
if not paper:
|
||||
raise HTTPException(status_code=404, detail="Paper not found")
|
||||
|
||||
summary_state = "none"
|
||||
if paper.summary_status:
|
||||
summary_state = paper.summary_status.status
|
||||
|
||||
return templates.TemplateResponse(
|
||||
request, "detail.html",
|
||||
{
|
||||
"paper": paper,
|
||||
"summary_state": summary_state,
|
||||
"page_title": paper.title_zh or paper.title_en,
|
||||
},
|
||||
)
|
||||
@@ -0,0 +1,182 @@
|
||||
"""爬虫服务 — 从 HuggingFace Daily Papers API 抓取论文元数据。"""
|
||||
|
||||
import logging
|
||||
from datetime import date as date_type
|
||||
from datetime import datetime, timezone
|
||||
|
||||
import httpx
|
||||
from sqlalchemy import select, text
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.config import settings
|
||||
from app.models import (
|
||||
CrawlLog,
|
||||
Paper,
|
||||
PaperAuthor,
|
||||
PaperTag,
|
||||
SummaryStatus,
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
async def fetch_daily(target_date: str, top_n: int | None = None) -> list[dict]:
|
||||
"""从 HF Daily Papers API 获取指定日期的论文列表。
|
||||
|
||||
Args:
|
||||
target_date: YYYY-MM-DD 格式
|
||||
top_n: 取前 N 篇,默认使用 settings.TOP_N
|
||||
|
||||
Returns:
|
||||
论文元数据列表
|
||||
"""
|
||||
top_n = top_n or settings.TOP_N
|
||||
url = f"{settings.HF_API_BASE}/daily_papers"
|
||||
params = {"date": target_date}
|
||||
|
||||
transport = None
|
||||
if settings.http_proxy:
|
||||
transport = httpx.AsyncHTTPTransport(proxy=settings.http_proxy)
|
||||
|
||||
async with httpx.AsyncClient(
|
||||
timeout=settings.HTTP_TIMEOUT_SECONDS,
|
||||
headers={"User-Agent": settings.HTTP_USER_AGENT},
|
||||
transport=transport,
|
||||
) as client:
|
||||
for attempt in range(1, settings.HTTP_MAX_RETRIES + 1):
|
||||
try:
|
||||
logger.info("Fetching HF Daily Papers: date=%s attempt=%d", target_date, attempt)
|
||||
resp = await client.get(url, params=params)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
break
|
||||
except (httpx.HTTPError, httpx.HTTPStatusError) as exc:
|
||||
logger.warning("Fetch failed (attempt %d/%d): %s", attempt, settings.HTTP_MAX_RETRIES, exc)
|
||||
if attempt == settings.HTTP_MAX_RETRIES:
|
||||
raise
|
||||
else:
|
||||
data = []
|
||||
|
||||
papers = data[:top_n]
|
||||
logger.info("Fetched %d papers for %s (raw=%d)", len(papers), target_date, len(data))
|
||||
return papers
|
||||
|
||||
|
||||
def _parse_paper(item: dict) -> dict:
|
||||
"""从 HF API 响应中提取论文元数据。"""
|
||||
paper_info = item.get("paper", item)
|
||||
arxiv_id = paper_info.get("id", "")
|
||||
published_raw = paper_info.get("publishedAt", "")
|
||||
published_at = None
|
||||
if published_raw:
|
||||
try:
|
||||
published_at = date_type.fromisoformat(published_raw[:10])
|
||||
except ValueError:
|
||||
pass
|
||||
return {
|
||||
"arxiv_id": arxiv_id,
|
||||
"title_en": paper_info.get("title", ""),
|
||||
"abstract": paper_info.get("abstract", ""),
|
||||
"published_at": published_at,
|
||||
"upvotes": item.get("paper", {}).get("upvotes", 0) or item.get("upvotes", 0),
|
||||
"hf_url": f"https://huggingface.co/papers/{arxiv_id}" if arxiv_id else "",
|
||||
"arxiv_url": f"https://arxiv.org/abs/{arxiv_id}" if arxiv_id else "",
|
||||
"pdf_url": f"https://arxiv.org/pdf/{arxiv_id}.pdf" if arxiv_id else "",
|
||||
"authors": [a.get("name", a) if isinstance(a, dict) else a for a in paper_info.get("authors", [])],
|
||||
"tags": [t.get("name", t) if isinstance(t, dict) else t for t in (paper_info.get("tags") or [])],
|
||||
}
|
||||
|
||||
|
||||
def upsert_papers(db: Session, papers_raw: list[dict], paper_date: str) -> list[Paper]:
|
||||
"""将论文元数据写入数据库。已有论文仅更新可变字段(upvotes 等),不重复插入。"""
|
||||
now = datetime.now(timezone.utc)
|
||||
paper_date_obj = date_type.fromisoformat(paper_date)
|
||||
new_papers: list[Paper] = []
|
||||
|
||||
for item in papers_raw:
|
||||
meta = _parse_paper(item)
|
||||
arxiv_id = meta["arxiv_id"]
|
||||
if not arxiv_id:
|
||||
continue
|
||||
|
||||
existing = db.execute(
|
||||
select(Paper).where(Paper.arxiv_id == arxiv_id)
|
||||
).scalar_one_or_none()
|
||||
|
||||
if existing:
|
||||
existing.upvotes = meta["upvotes"]
|
||||
existing.crawled_at = now
|
||||
logger.debug("Updated existing paper: %s", arxiv_id)
|
||||
else:
|
||||
paper = Paper(
|
||||
arxiv_id=arxiv_id,
|
||||
title_en=meta["title_en"],
|
||||
abstract=meta["abstract"],
|
||||
published_at=meta["published_at"],
|
||||
paper_date=paper_date_obj,
|
||||
crawled_at=now,
|
||||
upvotes=meta["upvotes"],
|
||||
hf_url=meta["hf_url"],
|
||||
arxiv_url=meta["arxiv_url"],
|
||||
pdf_url=meta["pdf_url"],
|
||||
)
|
||||
db.add(paper)
|
||||
db.flush()
|
||||
|
||||
for idx, name in enumerate(meta["authors"]):
|
||||
if name:
|
||||
db.add(PaperAuthor(paper_id=paper.id, name=name, position=idx))
|
||||
|
||||
for tag_name in meta["tags"]:
|
||||
if tag_name:
|
||||
db.add(PaperTag(paper_id=paper.id, tag=tag_name, source="hf"))
|
||||
|
||||
db.add(SummaryStatus(paper_id=paper.id, status="pending"))
|
||||
|
||||
authors_text = ", ".join(meta["authors"])
|
||||
tags_text = ", ".join(meta["tags"])
|
||||
db.execute(
|
||||
text(
|
||||
"INSERT INTO papers_fts(rowid, title_en, abstract, authors, tags) "
|
||||
"VALUES (:id, :title, :abstract, :authors, :tags)"
|
||||
),
|
||||
{"id": paper.id, "title": meta["title_en"], "abstract": meta["abstract"] or "",
|
||||
"authors": authors_text, "tags": tags_text},
|
||||
)
|
||||
|
||||
new_papers.append(paper)
|
||||
logger.debug("Inserted new paper: %s", arxiv_id)
|
||||
|
||||
db.commit()
|
||||
logger.info("Upserted %d papers (%d new) for %s", len(papers_raw), len(new_papers), paper_date)
|
||||
return new_papers
|
||||
|
||||
|
||||
async def crawl_daily(db: Session, target_date: str, top_n: int | None = None) -> dict:
|
||||
"""完整的抓取流程:获取 + 入库 + 写日志。"""
|
||||
now = datetime.now(timezone.utc)
|
||||
log_entry = CrawlLog(
|
||||
task="crawl",
|
||||
status="running",
|
||||
date=date_type.fromisoformat(target_date),
|
||||
started_at=now,
|
||||
)
|
||||
db.add(log_entry)
|
||||
db.commit()
|
||||
|
||||
try:
|
||||
raw_papers = await fetch_daily(target_date, top_n)
|
||||
new_papers = upsert_papers(db, raw_papers, target_date)
|
||||
log_entry.status = "success"
|
||||
log_entry.papers_found = len(raw_papers)
|
||||
log_entry.papers_new = len(new_papers)
|
||||
log_entry.completed_at = datetime.now(timezone.utc)
|
||||
db.commit()
|
||||
return {"found": len(raw_papers), "new": len(new_papers), "status": "success", "error": None}
|
||||
except Exception as exc:
|
||||
logger.exception("Crawl failed for %s", target_date)
|
||||
log_entry.status = "failed"
|
||||
log_entry.error = str(exc)
|
||||
log_entry.completed_at = datetime.now(timezone.utc)
|
||||
db.commit()
|
||||
return {"found": 0, "new": 0, "status": "failed", "error": str(exc)}
|
||||
@@ -0,0 +1,338 @@
|
||||
/* ── kami 风格参考:纸张质感、留白、墨蓝强调色 ─────────────────── */
|
||||
:root {
|
||||
--bg: #faf8f5;
|
||||
--surface: #ffffff;
|
||||
--ink: #1a1a2e;
|
||||
--ink-light: #4a4a6a;
|
||||
--accent: #2d5f8a;
|
||||
--accent-hover: #1d4a6f;
|
||||
--border: #e8e4df;
|
||||
--shadow: rgba(0, 0, 0, 0.06);
|
||||
--radius: 8px;
|
||||
--font-body: "Noto Serif SC", "Georgia", serif;
|
||||
--font-sans: "Inter", "Noto Sans SC", system-ui, sans-serif;
|
||||
--max-width: 960px;
|
||||
}
|
||||
|
||||
*, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
|
||||
|
||||
body {
|
||||
font-family: var(--font-sans);
|
||||
background: var(--bg);
|
||||
color: var(--ink);
|
||||
line-height: 1.7;
|
||||
-webkit-font-smoothing: antialiased;
|
||||
}
|
||||
|
||||
a { color: var(--accent); text-decoration: none; }
|
||||
a:hover { color: var(--accent-hover); text-decoration: underline; }
|
||||
|
||||
/* ── Header ─────────────────────────────────────────────────────── */
|
||||
.site-header {
|
||||
background: var(--surface);
|
||||
border-bottom: 1px solid var(--border);
|
||||
position: sticky;
|
||||
top: 0;
|
||||
z-index: 100;
|
||||
}
|
||||
|
||||
.nav-bar {
|
||||
max-width: var(--max-width);
|
||||
margin: 0 auto;
|
||||
padding: 12px 24px;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 24px;
|
||||
}
|
||||
|
||||
.nav-brand {
|
||||
font-family: var(--font-body);
|
||||
font-size: 1.2rem;
|
||||
font-weight: 700;
|
||||
color: var(--ink);
|
||||
}
|
||||
|
||||
.nav-links { display: flex; gap: 16px; margin-left: auto; }
|
||||
.nav-links a { font-size: 0.9rem; color: var(--ink-light); }
|
||||
.nav-links a:hover { color: var(--accent); }
|
||||
|
||||
/* ── Container ──────────────────────────────────────────────────── */
|
||||
.container {
|
||||
max-width: var(--max-width);
|
||||
margin: 0 auto;
|
||||
padding: 24px;
|
||||
}
|
||||
|
||||
/* ── Date Navigation ────────────────────────────────────────────── */
|
||||
.date-nav {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 16px;
|
||||
margin-bottom: 24px;
|
||||
flex-wrap: wrap;
|
||||
}
|
||||
|
||||
.date-title {
|
||||
font-family: var(--font-body);
|
||||
font-size: 1.5rem;
|
||||
font-weight: 700;
|
||||
}
|
||||
|
||||
.date-nav-btn {
|
||||
display: inline-block;
|
||||
padding: 6px 14px;
|
||||
background: var(--surface);
|
||||
border: 1px solid var(--border);
|
||||
border-radius: var(--radius);
|
||||
font-size: 0.85rem;
|
||||
color: var(--ink-light);
|
||||
transition: all 0.2s;
|
||||
}
|
||||
.date-nav-btn:hover { border-color: var(--accent); color: var(--accent); text-decoration: none; }
|
||||
|
||||
/* ── Date Chips ─────────────────────────────────────────────────── */
|
||||
.date-quick-nav {
|
||||
margin-top: 32px;
|
||||
padding-top: 16px;
|
||||
border-top: 1px solid var(--border);
|
||||
font-size: 0.85rem;
|
||||
color: var(--ink-light);
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 8px;
|
||||
flex-wrap: wrap;
|
||||
}
|
||||
|
||||
.date-chip {
|
||||
padding: 4px 10px;
|
||||
background: var(--surface);
|
||||
border: 1px solid var(--border);
|
||||
border-radius: 4px;
|
||||
font-size: 0.8rem;
|
||||
color: var(--ink-light);
|
||||
}
|
||||
.date-chip:hover { border-color: var(--accent); color: var(--accent); text-decoration: none; }
|
||||
.date-chip.active { background: var(--accent); color: #fff; border-color: var(--accent); }
|
||||
|
||||
/* ── Paper Card ─────────────────────────────────────────────────── */
|
||||
.paper-list { display: flex; flex-direction: column; gap: 16px; }
|
||||
|
||||
.paper-card {
|
||||
background: var(--surface);
|
||||
border: 1px solid var(--border);
|
||||
border-radius: var(--radius);
|
||||
padding: 20px 24px;
|
||||
transition: box-shadow 0.2s;
|
||||
}
|
||||
.paper-card:hover { box-shadow: 0 2px 12px var(--shadow); }
|
||||
|
||||
.paper-card-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: flex-start;
|
||||
gap: 12px;
|
||||
}
|
||||
|
||||
.paper-title {
|
||||
font-family: var(--font-body);
|
||||
font-size: 1.1rem;
|
||||
font-weight: 600;
|
||||
line-height: 1.5;
|
||||
flex: 1;
|
||||
}
|
||||
.paper-title a { color: var(--ink); }
|
||||
.paper-title a:hover { color: var(--accent); }
|
||||
|
||||
.paper-upvotes {
|
||||
font-size: 0.85rem;
|
||||
color: var(--ink-light);
|
||||
white-space: nowrap;
|
||||
}
|
||||
|
||||
.paper-one-line, .paper-abstract-preview {
|
||||
margin-top: 8px;
|
||||
color: var(--ink-light);
|
||||
font-size: 0.92rem;
|
||||
line-height: 1.6;
|
||||
}
|
||||
|
||||
.paper-meta {
|
||||
margin-top: 8px;
|
||||
font-size: 0.82rem;
|
||||
color: var(--ink-light);
|
||||
}
|
||||
|
||||
.paper-tags {
|
||||
margin-top: 8px;
|
||||
display: flex;
|
||||
gap: 6px;
|
||||
flex-wrap: wrap;
|
||||
}
|
||||
|
||||
.tag {
|
||||
display: inline-block;
|
||||
padding: 2px 8px;
|
||||
background: #eef3f8;
|
||||
color: var(--accent);
|
||||
border-radius: 3px;
|
||||
font-size: 0.75rem;
|
||||
font-weight: 500;
|
||||
}
|
||||
|
||||
.paper-footer {
|
||||
margin-top: 12px;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
}
|
||||
|
||||
.summary-badge {
|
||||
font-size: 0.8rem;
|
||||
padding: 2px 8px;
|
||||
border-radius: 3px;
|
||||
}
|
||||
.summary-none { background: #f0f0f0; color: #888; }
|
||||
.summary-pending { background: #fff3e0; color: #e67e22; }
|
||||
.summary-processing { background: #e3f2fd; color: #1976d2; }
|
||||
.summary-done { background: #e8f5e9; color: #388e3c; }
|
||||
.summary-failed, .summary-permanent_failure { background: #fce4ec; color: #c62828; }
|
||||
|
||||
.btn-detail {
|
||||
font-size: 0.85rem;
|
||||
color: var(--accent);
|
||||
font-weight: 500;
|
||||
}
|
||||
|
||||
/* ── Empty State ────────────────────────────────────────────────── */
|
||||
.empty-state {
|
||||
text-align: center;
|
||||
padding: 60px 20px;
|
||||
color: var(--ink-light);
|
||||
}
|
||||
.empty-state p:first-child { font-size: 1.2rem; }
|
||||
.hint { font-size: 0.85rem; margin-top: 8px; }
|
||||
|
||||
/* ── Paper Detail ───────────────────────────────────────────────── */
|
||||
.paper-detail { max-width: 780px; margin: 0 auto; }
|
||||
|
||||
.back-link {
|
||||
display: inline-block;
|
||||
margin-bottom: 16px;
|
||||
font-size: 0.85rem;
|
||||
color: var(--ink-light);
|
||||
}
|
||||
|
||||
.detail-title {
|
||||
font-family: var(--font-body);
|
||||
font-size: 1.6rem;
|
||||
font-weight: 700;
|
||||
line-height: 1.4;
|
||||
margin-bottom: 12px;
|
||||
}
|
||||
.detail-title .title-en {
|
||||
display: block;
|
||||
font-size: 1rem;
|
||||
font-weight: 400;
|
||||
color: var(--ink-light);
|
||||
margin-top: 4px;
|
||||
}
|
||||
|
||||
.detail-meta {
|
||||
display: flex;
|
||||
gap: 16px;
|
||||
flex-wrap: wrap;
|
||||
font-size: 0.88rem;
|
||||
color: var(--ink-light);
|
||||
margin-bottom: 12px;
|
||||
}
|
||||
|
||||
.detail-tags { margin-bottom: 12px; display: flex; gap: 6px; flex-wrap: wrap; }
|
||||
|
||||
.detail-links {
|
||||
display: flex;
|
||||
gap: 12px;
|
||||
margin-bottom: 24px;
|
||||
}
|
||||
.ext-link {
|
||||
padding: 6px 14px;
|
||||
background: var(--surface);
|
||||
border: 1px solid var(--border);
|
||||
border-radius: var(--radius);
|
||||
font-size: 0.85rem;
|
||||
color: var(--ink-light);
|
||||
}
|
||||
.ext-link:hover { border-color: var(--accent); color: var(--accent); text-decoration: none; }
|
||||
|
||||
/* ── Summary Sections ───────────────────────────────────────────── */
|
||||
.summary-section {
|
||||
margin-bottom: 24px;
|
||||
padding: 20px;
|
||||
background: var(--surface);
|
||||
border: 1px solid var(--border);
|
||||
border-radius: var(--radius);
|
||||
}
|
||||
|
||||
.summary-section h2 {
|
||||
font-family: var(--font-body);
|
||||
font-size: 1.05rem;
|
||||
font-weight: 600;
|
||||
margin-bottom: 8px;
|
||||
color: var(--accent);
|
||||
}
|
||||
|
||||
.summary-section p {
|
||||
font-size: 0.92rem;
|
||||
line-height: 1.8;
|
||||
}
|
||||
|
||||
.one-line {
|
||||
font-size: 1rem;
|
||||
font-weight: 500;
|
||||
line-height: 1.6;
|
||||
}
|
||||
|
||||
.abstract-section { background: #faf8f5; }
|
||||
.abstract-en { font-size: 0.9rem; color: var(--ink-light); font-style: italic; }
|
||||
|
||||
/* ── Summary Placeholders ───────────────────────────────────────── */
|
||||
.summary-placeholder {
|
||||
padding: 24px;
|
||||
text-align: center;
|
||||
border-radius: var(--radius);
|
||||
margin-bottom: 24px;
|
||||
}
|
||||
.summary-placeholder.processing { background: #e3f2fd; }
|
||||
.summary-placeholder.failed { background: #fce4ec; }
|
||||
.summary-placeholder.none { background: #f5f5f5; }
|
||||
.error-detail { font-size: 0.85rem; color: #c62828; margin-top: 8px; }
|
||||
|
||||
.quality-warning {
|
||||
padding: 10px 16px;
|
||||
background: #fff8e1;
|
||||
border: 1px solid #ffe082;
|
||||
border-radius: var(--radius);
|
||||
font-size: 0.85rem;
|
||||
color: #f57f17;
|
||||
margin-bottom: 16px;
|
||||
}
|
||||
|
||||
/* ── Footer ─────────────────────────────────────────────────────── */
|
||||
.site-footer {
|
||||
margin-top: 48px;
|
||||
padding: 20px;
|
||||
text-align: center;
|
||||
font-size: 0.8rem;
|
||||
color: var(--ink-light);
|
||||
border-top: 1px solid var(--border);
|
||||
}
|
||||
|
||||
/* ── Responsive ─────────────────────────────────────────────────── */
|
||||
@media (max-width: 640px) {
|
||||
.container { padding: 16px; }
|
||||
.nav-bar { padding: 10px 16px; }
|
||||
.date-nav { gap: 8px; }
|
||||
.date-title { font-size: 1.2rem; }
|
||||
.paper-card { padding: 14px 16px; }
|
||||
.detail-title { font-size: 1.3rem; }
|
||||
.detail-meta { flex-direction: column; gap: 4px; }
|
||||
}
|
||||
@@ -0,0 +1 @@
|
||||
/* app.js — 基础前端交互(HTMX 后续增强) */
|
||||
@@ -0,0 +1,32 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="zh-CN">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>{% block title %}HF Daily Papers{% endblock %}</title>
|
||||
<link rel="stylesheet" href="/static/css/style.css">
|
||||
</head>
|
||||
<body>
|
||||
<header class="site-header">
|
||||
<nav class="nav-bar">
|
||||
<a href="/" class="nav-brand">📚 HF Daily Papers</a>
|
||||
<div class="nav-links">
|
||||
<a href="/day/{{ today }}">今日</a>
|
||||
<a href="/search">搜索</a>
|
||||
<a href="/reading-list">阅读列表</a>
|
||||
</div>
|
||||
</nav>
|
||||
</header>
|
||||
|
||||
<main class="container">
|
||||
{% block content %}{% endblock %}
|
||||
</main>
|
||||
|
||||
<footer class="site-footer">
|
||||
<p>HF Daily Papers — 中文论文导览站 · 数据来源于 <a href="https://huggingface.co/papers" target="_blank">HuggingFace</a></p>
|
||||
</footer>
|
||||
|
||||
<script src="/static/js/app.js"></script>
|
||||
{% block scripts %}{% endblock %}
|
||||
</body>
|
||||
</html>
|
||||
@@ -0,0 +1,121 @@
|
||||
{% extends "base.html" %}
|
||||
|
||||
{% block title %}{{ page_title }} — HF Daily Papers{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
<article class="paper-detail">
|
||||
<a href="/day/{{ paper.paper_date.isoformat() }}" class="back-link">← 返回 {{ paper.paper_date.isoformat() }}</a>
|
||||
|
||||
{# 标题 #}
|
||||
<h1 class="detail-title">
|
||||
{{ paper.title_zh or paper.title_en }}
|
||||
{% if paper.title_zh and paper.title_en != paper.title_zh %}
|
||||
<small class="title-en">{{ paper.title_en }}</small>
|
||||
{% endif %}
|
||||
</h1>
|
||||
|
||||
{# 元信息 #}
|
||||
<div class="detail-meta">
|
||||
<span class="detail-authors">{{ paper.authors|map(attribute='name')|join(', ') }}</span>
|
||||
<span class="detail-date">📅 {{ paper.published_at or paper.paper_date }}</span>
|
||||
<span class="detail-upvotes">👍 {{ paper.upvotes }}</span>
|
||||
</div>
|
||||
|
||||
{# 标签 #}
|
||||
{% if paper.tags %}
|
||||
<div class="detail-tags">
|
||||
{% for tag in paper.tags %}
|
||||
<span class="tag">{{ tag.tag }}</span>
|
||||
{% endfor %}
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
{# 链接 #}
|
||||
<div class="detail-links">
|
||||
{% if paper.arxiv_url %}<a href="{{ paper.arxiv_url }}" target="_blank" class="ext-link">arXiv</a>{% endif %}
|
||||
{% if paper.hf_url %}<a href="{{ paper.hf_url }}" target="_blank" class="ext-link">HuggingFace</a>{% endif %}
|
||||
{% if paper.pdf_url %}<a href="{{ paper.pdf_url }}" target="_blank" class="ext-link">PDF</a>{% endif %}
|
||||
</div>
|
||||
|
||||
{# 总结内容 — 按状态降级 #}
|
||||
{% if summary_state == 'done' and paper.summary %}
|
||||
{% if paper.summary_status and paper.summary_status.quality == 'low' %}
|
||||
<div class="quality-warning">⚠️ AI 总结质量较低,仅供参考</div>
|
||||
{% elif paper.summary_status and paper.summary_status.quality == 'degraded' %}
|
||||
<div class="quality-warning">📝 总结部分字段不完整</div>
|
||||
{% endif %}
|
||||
|
||||
{% if paper.summary.one_line %}
|
||||
<section class="summary-section">
|
||||
<h2>一句话摘要</h2>
|
||||
<p class="one-line">{{ paper.summary.one_line }}</p>
|
||||
</section>
|
||||
{% endif %}
|
||||
|
||||
{% if paper.summary.difficulty %}
|
||||
<section class="summary-section">
|
||||
<h2>难度</h2>
|
||||
<p>{{ paper.summary.difficulty }}</p>
|
||||
</section>
|
||||
{% endif %}
|
||||
|
||||
{% if paper.summary.motivation_problem %}
|
||||
<section class="summary-section">
|
||||
<h2>研究动机</h2>
|
||||
{% if paper.summary.motivation_problem %}<p><strong>问题:</strong>{{ paper.summary.motivation_problem }}</p>{% endif %}
|
||||
{% if paper.summary.motivation_goal %}<p><strong>目标:</strong>{{ paper.summary.motivation_goal }}</p>{% endif %}
|
||||
{% if paper.summary.motivation_gap %}<p><strong>差距:</strong>{{ paper.summary.motivation_gap }}</p>{% endif %}
|
||||
</section>
|
||||
{% endif %}
|
||||
|
||||
{% if paper.summary.method_key_idea %}
|
||||
<section class="summary-section">
|
||||
<h2>核心方法</h2>
|
||||
{% if paper.summary.method_overview %}<p>{{ paper.summary.method_overview }}</p>{% endif %}
|
||||
<p><strong>关键思路:</strong>{{ paper.summary.method_key_idea }}</p>
|
||||
{% if paper.summary.method_novelty %}<p><strong>新颖性:</strong>{{ paper.summary.method_novelty }}</p>{% endif %}
|
||||
</section>
|
||||
{% endif %}
|
||||
|
||||
{% if paper.summary.results_main_json %}
|
||||
<section class="summary-section">
|
||||
<h2>实验结果</h2>
|
||||
<p>{{ paper.summary.results_main_json }}</p>
|
||||
</section>
|
||||
{% endif %}
|
||||
|
||||
{% if paper.summary.limitations_json %}
|
||||
<section class="summary-section">
|
||||
<h2>局限与改进</h2>
|
||||
<p>{{ paper.summary.limitations_json }}</p>
|
||||
</section>
|
||||
{% endif %}
|
||||
|
||||
{% elif summary_state == 'processing' %}
|
||||
<div class="summary-placeholder processing">
|
||||
<p>🔄 正在生成 AI 总结,请稍后刷新页面</p>
|
||||
</div>
|
||||
|
||||
{% elif summary_state in ('failed', 'permanent_failure') %}
|
||||
<div class="summary-placeholder failed">
|
||||
<p>❌ 总结生成失败{% if paper.summary_status and paper.summary_status.error_type %}({{ paper.summary_status.error_type }}){% endif %}</p>
|
||||
{% if paper.summary_status and paper.summary_status.error %}
|
||||
<p class="error-detail">{{ paper.summary_status.error }}</p>
|
||||
{% endif %}
|
||||
</div>
|
||||
|
||||
{% else %}
|
||||
<div class="summary-placeholder none">
|
||||
<p>📝 AI 总结尚未生成</p>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
{# 英文摘要 — 始终显示 #}
|
||||
{% if paper.abstract %}
|
||||
<section class="summary-section abstract-section">
|
||||
<h2>Abstract</h2>
|
||||
<p class="abstract-en">{{ paper.abstract }}</p>
|
||||
</section>
|
||||
{% endif %}
|
||||
</article>
|
||||
{% endblock %}
|
||||
@@ -0,0 +1,36 @@
|
||||
{% extends "base.html" %}
|
||||
|
||||
{% block title %}{{ page_title }} — HF Daily Papers{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
<div class="date-nav">
|
||||
{% if prev_day %}
|
||||
<a href="/day/{{ prev_day }}" class="date-nav-btn">← 前一天</a>
|
||||
{% endif %}
|
||||
<h1 class="date-title">{{ current_date }}</h1>
|
||||
{% if next_day <= today %}
|
||||
<a href="/day/{{ next_day }}" class="date-nav-btn">后一天 →</a>
|
||||
{% endif %}
|
||||
<a href="/day/{{ today }}" class="date-nav-btn">今日</a>
|
||||
</div>
|
||||
|
||||
{% if papers %}
|
||||
<div class="paper-list">
|
||||
{% for paper in papers %}
|
||||
{% include "partials/paper_card.html" %}
|
||||
{% endfor %}
|
||||
</div>
|
||||
{% else %}
|
||||
<div class="empty-state">
|
||||
<p>📭 当天暂无论文数据</p>
|
||||
<p class="hint">试试浏览其他日期,或使用管理接口抓取数据</p>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<div class="date-quick-nav">
|
||||
<span>有数据的日期:</span>
|
||||
{% for d in available_dates[:10] %}
|
||||
<a href="/day/{{ d }}" class="date-chip {% if d == current_date %}active{% endif %}">{{ d }}</a>
|
||||
{% endfor %}
|
||||
</div>
|
||||
{% endblock %}
|
||||
@@ -0,0 +1,44 @@
|
||||
{# 论文卡片组件 — paper 变量必须在上下文中 #}
|
||||
<article class="paper-card" data-arxiv="{{ paper.arxiv_id }}">
|
||||
<div class="paper-card-header">
|
||||
<h2 class="paper-title">
|
||||
<a href="/paper/{{ paper.arxiv_id }}">
|
||||
{{ paper.title_zh or paper.title_en }}
|
||||
</a>
|
||||
</h2>
|
||||
<span class="paper-upvotes">👍 {{ paper.upvotes }}</span>
|
||||
</div>
|
||||
|
||||
{% if paper.summary and paper.summary.one_line %}
|
||||
<p class="paper-one-line">{{ paper.summary.one_line }}</p>
|
||||
{% elif paper.abstract %}
|
||||
<p class="paper-abstract-preview">{{ paper.abstract[:200] }}{% if paper.abstract|length > 200 %}…{% endif %}</p>
|
||||
{% endif %}
|
||||
|
||||
<div class="paper-meta">
|
||||
<span class="paper-authors">
|
||||
{{ paper.authors|map(attribute='name')|join(', ')|truncate(80) }}
|
||||
</span>
|
||||
</div>
|
||||
|
||||
<div class="paper-tags">
|
||||
{% for tag in paper.tags[:5] %}
|
||||
<span class="tag">{{ tag.tag }}</span>
|
||||
{% endfor %}
|
||||
</div>
|
||||
|
||||
<div class="paper-footer">
|
||||
<span class="summary-badge summary-{{ paper.summary_status.status if paper.summary_status else 'none' }}">
|
||||
{% if not paper.summary_status or paper.summary_status.status == 'pending' %}
|
||||
未总结
|
||||
{% elif paper.summary_status.status == 'processing' %}
|
||||
🔄 总结中
|
||||
{% elif paper.summary_status.status == 'failed' or paper.summary_status.status == 'permanent_failure' %}
|
||||
❌ 总结失败
|
||||
{% elif paper.summary_status.status == 'done' %}
|
||||
✅ 已总结
|
||||
{% endif %}
|
||||
</span>
|
||||
<a href="/paper/{{ paper.arxiv_id }}" class="btn-detail">详情 →</a>
|
||||
</div>
|
||||
</article>
|
||||
@@ -0,0 +1,224 @@
|
||||
# API 路由与页面设计
|
||||
|
||||
> 本文档定义页面路由、JSON API、管理接口、用户流程和验收标准。
|
||||
|
||||
---
|
||||
|
||||
## 1. 页面路由
|
||||
|
||||
| 方法 | 路径 | 说明 |
|
||||
|------|------|------|
|
||||
| GET | `/` | 重定向到 `/day/{today}` |
|
||||
| GET | `/day/{date}` | 指定日期论文列表 |
|
||||
| GET | `/paper/{arxiv_id}` | 论文详情 |
|
||||
| GET | `/search` | 搜索页和搜索结果 |
|
||||
| GET | `/reading-list` | 收藏和阅读列表 |
|
||||
| GET | `/admin/logs` | 管理日志页,需要 token |
|
||||
| GET | `/rss.xml` | RSS Feed |
|
||||
|
||||
后续增强:
|
||||
|
||||
- `/trends`
|
||||
- `/compare?ids=id1,id2`
|
||||
- `/similar/{arxiv_id}`
|
||||
|
||||
---
|
||||
|
||||
## 2. 数据 API
|
||||
|
||||
| 方法 | 路径 | 说明 |
|
||||
|------|------|------|
|
||||
| GET | `/api/papers?date=&tag=&q=` | 论文列表 |
|
||||
| GET | `/api/paper/{arxiv_id}` | 单篇论文详情 |
|
||||
| GET | `/api/dates` | 有数据的日期列表 |
|
||||
| GET | `/api/tags` | 标签及计数 |
|
||||
| GET | `/api/stats` | 统计信息 |
|
||||
| GET | `/api/search?q=&tag=` | FTS5 搜索 |
|
||||
|
||||
---
|
||||
|
||||
## 3. 用户数据 API
|
||||
|
||||
| 方法 | 路径 | 说明 |
|
||||
|------|------|------|
|
||||
| POST | `/api/bookmark/{arxiv_id}` | 收藏/取消收藏 |
|
||||
| POST | `/api/reading-status/{arxiv_id}` | 更新阅读状态 |
|
||||
| GET | `/api/note/{arxiv_id}` | 获取笔记 |
|
||||
| POST | `/api/note/{arxiv_id}` | 保存笔记 |
|
||||
|
||||
请求和响应使用 JSON。无账号体系,数据写入本地 SQLite。
|
||||
|
||||
安全边界:
|
||||
|
||||
- 默认 `APP_HOST=127.0.0.1` 时,用户数据 API 只服务本机访问。
|
||||
- 如果绑定到非本地地址,用户数据写接口需要启用 same-origin 检查或 token。
|
||||
|
||||
---
|
||||
|
||||
## 4. 管理接口
|
||||
|
||||
所有管理接口都需要:
|
||||
|
||||
```text
|
||||
Authorization: Bearer <ADMIN_TOKEN>
|
||||
```
|
||||
|
||||
| 方法 | 路径 | 说明 |
|
||||
|------|------|------|
|
||||
| POST | `/admin/crawl` | 手动抓取指定日期,默认今天 |
|
||||
| POST | `/admin/summarize/{arxiv_id}` | 手动总结或重跑单篇 |
|
||||
| POST | `/admin/summarize` | 批量总结 pending 论文 |
|
||||
| POST | `/admin/cleanup` | 清理临时文件 |
|
||||
| POST | `/admin/delete` | 删除指定日期范围内的数据 |
|
||||
| GET | `/admin/logs` | 查看任务日志 |
|
||||
|
||||
### `/admin/delete` 请求体
|
||||
|
||||
```json
|
||||
{
|
||||
"date_start": "2026-06-01",
|
||||
"date_end": "2026-06-05",
|
||||
"include_notes": true,
|
||||
"confirm": "DELETE"
|
||||
}
|
||||
```
|
||||
|
||||
`confirm` 必须为 `DELETE`,否则拒绝执行。
|
||||
|
||||
---
|
||||
|
||||
## 5. 页面状态
|
||||
|
||||
### 首页 / 日期页
|
||||
|
||||
每张论文卡片展示:
|
||||
|
||||
- 中文标题;没有总结时展示英文标题。
|
||||
- 一句话摘要;没有总结时展示英文 abstract 截断。
|
||||
- 标签、作者、upvotes、难度。
|
||||
- 总结状态:未总结、总结中、失败、已完成。
|
||||
- 收藏按钮、阅读状态入口、详情链接。
|
||||
|
||||
### 详情页
|
||||
|
||||
详情页按状态降级:
|
||||
|
||||
| 状态 | 展示 |
|
||||
|------|------|
|
||||
| 无总结 | 英文标题、作者、摘要、HF/arXiv 链接、手动总结按钮 |
|
||||
| processing | 元数据 + “正在生成总结” |
|
||||
| failed | 元数据 + 错误类型 + 手动重跑按钮 |
|
||||
| done/normal | 完整中文结构化解读 |
|
||||
| done/degraded | 展示已有内容,缺失模块标注不完整 |
|
||||
| done/low | 顶部质量提示 + 已有内容 |
|
||||
|
||||
详情模块:
|
||||
|
||||
- 一句话摘要
|
||||
- 预置知识
|
||||
- 研究动机
|
||||
- 核心方法
|
||||
- 实验结果
|
||||
- 局限和改进方向
|
||||
- 原文链接
|
||||
- 收藏、阅读状态、个人笔记
|
||||
|
||||
### 搜索页
|
||||
|
||||
MVP 只提供关键词搜索:
|
||||
|
||||
- 搜索框。
|
||||
- 标签筛选。
|
||||
- 结果按相关性和日期排序。
|
||||
- 命中片段高亮。
|
||||
|
||||
语义搜索作为后续增强,UI 上先不展示模式切换。
|
||||
|
||||
### 阅读列表
|
||||
|
||||
筛选项:
|
||||
|
||||
- 全部收藏。
|
||||
- 未读。
|
||||
- 已读摘要。
|
||||
- 已读原文。
|
||||
- 有笔记。
|
||||
- 标签。
|
||||
|
||||
---
|
||||
|
||||
## 6. 用户流程
|
||||
|
||||
```text
|
||||
访问 /
|
||||
-> /day/{today}
|
||||
-> 浏览论文卡片
|
||||
-> 点击论文进入 /paper/{arxiv_id}
|
||||
-> 收藏 / 修改阅读状态 / 写笔记
|
||||
-> 搜索 /search?q=...
|
||||
-> 阅读列表 /reading-list
|
||||
```
|
||||
|
||||
管理员流程:
|
||||
|
||||
```text
|
||||
POST /admin/crawl
|
||||
-> 抓取论文并入库
|
||||
-> POST /admin/summarize
|
||||
-> 生成总结
|
||||
-> POST /admin/cleanup
|
||||
-> 查看 /admin/logs
|
||||
```
|
||||
|
||||
删除流程:
|
||||
|
||||
```text
|
||||
POST /admin/delete
|
||||
-> 校验 token 和 confirm
|
||||
-> 删除日期范围内论文、索引、用户数据、本地文件
|
||||
-> 写入删除记录和日志
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. MVP 验收标准
|
||||
|
||||
### 抓取
|
||||
|
||||
- 指定日期能抓取 HF Daily Papers 前 N 篇。
|
||||
- 同一天重复抓取不会重复插入。
|
||||
- 空日期返回成功状态和 0 篇日志。
|
||||
- 网络失败有 timeout、重试和错误日志。
|
||||
|
||||
### 总结
|
||||
|
||||
- 单篇总结失败不会影响其他论文。
|
||||
- 必填字段缺失时自动重试一次。
|
||||
- 重试失败后标记 `permanent_failure`。
|
||||
- 总结成功后页面、FTS 索引和 summary.json 同步更新。
|
||||
- 成功或失败后都会清理 PDF/源码临时文件。
|
||||
|
||||
### 页面
|
||||
|
||||
- 首页能显示未总结、总结中、失败、完成状态。
|
||||
- 详情页无总结时仍可阅读英文元数据。
|
||||
- degraded/low 总结有清晰提示。
|
||||
- 移动端不出现主要内容横向溢出。
|
||||
|
||||
### 搜索
|
||||
|
||||
- 能搜索标题、摘要、作者、标签、中文总结。
|
||||
- 删除论文后搜索结果不再出现该论文。
|
||||
|
||||
### 管理
|
||||
|
||||
- 无 token 不能调用管理接口。
|
||||
- token 错误返回 401。
|
||||
- 删除接口没有 `confirm=DELETE` 时拒绝执行。
|
||||
- 删除指定日期范围后,页面、搜索索引、用户数据和本地文件保持一致。
|
||||
|
||||
### 调度
|
||||
|
||||
- 单 worker 下每日任务只执行一次。
|
||||
- 多 worker 或非本地 host 配置存在风险时,应用启动给出明确告警或拒绝启动。
|
||||
- `/` 的 today 和每日调度日期都按 `APP_TIMEZONE` 计算。
|
||||
@@ -0,0 +1,394 @@
|
||||
# 数据模型
|
||||
|
||||
> 本文档定义 SQLite 表、summary.json schema、索引同步、校验和删除策略。
|
||||
|
||||
---
|
||||
|
||||
## 1. 设计原则
|
||||
|
||||
1. SQLite 是主存储,页面和 API 优先从 SQLite 读取。
|
||||
2. PDF、LaTeX 源码等下载文件是临时资产,解析和总结完成后清理。
|
||||
3. `meta.json`、`summary.json`、`raw_output.txt` 可作为可读备份保存在 `data/papers/{arxiv_id}/`。
|
||||
4. 作者和标签使用规范化表,避免 JSON 字符串聚合困难。
|
||||
5. FTS5 由独立索引表维护,写入/更新/删除论文时同步更新。
|
||||
6. ChromaDB 是后续增强,不能成为 MVP 页面渲染的必要依赖。
|
||||
7. 每个 SQLite 连接必须执行 `PRAGMA foreign_keys=ON`,确保级联删除生效。
|
||||
|
||||
---
|
||||
|
||||
## 2. 数据库表
|
||||
|
||||
### papers — 论文主表
|
||||
|
||||
```sql
|
||||
CREATE TABLE papers (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
arxiv_id TEXT UNIQUE NOT NULL,
|
||||
title_en TEXT NOT NULL,
|
||||
title_zh TEXT,
|
||||
abstract TEXT,
|
||||
published_at DATE,
|
||||
paper_date DATE NOT NULL,
|
||||
crawled_at DATETIME NOT NULL,
|
||||
upvotes INTEGER DEFAULT 0,
|
||||
hf_url TEXT,
|
||||
arxiv_url TEXT,
|
||||
pdf_url TEXT,
|
||||
source_url TEXT,
|
||||
asset_status TEXT DEFAULT 'not_downloaded', -- not_downloaded / ready / failed / cleaned
|
||||
asset_error TEXT,
|
||||
meta_path TEXT,
|
||||
summary_path TEXT,
|
||||
raw_output_path TEXT,
|
||||
summary_quality TEXT -- normal / degraded / low
|
||||
);
|
||||
```
|
||||
|
||||
手动删除采用物理删除。删除审计写入 `data_delete_jobs` 和 `crawl_logs`。
|
||||
|
||||
### paper_authors — 作者表
|
||||
|
||||
```sql
|
||||
CREATE TABLE paper_authors (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
|
||||
name TEXT NOT NULL,
|
||||
position INTEGER DEFAULT 0,
|
||||
UNIQUE(paper_id, name)
|
||||
);
|
||||
```
|
||||
|
||||
### paper_tags — 标签表
|
||||
|
||||
```sql
|
||||
CREATE TABLE paper_tags (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
|
||||
tag TEXT NOT NULL,
|
||||
source TEXT DEFAULT 'hf', -- hf / ai / user
|
||||
UNIQUE(paper_id, tag, source)
|
||||
);
|
||||
```
|
||||
|
||||
### paper_summaries — 结构化总结表
|
||||
|
||||
```sql
|
||||
CREATE TABLE paper_summaries (
|
||||
paper_id INTEGER PRIMARY KEY REFERENCES papers(id) ON DELETE CASCADE,
|
||||
one_line TEXT,
|
||||
difficulty TEXT,
|
||||
prerequisites_json TEXT,
|
||||
motivation_problem TEXT,
|
||||
motivation_goal TEXT,
|
||||
motivation_gap TEXT,
|
||||
method_overview TEXT,
|
||||
method_key_idea TEXT,
|
||||
method_steps_json TEXT,
|
||||
method_novelty TEXT,
|
||||
results_main_json TEXT,
|
||||
results_benchmarks_json TEXT,
|
||||
limitations_json TEXT,
|
||||
weaknesses_json TEXT,
|
||||
future_work_json TEXT,
|
||||
reproducibility TEXT,
|
||||
full_json TEXT NOT NULL,
|
||||
updated_at DATETIME NOT NULL
|
||||
);
|
||||
```
|
||||
|
||||
结构化字段用于页面、对比、搜索和排序;`full_json` 保留完整原始结构。
|
||||
|
||||
### summary_status — 总结状态
|
||||
|
||||
```sql
|
||||
CREATE TABLE summary_status (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
|
||||
status TEXT NOT NULL, -- pending / processing / done / failed / permanent_failure
|
||||
quality TEXT, -- normal / degraded / low
|
||||
error_type TEXT, -- pdf_download_failed / timeout / process_error / json_not_found / json_invalid / field_missing / schema_error / unknown
|
||||
error TEXT,
|
||||
retry_count INTEGER DEFAULT 0,
|
||||
raw_output_saved BOOLEAN DEFAULT FALSE,
|
||||
started_at DATETIME,
|
||||
completed_at DATETIME,
|
||||
UNIQUE(paper_id)
|
||||
);
|
||||
```
|
||||
|
||||
### papers_fts — 全文搜索索引
|
||||
|
||||
```sql
|
||||
CREATE VIRTUAL TABLE papers_fts USING fts5(
|
||||
title_en,
|
||||
title_zh,
|
||||
abstract,
|
||||
authors,
|
||||
tags,
|
||||
summary_text,
|
||||
tokenize='unicode61'
|
||||
);
|
||||
```
|
||||
|
||||
使用普通 FTS5 表,由应用层显式维护。普通 FTS5 会复制一份索引文本,数据量可接受,换取简单可靠的更新和删除语义:
|
||||
|
||||
- 新增论文:插入标题、摘要、作者、标签。
|
||||
- 总结完成:更新中文标题和 `summary_text`。
|
||||
- 收藏/笔记变更:不进入 FTS,避免个人笔记污染论文搜索。
|
||||
- 删除论文:同步删除对应 FTS row。
|
||||
|
||||
写入时必须使用 `papers.id` 作为 FTS rowid:
|
||||
|
||||
```sql
|
||||
INSERT INTO papers_fts(rowid, title_en, title_zh, abstract, authors, tags, summary_text)
|
||||
VALUES (:paper_id, :title_en, :title_zh, :abstract, :authors, :tags, :summary_text);
|
||||
```
|
||||
|
||||
更新时可使用普通 `UPDATE`,也可先按 rowid 删除再插入。删除论文时执行:
|
||||
|
||||
```sql
|
||||
DELETE FROM papers_fts WHERE rowid = :paper_id;
|
||||
```
|
||||
|
||||
### crawl_logs — 任务日志
|
||||
|
||||
```sql
|
||||
CREATE TABLE crawl_logs (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
task TEXT NOT NULL, -- crawl / summarize / cleanup / delete / scheduler
|
||||
status TEXT NOT NULL, -- running / success / failed
|
||||
date DATE,
|
||||
papers_found INTEGER,
|
||||
papers_new INTEGER,
|
||||
error TEXT,
|
||||
started_at DATETIME NOT NULL,
|
||||
completed_at DATETIME
|
||||
);
|
||||
```
|
||||
|
||||
### task_locks — 任务锁
|
||||
|
||||
```sql
|
||||
CREATE TABLE task_locks (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
task TEXT NOT NULL,
|
||||
lock_key TEXT NOT NULL, -- 通常是日期,如 2026-06-05
|
||||
status TEXT NOT NULL, -- running / finished / failed
|
||||
owner TEXT,
|
||||
acquired_at DATETIME NOT NULL,
|
||||
released_at DATETIME
|
||||
);
|
||||
|
||||
CREATE UNIQUE INDEX uq_task_locks_running
|
||||
ON task_locks(task, lock_key)
|
||||
WHERE status = 'running';
|
||||
```
|
||||
|
||||
防重入规则:启动任务前插入 `status='running'` 的锁;插入失败说明同一任务正在运行,直接跳过或返回 409。任务完成后更新为 `finished` 或 `failed`。
|
||||
|
||||
### user_bookmarks — 收藏
|
||||
|
||||
```sql
|
||||
CREATE TABLE user_bookmarks (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
|
||||
note TEXT,
|
||||
created_at DATETIME NOT NULL,
|
||||
UNIQUE(paper_id)
|
||||
);
|
||||
```
|
||||
|
||||
### user_reading_status — 阅读状态
|
||||
|
||||
```sql
|
||||
CREATE TABLE user_reading_status (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
|
||||
status TEXT NOT NULL, -- unread / skimmed / read_summary / read_full
|
||||
updated_at DATETIME NOT NULL,
|
||||
UNIQUE(paper_id)
|
||||
);
|
||||
```
|
||||
|
||||
### user_notes — 个人笔记
|
||||
|
||||
```sql
|
||||
CREATE TABLE user_notes (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
|
||||
content TEXT NOT NULL,
|
||||
created_at DATETIME NOT NULL,
|
||||
updated_at DATETIME NOT NULL,
|
||||
UNIQUE(paper_id)
|
||||
);
|
||||
```
|
||||
|
||||
### data_delete_jobs — 手动删除记录
|
||||
|
||||
```sql
|
||||
CREATE TABLE data_delete_jobs (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
date_start DATE NOT NULL,
|
||||
date_end DATE NOT NULL,
|
||||
include_notes BOOLEAN DEFAULT TRUE,
|
||||
paper_count INTEGER DEFAULT 0,
|
||||
status TEXT NOT NULL, -- running / success / failed
|
||||
error TEXT,
|
||||
started_at DATETIME NOT NULL,
|
||||
completed_at DATETIME
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. summary.json Schema
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel, Field, field_validator
|
||||
|
||||
|
||||
class Prerequisites(BaseModel):
|
||||
concepts: list[str] = Field(default_factory=list)
|
||||
level: str = ""
|
||||
|
||||
|
||||
class Motivation(BaseModel):
|
||||
problem: str
|
||||
goal: str = ""
|
||||
gap: str = ""
|
||||
|
||||
|
||||
class Method(BaseModel):
|
||||
overview: str = ""
|
||||
key_idea: str
|
||||
steps: list[str] = Field(default_factory=list)
|
||||
novelty: str = ""
|
||||
|
||||
|
||||
class Results(BaseModel):
|
||||
main_findings: list[str] = Field(default_factory=list)
|
||||
benchmarks: list[dict] = Field(default_factory=list)
|
||||
limitations: list[str] = Field(default_factory=list)
|
||||
|
||||
|
||||
class Improvements(BaseModel):
|
||||
weaknesses: list[str] = Field(default_factory=list)
|
||||
future_work: list[str] = Field(default_factory=list)
|
||||
reproducibility: str = ""
|
||||
|
||||
|
||||
class SummarySchema(BaseModel):
|
||||
title_zh: str
|
||||
one_line: str
|
||||
tags: list[str]
|
||||
difficulty: str = ""
|
||||
paper_date: str | None = None
|
||||
prerequisites: Prerequisites = Field(default_factory=Prerequisites)
|
||||
motivation: Motivation
|
||||
method: Method
|
||||
results: Results = Field(default_factory=Results)
|
||||
improvements: Improvements = Field(default_factory=Improvements)
|
||||
|
||||
@field_validator("title_zh", "one_line")
|
||||
@classmethod
|
||||
def non_empty_text(cls, value: str) -> str:
|
||||
if not value or not value.strip():
|
||||
raise ValueError("field cannot be empty")
|
||||
return value.strip()
|
||||
|
||||
@field_validator("tags")
|
||||
@classmethod
|
||||
def non_empty_tags(cls, value: list[str]) -> list[str]:
|
||||
tags = [tag.strip() for tag in value if tag and tag.strip()]
|
||||
if not tags:
|
||||
raise ValueError("tags cannot be empty")
|
||||
return tags
|
||||
```
|
||||
|
||||
实际实现时还要给 `Motivation.problem` 和 `Method.key_idea` 加同样的非空校验,空字符串视为 `field_missing`。
|
||||
|
||||
### 字段分级
|
||||
|
||||
| 级别 | 字段 | 处理 |
|
||||
|------|------|------|
|
||||
| 必填 | `title_zh`, `one_line`, `tags`, `motivation.problem`, `method.key_idea` | 缺失则失败并重试 |
|
||||
| 重要 | `motivation.goal`, `motivation.gap`, `method.overview`, `results.main_findings` | 缺失可入库,标记 `degraded` |
|
||||
| 可选 | `benchmarks`, `limitations`, `improvements`, `prerequisites` | 缺失用默认值 |
|
||||
|
||||
---
|
||||
|
||||
## 4. 校验和错误处理
|
||||
|
||||
### 状态流转
|
||||
|
||||
```text
|
||||
pending -> processing -> done
|
||||
└-> failed -> pending retry -> processing
|
||||
└-> permanent_failure
|
||||
```
|
||||
|
||||
### 错误分级
|
||||
|
||||
| error_type | 场景 | 自动重试 |
|
||||
|------------|------|----------|
|
||||
| timeout | pi 超时 | 是 |
|
||||
| pdf_download_failed | PDF 下载失败或文件不可读 | 是 |
|
||||
| process_error | pi 进程非 0 退出 | 是 |
|
||||
| json_not_found | 输出中找不到 JSON | 是 |
|
||||
| json_invalid | JSON 解析失败 | 是 |
|
||||
| field_missing | 必填字段缺失 | 是 |
|
||||
| schema_error | 字段类型不合法 | 是 |
|
||||
| unknown | 未分类异常 | 是 |
|
||||
|
||||
最大自动重试次数为 1。重试后仍失败则标记 `permanent_failure`,管理后台可手动重跑。
|
||||
|
||||
### 质量分级
|
||||
|
||||
| quality | 条件 | 页面表现 |
|
||||
|---------|------|----------|
|
||||
| normal | 必填和重要字段完整 | 完整展示 |
|
||||
| degraded | 必填完整,重要字段部分缺失 | 缺失模块显示“不完整” |
|
||||
| low | 字段存在但内容明显空洞 | 顶部提示“AI 总结质量较低” |
|
||||
|
||||
---
|
||||
|
||||
## 5. 删除和清理策略
|
||||
|
||||
### 临时文件清理
|
||||
|
||||
每篇论文处理完成后删除:
|
||||
|
||||
- `data/tmp/{arxiv_id}/paper.pdf`
|
||||
- `data/tmp/{arxiv_id}/source/`
|
||||
- 其他下载中间文件
|
||||
|
||||
总结失败时也应清理下载文件,但保留 `raw_output.txt` 和错误日志。
|
||||
|
||||
### 手动删除指定日期范围
|
||||
|
||||
管理员可删除 `paper_date` 落在指定范围内的数据。删除流程:
|
||||
|
||||
1. 查询目标论文。
|
||||
2. 删除用户收藏、阅读状态、笔记。
|
||||
3. 删除 summary/status/authors/tags。
|
||||
4. 删除 FTS5 索引。
|
||||
5. 删除 `data/papers/{arxiv_id}/` 和 `data/tmp/{arxiv_id}/`。
|
||||
6. 物理删除 `papers` 记录。
|
||||
7. 写入 `data_delete_jobs` 和 `crawl_logs`。
|
||||
|
||||
如后续需要可恢复删除,再引入 `deleted_at` 软删除字段;MVP 不实现。
|
||||
|
||||
---
|
||||
|
||||
## 6. ChromaDB 增强设计
|
||||
|
||||
ChromaDB 不进入 MVP。接入时只索引 `paper_summaries` 中的高信号字段:
|
||||
|
||||
- 中文标题
|
||||
- 英文标题
|
||||
- 标签
|
||||
- 一句话摘要
|
||||
- `motivation_problem`
|
||||
- `method_key_idea`
|
||||
|
||||
向量维度必须和 `EMBED_MODEL` 匹配。写入前校验 embedding 长度,不匹配则跳过语义索引并记录日志,不影响普通页面和 FTS 搜索。
|
||||
@@ -0,0 +1,269 @@
|
||||
# 服务模块详解
|
||||
|
||||
> 本文档描述各服务模块的职责、输入输出、失败处理和实现约束。
|
||||
|
||||
---
|
||||
|
||||
## 1. 爬虫服务
|
||||
|
||||
**职责**:从 HuggingFace Daily Papers 获取论文列表,写入元数据。PDF 不在抓取阶段长期保存。
|
||||
|
||||
### 数据源
|
||||
|
||||
- Daily Papers API:`GET https://huggingface.co/api/daily_papers?date=YYYY-MM-DD`
|
||||
- PDF:`https://arxiv.org/pdf/{arxiv_id}.pdf`(总结阶段按需下载)
|
||||
- 源码(后续增强):`https://arxiv.org/e-print/{arxiv_id}`
|
||||
|
||||
HuggingFace 官方 Hub API 文档说明 `/api/daily_papers` 支持 `date` 查询参数。
|
||||
|
||||
### 规则
|
||||
|
||||
- `arxiv_id` 是唯一键。
|
||||
- 重复抓取同一天时,已有论文只更新 upvotes、标签等可变元数据,不重复插入。
|
||||
- 网络请求必须设置 timeout、User-Agent、重试次数。
|
||||
- API 返回空列表时记录成功日志,不视为失败。
|
||||
- 抓取阶段不下载 PDF;总结阶段 PDF 下载失败时更新 `asset_status=failed` 和 `summary_status.error_type=pdf_download_failed`。
|
||||
|
||||
### 接口
|
||||
|
||||
```python
|
||||
async def fetch_daily(date: str, top_n: int) -> list[PaperMeta]: ...
|
||||
async def upsert_papers(papers: list[PaperMeta]) -> list[Paper]: ...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. AI 总结服务
|
||||
|
||||
**职责**:调用 pi CLI,把单篇论文转成结构化中文总结。
|
||||
|
||||
### 调用原则
|
||||
|
||||
- 一篇论文一次 pi 调用。
|
||||
- 并发数由 `SUMMARY_CONCURRENCY` 控制,默认 3。
|
||||
- 单篇超时由 `SUMMARY_TIMEOUT_SECONDS` 控制,默认 300 秒。
|
||||
- pi 路径通过 `PI_BIN` 配置,当前可以先使用宿主机路径;跑通后再抽象部署方式。
|
||||
- PDF 在总结开始前按需下载到 `data/tmp/{arxiv_id}/paper.pdf`,总结成功或失败后清理。
|
||||
|
||||
### 调用示例
|
||||
|
||||
```bash
|
||||
pi -p --skill daily-paper-summary \
|
||||
"请深度解读以下论文,并按指定 JSON schema 输出:
|
||||
@data/papers/2401.12345/meta.json
|
||||
@data/tmp/2401.12345/paper.pdf"
|
||||
```
|
||||
|
||||
### 流程
|
||||
|
||||
```text
|
||||
取 pending 论文
|
||||
-> 下载 PDF 到 data/tmp/{arxiv_id}/paper.pdf
|
||||
-> status=processing
|
||||
-> 调 pi
|
||||
-> 提取 JSON
|
||||
-> Pydantic 校验
|
||||
-> 写 summary.json
|
||||
-> 写 paper_summaries / paper_tags / papers_fts
|
||||
-> status=done
|
||||
-> 清理 PDF/源码临时文件
|
||||
```
|
||||
|
||||
失败时保存 raw output、更新 `summary_status`,并清理下载文件。
|
||||
|
||||
PDF 下载失败不调用 pi,直接记录 `pdf_download_failed` 并进入重试流程。
|
||||
|
||||
---
|
||||
|
||||
## 3. 搜索服务
|
||||
|
||||
**职责**:MVP 提供 FTS5 关键词搜索;后续接入 ChromaDB 语义搜索。
|
||||
|
||||
### FTS5 搜索
|
||||
|
||||
索引字段:
|
||||
|
||||
- 英文标题
|
||||
- 中文标题
|
||||
- 英文摘要
|
||||
- 作者
|
||||
- 标签
|
||||
- 中文总结正文
|
||||
|
||||
应用层负责同步 FTS:
|
||||
|
||||
```python
|
||||
def build_fts_document(paper: Paper, summary: PaperSummary | None) -> FtsDocument:
|
||||
summary_text = ""
|
||||
if summary:
|
||||
summary_text = " ".join([
|
||||
summary.one_line or "",
|
||||
summary.motivation_problem or "",
|
||||
summary.motivation_goal or "",
|
||||
summary.method_overview or "",
|
||||
summary.method_key_idea or "",
|
||||
" ".join(summary.results_main or []),
|
||||
])
|
||||
return FtsDocument(...)
|
||||
```
|
||||
|
||||
### ChromaDB 语义搜索(后续)
|
||||
|
||||
接入时要求:
|
||||
|
||||
- `CHROMA_ENABLED=true` 才初始化。
|
||||
- embedding API 失败不能影响总结入库。
|
||||
- embedding 维度和配置不匹配时记录日志并跳过。
|
||||
- 使用当前 ChromaDB 官方 API 重新确认查询和过滤语法后实现。
|
||||
|
||||
---
|
||||
|
||||
## 4. 页面渲染服务
|
||||
|
||||
**职责**:从 SQLite 读取数据并渲染 Jinja2 模板。
|
||||
|
||||
kami 只作为风格参考:
|
||||
|
||||
- 参考纸张质感、留白、字体层级和墨蓝强调色。
|
||||
- 不调用 kami,不依赖 kami 生成页面。
|
||||
- CSS 放在 `app/static/css/style.css`,按本项目页面实际结构维护。
|
||||
|
||||
页面必须支持降级状态:
|
||||
|
||||
- 无总结:显示英文元数据和“AI 总结尚未生成”。
|
||||
- 总结失败:显示错误类型和手动重跑入口。
|
||||
- degraded/low:显示提示,但仍展示已有内容。
|
||||
|
||||
---
|
||||
|
||||
## 5. 用户数据服务
|
||||
|
||||
**职责**:本地个人化数据,无账号体系。
|
||||
|
||||
功能:
|
||||
|
||||
- 收藏/取消收藏。
|
||||
- 阅读状态:`unread`、`skimmed`、`read_summary`、`read_full`。
|
||||
- 个人 Markdown 笔记。
|
||||
- 阅读列表:按收藏、状态、标签、日期筛选。
|
||||
|
||||
所有用户数据跟随论文删除一起删除。
|
||||
|
||||
---
|
||||
|
||||
## 6. 清理和删除服务
|
||||
|
||||
**职责**:清理临时文件,并支持管理员手动删除指定日期范围内的数据。
|
||||
|
||||
### 临时文件清理
|
||||
|
||||
触发时机:
|
||||
|
||||
- 单篇总结成功后。
|
||||
- 单篇总结失败后。
|
||||
- 每日任务结束后兜底扫描 `data/tmp/`。
|
||||
|
||||
### 手动删除
|
||||
|
||||
接口:
|
||||
|
||||
```python
|
||||
async def delete_papers_by_date_range(
|
||||
date_start: date,
|
||||
date_end: date,
|
||||
include_notes: bool = True,
|
||||
) -> DeleteResult: ...
|
||||
```
|
||||
|
||||
要求:
|
||||
|
||||
- 删除前统计目标论文数量。
|
||||
- 删除 DB 记录、FTS 索引、本地文件。
|
||||
- 删除失败时记录具体 arXiv ID 和错误。
|
||||
- 日期范围必须有限制,避免误删全部数据;管理接口需要二次确认参数。
|
||||
|
||||
---
|
||||
|
||||
## 7. 调度服务
|
||||
|
||||
**职责**:自动执行每日抓取和总结。
|
||||
|
||||
### 约束
|
||||
|
||||
- 应用以单 worker 运行。
|
||||
- `APP_WORKERS` 必须为 1,或 `SCHEDULER_ENABLED=false`。
|
||||
- 启动时检查运行中任务,避免重复执行。
|
||||
- 同一日期同一任务使用数据库锁或日志状态防重入。
|
||||
- 推荐使用 `task_locks` 表;抢锁失败时,自动任务跳过,管理接口返回 409。
|
||||
|
||||
### 每日流程
|
||||
|
||||
```text
|
||||
08:00
|
||||
-> 按 APP_TIMEZONE 计算 today
|
||||
-> crawl(date=today)
|
||||
-> summarize pending papers
|
||||
-> cleanup tmp files
|
||||
-> write logs
|
||||
```
|
||||
|
||||
手动触发方式:
|
||||
|
||||
- CLI:`python -m app.cli crawl --date YYYY-MM-DD`
|
||||
- API:`POST /admin/crawl`
|
||||
|
||||
---
|
||||
|
||||
## 8. 管理和安全服务
|
||||
|
||||
**职责**:保护所有有副作用的管理操作。
|
||||
|
||||
### 鉴权
|
||||
|
||||
管理接口必须要求 `ADMIN_TOKEN`:
|
||||
|
||||
```text
|
||||
Authorization: Bearer <ADMIN_TOKEN>
|
||||
```
|
||||
|
||||
受保护接口:
|
||||
|
||||
- `POST /admin/crawl`
|
||||
- `POST /admin/summarize/{arxiv_id}`
|
||||
- `POST /admin/summarize`
|
||||
- `POST /admin/cleanup`
|
||||
- `POST /admin/delete`
|
||||
- `GET /admin/logs`
|
||||
|
||||
如果 `ADMIN_TOKEN` 为空或为默认值 `change-me`,应用启动时应警告;如果 `APP_HOST` 不是 `127.0.0.1`,应拒绝启动或要求显式确认。
|
||||
|
||||
用户数据接口默认仅面向本地使用。如果 `APP_HOST=127.0.0.1`,收藏、阅读状态、笔记接口不额外要求 token;如果绑定到非本地地址,应启用 same-origin 检查或要求 `ADMIN_TOKEN`,避免内网其他人修改本地笔记。
|
||||
|
||||
---
|
||||
|
||||
## 9. RSS 服务
|
||||
|
||||
**职责**:输出最近论文的 RSS Feed。
|
||||
|
||||
MVP 只做 `/rss.xml`:
|
||||
|
||||
- 默认最近 7 天。
|
||||
- 支持 `?tag=RAG`。
|
||||
- 有中文标题则用中文标题,否则用英文标题。
|
||||
- 详情链接指向本站 `/paper/{arxiv_id}`。
|
||||
|
||||
Atom 和 JSON Feed 作为后续增强。
|
||||
|
||||
---
|
||||
|
||||
## 10. 后续增强服务
|
||||
|
||||
这些能力暂不进入 MVP:
|
||||
|
||||
- LaTeX 图片提取。
|
||||
- ChromaDB 语义搜索。
|
||||
- 相似论文推荐。
|
||||
- 趋势看板。
|
||||
- 论文对比页。
|
||||
|
||||
实现前需要重新评估数据量、API 成本、页面复杂度和验收标准。
|
||||
@@ -0,0 +1,30 @@
|
||||
[project]
|
||||
name = "hf-daily-papers"
|
||||
version = "0.1.0"
|
||||
description = "HuggingFace Daily Papers — 中文论文导览站"
|
||||
requires-python = ">=3.12"
|
||||
dependencies = [
|
||||
"fastapi>=0.115",
|
||||
"uvicorn[standard]>=0.34",
|
||||
"sqlalchemy>=2.0",
|
||||
"httpx>=0.28",
|
||||
"jinja2>=3.1",
|
||||
"python-multipart>=0.0.18",
|
||||
"pydantic>=2.0",
|
||||
"pydantic-settings>=2.0",
|
||||
"typer>=0.15",
|
||||
"python-dotenv>=1.0",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
dev = [
|
||||
"pytest>=8.0",
|
||||
"pytest-asyncio>=0.24",
|
||||
]
|
||||
|
||||
[build-system]
|
||||
requires = ["hatchling"]
|
||||
build-backend = "hatchling.build"
|
||||
|
||||
[tool.hatch.build.targets.wheel]
|
||||
packages = ["app"]
|
||||
@@ -0,0 +1,5 @@
|
||||
"""快捷脚本:初始化数据库。"""
|
||||
|
||||
if __name__ == "__main__":
|
||||
from app.cli import cli_app
|
||||
cli_app(["init-db"])
|
||||
@@ -0,0 +1,6 @@
|
||||
"""快捷脚本:手动抓取指定日期。用法: python scripts/manual_crawl.py [YYYY-MM-DD] [--top N]"""
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
from app.cli import cli_app
|
||||
cli_app(["crawl"] + sys.argv[1:])
|
||||
Reference in New Issue
Block a user