f1be24ab83
- Add FastAPI app with paper browsing UI and REST API - Add crawler service and database models - Add scripts for DB init and manual crawl - Add docs (api-and-ui, data-model, services) - Add requirements and project config
395 lines
12 KiB
Markdown
395 lines
12 KiB
Markdown
# 数据模型
|
||
|
||
> 本文档定义 SQLite 表、summary.json schema、索引同步、校验和删除策略。
|
||
|
||
---
|
||
|
||
## 1. 设计原则
|
||
|
||
1. SQLite 是主存储,页面和 API 优先从 SQLite 读取。
|
||
2. PDF、LaTeX 源码等下载文件是临时资产,解析和总结完成后清理。
|
||
3. `meta.json`、`summary.json`、`raw_output.txt` 可作为可读备份保存在 `data/papers/{arxiv_id}/`。
|
||
4. 作者和标签使用规范化表,避免 JSON 字符串聚合困难。
|
||
5. FTS5 由独立索引表维护,写入/更新/删除论文时同步更新。
|
||
6. ChromaDB 是后续增强,不能成为 MVP 页面渲染的必要依赖。
|
||
7. 每个 SQLite 连接必须执行 `PRAGMA foreign_keys=ON`,确保级联删除生效。
|
||
|
||
---
|
||
|
||
## 2. 数据库表
|
||
|
||
### papers — 论文主表
|
||
|
||
```sql
|
||
CREATE TABLE papers (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
arxiv_id TEXT UNIQUE NOT NULL,
|
||
title_en TEXT NOT NULL,
|
||
title_zh TEXT,
|
||
abstract TEXT,
|
||
published_at DATE,
|
||
paper_date DATE NOT NULL,
|
||
crawled_at DATETIME NOT NULL,
|
||
upvotes INTEGER DEFAULT 0,
|
||
hf_url TEXT,
|
||
arxiv_url TEXT,
|
||
pdf_url TEXT,
|
||
source_url TEXT,
|
||
asset_status TEXT DEFAULT 'not_downloaded', -- not_downloaded / ready / failed / cleaned
|
||
asset_error TEXT,
|
||
meta_path TEXT,
|
||
summary_path TEXT,
|
||
raw_output_path TEXT,
|
||
summary_quality TEXT -- normal / degraded / low
|
||
);
|
||
```
|
||
|
||
手动删除采用物理删除。删除审计写入 `data_delete_jobs` 和 `crawl_logs`。
|
||
|
||
### paper_authors — 作者表
|
||
|
||
```sql
|
||
CREATE TABLE paper_authors (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
|
||
name TEXT NOT NULL,
|
||
position INTEGER DEFAULT 0,
|
||
UNIQUE(paper_id, name)
|
||
);
|
||
```
|
||
|
||
### paper_tags — 标签表
|
||
|
||
```sql
|
||
CREATE TABLE paper_tags (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
|
||
tag TEXT NOT NULL,
|
||
source TEXT DEFAULT 'hf', -- hf / ai / user
|
||
UNIQUE(paper_id, tag, source)
|
||
);
|
||
```
|
||
|
||
### paper_summaries — 结构化总结表
|
||
|
||
```sql
|
||
CREATE TABLE paper_summaries (
|
||
paper_id INTEGER PRIMARY KEY REFERENCES papers(id) ON DELETE CASCADE,
|
||
one_line TEXT,
|
||
difficulty TEXT,
|
||
prerequisites_json TEXT,
|
||
motivation_problem TEXT,
|
||
motivation_goal TEXT,
|
||
motivation_gap TEXT,
|
||
method_overview TEXT,
|
||
method_key_idea TEXT,
|
||
method_steps_json TEXT,
|
||
method_novelty TEXT,
|
||
results_main_json TEXT,
|
||
results_benchmarks_json TEXT,
|
||
limitations_json TEXT,
|
||
weaknesses_json TEXT,
|
||
future_work_json TEXT,
|
||
reproducibility TEXT,
|
||
full_json TEXT NOT NULL,
|
||
updated_at DATETIME NOT NULL
|
||
);
|
||
```
|
||
|
||
结构化字段用于页面、对比、搜索和排序;`full_json` 保留完整原始结构。
|
||
|
||
### summary_status — 总结状态
|
||
|
||
```sql
|
||
CREATE TABLE summary_status (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
|
||
status TEXT NOT NULL, -- pending / processing / done / failed / permanent_failure
|
||
quality TEXT, -- normal / degraded / low
|
||
error_type TEXT, -- pdf_download_failed / timeout / process_error / json_not_found / json_invalid / field_missing / schema_error / unknown
|
||
error TEXT,
|
||
retry_count INTEGER DEFAULT 0,
|
||
raw_output_saved BOOLEAN DEFAULT FALSE,
|
||
started_at DATETIME,
|
||
completed_at DATETIME,
|
||
UNIQUE(paper_id)
|
||
);
|
||
```
|
||
|
||
### papers_fts — 全文搜索索引
|
||
|
||
```sql
|
||
CREATE VIRTUAL TABLE papers_fts USING fts5(
|
||
title_en,
|
||
title_zh,
|
||
abstract,
|
||
authors,
|
||
tags,
|
||
summary_text,
|
||
tokenize='unicode61'
|
||
);
|
||
```
|
||
|
||
使用普通 FTS5 表,由应用层显式维护。普通 FTS5 会复制一份索引文本,数据量可接受,换取简单可靠的更新和删除语义:
|
||
|
||
- 新增论文:插入标题、摘要、作者、标签。
|
||
- 总结完成:更新中文标题和 `summary_text`。
|
||
- 收藏/笔记变更:不进入 FTS,避免个人笔记污染论文搜索。
|
||
- 删除论文:同步删除对应 FTS row。
|
||
|
||
写入时必须使用 `papers.id` 作为 FTS rowid:
|
||
|
||
```sql
|
||
INSERT INTO papers_fts(rowid, title_en, title_zh, abstract, authors, tags, summary_text)
|
||
VALUES (:paper_id, :title_en, :title_zh, :abstract, :authors, :tags, :summary_text);
|
||
```
|
||
|
||
更新时可使用普通 `UPDATE`,也可先按 rowid 删除再插入。删除论文时执行:
|
||
|
||
```sql
|
||
DELETE FROM papers_fts WHERE rowid = :paper_id;
|
||
```
|
||
|
||
### crawl_logs — 任务日志
|
||
|
||
```sql
|
||
CREATE TABLE crawl_logs (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
task TEXT NOT NULL, -- crawl / summarize / cleanup / delete / scheduler
|
||
status TEXT NOT NULL, -- running / success / failed
|
||
date DATE,
|
||
papers_found INTEGER,
|
||
papers_new INTEGER,
|
||
error TEXT,
|
||
started_at DATETIME NOT NULL,
|
||
completed_at DATETIME
|
||
);
|
||
```
|
||
|
||
### task_locks — 任务锁
|
||
|
||
```sql
|
||
CREATE TABLE task_locks (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
task TEXT NOT NULL,
|
||
lock_key TEXT NOT NULL, -- 通常是日期,如 2026-06-05
|
||
status TEXT NOT NULL, -- running / finished / failed
|
||
owner TEXT,
|
||
acquired_at DATETIME NOT NULL,
|
||
released_at DATETIME
|
||
);
|
||
|
||
CREATE UNIQUE INDEX uq_task_locks_running
|
||
ON task_locks(task, lock_key)
|
||
WHERE status = 'running';
|
||
```
|
||
|
||
防重入规则:启动任务前插入 `status='running'` 的锁;插入失败说明同一任务正在运行,直接跳过或返回 409。任务完成后更新为 `finished` 或 `failed`。
|
||
|
||
### user_bookmarks — 收藏
|
||
|
||
```sql
|
||
CREATE TABLE user_bookmarks (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
|
||
note TEXT,
|
||
created_at DATETIME NOT NULL,
|
||
UNIQUE(paper_id)
|
||
);
|
||
```
|
||
|
||
### user_reading_status — 阅读状态
|
||
|
||
```sql
|
||
CREATE TABLE user_reading_status (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
|
||
status TEXT NOT NULL, -- unread / skimmed / read_summary / read_full
|
||
updated_at DATETIME NOT NULL,
|
||
UNIQUE(paper_id)
|
||
);
|
||
```
|
||
|
||
### user_notes — 个人笔记
|
||
|
||
```sql
|
||
CREATE TABLE user_notes (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
|
||
content TEXT NOT NULL,
|
||
created_at DATETIME NOT NULL,
|
||
updated_at DATETIME NOT NULL,
|
||
UNIQUE(paper_id)
|
||
);
|
||
```
|
||
|
||
### data_delete_jobs — 手动删除记录
|
||
|
||
```sql
|
||
CREATE TABLE data_delete_jobs (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
date_start DATE NOT NULL,
|
||
date_end DATE NOT NULL,
|
||
include_notes BOOLEAN DEFAULT TRUE,
|
||
paper_count INTEGER DEFAULT 0,
|
||
status TEXT NOT NULL, -- running / success / failed
|
||
error TEXT,
|
||
started_at DATETIME NOT NULL,
|
||
completed_at DATETIME
|
||
);
|
||
```
|
||
|
||
---
|
||
|
||
## 3. summary.json Schema
|
||
|
||
```python
|
||
from pydantic import BaseModel, Field, field_validator
|
||
|
||
|
||
class Prerequisites(BaseModel):
|
||
concepts: list[str] = Field(default_factory=list)
|
||
level: str = ""
|
||
|
||
|
||
class Motivation(BaseModel):
|
||
problem: str
|
||
goal: str = ""
|
||
gap: str = ""
|
||
|
||
|
||
class Method(BaseModel):
|
||
overview: str = ""
|
||
key_idea: str
|
||
steps: list[str] = Field(default_factory=list)
|
||
novelty: str = ""
|
||
|
||
|
||
class Results(BaseModel):
|
||
main_findings: list[str] = Field(default_factory=list)
|
||
benchmarks: list[dict] = Field(default_factory=list)
|
||
limitations: list[str] = Field(default_factory=list)
|
||
|
||
|
||
class Improvements(BaseModel):
|
||
weaknesses: list[str] = Field(default_factory=list)
|
||
future_work: list[str] = Field(default_factory=list)
|
||
reproducibility: str = ""
|
||
|
||
|
||
class SummarySchema(BaseModel):
|
||
title_zh: str
|
||
one_line: str
|
||
tags: list[str]
|
||
difficulty: str = ""
|
||
paper_date: str | None = None
|
||
prerequisites: Prerequisites = Field(default_factory=Prerequisites)
|
||
motivation: Motivation
|
||
method: Method
|
||
results: Results = Field(default_factory=Results)
|
||
improvements: Improvements = Field(default_factory=Improvements)
|
||
|
||
@field_validator("title_zh", "one_line")
|
||
@classmethod
|
||
def non_empty_text(cls, value: str) -> str:
|
||
if not value or not value.strip():
|
||
raise ValueError("field cannot be empty")
|
||
return value.strip()
|
||
|
||
@field_validator("tags")
|
||
@classmethod
|
||
def non_empty_tags(cls, value: list[str]) -> list[str]:
|
||
tags = [tag.strip() for tag in value if tag and tag.strip()]
|
||
if not tags:
|
||
raise ValueError("tags cannot be empty")
|
||
return tags
|
||
```
|
||
|
||
实际实现时还要给 `Motivation.problem` 和 `Method.key_idea` 加同样的非空校验,空字符串视为 `field_missing`。
|
||
|
||
### 字段分级
|
||
|
||
| 级别 | 字段 | 处理 |
|
||
|------|------|------|
|
||
| 必填 | `title_zh`, `one_line`, `tags`, `motivation.problem`, `method.key_idea` | 缺失则失败并重试 |
|
||
| 重要 | `motivation.goal`, `motivation.gap`, `method.overview`, `results.main_findings` | 缺失可入库,标记 `degraded` |
|
||
| 可选 | `benchmarks`, `limitations`, `improvements`, `prerequisites` | 缺失用默认值 |
|
||
|
||
---
|
||
|
||
## 4. 校验和错误处理
|
||
|
||
### 状态流转
|
||
|
||
```text
|
||
pending -> processing -> done
|
||
└-> failed -> pending retry -> processing
|
||
└-> permanent_failure
|
||
```
|
||
|
||
### 错误分级
|
||
|
||
| error_type | 场景 | 自动重试 |
|
||
|------------|------|----------|
|
||
| timeout | pi 超时 | 是 |
|
||
| pdf_download_failed | PDF 下载失败或文件不可读 | 是 |
|
||
| process_error | pi 进程非 0 退出 | 是 |
|
||
| json_not_found | 输出中找不到 JSON | 是 |
|
||
| json_invalid | JSON 解析失败 | 是 |
|
||
| field_missing | 必填字段缺失 | 是 |
|
||
| schema_error | 字段类型不合法 | 是 |
|
||
| unknown | 未分类异常 | 是 |
|
||
|
||
最大自动重试次数为 1。重试后仍失败则标记 `permanent_failure`,管理后台可手动重跑。
|
||
|
||
### 质量分级
|
||
|
||
| quality | 条件 | 页面表现 |
|
||
|---------|------|----------|
|
||
| normal | 必填和重要字段完整 | 完整展示 |
|
||
| degraded | 必填完整,重要字段部分缺失 | 缺失模块显示“不完整” |
|
||
| low | 字段存在但内容明显空洞 | 顶部提示“AI 总结质量较低” |
|
||
|
||
---
|
||
|
||
## 5. 删除和清理策略
|
||
|
||
### 临时文件清理
|
||
|
||
每篇论文处理完成后删除:
|
||
|
||
- `data/tmp/{arxiv_id}/paper.pdf`
|
||
- `data/tmp/{arxiv_id}/source/`
|
||
- 其他下载中间文件
|
||
|
||
总结失败时也应清理下载文件,但保留 `raw_output.txt` 和错误日志。
|
||
|
||
### 手动删除指定日期范围
|
||
|
||
管理员可删除 `paper_date` 落在指定范围内的数据。删除流程:
|
||
|
||
1. 查询目标论文。
|
||
2. 删除用户收藏、阅读状态、笔记。
|
||
3. 删除 summary/status/authors/tags。
|
||
4. 删除 FTS5 索引。
|
||
5. 删除 `data/papers/{arxiv_id}/` 和 `data/tmp/{arxiv_id}/`。
|
||
6. 物理删除 `papers` 记录。
|
||
7. 写入 `data_delete_jobs` 和 `crawl_logs`。
|
||
|
||
如后续需要可恢复删除,再引入 `deleted_at` 软删除字段;MVP 不实现。
|
||
|
||
---
|
||
|
||
## 6. ChromaDB 增强设计
|
||
|
||
ChromaDB 不进入 MVP。接入时只索引 `paper_summaries` 中的高信号字段:
|
||
|
||
- 中文标题
|
||
- 英文标题
|
||
- 标签
|
||
- 一句话摘要
|
||
- `motivation_problem`
|
||
- `method_key_idea`
|
||
|
||
向量维度必须和 `EMBED_MODEL` 匹配。写入前校验 embedding 长度,不匹配则跳过语义索引并记录日志,不影响普通页面和 FTS 搜索。
|