Files
daily-paper/docs/data-model.md
T
Rain-Bus f1be24ab83 feat: initial project structure
- Add FastAPI app with paper browsing UI and REST API
- Add crawler service and database models
- Add scripts for DB init and manual crawl
- Add docs (api-and-ui, data-model, services)
- Add requirements and project config
2026-06-05 21:56:40 +08:00

395 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 数据模型
> 本文档定义 SQLite 表、summary.json schema、索引同步、校验和删除策略。
---
## 1. 设计原则
1. SQLite 是主存储,页面和 API 优先从 SQLite 读取。
2. PDF、LaTeX 源码等下载文件是临时资产,解析和总结完成后清理。
3. `meta.json``summary.json``raw_output.txt` 可作为可读备份保存在 `data/papers/{arxiv_id}/`
4. 作者和标签使用规范化表,避免 JSON 字符串聚合困难。
5. FTS5 由独立索引表维护,写入/更新/删除论文时同步更新。
6. ChromaDB 是后续增强,不能成为 MVP 页面渲染的必要依赖。
7. 每个 SQLite 连接必须执行 `PRAGMA foreign_keys=ON`,确保级联删除生效。
---
## 2. 数据库表
### papers — 论文主表
```sql
CREATE TABLE papers (
id INTEGER PRIMARY KEY AUTOINCREMENT,
arxiv_id TEXT UNIQUE NOT NULL,
title_en TEXT NOT NULL,
title_zh TEXT,
abstract TEXT,
published_at DATE,
paper_date DATE NOT NULL,
crawled_at DATETIME NOT NULL,
upvotes INTEGER DEFAULT 0,
hf_url TEXT,
arxiv_url TEXT,
pdf_url TEXT,
source_url TEXT,
asset_status TEXT DEFAULT 'not_downloaded', -- not_downloaded / ready / failed / cleaned
asset_error TEXT,
meta_path TEXT,
summary_path TEXT,
raw_output_path TEXT,
summary_quality TEXT -- normal / degraded / low
);
```
手动删除采用物理删除。删除审计写入 `data_delete_jobs``crawl_logs`
### paper_authors — 作者表
```sql
CREATE TABLE paper_authors (
id INTEGER PRIMARY KEY AUTOINCREMENT,
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
name TEXT NOT NULL,
position INTEGER DEFAULT 0,
UNIQUE(paper_id, name)
);
```
### paper_tags — 标签表
```sql
CREATE TABLE paper_tags (
id INTEGER PRIMARY KEY AUTOINCREMENT,
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
tag TEXT NOT NULL,
source TEXT DEFAULT 'hf', -- hf / ai / user
UNIQUE(paper_id, tag, source)
);
```
### paper_summaries — 结构化总结表
```sql
CREATE TABLE paper_summaries (
paper_id INTEGER PRIMARY KEY REFERENCES papers(id) ON DELETE CASCADE,
one_line TEXT,
difficulty TEXT,
prerequisites_json TEXT,
motivation_problem TEXT,
motivation_goal TEXT,
motivation_gap TEXT,
method_overview TEXT,
method_key_idea TEXT,
method_steps_json TEXT,
method_novelty TEXT,
results_main_json TEXT,
results_benchmarks_json TEXT,
limitations_json TEXT,
weaknesses_json TEXT,
future_work_json TEXT,
reproducibility TEXT,
full_json TEXT NOT NULL,
updated_at DATETIME NOT NULL
);
```
结构化字段用于页面、对比、搜索和排序;`full_json` 保留完整原始结构。
### summary_status — 总结状态
```sql
CREATE TABLE summary_status (
id INTEGER PRIMARY KEY AUTOINCREMENT,
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
status TEXT NOT NULL, -- pending / processing / done / failed / permanent_failure
quality TEXT, -- normal / degraded / low
error_type TEXT, -- pdf_download_failed / timeout / process_error / json_not_found / json_invalid / field_missing / schema_error / unknown
error TEXT,
retry_count INTEGER DEFAULT 0,
raw_output_saved BOOLEAN DEFAULT FALSE,
started_at DATETIME,
completed_at DATETIME,
UNIQUE(paper_id)
);
```
### papers_fts — 全文搜索索引
```sql
CREATE VIRTUAL TABLE papers_fts USING fts5(
title_en,
title_zh,
abstract,
authors,
tags,
summary_text,
tokenize='unicode61'
);
```
使用普通 FTS5 表,由应用层显式维护。普通 FTS5 会复制一份索引文本,数据量可接受,换取简单可靠的更新和删除语义:
- 新增论文:插入标题、摘要、作者、标签。
- 总结完成:更新中文标题和 `summary_text`
- 收藏/笔记变更:不进入 FTS,避免个人笔记污染论文搜索。
- 删除论文:同步删除对应 FTS row。
写入时必须使用 `papers.id` 作为 FTS rowid
```sql
INSERT INTO papers_fts(rowid, title_en, title_zh, abstract, authors, tags, summary_text)
VALUES (:paper_id, :title_en, :title_zh, :abstract, :authors, :tags, :summary_text);
```
更新时可使用普通 `UPDATE`,也可先按 rowid 删除再插入。删除论文时执行:
```sql
DELETE FROM papers_fts WHERE rowid = :paper_id;
```
### crawl_logs — 任务日志
```sql
CREATE TABLE crawl_logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
task TEXT NOT NULL, -- crawl / summarize / cleanup / delete / scheduler
status TEXT NOT NULL, -- running / success / failed
date DATE,
papers_found INTEGER,
papers_new INTEGER,
error TEXT,
started_at DATETIME NOT NULL,
completed_at DATETIME
);
```
### task_locks — 任务锁
```sql
CREATE TABLE task_locks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
task TEXT NOT NULL,
lock_key TEXT NOT NULL, -- 通常是日期,如 2026-06-05
status TEXT NOT NULL, -- running / finished / failed
owner TEXT,
acquired_at DATETIME NOT NULL,
released_at DATETIME
);
CREATE UNIQUE INDEX uq_task_locks_running
ON task_locks(task, lock_key)
WHERE status = 'running';
```
防重入规则:启动任务前插入 `status='running'` 的锁;插入失败说明同一任务正在运行,直接跳过或返回 409。任务完成后更新为 `finished``failed`
### user_bookmarks — 收藏
```sql
CREATE TABLE user_bookmarks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
note TEXT,
created_at DATETIME NOT NULL,
UNIQUE(paper_id)
);
```
### user_reading_status — 阅读状态
```sql
CREATE TABLE user_reading_status (
id INTEGER PRIMARY KEY AUTOINCREMENT,
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
status TEXT NOT NULL, -- unread / skimmed / read_summary / read_full
updated_at DATETIME NOT NULL,
UNIQUE(paper_id)
);
```
### user_notes — 个人笔记
```sql
CREATE TABLE user_notes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
paper_id INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
content TEXT NOT NULL,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL,
UNIQUE(paper_id)
);
```
### data_delete_jobs — 手动删除记录
```sql
CREATE TABLE data_delete_jobs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
date_start DATE NOT NULL,
date_end DATE NOT NULL,
include_notes BOOLEAN DEFAULT TRUE,
paper_count INTEGER DEFAULT 0,
status TEXT NOT NULL, -- running / success / failed
error TEXT,
started_at DATETIME NOT NULL,
completed_at DATETIME
);
```
---
## 3. summary.json Schema
```python
from pydantic import BaseModel, Field, field_validator
class Prerequisites(BaseModel):
concepts: list[str] = Field(default_factory=list)
level: str = ""
class Motivation(BaseModel):
problem: str
goal: str = ""
gap: str = ""
class Method(BaseModel):
overview: str = ""
key_idea: str
steps: list[str] = Field(default_factory=list)
novelty: str = ""
class Results(BaseModel):
main_findings: list[str] = Field(default_factory=list)
benchmarks: list[dict] = Field(default_factory=list)
limitations: list[str] = Field(default_factory=list)
class Improvements(BaseModel):
weaknesses: list[str] = Field(default_factory=list)
future_work: list[str] = Field(default_factory=list)
reproducibility: str = ""
class SummarySchema(BaseModel):
title_zh: str
one_line: str
tags: list[str]
difficulty: str = ""
paper_date: str | None = None
prerequisites: Prerequisites = Field(default_factory=Prerequisites)
motivation: Motivation
method: Method
results: Results = Field(default_factory=Results)
improvements: Improvements = Field(default_factory=Improvements)
@field_validator("title_zh", "one_line")
@classmethod
def non_empty_text(cls, value: str) -> str:
if not value or not value.strip():
raise ValueError("field cannot be empty")
return value.strip()
@field_validator("tags")
@classmethod
def non_empty_tags(cls, value: list[str]) -> list[str]:
tags = [tag.strip() for tag in value if tag and tag.strip()]
if not tags:
raise ValueError("tags cannot be empty")
return tags
```
实际实现时还要给 `Motivation.problem``Method.key_idea` 加同样的非空校验,空字符串视为 `field_missing`
### 字段分级
| 级别 | 字段 | 处理 |
|------|------|------|
| 必填 | `title_zh`, `one_line`, `tags`, `motivation.problem`, `method.key_idea` | 缺失则失败并重试 |
| 重要 | `motivation.goal`, `motivation.gap`, `method.overview`, `results.main_findings` | 缺失可入库,标记 `degraded` |
| 可选 | `benchmarks`, `limitations`, `improvements`, `prerequisites` | 缺失用默认值 |
---
## 4. 校验和错误处理
### 状态流转
```text
pending -> processing -> done
└-> failed -> pending retry -> processing
└-> permanent_failure
```
### 错误分级
| error_type | 场景 | 自动重试 |
|------------|------|----------|
| timeout | pi 超时 | 是 |
| pdf_download_failed | PDF 下载失败或文件不可读 | 是 |
| process_error | pi 进程非 0 退出 | 是 |
| json_not_found | 输出中找不到 JSON | 是 |
| json_invalid | JSON 解析失败 | 是 |
| field_missing | 必填字段缺失 | 是 |
| schema_error | 字段类型不合法 | 是 |
| unknown | 未分类异常 | 是 |
最大自动重试次数为 1。重试后仍失败则标记 `permanent_failure`,管理后台可手动重跑。
### 质量分级
| quality | 条件 | 页面表现 |
|---------|------|----------|
| normal | 必填和重要字段完整 | 完整展示 |
| degraded | 必填完整,重要字段部分缺失 | 缺失模块显示“不完整” |
| low | 字段存在但内容明显空洞 | 顶部提示“AI 总结质量较低” |
---
## 5. 删除和清理策略
### 临时文件清理
每篇论文处理完成后删除:
- `data/tmp/{arxiv_id}/paper.pdf`
- `data/tmp/{arxiv_id}/source/`
- 其他下载中间文件
总结失败时也应清理下载文件,但保留 `raw_output.txt` 和错误日志。
### 手动删除指定日期范围
管理员可删除 `paper_date` 落在指定范围内的数据。删除流程:
1. 查询目标论文。
2. 删除用户收藏、阅读状态、笔记。
3. 删除 summary/status/authors/tags。
4. 删除 FTS5 索引。
5. 删除 `data/papers/{arxiv_id}/``data/tmp/{arxiv_id}/`
6. 物理删除 `papers` 记录。
7. 写入 `data_delete_jobs``crawl_logs`
如后续需要可恢复删除,再引入 `deleted_at` 软删除字段;MVP 不实现。
---
## 6. ChromaDB 增强设计
ChromaDB 不进入 MVP。接入时只索引 `paper_summaries` 中的高信号字段:
- 中文标题
- 英文标题
- 标签
- 一句话摘要
- `motivation_problem`
- `method_key_idea`
向量维度必须和 `EMBED_MODEL` 匹配。写入前校验 embedding 长度,不匹配则跳过语义索引并记录日志,不影响普通页面和 FTS 搜索。