Files
daily-paper/docs/data-model.md
T
Rain-Bus f1be24ab83 feat: initial project structure
- Add FastAPI app with paper browsing UI and REST API
- Add crawler service and database models
- Add scripts for DB init and manual crawl
- Add docs (api-and-ui, data-model, services)
- Add requirements and project config
2026-06-05 21:56:40 +08:00

12 KiB
Raw Blame History

数据模型

本文档定义 SQLite 表、summary.json schema、索引同步、校验和删除策略。


1. 设计原则

  1. SQLite 是主存储,页面和 API 优先从 SQLite 读取。
  2. PDF、LaTeX 源码等下载文件是临时资产,解析和总结完成后清理。
  3. meta.jsonsummary.jsonraw_output.txt 可作为可读备份保存在 data/papers/{arxiv_id}/
  4. 作者和标签使用规范化表,避免 JSON 字符串聚合困难。
  5. FTS5 由独立索引表维护,写入/更新/删除论文时同步更新。
  6. ChromaDB 是后续增强,不能成为 MVP 页面渲染的必要依赖。
  7. 每个 SQLite 连接必须执行 PRAGMA foreign_keys=ON,确保级联删除生效。

2. 数据库表

papers — 论文主表

CREATE TABLE papers (
    id                 INTEGER PRIMARY KEY AUTOINCREMENT,
    arxiv_id           TEXT UNIQUE NOT NULL,
    title_en           TEXT NOT NULL,
    title_zh           TEXT,
    abstract           TEXT,
    published_at       DATE,
    paper_date         DATE NOT NULL,
    crawled_at         DATETIME NOT NULL,
    upvotes            INTEGER DEFAULT 0,
    hf_url             TEXT,
    arxiv_url          TEXT,
    pdf_url            TEXT,
    source_url         TEXT,
    asset_status       TEXT DEFAULT 'not_downloaded', -- not_downloaded / ready / failed / cleaned
    asset_error        TEXT,
    meta_path          TEXT,
    summary_path       TEXT,
    raw_output_path    TEXT,
    summary_quality    TEXT       -- normal / degraded / low
);

手动删除采用物理删除。删除审计写入 data_delete_jobscrawl_logs

paper_authors — 作者表

CREATE TABLE paper_authors (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    paper_id    INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
    name        TEXT NOT NULL,
    position    INTEGER DEFAULT 0,
    UNIQUE(paper_id, name)
);

paper_tags — 标签表

CREATE TABLE paper_tags (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    paper_id    INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
    tag         TEXT NOT NULL,
    source      TEXT DEFAULT 'hf', -- hf / ai / user
    UNIQUE(paper_id, tag, source)
);

paper_summaries — 结构化总结表

CREATE TABLE paper_summaries (
    paper_id                 INTEGER PRIMARY KEY REFERENCES papers(id) ON DELETE CASCADE,
    one_line                 TEXT,
    difficulty               TEXT,
    prerequisites_json       TEXT,
    motivation_problem       TEXT,
    motivation_goal          TEXT,
    motivation_gap           TEXT,
    method_overview          TEXT,
    method_key_idea          TEXT,
    method_steps_json        TEXT,
    method_novelty           TEXT,
    results_main_json        TEXT,
    results_benchmarks_json  TEXT,
    limitations_json         TEXT,
    weaknesses_json          TEXT,
    future_work_json         TEXT,
    reproducibility          TEXT,
    full_json                TEXT NOT NULL,
    updated_at               DATETIME NOT NULL
);

结构化字段用于页面、对比、搜索和排序;full_json 保留完整原始结构。

summary_status — 总结状态

CREATE TABLE summary_status (
    id                INTEGER PRIMARY KEY AUTOINCREMENT,
    paper_id          INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
    status            TEXT NOT NULL, -- pending / processing / done / failed / permanent_failure
    quality           TEXT,          -- normal / degraded / low
    error_type        TEXT,          -- pdf_download_failed / timeout / process_error / json_not_found / json_invalid / field_missing / schema_error / unknown
    error             TEXT,
    retry_count       INTEGER DEFAULT 0,
    raw_output_saved  BOOLEAN DEFAULT FALSE,
    started_at        DATETIME,
    completed_at      DATETIME,
    UNIQUE(paper_id)
);

papers_fts — 全文搜索索引

CREATE VIRTUAL TABLE papers_fts USING fts5(
    title_en,
    title_zh,
    abstract,
    authors,
    tags,
    summary_text,
    tokenize='unicode61'
);

使用普通 FTS5 表,由应用层显式维护。普通 FTS5 会复制一份索引文本,数据量可接受,换取简单可靠的更新和删除语义:

  • 新增论文:插入标题、摘要、作者、标签。
  • 总结完成:更新中文标题和 summary_text
  • 收藏/笔记变更:不进入 FTS,避免个人笔记污染论文搜索。
  • 删除论文:同步删除对应 FTS row。

写入时必须使用 papers.id 作为 FTS rowid

INSERT INTO papers_fts(rowid, title_en, title_zh, abstract, authors, tags, summary_text)
VALUES (:paper_id, :title_en, :title_zh, :abstract, :authors, :tags, :summary_text);

更新时可使用普通 UPDATE,也可先按 rowid 删除再插入。删除论文时执行:

DELETE FROM papers_fts WHERE rowid = :paper_id;

crawl_logs — 任务日志

CREATE TABLE crawl_logs (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    task            TEXT NOT NULL, -- crawl / summarize / cleanup / delete / scheduler
    status          TEXT NOT NULL, -- running / success / failed
    date            DATE,
    papers_found    INTEGER,
    papers_new      INTEGER,
    error           TEXT,
    started_at      DATETIME NOT NULL,
    completed_at    DATETIME
);

task_locks — 任务锁

CREATE TABLE task_locks (
    id             INTEGER PRIMARY KEY AUTOINCREMENT,
    task           TEXT NOT NULL,
    lock_key       TEXT NOT NULL, -- 通常是日期,如 2026-06-05
    status         TEXT NOT NULL, -- running / finished / failed
    owner          TEXT,
    acquired_at    DATETIME NOT NULL,
    released_at    DATETIME
);

CREATE UNIQUE INDEX uq_task_locks_running
ON task_locks(task, lock_key)
WHERE status = 'running';

防重入规则:启动任务前插入 status='running' 的锁;插入失败说明同一任务正在运行,直接跳过或返回 409。任务完成后更新为 finishedfailed

user_bookmarks — 收藏

CREATE TABLE user_bookmarks (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    paper_id    INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
    note        TEXT,
    created_at  DATETIME NOT NULL,
    UNIQUE(paper_id)
);

user_reading_status — 阅读状态

CREATE TABLE user_reading_status (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    paper_id    INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
    status      TEXT NOT NULL, -- unread / skimmed / read_summary / read_full
    updated_at  DATETIME NOT NULL,
    UNIQUE(paper_id)
);

user_notes — 个人笔记

CREATE TABLE user_notes (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    paper_id    INTEGER NOT NULL REFERENCES papers(id) ON DELETE CASCADE,
    content     TEXT NOT NULL,
    created_at  DATETIME NOT NULL,
    updated_at  DATETIME NOT NULL,
    UNIQUE(paper_id)
);

data_delete_jobs — 手动删除记录

CREATE TABLE data_delete_jobs (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    date_start      DATE NOT NULL,
    date_end        DATE NOT NULL,
    include_notes   BOOLEAN DEFAULT TRUE,
    paper_count     INTEGER DEFAULT 0,
    status          TEXT NOT NULL, -- running / success / failed
    error           TEXT,
    started_at      DATETIME NOT NULL,
    completed_at    DATETIME
);

3. summary.json Schema

from pydantic import BaseModel, Field, field_validator


class Prerequisites(BaseModel):
    concepts: list[str] = Field(default_factory=list)
    level: str = ""


class Motivation(BaseModel):
    problem: str
    goal: str = ""
    gap: str = ""


class Method(BaseModel):
    overview: str = ""
    key_idea: str
    steps: list[str] = Field(default_factory=list)
    novelty: str = ""


class Results(BaseModel):
    main_findings: list[str] = Field(default_factory=list)
    benchmarks: list[dict] = Field(default_factory=list)
    limitations: list[str] = Field(default_factory=list)


class Improvements(BaseModel):
    weaknesses: list[str] = Field(default_factory=list)
    future_work: list[str] = Field(default_factory=list)
    reproducibility: str = ""


class SummarySchema(BaseModel):
    title_zh: str
    one_line: str
    tags: list[str]
    difficulty: str = ""
    paper_date: str | None = None
    prerequisites: Prerequisites = Field(default_factory=Prerequisites)
    motivation: Motivation
    method: Method
    results: Results = Field(default_factory=Results)
    improvements: Improvements = Field(default_factory=Improvements)

    @field_validator("title_zh", "one_line")
    @classmethod
    def non_empty_text(cls, value: str) -> str:
        if not value or not value.strip():
            raise ValueError("field cannot be empty")
        return value.strip()

    @field_validator("tags")
    @classmethod
    def non_empty_tags(cls, value: list[str]) -> list[str]:
        tags = [tag.strip() for tag in value if tag and tag.strip()]
        if not tags:
            raise ValueError("tags cannot be empty")
        return tags

实际实现时还要给 Motivation.problemMethod.key_idea 加同样的非空校验,空字符串视为 field_missing

字段分级

级别 字段 处理
必填 title_zh, one_line, tags, motivation.problem, method.key_idea 缺失则失败并重试
重要 motivation.goal, motivation.gap, method.overview, results.main_findings 缺失可入库,标记 degraded
可选 benchmarks, limitations, improvements, prerequisites 缺失用默认值

4. 校验和错误处理

状态流转

pending -> processing -> done
                    └-> failed -> pending retry -> processing
                    └-> permanent_failure

错误分级

error_type 场景 自动重试
timeout pi 超时
pdf_download_failed PDF 下载失败或文件不可读
process_error pi 进程非 0 退出
json_not_found 输出中找不到 JSON
json_invalid JSON 解析失败
field_missing 必填字段缺失
schema_error 字段类型不合法
unknown 未分类异常

最大自动重试次数为 1。重试后仍失败则标记 permanent_failure,管理后台可手动重跑。

质量分级

quality 条件 页面表现
normal 必填和重要字段完整 完整展示
degraded 必填完整,重要字段部分缺失 缺失模块显示“不完整”
low 字段存在但内容明显空洞 顶部提示“AI 总结质量较低”

5. 删除和清理策略

临时文件清理

每篇论文处理完成后删除:

  • data/tmp/{arxiv_id}/paper.pdf
  • data/tmp/{arxiv_id}/source/
  • 其他下载中间文件

总结失败时也应清理下载文件,但保留 raw_output.txt 和错误日志。

手动删除指定日期范围

管理员可删除 paper_date 落在指定范围内的数据。删除流程:

  1. 查询目标论文。
  2. 删除用户收藏、阅读状态、笔记。
  3. 删除 summary/status/authors/tags。
  4. 删除 FTS5 索引。
  5. 删除 data/papers/{arxiv_id}/data/tmp/{arxiv_id}/
  6. 物理删除 papers 记录。
  7. 写入 data_delete_jobscrawl_logs

如后续需要可恢复删除,再引入 deleted_at 软删除字段;MVP 不实现。


6. ChromaDB 增强设计

ChromaDB 不进入 MVP。接入时只索引 paper_summaries 中的高信号字段:

  • 中文标题
  • 英文标题
  • 标签
  • 一句话摘要
  • motivation_problem
  • method_key_idea

向量维度必须和 EMBED_MODEL 匹配。写入前校验 embedding 长度,不匹配则跳过语义索引并记录日志,不影响普通页面和 FTS 搜索。