Skip to content

English · Español

Theory 08 — Backup and Restore

🇪🇸 El portal usa la API de copia en línea de SQLite (Connection.backup) para producir snapshots íntegros mientras el proceso sigue en marcha. Objetivo de recuperación: RPO ≤ 24 h, RTO ≤ 15 min en una sola máquina, con verificación trimestral del procedimiento. La replicación off-machine queda fuera de alcance — se difiere a una fase futura de producción.

Why SQLite online backup (not cp)

The portal runs SQLite in WAL mode (the default for any FastAPI + SQLModel deployment that wants concurrent readers during writes). In WAL mode, a committed write goes into the *.db-wal sidecar file first; it is only folded back into the main *.db file at the next checkpoint. Two consequences:

  1. A naive cp portal.db snapshot.db captures the main file without the WAL. Any transactions committed since the last checkpoint are missing from the copy. At restore time you would silently lose the most recent minutes of writes — exactly the data you wanted to back up.
  2. If a checkpoint runs while cp is mid-stream, the copy can contain pages from two distinct states. The resulting file may pass an sqlite3 .open but will fail PRAGMA integrity_check (and worse, may succeed at first and corrupt later under write pressure).

The SQLite online backup API (Connection.backup in Python's sqlite3 module) walks the database one page at a time while holding the appropriate locks. The result is a single file that represents a consistent snapshot at the moment the backup completed, including pending WAL data folded in. This is safe to run while the portal serves traffic.

After each write, the backup script reopens the snapshot read-only and runs PRAGMA integrity_check. This is cheap (single-digit milliseconds for portal-sized DBs) and catches the rare cases where the destination disk returned success on a partial write.

What lives in a snapshot

var/snapshots/
├── 2026-05-23-031500-portal.db     ← portal DB online backup
├── 2026-05-23-031500-vault.db      ← vault DB online backup
├── 2026-05-22-031500-portal.db
├── 2026-05-22-031500-vault.db
├── .last_ok                        ← sentinel: { "label": ..., "ts": ... }
└── pre-restore-2026-05-23-040000-portal.db   ← created by restore (never rotated)

The portal and vault snapshots share a timestamp prefix. The rotate script treats them as a pair: keeping N pairs keeps both files for the same moment in time, so the vault never drifts ahead of the portal it protects.

The .last_ok sentinel is a JSON file with two fields:

{"label": "cron", "ts": "2026-05-23T03:15:00+00:00"}

The last_snapshot_age_seconds Prometheus gauge reads this file on every /metrics scrape. If the file is missing or older than the cron interval, the dashboard turns red.

Schedule

Recommended cron:

# /etc/cron.d/miniportal-snapshot
15 3 * * *   portaluser   cd /opt/lynx-cortex && \
    uv run python scripts/portal_backup.py --label cron && \
    uv run python scripts/portal_snapshot_rotate.py --keep 14

Nightly at 03:15 UTC — far from peak portal traffic, which is daytime for the learner cohort. The rotate step runs immediately after the backup so the disk footprint stays bounded at 14 pairs ≈ 14 × (portal_size + vault_size). For the §A13 grammar-tutor portal that is a fraction of a megabyte; padding to 14 days is generous.

Manual snapshots are available to admins through the /admin/obs dashboard ("snapshot now" button → POST /admin/obs/snapshot-now). Manual snapshots are stamped with --label manual and counted in rotation alongside cron snapshots.

Restore procedure

  1. Stop the portal. The restore script refuses to overwrite a DB that another process holds open (it acquires an exclusive flock). This is a safety check, not a correctness one — a forced overwrite of a live DB would corrupt in-flight writes.
  2. Pick a snapshot. List files in var/snapshots/; pick the most recent pair whose timestamp predates the incident.
  3. Dry-run first.
uv run python scripts/portal_restore.py \
    --snapshot var/snapshots/2026-05-22-031500-portal.db \
    --target portal --dry-run

The dry-run verifies the snapshot's PRAGMA integrity_check and prints what would be done. It never touches disk. 4. Real restore.

uv run python scripts/portal_restore.py \
    --snapshot var/snapshots/2026-05-22-031500-portal.db \
    --target portal --i-know-what-im-doing

Before overwriting, the script writes a pre-restore backup of the current live DB to var/snapshots/pre-restore-{ts}-portal.db. These are never rotated automatically — they are your "undo". If after restore you realise you picked the wrong snapshot, you can restore from the pre-restore backup. 5. Restart the portal. Verify the admin obs dashboard shows the expected state (snapshot age resets only on the next successful backup; restore does not reset the sentinel).

RPO / RTO targets

Metric Target How achieved
RPO (Recovery Point Objective) ≤ 24 h Nightly cron snapshot. Worst case: incident occurs at 03:14 UTC, just before the 03:15 cron — we lose 23h 59m of writes.
RTO (Recovery Time Objective) ≤ 15 min Stop portal (1 min), dry-run + restore (2 min), restart and smoke-check (5 min); the remaining 7 min is "operator reads the incident notes".

These are single-machine numbers. Recovering from a hardware-level loss (disk failure, machine compromise) is out of scope for Phase 41 — that requires off-machine snapshot replication (S3, restic to a remote, etc.), which is deferred to a production deployment phase.

What is NOT in scope

  • Off-machine replication. See above; deferred.
  • Encryption-at-rest for snapshots. The vault snapshot inherits the vault's column-level encryption; the portal snapshot does not. In Phase 41 the portal DB does not contain plaintext passwords or session keys (those live in the vault) — so a leaked snapshot leaks per-student metadata but not credentials. If the threat model later requires encrypted snapshots, that is a follow-up phase.
  • Point-in-time recovery (PITR). SQLite has no built-in WAL archiving. Daily granularity is the contract.

Test restore quarterly

A snapshot you have never restored is not a backup; it is a hopeful file. Schedule a calendar reminder:

  1. On the first Monday of each quarter, copy yesterday's snapshot pair into a scratch directory.
  2. Run portal_restore.py --dry-run against it.
  3. Spin up a throwaway portal instance pointed at the restored DBs.
  4. Log in as the seeded admin; verify three random student journal entries render; verify the audit log contains rows.
  5. Write a one-line entry in experiments/41-portal-dr-drill/<date>.md confirming the drill ran.

The quarterly drill is the artefact that lets you claim a 15-minute RTO in good faith — without it, the number is aspirational.

Cross-references

  • docs/phase-41-learner-portal/theory/07-observability.md — the last_snapshot_age_seconds gauge surfaces the freshness of these snapshots on the dashboard.
  • src/miniportal/obs_extended/service.py — the probes that update that gauge on every /metrics scrape.
  • scripts/portal_backup.py, scripts/portal_restore.py, scripts/portal_snapshot_rotate.py — the three operator-facing entry points.

Common pitfalls

  • Using shutil.copy on a live SQLite file. Described above. Use the online backup API.
  • Forgetting the vault. A portal restore without a matching vault restore can leave password-set tokens unredeemable. Restore the pair.
  • Trusting an un-verified snapshot. Always run PRAGMA integrity_check on the snapshot before treating it as the source of truth. Both portal_backup.py and portal_restore.py do this for you; do not skip it when restoring manually.
  • Rotating away your only good snapshot. Set --keep to at least 14 on a daily schedule, so a week of bad backups still leaves you a week of good ones to fall back to.