English · Español
Theory 08 — Backup and Restore¶
🇪🇸 El portal usa la API de copia en línea de SQLite (
Connection.backup) para producir snapshots íntegros mientras el proceso sigue en marcha. Objetivo de recuperación: RPO ≤ 24 h, RTO ≤ 15 min en una sola máquina, con verificación trimestral del procedimiento. La replicación off-machine queda fuera de alcance — se difiere a una fase futura de producción.
Why SQLite online backup (not cp)¶
The portal runs SQLite in WAL mode (the default for any FastAPI + SQLModel
deployment that wants concurrent readers during writes). In WAL mode, a
committed write goes into the *.db-wal sidecar file first; it is only
folded back into the main *.db file at the next checkpoint. Two
consequences:
- A naive
cp portal.db snapshot.dbcaptures the main file without the WAL. Any transactions committed since the last checkpoint are missing from the copy. At restore time you would silently lose the most recent minutes of writes — exactly the data you wanted to back up. - If a checkpoint runs while
cpis mid-stream, the copy can contain pages from two distinct states. The resulting file may pass ansqlite3 .openbut will failPRAGMA integrity_check(and worse, may succeed at first and corrupt later under write pressure).
The SQLite online backup API (Connection.backup in Python's sqlite3
module) walks the database one page at a time while holding the appropriate
locks. The result is a single file that represents a consistent snapshot at
the moment the backup completed, including pending WAL data folded in.
This is safe to run while the portal serves traffic.
After each write, the backup script reopens the snapshot read-only and runs
PRAGMA integrity_check. This is cheap (single-digit milliseconds for
portal-sized DBs) and catches the rare cases where the destination disk
returned success on a partial write.
What lives in a snapshot¶
var/snapshots/
├── 2026-05-23-031500-portal.db ← portal DB online backup
├── 2026-05-23-031500-vault.db ← vault DB online backup
├── 2026-05-22-031500-portal.db
├── 2026-05-22-031500-vault.db
├── .last_ok ← sentinel: { "label": ..., "ts": ... }
└── pre-restore-2026-05-23-040000-portal.db ← created by restore (never rotated)
The portal and vault snapshots share a timestamp prefix. The rotate script treats them as a pair: keeping N pairs keeps both files for the same moment in time, so the vault never drifts ahead of the portal it protects.
The .last_ok sentinel is a JSON file with two fields:
The last_snapshot_age_seconds Prometheus gauge reads this file on every
/metrics scrape. If the file is missing or older than the cron interval,
the dashboard turns red.
Schedule¶
Recommended cron:
# /etc/cron.d/miniportal-snapshot
15 3 * * * portaluser cd /opt/lynx-cortex && \
uv run python scripts/portal_backup.py --label cron && \
uv run python scripts/portal_snapshot_rotate.py --keep 14
Nightly at 03:15 UTC — far from peak portal traffic, which is daytime for the learner cohort. The rotate step runs immediately after the backup so the disk footprint stays bounded at 14 pairs ≈ 14 × (portal_size + vault_size). For the §A13 grammar-tutor portal that is a fraction of a megabyte; padding to 14 days is generous.
Manual snapshots are available to admins through the /admin/obs dashboard
("snapshot now" button → POST /admin/obs/snapshot-now). Manual snapshots
are stamped with --label manual and counted in rotation alongside cron
snapshots.
Restore procedure¶
- Stop the portal. The restore script refuses to overwrite a DB that
another process holds open (it acquires an exclusive
flock). This is a safety check, not a correctness one — a forced overwrite of a live DB would corrupt in-flight writes. - Pick a snapshot. List files in
var/snapshots/; pick the most recent pair whose timestamp predates the incident. - Dry-run first.
uv run python scripts/portal_restore.py \
--snapshot var/snapshots/2026-05-22-031500-portal.db \
--target portal --dry-run
The dry-run verifies the snapshot's PRAGMA integrity_check and prints
what would be done. It never touches disk.
4. Real restore.
uv run python scripts/portal_restore.py \
--snapshot var/snapshots/2026-05-22-031500-portal.db \
--target portal --i-know-what-im-doing
Before overwriting, the script writes a pre-restore backup of the
current live DB to var/snapshots/pre-restore-{ts}-portal.db. These are
never rotated automatically — they are your "undo". If after restore
you realise you picked the wrong snapshot, you can restore from the
pre-restore backup.
5. Restart the portal. Verify the admin obs dashboard shows the
expected state (snapshot age resets only on the next successful
backup; restore does not reset the sentinel).
RPO / RTO targets¶
| Metric | Target | How achieved |
|---|---|---|
| RPO (Recovery Point Objective) | ≤ 24 h | Nightly cron snapshot. Worst case: incident occurs at 03:14 UTC, just before the 03:15 cron — we lose 23h 59m of writes. |
| RTO (Recovery Time Objective) | ≤ 15 min | Stop portal (1 min), dry-run + restore (2 min), restart and smoke-check (5 min); the remaining 7 min is "operator reads the incident notes". |
These are single-machine numbers. Recovering from a hardware-level loss (disk failure, machine compromise) is out of scope for Phase 41 — that requires off-machine snapshot replication (S3, restic to a remote, etc.), which is deferred to a production deployment phase.
What is NOT in scope¶
- Off-machine replication. See above; deferred.
- Encryption-at-rest for snapshots. The vault snapshot inherits the vault's column-level encryption; the portal snapshot does not. In Phase 41 the portal DB does not contain plaintext passwords or session keys (those live in the vault) — so a leaked snapshot leaks per-student metadata but not credentials. If the threat model later requires encrypted snapshots, that is a follow-up phase.
- Point-in-time recovery (PITR). SQLite has no built-in WAL archiving. Daily granularity is the contract.
Test restore quarterly¶
A snapshot you have never restored is not a backup; it is a hopeful file. Schedule a calendar reminder:
- On the first Monday of each quarter, copy yesterday's snapshot pair into a scratch directory.
- Run
portal_restore.py --dry-runagainst it. - Spin up a throwaway portal instance pointed at the restored DBs.
- Log in as the seeded admin; verify three random student journal entries render; verify the audit log contains rows.
- Write a one-line entry in
experiments/41-portal-dr-drill/<date>.mdconfirming the drill ran.
The quarterly drill is the artefact that lets you claim a 15-minute RTO in good faith — without it, the number is aspirational.
Cross-references¶
docs/phase-41-learner-portal/theory/07-observability.md— thelast_snapshot_age_secondsgauge surfaces the freshness of these snapshots on the dashboard.src/miniportal/obs_extended/service.py— the probes that update that gauge on every/metricsscrape.scripts/portal_backup.py,scripts/portal_restore.py,scripts/portal_snapshot_rotate.py— the three operator-facing entry points.
Common pitfalls¶
- Using
shutil.copyon a live SQLite file. Described above. Use the online backup API. - Forgetting the vault. A portal restore without a matching vault restore can leave password-set tokens unredeemable. Restore the pair.
- Trusting an un-verified snapshot. Always run
PRAGMA integrity_checkon the snapshot before treating it as the source of truth. Bothportal_backup.pyandportal_restore.pydo this for you; do not skip it when restoring manually. - Rotating away your only good snapshot. Set
--keepto at least 14 on a daily schedule, so a week of bad backups still leaves you a week of good ones to fall back to.