English · Español
Lab 06 — Deploy on a single VPS, daily backups, disaster-recovery drill¶
🇪🇸 Portal en vivo en un solo servidor: Caddy delante, uvicorn detrás, SQLite en
/var/lib/lynx-portal/. Copia diaria con BorgBackup a otro sitio. La prueba real: dropear la base de datos, restaurarla desde la copia de ayer, y comprobar que no falta nada. La puerta de salida de la Phase 41: una persona externa se registra, falla una pregunta, y al día siguiente esa pregunta vuelve.
Goal¶
Deploy the portal on a single VPS in a way Borja can re-execute in an afternoon. Caddy terminates TLS and reverse-proxies to uvicorn; SQLite lives on local disk under /var/lib/lynx-portal/. BorgBackup (or restic) snapshots the DB nightly to an off-site, encrypted destination. The exit gate is a disaster-recovery drill + a real external user who completes the loop end-to-end.
Why this lab exists¶
A portal that runs only on Borja's laptop is not a portal. Lab 06 is what makes Phase 41's effort durable: someone other than Borja, on a different network, on a different day, can use the system. The single-VPS pattern is chosen because (a) it is what an actual classroom of 3–20 students needs, (b) it has zero managed-service lock-in, and © the disaster-recovery drill is tractable on it — you cannot easily drill RDS, you can absolutely drill a SQLite file.
The exit gate (real external user) is the only honest test that the lab-01..05 chain is coherent. Internal tests can pass while the deployed system is unusable; the external sign-up + fail-a-question + see-it-tomorrow flow proves the whole loop.
Prerequisites¶
- Labs 00–05 done; all tests green.
- A VPS is provisioned (any small instance — 1 vCPU, 1 GB RAM, 10 GB disk is enough for ≤ 50 learners).
- An off-site backup target exists (a second VPS, S3-compatible bucket, or Borja's home NAS reachable over SSH).
- Phase 37 secret-handling rules are internalized (no secrets in git; vault env file under
0600).
Deliverables¶
deploy/Caddyfile— reverse-proxy + automatic Let's Encrypt config.deploy/systemd/lynx-portal.service— uvicorn under systemd, restart policy, user/group.deploy/install.sh— idempotent first-time install on a fresh VPS (apt deps, user creation, dir layout, systemd enable).deploy/backup/borg-backup.sh— daily snapshot script.deploy/backup/borg-restore.sh— restore-from-snapshot script.deploy/backup/crontab— fragment installed byinstall.sh.deploy/README.md— runbook (sign-off checklist for a new deploy).tests/integration/test_disaster_recovery.py— drops + restores + verifies.experiments/41-deploy/dr-drill-2026-XX-XX.md— log of the actual drill run.experiments/41-deploy/external-user-walkthrough.md— the exit-gate transcript.
Step 1 — The VPS layout¶
/var/lib/lynx-portal/
portal.db # SQLite, WAL mode
portal.db-wal # write-ahead log
portal.db-shm # shared memory
audit/YYYY-MM-DD.log # JSONL audit
uploads/ # if any (notes attachments — out of scope for lab 06)
/etc/lynx-portal/
portal.env # 0600 root:portal — secrets (pepper, session_secret, backup repo passphrase)
/opt/lynx-portal/
app/ # checkout of the repo OR a uv-built venv slug
venv/ # uv-managed
/var/log/lynx-portal/
portal.log # if not using systemd-journald
Ownership: portal:portal for everything under /var/lib/lynx-portal/ and /opt/lynx-portal/. Mode 0750 on directories, 0640 on regular files. The portal system user is a no-login account (/usr/sbin/nologin).
Step 2 — install.sh¶
#!/usr/bin/env bash
# deploy/install.sh
# Idempotent. Safe to re-run.
set -euo pipefail
# Lab 06 step 2: implement the following actions, each guarded by an "is it already done?" check.
#
# 1. apt-get install -y caddy python3 sqlite3 borgbackup curl
# 2. id -u portal || useradd --system --home /var/lib/lynx-portal --shell /usr/sbin/nologin portal
# 3. mkdir -p /var/lib/lynx-portal /etc/lynx-portal /opt/lynx-portal
# 4. chown -R portal:portal /var/lib/lynx-portal
# chmod 0750 /var/lib/lynx-portal
# 5. install -m 0600 -o root -g portal portal.env.template /etc/lynx-portal/portal.env
# # Operator fills in real secrets after this script exits.
# 6. install -m 0644 deploy/Caddyfile /etc/caddy/Caddyfile
# 7. install -m 0644 deploy/systemd/lynx-portal.service /etc/systemd/system/
# 8. systemctl daemon-reload
# systemctl enable --now lynx-portal
# systemctl reload caddy
# 9. install -m 0755 deploy/backup/borg-backup.sh /usr/local/bin/lynx-backup
# crontab -u portal deploy/backup/crontab
echo "Lab 06 step 2 — implement the install script."
exit 1
The script ends with a printed checklist of manual steps Borja still has to do (fill in /etc/lynx-portal/portal.env, run borg init against the remote repo, point the domain's DNS at the VPS). It does not try to do those itself — those are credentials-bearing actions and belong to the operator.
Step 3 — Caddyfile¶
# deploy/Caddyfile
portal.example.com {
encode zstd gzip
log {
output file /var/log/caddy/access.log
format console
}
@healthz path /healthz
handle @healthz {
reverse_proxy 127.0.0.1:8001
}
handle /static/* {
root * /opt/lynx-portal/app/src/miniportal
file_server
}
handle {
reverse_proxy 127.0.0.1:8001
}
header {
Strict-Transport-Security "max-age=31536000; includeSubDomains"
X-Content-Type-Options "nosniff"
Referrer-Policy "strict-origin-when-cross-origin"
# CSP is set per-route by the app (some pages need inline event handlers; locked down in lab 03).
}
}
Replace portal.example.com at deploy time; document the choice in deploy/README.md so the next deployer does not forget.
Step 4 — systemd unit¶
# deploy/systemd/lynx-portal.service
[Unit]
Description=lynx-cortex learner portal
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=portal
Group=portal
WorkingDirectory=/opt/lynx-portal/app
EnvironmentFile=/etc/lynx-portal/portal.env
ExecStart=/opt/lynx-portal/venv/bin/uvicorn miniportal.app:make_app --factory --host 127.0.0.1 --port 8001
Restart=on-failure
RestartSec=3
# Hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/var/lib/lynx-portal /var/log/lynx-portal
ProtectHome=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
LockPersonality=true
[Install]
WantedBy=multi-user.target
The hardening lines come from Phase 37's hardening checklist; they are non-negotiable for production. The integration test in experiments/41-deploy/ runs systemd-analyze security lynx-portal.service and expects a score ≤ 3.0.
Step 5 — SQLite + WAL¶
SQLite is a single file. That is the strength (trivial backups) and the weakness (one bad write corrupts everything). Two non-negotiable settings:
# Configured in miniportal.app on startup
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL; # WAL + NORMAL is the standard recommendation
PRAGMA foreign_keys = ON;
PRAGMA busy_timeout = 5000;
Do not host the DB on an NFS/SMB mount. SQLite + network filesystems is a documented anti-pattern; the flock/fcntl semantics required for WAL are not reliable across many network filesystems. The VPS's local disk is the deployment target.
Step 6 — Backup script¶
#!/usr/bin/env bash
# deploy/backup/borg-backup.sh
# Runs as the portal user via cron, 02:30 daily.
set -euo pipefail
# Lab 06 step 6: implement the following pipeline, each step error-checked.
#
# 1. ts=$(date -u +%Y%m%dT%H%M%SZ)
# 2. sqlite3 /var/lib/lynx-portal/portal.db ".backup '/var/lib/lynx-portal/portal.snapshot.${ts}.db'"
# # SQLite's .backup is online-safe and produces a consistent snapshot.
# 3. borg create \
# --stats --compression zstd \
# ssh://borg@offsite.example.com/./repo::lynx-portal-${ts} \
# /var/lib/lynx-portal/portal.snapshot.${ts}.db \
# /var/lib/lynx-portal/audit
# 4. rm /var/lib/lynx-portal/portal.snapshot.${ts}.db
# 5. borg prune --keep-daily=14 --keep-weekly=8 --keep-monthly=12 \
# ssh://borg@offsite.example.com/./repo
# 6. Append a line to /var/log/lynx-portal/backup.log with the timestamp + borg's stats.
echo "Lab 06 step 6 — implement the backup pipeline."
exit 1
The .backup SQLite command is the right way to snapshot a live DB; cp portal.db backup.db is not safe under WAL. The script never reads portal.env — the borg passphrase comes from a separate env file at /etc/lynx-portal/borg.env, mode 0400 root:portal, loaded by the cron wrapper.
Step 7 — Restore script + DR drill¶
#!/usr/bin/env bash
# deploy/backup/borg-restore.sh
set -euo pipefail
# Lab 06 step 7: implement the restore pipeline.
#
# 1. systemctl stop lynx-portal
# 2. mv /var/lib/lynx-portal/portal.db /var/lib/lynx-portal/portal.db.broken.$(date -u +%s)
# 3. borg extract ssh://borg@offsite.example.com/./repo::"$1" \
# --strip-components 4 -C /var/lib/lynx-portal/
# 4. mv /var/lib/lynx-portal/portal.snapshot.*.db /var/lib/lynx-portal/portal.db
# 5. chown portal:portal /var/lib/lynx-portal/portal.db
# 6. systemctl start lynx-portal
# 7. curl -fsS http://127.0.0.1:8001/healthz
echo "Lab 06 step 7 — implement the restore pipeline."
exit 1
The DR drill is run once before the exit gate:
- Take a fresh backup (
lynx-backupmanually). - Seed three test students, two journal entries each, one quiz attempt each.
- Drop the DB (
rm /var/lib/lynx-portal/portal.db*). - Restore from yesterday's snapshot (
borg-restore.sh lynx-portal-YYYYMMDDTHHMMSSZ). - Verify: three students still exist, journal entries are present, quiz attempts are present.
- Log everything in
experiments/41-deploy/dr-drill-2026-XX-XX.md.
# tests/integration/test_disaster_recovery.py
def test_dr_drill_end_to_end():
"""Run inside a docker-compose harness:
1. Start the portal container with a seeded DB.
2. Run the backup script against a local borg repo.
3. `rm` the DB file.
4. Run the restore script.
5. Assert /healthz is green and the three seeded students are queryable.
This test is slow (~30 s); marked @pytest.mark.slow and skipped from default `just test`.
Run as `just test-integration` or in CI nightly.
"""
raise NotImplementedError("Lab 06 step 7 — wire the docker-compose harness; run the four phases.")
Step 8 — The exit gate¶
The Phase 41 exit gate is one external user completing the loop:
- Borja shares the portal URL and a username with someone outside the project (a friend, a student).
- They visit
/login, type the username, get redirected to/set-password. - They set a password.
- They take the phase-0 quiz.
- They deliberately get one question wrong.
- The next day, they log in and the wrong-question card is in their
/reviewqueue.
Borja captures the walkthrough in experiments/41-deploy/external-user-walkthrough.md — including the actual times of each step, any friction the user reported, and the screenshot or trace ID showing the review card surfaced on day two.
If the user does not see the review card on day two, the SM-2 wiring (lab 04) or the timezone handling (lab 06) has a bug — fix in the originating phase, not in lab 06.
What "done" looks like¶
-
deploy/install.shbrings a fresh VPS to a healthy portal athttps://portal.example.com/healthzin one run. -
systemctl status lynx-portalshowsactive (running);systemd-analyze securityscore ≤ 3.0. -
Caddyfileterminates TLS via Let's Encrypt; HSTS header present. - SQLite is in WAL mode;
PRAGMA foreign_keysis ON. - Daily borg backups run via cron at 02:30 UTC; the off-site repo holds at least one snapshot.
-
borg-restore.shsucceeds in the DR drill; data integrity verified post-restore. -
tests/integration/test_disaster_recovery.pyis green when invoked. - One external user completed the sign-up → set-password → quiz → fail → next-day review loop; walkthrough committed.
-
deploy/README.mdrunbook is complete enough that a second deployer can repeat without Borja's help.
Common pitfalls¶
- SQLite on NFS. Documented anti-pattern; corruption is a question of when, not if. Local disk on the VPS.
cp portal.dbinstead of.backup. WAL mode means the file on disk is not a consistent snapshot. Use SQLite's.backupcommand.- Backup passphrase in git. Even in a private repo. The borg passphrase lives in
/etc/lynx-portal/borg.env, mode 0400. - No DR drill. Backups that are never restored are not backups. The drill is the contract.
- Letting
install.shwrite secrets to git-tracked files. The script copies a template; the operator fills in real values out-of-band. - Forgetting to point DNS. Caddy will silently fail to acquire a certificate; debug by
journalctl -u caddyand look for ACME errors. - Time-zone ambiguity in SM-2
due_on. If the portal storesdue_onin UTC but the external user is in a UTC+2 timezone, the card may surface a day "late" by their clock. Document the convention (UTC dates throughout) and add a per-usertzfield if real complaints arise. - Trusting the deploy without the exit gate. Internal tests can pass on a broken deploy. The external-user walkthrough is the only honest acceptance test.
End of Phase 41 lab series. Next: PHASE_41_REPORT.md once the deploy is live and the external user has completed the loop.