Skip to content

English · Español

Lab 06 — Deploy on a single VPS, daily backups, disaster-recovery drill

🇪🇸 Portal en vivo en un solo servidor: Caddy delante, uvicorn detrás, SQLite en /var/lib/lynx-portal/. Copia diaria con BorgBackup a otro sitio. La prueba real: dropear la base de datos, restaurarla desde la copia de ayer, y comprobar que no falta nada. La puerta de salida de la Phase 41: una persona externa se registra, falla una pregunta, y al día siguiente esa pregunta vuelve.

Goal

Deploy the portal on a single VPS in a way Borja can re-execute in an afternoon. Caddy terminates TLS and reverse-proxies to uvicorn; SQLite lives on local disk under /var/lib/lynx-portal/. BorgBackup (or restic) snapshots the DB nightly to an off-site, encrypted destination. The exit gate is a disaster-recovery drill + a real external user who completes the loop end-to-end.

Why this lab exists

A portal that runs only on Borja's laptop is not a portal. Lab 06 is what makes Phase 41's effort durable: someone other than Borja, on a different network, on a different day, can use the system. The single-VPS pattern is chosen because (a) it is what an actual classroom of 3–20 students needs, (b) it has zero managed-service lock-in, and © the disaster-recovery drill is tractable on it — you cannot easily drill RDS, you can absolutely drill a SQLite file.

The exit gate (real external user) is the only honest test that the lab-01..05 chain is coherent. Internal tests can pass while the deployed system is unusable; the external sign-up + fail-a-question + see-it-tomorrow flow proves the whole loop.

Prerequisites

  • Labs 00–05 done; all tests green.
  • A VPS is provisioned (any small instance — 1 vCPU, 1 GB RAM, 10 GB disk is enough for ≤ 50 learners).
  • An off-site backup target exists (a second VPS, S3-compatible bucket, or Borja's home NAS reachable over SSH).
  • Phase 37 secret-handling rules are internalized (no secrets in git; vault env file under 0600).

Deliverables

  • deploy/Caddyfile — reverse-proxy + automatic Let's Encrypt config.
  • deploy/systemd/lynx-portal.service — uvicorn under systemd, restart policy, user/group.
  • deploy/install.sh — idempotent first-time install on a fresh VPS (apt deps, user creation, dir layout, systemd enable).
  • deploy/backup/borg-backup.sh — daily snapshot script.
  • deploy/backup/borg-restore.sh — restore-from-snapshot script.
  • deploy/backup/crontab — fragment installed by install.sh.
  • deploy/README.md — runbook (sign-off checklist for a new deploy).
  • tests/integration/test_disaster_recovery.py — drops + restores + verifies.
  • experiments/41-deploy/dr-drill-2026-XX-XX.md — log of the actual drill run.
  • experiments/41-deploy/external-user-walkthrough.md — the exit-gate transcript.

Step 1 — The VPS layout

/var/lib/lynx-portal/
  portal.db                # SQLite, WAL mode
  portal.db-wal            # write-ahead log
  portal.db-shm            # shared memory
  audit/YYYY-MM-DD.log     # JSONL audit
  uploads/                 # if any (notes attachments — out of scope for lab 06)
/etc/lynx-portal/
  portal.env               # 0600 root:portal — secrets (pepper, session_secret, backup repo passphrase)
/opt/lynx-portal/
  app/                     # checkout of the repo OR a uv-built venv slug
  venv/                    # uv-managed
/var/log/lynx-portal/
  portal.log               # if not using systemd-journald

Ownership: portal:portal for everything under /var/lib/lynx-portal/ and /opt/lynx-portal/. Mode 0750 on directories, 0640 on regular files. The portal system user is a no-login account (/usr/sbin/nologin).

Step 2 — install.sh

#!/usr/bin/env bash
# deploy/install.sh
# Idempotent. Safe to re-run.

set -euo pipefail

# Lab 06 step 2: implement the following actions, each guarded by an "is it already done?" check.
#
#   1. apt-get install -y caddy python3 sqlite3 borgbackup curl
#   2. id -u portal || useradd --system --home /var/lib/lynx-portal --shell /usr/sbin/nologin portal
#   3. mkdir -p /var/lib/lynx-portal /etc/lynx-portal /opt/lynx-portal
#   4. chown -R portal:portal /var/lib/lynx-portal
#      chmod 0750 /var/lib/lynx-portal
#   5. install -m 0600 -o root -g portal portal.env.template /etc/lynx-portal/portal.env
#      # Operator fills in real secrets after this script exits.
#   6. install -m 0644 deploy/Caddyfile /etc/caddy/Caddyfile
#   7. install -m 0644 deploy/systemd/lynx-portal.service /etc/systemd/system/
#   8. systemctl daemon-reload
#      systemctl enable --now lynx-portal
#      systemctl reload caddy
#   9. install -m 0755 deploy/backup/borg-backup.sh /usr/local/bin/lynx-backup
#      crontab -u portal deploy/backup/crontab
echo "Lab 06 step 2 — implement the install script."
exit 1

The script ends with a printed checklist of manual steps Borja still has to do (fill in /etc/lynx-portal/portal.env, run borg init against the remote repo, point the domain's DNS at the VPS). It does not try to do those itself — those are credentials-bearing actions and belong to the operator.

Step 3 — Caddyfile

# deploy/Caddyfile
portal.example.com {
    encode zstd gzip
    log {
        output file /var/log/caddy/access.log
        format console
    }

    @healthz path /healthz
    handle @healthz {
        reverse_proxy 127.0.0.1:8001
    }

    handle /static/* {
        root * /opt/lynx-portal/app/src/miniportal
        file_server
    }

    handle {
        reverse_proxy 127.0.0.1:8001
    }

    header {
        Strict-Transport-Security "max-age=31536000; includeSubDomains"
        X-Content-Type-Options "nosniff"
        Referrer-Policy "strict-origin-when-cross-origin"
        # CSP is set per-route by the app (some pages need inline event handlers; locked down in lab 03).
    }
}

Replace portal.example.com at deploy time; document the choice in deploy/README.md so the next deployer does not forget.

Step 4 — systemd unit

# deploy/systemd/lynx-portal.service
[Unit]
Description=lynx-cortex learner portal
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=portal
Group=portal
WorkingDirectory=/opt/lynx-portal/app
EnvironmentFile=/etc/lynx-portal/portal.env
ExecStart=/opt/lynx-portal/venv/bin/uvicorn miniportal.app:make_app --factory --host 127.0.0.1 --port 8001
Restart=on-failure
RestartSec=3
# Hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/var/lib/lynx-portal /var/log/lynx-portal
ProtectHome=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
LockPersonality=true

[Install]
WantedBy=multi-user.target

The hardening lines come from Phase 37's hardening checklist; they are non-negotiable for production. The integration test in experiments/41-deploy/ runs systemd-analyze security lynx-portal.service and expects a score ≤ 3.0.

Step 5 — SQLite + WAL

SQLite is a single file. That is the strength (trivial backups) and the weakness (one bad write corrupts everything). Two non-negotiable settings:

# Configured in miniportal.app on startup
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;     # WAL + NORMAL is the standard recommendation
PRAGMA foreign_keys = ON;
PRAGMA busy_timeout = 5000;

Do not host the DB on an NFS/SMB mount. SQLite + network filesystems is a documented anti-pattern; the flock/fcntl semantics required for WAL are not reliable across many network filesystems. The VPS's local disk is the deployment target.

Step 6 — Backup script

#!/usr/bin/env bash
# deploy/backup/borg-backup.sh
# Runs as the portal user via cron, 02:30 daily.

set -euo pipefail

# Lab 06 step 6: implement the following pipeline, each step error-checked.
#
#   1. ts=$(date -u +%Y%m%dT%H%M%SZ)
#   2. sqlite3 /var/lib/lynx-portal/portal.db ".backup '/var/lib/lynx-portal/portal.snapshot.${ts}.db'"
#      # SQLite's .backup is online-safe and produces a consistent snapshot.
#   3. borg create \
#        --stats --compression zstd \
#        ssh://borg@offsite.example.com/./repo::lynx-portal-${ts} \
#        /var/lib/lynx-portal/portal.snapshot.${ts}.db \
#        /var/lib/lynx-portal/audit
#   4. rm /var/lib/lynx-portal/portal.snapshot.${ts}.db
#   5. borg prune --keep-daily=14 --keep-weekly=8 --keep-monthly=12 \
#        ssh://borg@offsite.example.com/./repo
#   6. Append a line to /var/log/lynx-portal/backup.log with the timestamp + borg's stats.
echo "Lab 06 step 6 — implement the backup pipeline."
exit 1

The .backup SQLite command is the right way to snapshot a live DB; cp portal.db backup.db is not safe under WAL. The script never reads portal.env — the borg passphrase comes from a separate env file at /etc/lynx-portal/borg.env, mode 0400 root:portal, loaded by the cron wrapper.

Step 7 — Restore script + DR drill

#!/usr/bin/env bash
# deploy/backup/borg-restore.sh

set -euo pipefail

# Lab 06 step 7: implement the restore pipeline.
#
#   1. systemctl stop lynx-portal
#   2. mv /var/lib/lynx-portal/portal.db /var/lib/lynx-portal/portal.db.broken.$(date -u +%s)
#   3. borg extract ssh://borg@offsite.example.com/./repo::"$1" \
#        --strip-components 4 -C /var/lib/lynx-portal/
#   4. mv /var/lib/lynx-portal/portal.snapshot.*.db /var/lib/lynx-portal/portal.db
#   5. chown portal:portal /var/lib/lynx-portal/portal.db
#   6. systemctl start lynx-portal
#   7. curl -fsS http://127.0.0.1:8001/healthz
echo "Lab 06 step 7 — implement the restore pipeline."
exit 1

The DR drill is run once before the exit gate:

  1. Take a fresh backup (lynx-backup manually).
  2. Seed three test students, two journal entries each, one quiz attempt each.
  3. Drop the DB (rm /var/lib/lynx-portal/portal.db*).
  4. Restore from yesterday's snapshot (borg-restore.sh lynx-portal-YYYYMMDDTHHMMSSZ).
  5. Verify: three students still exist, journal entries are present, quiz attempts are present.
  6. Log everything in experiments/41-deploy/dr-drill-2026-XX-XX.md.
# tests/integration/test_disaster_recovery.py
def test_dr_drill_end_to_end():
    """Run inside a docker-compose harness:

      1. Start the portal container with a seeded DB.
      2. Run the backup script against a local borg repo.
      3. `rm` the DB file.
      4. Run the restore script.
      5. Assert /healthz is green and the three seeded students are queryable.

    This test is slow (~30 s); marked @pytest.mark.slow and skipped from default `just test`.
    Run as `just test-integration` or in CI nightly.
    """
    raise NotImplementedError("Lab 06 step 7 — wire the docker-compose harness; run the four phases.")

Step 8 — The exit gate

The Phase 41 exit gate is one external user completing the loop:

  1. Borja shares the portal URL and a username with someone outside the project (a friend, a student).
  2. They visit /login, type the username, get redirected to /set-password.
  3. They set a password.
  4. They take the phase-0 quiz.
  5. They deliberately get one question wrong.
  6. The next day, they log in and the wrong-question card is in their /review queue.

Borja captures the walkthrough in experiments/41-deploy/external-user-walkthrough.md — including the actual times of each step, any friction the user reported, and the screenshot or trace ID showing the review card surfaced on day two.

If the user does not see the review card on day two, the SM-2 wiring (lab 04) or the timezone handling (lab 06) has a bug — fix in the originating phase, not in lab 06.

What "done" looks like

  • deploy/install.sh brings a fresh VPS to a healthy portal at https://portal.example.com/healthz in one run.
  • systemctl status lynx-portal shows active (running); systemd-analyze security score ≤ 3.0.
  • Caddyfile terminates TLS via Let's Encrypt; HSTS header present.
  • SQLite is in WAL mode; PRAGMA foreign_keys is ON.
  • Daily borg backups run via cron at 02:30 UTC; the off-site repo holds at least one snapshot.
  • borg-restore.sh succeeds in the DR drill; data integrity verified post-restore.
  • tests/integration/test_disaster_recovery.py is green when invoked.
  • One external user completed the sign-up → set-password → quiz → fail → next-day review loop; walkthrough committed.
  • deploy/README.md runbook is complete enough that a second deployer can repeat without Borja's help.

Common pitfalls

  1. SQLite on NFS. Documented anti-pattern; corruption is a question of when, not if. Local disk on the VPS.
  2. cp portal.db instead of .backup. WAL mode means the file on disk is not a consistent snapshot. Use SQLite's .backup command.
  3. Backup passphrase in git. Even in a private repo. The borg passphrase lives in /etc/lynx-portal/borg.env, mode 0400.
  4. No DR drill. Backups that are never restored are not backups. The drill is the contract.
  5. Letting install.sh write secrets to git-tracked files. The script copies a template; the operator fills in real values out-of-band.
  6. Forgetting to point DNS. Caddy will silently fail to acquire a certificate; debug by journalctl -u caddy and look for ACME errors.
  7. Time-zone ambiguity in SM-2 due_on. If the portal stores due_on in UTC but the external user is in a UTC+2 timezone, the card may surface a day "late" by their clock. Document the convention (UTC dates throughout) and add a per-user tz field if real complaints arise.
  8. Trusting the deploy without the exit gate. Internal tests can pass on a broken deploy. The external-user walkthrough is the only honest acceptance test.

End of Phase 41 lab series. Next: PHASE_41_REPORT.md once the deploy is live and the external user has completed the loop.