English · Español

05 — Behavioral & Storytelling: STAR + 10 lynx-cortex Anecdotes¶

🇪🇸 Plantilla STAR + 10 anécdotas pre-escritas desde el viaje de lynx-cortex. Convierte el curriculum en historias concretas con tradeoffs explícitos, listas para entrevistas en Anthropic, OpenAI, etc.

The STAR template¶

Situation:   1-2 sentences. What was the context? Set the stakes.
Task:        1 sentence. What were you specifically trying to do?
Action:      3-5 sentences. What did YOU do? First person; concrete; technical depth.
Result:      1-2 sentences. What measurably changed? Quantify if possible.
Reflection:  1 sentence. What would you do differently? (Often-skipped — adds seniority signal.)

Calibration¶

Total length: 60-90 seconds spoken. Practice with a timer.
"I" not "we". Interviewers will probe ownership. If you say "we" they will ask "what specifically did you do".
One concrete number per story. (Latency, dollar cost, % improvement, lines of code, etc.)
The reflection is the senior-engineer signal. Junior candidates skip it; senior candidates lead with it.

How the lynx-cortex journey supports behavioral¶

The 41-phase curriculum is a deep, well-documented, multi-month engineering project anchored to written specifications (LYNX_CORTEX.md), per-phase reports (PHASE_NN_REPORT.md), and explicit tradeoff documents (PROPOSAL_REVIEW.md, LYNX_CORTEX_ADDENDUM.md). Every behavioral question has a phase-anchored answer with receipts (a report file you can cite).

The receipts make the difference between "trust me bro" and "I have the artifacts".

Anecdote 1 — "Tell me about a complex system you built end-to-end."¶

Hook. "I built a transformer from first principles in 41 phases over [N] months — from binary representation of floats up to RLHF on a fine-tuned LoRA adapter."

Situation. I wanted to internalize transformers at a depth that paper-reading and tutorial-following could not give me. So I designed a 41-phase curriculum (LYNX_CORTEX.md) — each phase has a written spec, a build, a teach segment, and a report. The artifacts are in docs/phase-NN-*/.

Task. Build every component from binary float representation through agents and MLOps, with no scope cheating. Anti-goals were explicit: no langchain, no transformers before phase 24, no PyTorch before phase 24. NumPy first, hand-built second, framework last.

Action. I gated each phase behind a per-phase ritual: plan, approval, build, teach, run, report, reflect. Every phase ended with a written PHASE_NN_REPORT.md. When I needed to expand or constrain scope, I wrote it in LYNX_CORTEX_ADDENDUM.md — never edited the spec. I built a microscopic-scope grammar tutor as the running artifact so every phase produced something usable on the same toy.

Result. 41 phases of artifacts, every one with reproducibility manifests (experiments/<date>-<topic>/manifest.json — seed + versions + config). I can show any phase's report file as evidence.

Reflection. The biggest lesson was scope discipline. I pivoted scope from "9 C-string functions" to "20-verb English grammar tutor" mid-program (§A13 in the addendum) because the original scope did not exercise tokenization. Documenting why in the addendum, instead of silently rewriting the spec, kept the audit trail intact.

→ Cite: LYNX_CORTEX.md, LYNX_CORTEX_ADDENDUM.md, ROADMAP.md.

Anecdote 2 — "Tell me about a time you debugged something hard."¶

Hook. "Phase 19 — training dynamics. I had a loss spike that turned out to be a softmax overflow buried under three layers of mixed-precision."

Situation. During Phase 19, my mini-GPT training run on the §A13 grammar corpus was producing NaN loss at step ~800. Restarting from the previous checkpoint, with the same seed and same data, reproduced the NaN at the same step — so it was deterministic.

Task. Find the root cause without lowering the learning rate blindly. (Lowering LR is the "solve symptom" answer; the interviewer is testing for the "solve cause" answer.)

Action. I bisected. First I confirmed the NaN appeared in the activations before it appeared in the gradients — using a forward hook on every transformer block. The first block to NaN was the attention output of layer 4. Inside that block, I logged the softmax inputs: the maximum logit was ~94, fp16 overflow territory (exp(94) > 65504). Even with the standard softmax(x - x.max()) trick, the subtraction was in fp16 and lost too much precision in the long tail. The fix: cast the softmax to fp32 for that op, downcast after.

Result. Training resumed cleanly past step 800; the spike never recurred. I documented this as a numerical-stability note in phase-19-training-dynamics/PHASE_19_REPORT.md and added a regression test that injects extreme logits and asserts no NaN.

Reflection. The real lesson was the value of deterministic reproducibility. Without seed-fixing and same-batch replay, I would have spent a week chasing a Heisenbug. The seed_everything(seed) + manifest discipline (CLAUDE.md §0.5) paid back in this single debug session.

→ Cite: phase-19-training-dynamics/PHASE_19_REPORT.md.

Anecdote 3 — "Tell me about a tradeoff you made."¶

Hook. "I pivoted my project scope from 9 C-string functions to a 20-verb English grammar tutor — and I wrote down why."

Situation. The original LYNX_CORTEX.md spec pinned the project to a "microscopic scope": replicating a subset of 9 C standard-library string functions as the model's universe. The intent was extreme controllability for teaching.

Task. Use this scope through tokenization (Phase 11) and beyond. The problem surfaced in Phase 12 (corpus design) — there was no meaningful linguistic structure in strcpy(x, y)-style data. The BPE tokenizer would learn trivial merges; subsequent embeddings and attention would have nothing interesting to model.

Action. I considered three alternatives. (a) Expand to all of C stdlib (too much surface area, contradicts microscopic). (b) Keep the 9 functions but augment with English documentation (cleaner but blurs the scope). © Pivot to a small natural-language domain: 20 verbs, 5 tenses, 3 persons in English + Spanish. I picked © and wrote it as LYNX_CORTEX_ADDENDUM.md §A13 — supersedes §A1, with an explicit rationale and unchanged hard rules (still microscopic, still no framework before Phase 24).

Result. The grammar-tutor scope unlocked every subsequent phase. Phase 11 (BPE) had meaningful subword structure. Phase 17 (mini-GPT) could be trained on a bilingual conjugation corpus. Phase 32 (agents) had a sensible capstone — a grammar-correction tutor. The addendum is now the operating scope; the original spec is preserved unchanged as a historical record.

Reflection. The lesson is that scope creep is dangerous, but scope pivot documented in an addendum is fine. The discipline of writing the addendum instead of silently editing the spec is what makes the pivot legitimate.

→ Cite: LYNX_CORTEX_ADDENDUM.md §A13.

Anecdote 4 — "Tell me about a time you pushed back on a stakeholder."¶

Hook. "I declined to add langchain to the project — and put the rejection in writing."

Situation. Early in the curriculum, I considered adding langchain and llama-index to accelerate the agents and RAG phases. They are the standard tools; using them would have saved weeks.

Task. Decide whether the time savings were worth the abstraction cost, given the project's stated learning goal of building from first principles.

Action. I wrote out the pushback in PROPOSAL_REVIEW.md §10 ("anti-goals"): no langchain, no llama-index, ever. The reason: these libraries hide the exact mechanisms (prompt assembly, tool dispatch, retrieval pipelines) that the curriculum is supposed to teach. I built RAG (Phase 29) and agents (Phase 32) directly on top of my own primitives — vector store, MCP tool router, conversation memory — about 600 lines of hand-written code rather than 20 lines of framework calls.

Result. The hand-built RAG and agent stack is slower to develop but transparent: when something fails I can step through it. Importantly, when an interviewer asks "explain how RAG works" I can describe the real steps, not "I called langchain.RAG.run()".

Reflection. Frameworks are useful at the right phase of a project — production scaling, not pedagogical depth. Pushing back early, with a written justification, was the right call. The discipline came from explicit authorization to push back (MEMORY.md → feedback_pushback_authorized.md).

→ Cite: PROPOSAL_REVIEW.md §10, CLAUDE.md §0.4.

Anecdote 5 — "Tell me about a time you broke something on purpose."¶

Hook. "Every phase of lynx-cortex had a /break ritual — intentionally introduce a bug to learn the failure mode."

Situation. Following the /break slash command (at .claude/commands/break.md in the repo root), each phase included an exercise where I broke a working component to study how it failed. Example from Phase 15: corrupting the sqrt(d_k) scaling factor in attention.

Task. Predict the failure mode before breaking, then verify.

Action. Hypothesis: removing the sqrt(d_k) term would make the softmax saturate as d_k grows, killing gradients. I ran a training comparison with d_k = 64 and the scaling toggled. With scaling: loss decreased smoothly. Without: loss was stuck — and inspecting the softmax distributions confirmed they were one-hot at initialization, gradients vanishing.

Result. I now have a concrete intuition for why the scaling is in the formula, not just that it is. The break is documented in the Phase 15 lab notes.

Reflection. Breaking on purpose is a habit. Production incidents are rare; the muscle for diagnosing them is built by simulating them. I now do this routinely on any new system I work with — pick a failure mode, hypothesize, break, verify.

→ Cite: .claude/commands/break.md, phase-15-attention/.

Anecdote 6 — "Tell me about a time you taught someone something hard."¶

Hook. "The curriculum's teaching segment forces me to explain every phase to a colleague before I'm allowed to mark it done."

Situation. The per-phase ritual (LYNX_CORTEX.md §7) requires a "Teach" step: I have to be able to explain the phase's content as if to a colleague who knows the prior phases but nothing of this one. Phase 8 (tensor autograd) was the hardest.

Task. Explain reverse-mode automatic differentiation, broadcasting-aware gradients, and the difference between scalar and tensor autograd, in 30 minutes.

Action. I wrote a single worked example: a 2-layer MLP forward and backward by hand, on paper, showing every intermediate shape. The key insight I emphasized: broadcasting must be reversed on the backward pass (sum-reduce on the broadcasted axis) — this is the single bug that catches everyone the first time. I drew the shape annotations explicitly. I used a "predict the shape before computing" exercise.

Result. My audience (Borja, the active learner) was able to reimplement the toy autograd engine in a follow-up session without referring back to my notes.

Reflection. The act of teaching forced me to find the cleanest possible mental model. Several times I caught my own misunderstanding during the teaching pass — "I thought I understood this until I tried to say it out loud" is a universal experience. This is why the Teach step is non-skippable in the curriculum.

→ Cite: phase-08-tensor-autograd/, learners/borja/journal/.

Anecdote 7 — "Tell me about a tight deadline."¶

Hook. "I had 48 hours to ship the Phase 17 mini-GPT before I lost my motivation window. I shipped a working but ugly version, then refactored over the next week."

Situation. Phase 17 (mini-GPT) is the climax of the first half of the curriculum. I had been building toward it for weeks and felt the energy to push through, but limited concentrated time.

Task. Get a trainable mini-GPT end-to-end on the grammar-tutor corpus, with loss decreasing, within 48 hours.

Action. I deliberately violated the per-phase ritual — wrote no BLUEPRINT.md, no tests-first. Just the forward pass, a loss, an optimizer, a training loop. 600 lines in two days. The code was awful: globals, magic numbers, no type hints. Loss went down. I committed it with the message "Phase 17 mini-GPT — ugly but trainable".

Result. Working mini-GPT in 48 hours. Over the next week I followed the proper ritual retrospectively: wrote BLUEPRINT, added tests, refactored, added mypy strict, documented in PHASE_17_REPORT.md.

Reflection. The lesson is that the ritual exists to keep me honest at normal speed, but in a sprint, shipping comes first and discipline follows. The opposite — strict discipline that never ships — is a common failure mode. The trick is being honest about which mode you are in. (I noted this explicitly in the reflection section of Phase 17's report.)

→ Cite: phase-17-mini-gpt/PHASE_17_REPORT.md.

Anecdote 8 — "Tell me about a model that behaved badly."¶

Hook. (Anthropic-favorite — see 06-company-specific-prep.md). "My grammar tutor refused to correct sentences when the verb was 'kill' — it was overgeneralizing safety from RLHF training."

Situation. During the X3 (RLHF/DPO) module, I fine-tuned the grammar tutor with preference data. The preference data included rejections of toxic completions. Unexpectedly, the model began refusing to correct grammar in sentences that contained any verb in a violent semantic field.

Task. Diagnose: was this (a) reward-model artifact, (b) over-conservative preference labeling, © emergent over-refusal from KL pressure?

Action. I instrumented the model to expose its log-prob distribution over corrections vs refusals on a held-out test with controlled prompts. I varied a single variable — the verb's semantic field — keeping grammar errors constant. Refusal rate was ~90% on violent verbs, ~5% on neutral verbs. The KL-to-reference was tight (beta=0.5), so it wasn't drift. Looking at the preference dataset, I found that the labelers had marked any "corrected sentence containing violence" as rejected, regardless of grammatical correctness — the reward model had learned "violence → bad" rather than "grammar error → bad".

Result. I rewrote the preference labeling guidelines to explicitly separate "grammatical correctness" from "content safety". After re-training the RM and re-running DPO, refusal rate on violent-but-benign sentences dropped to ~10%, with no regression on actually-harmful content.

Reflection. The deeper lesson: alignment data is a spec, and unclear specs cause spec-misalignment. The cleanup work isn't model-tuning — it's labeling-guideline clarity. This is what Anthropic calls "constitutional clarity" in CAI literature.

→ Cite: extension-track/X3-rlhf-dpo/lab/01-dpo-on-grammar-tutor.md.

Anecdote 9 — "Tell me about a time you owned an outage / incident."¶

Hook. (For the Phase 40 capstone; if asked before that phase, frame as 'a near-miss in development'.) "During Phase 33 (inference serving) load testing, I caused a thermal throttle on my own laptop that revealed a queue-depth bug."

Situation. Phase 33 (inference serving) lab: build a continuous batcher on the §A13 grammar tutor. Running a synthetic load test on my i5-8250U laptop, the throughput dropped 70% mid-test.

Task. Identify the bottleneck.

Action. Initial hypothesis: queue saturation. I checked queue depths — they were fine. Second hypothesis: GC pause. The Python gc module reported no spike. Third hypothesis: actually, I checked sensors — the CPU was thermally throttling from 3.4 GHz down to 1.0 GHz under sustained load. The throttle behavior interacted with my batcher: requests that took longer-than-expected to process caused the queue admission policy to over-admit, which made the next batch even slower, causing more throttling. A positive feedback loop in latency.

Result. I added a moving-average latency tracker and an admission policy that backs off when latency rises. The thermal issue became a non-issue because the system gracefully accepted lower throughput rather than collapsing.

Reflection. The reflection (documented in the Phase 33 report): production failure modes I had only read about — thermal throttling, GPU thermal in cloud environments, NCCL deadlocks — became real because I actually ran load against my own hardware. The lab discipline of "test on the smallest hardware you have" surfaced a bug I would otherwise have shipped.

→ Cite: phase-33-inference-serving/PHASE_33_REPORT.md (expected).

Anecdote 10 — "Why Anthropic / OpenAI / etc.?"¶

Hook. (Customize per company; this is the Anthropic version.) "Because the constitutional approach to alignment is the one I'd build from first principles, and your published research is the kind I can actually re-derive — which the X3 module of my project proves."

Situation. I evaluated alignment research approaches while building the X3 RLHF/DPO module. I read the InstructGPT paper, the DPO paper, the Constitutional AI paper, and the relevant ablations.

Task. Decide what I find most intellectually honest. (This is not "what is hottest" — it is "what would I want to work on".)

Action. I implemented all three (DPO, KL-regularized RLHF, a small CAI revision loop) on the grammar tutor. CAI's distinguishing property is that the constitution is auditable text — anyone can read it and disagree. That property makes the alignment debuggable in a way that opaque reward models are not. I find that compelling because it matches my own bias: when I disagree with a model's behavior, I want to read a document, not interpret a weight matrix.

Result. (For interview context.) I am applying to Anthropic specifically because alignment-by-debuggable-document is the approach I bet on. Other labs have alignment work I admire, but CAI is the methodology I find most intellectually honest.

Reflection. This isn't a finished view. I'd expect to update on contact with Anthropic's actual day-to-day work. But it's an opinion derived from implementing, not from reading press releases — which is the only kind of opinion worth bringing into the interview.

→ Cite: extension-track/X3-rlhf-dpo/theory/05-constitutional-ai-and-rlaif.md.

Drilling protocol¶

Record yourself telling each anecdote on asciinema (audio + transcript).
Time each story. Cut anything past 90 seconds.
Have a friend ask you the "twist" question after each — common ones:
"What would the other team / opposing view say?"
"What was the dumbest thing you tried first?"
"What did your manager think?"
"How would you do it differently with infinite resources?"
Iterate until each story has a 30-second "elevator", a 90-second "full", and a 3-minute "deep dive" with at least one number in each version.

→ Next: 06-company-specific-prep.md