Two-model architecture: Haiku for chat, Sonnet for assessment
Same site, two AI features, two models. The decision wasn't about cost — it was about matching the model to the task. Here's the framework.
The lazy default for any project that uses an LLM is to pick one model and use it everywhere. There's something tidy about it — the same client, the same prompt-shape, the same cost ceiling. For most personal projects, this is fine.
For projects with more than one AI feature, the default leaves performance on the table. Different features have different cognitive shapes, and matching the model to the shape is one of the higher-leverage decisions you can make. This site has two AI features, and they ended up on two different models for reasons that are worth working through.
The two features
/ask-ai is an open chat. A user asks a question. The model reads my site corpus and answers from it. Sample questions: "what was your role at WB Games?", "what's your strongest PM skill?", "what would you say is your biggest gap?" The good answers are short, direct, and faithful to the corpus. The bad answers are speculative, long, or stylistically over-flowery.
/fit-check is a structured analysis. A user pastes a job description. The model reads my skills matrix (a 200-row spreadsheet of self-rated skills with priority tags), reads the JD, and returns a structured assessment: where I'm strong, where I'm moderate, where I'm a gap. The good output is opinionated, weighted by JD emphasis, and willing to call out misfits. The bad output is generic flattery or a token-matching exercise.
These look like the same shape on the surface — model-with-a-corpus answers a query about me — but they are not the same task underneath.
Retrieval vs. synthesis
/ask-ai is mostly retrieval-shaped. The user is asking a factual question that has an answer somewhere in the corpus. The model's job is to find it, surface it, and stop. Reasoning across multiple sections is rare. Most questions can be answered by a single fact already stated somewhere in the corpus, restated in conversational tone.
/fit-check is mostly synthesis-shaped. There's no place in the skills matrix that says "this candidate is a great fit for this VP role." The model has to read the whole JD, weight which capabilities it's emphasizing, look at the corresponding rows of the matrix, weigh self-ratings against the JD's emphasis, and produce an integrated assessment that makes a senior judgment call. This is reasoning across multiple inputs, not look-up.
Retrieval is what smaller models do well. Synthesis is what larger models do well. Match accordingly.
Why Haiku 4.5 for /ask-ai
I tried Haiku 4.5 and Sonnet 4.6 on the same chat prompts. The Haiku output was indistinguishable for retrieval-shaped questions. The Sonnet output was, if anything, worse — it tended to over-elaborate, add commentary, and produce paragraphs where a sentence would do. The corpus is the same. The system prompt asks for concise answers. Haiku followed the instruction more cleanly than Sonnet did, which surprised me.
I think this is a real pattern: when the task doesn't need synthesis, the larger model uses its extra capacity to produce extra content, which often works against the instruction. The smaller model has less capacity to "embellish," so it stays closer to the brief.
Plus: Haiku is cheaper per token, faster to first token, and qualifies for prompt caching at a reasonable threshold. For a feature that runs many times per visit (multi-turn chat), all three matter. The cost story is secondary to the quality story, but it's not zero.
Why Sonnet 4.6 for /fit-check
I tried both models on real job descriptions. Haiku's output was technically correct but flat — it identified matching keywords from the matrix but didn't integrate the JD against my profile in a way that read like a senior assessment. Asked about a JD that emphasized data-platform PM work, Haiku surfaced a few rows of the matrix and called it a fit. Sonnet, on the same JD, weighed the JD's emphasis on platform vs. application work, recognized that my matrix's strongest signals are on the application side, and produced a "moderate fit, here's why" answer that I genuinely agreed with.
That's the synthesis gap. It's not that Haiku was wrong. It's that Sonnet was better at the actual task, which is integrating multiple signals into a single assessment.
For a feature that runs once per JD paste — much less frequently than chat — the cost difference is negligible and the quality difference is the entire feature. Sonnet wins. Easily.
A framework for routing tasks to models
Looking back across both features, a framework emerges. For each AI feature, ask:
- Is the task retrieval or synthesis? Retrieval favors smaller models; synthesis favors larger ones.
- How often does the feature run per session? Frequent invocations amplify cost differences and latency differences.
- Does the corpus fit in a prompt? If yes, the smaller model with caching is usually fine. If no, you're past the small-corpus regime entirely and the model choice is downstream of architecture.
- What's the failure mode of "almost good enough"? A chat that's slightly worse is a minor annoyance. A fit assessment that's slightly worse is a feature that doesn't differentiate.
- Does the user perceive latency? Streaming hides a lot. A synchronous request that takes 8 seconds with a spinner is unpleasant; the same request with streaming feels alive.
For each yes/no, you get a different recommendation. The grid isn't memorizable; the questions are.
When to escalate, when to downgrade
The reverse decisions also exist:
- Escalate from a smaller model when the smaller model's output starts feeling shallow on the actual user task — not on artificial benchmarks. The trigger is "I keep wishing it had reasoned harder."
- Downgrade from a larger model when you notice the larger model is adding content the user doesn't want — extra paragraphs, qualifications, stylistic flourishes. The trigger is "this is too much."
The temptation is to escalate everything. Resist. Larger models on retrieval tasks often produce worse output by being too generative relative to the brief.
Operational notes
A few small things I learned while running both:
- Different cache thresholds. Haiku 4.5's prompt-cache minimum is around 4096 tokens; Sonnet 4.6's is similar but check current docs. If your corpus is small, your cacheable prefix may need padding to qualify.
- Different streaming feel. Haiku's streaming is faster per token, which matters for the chat feel. Sonnet's streaming is slightly slower per token but the words are weightier — the perceived information rate is comparable.
- Different rate limits. Tier-dependent, but Haiku's are usually higher than Sonnet's. If you're load-testing a chat feature, this matters.
- Different cost trajectories. Cost differences between Haiku and Sonnet are roughly an order of magnitude on input tokens. For features that run many times per session, that compounds; for features that run once, it doesn't.
The principle
The decision isn't "which model is best." It's "which model fits this task." The which model is best framing is a distraction borrowed from benchmarks; it ignores that good benchmarks measure capability, not appropriateness.
Match the model to the task. Use the smaller model when the task is retrieval-shaped, frequent, and doesn't reward extra reasoning. Use the larger model when the task is synthesis-shaped, infrequent, and rewards integrated judgment. For a personal site with two features that look similar but aren't, this gave me one feature that's fast and cheap and another that's thoughtful and opinionated — the right combination for what each is supposed to do.
The two-model pattern generalizes well. Most non-trivial AI products end up with multiple inference paths inside them; my site is just a small instance of the same thing. If you have one AI feature today and you're planning a second, the question to ask isn't "do I reuse the same model?" — it's "does this new feature have the same shape as the existing one?" If the shapes are the same, reuse. If they aren't, route.