Why evals should come before everything else in LLM development, not after
| | |

Why evals should come before everything else in LLM development, not after

Evals First. Everything Else Second.

Most engineers I know treat evals like cleanup work. Something you wire up after the “real” engineering is done, right before you push to production and hope nobody notices the weird edge cases. I have done this. You have probably done this too.

It is the wrong order of operations, and I want to explain why I finally changed my mind.

The Moment That Reframed This For Me

Aakash Gupta ran Karpathy’s autoresearch repo against a Claude Code skill and let it run eval cycles overnight. Four rounds. When he woke up, the skill had moved from 41% to 92% on the benchmark.

That number is not impressive because it is big. It is impressive because of what had to be true for it to happen. The system had a clear definition of what “good” looked like, a mechanism for measuring distance from that target, and a loop that could act on the gap. Remove any one of those three things and you do not get improvement. You get drift.

Most LLM projects are missing all three before they ship.

Why Vibes Are Not a Feedback Loop

When you deploy without evals, you are not running a feedback loop. You are collecting anecdotes. Users complain, you tweak a prompt, things feel better, you move on. That is not iteration. That is noise management.

Real improvement requires a stable measurement surface. You need to know what score you started at so you can tell whether a change helped or hurt. Without that, every prompt tweak is a guess, and you have no way to distinguish a genuine gain from regression on something you were not looking at.

The worst part is that late-stage evals tend to be weak. When you write evals after building the system, you unconsciously write them around the system you built. They reflect what your system already does well. They are not probing the failure modes you have not seen yet.

Evals Written First Are Different

When you write evals before you write the system, you are forced to define what success actually means before you have anything to rationalize. That is uncomfortable. It surfaces disagreements early, between you and your team, or between you and the stakeholder who asked for the feature. Those disagreements are worth having now instead of in a post-incident review.

This is the same logic behind test-driven development, and yes, I know that comparison is not new. But TDD in traditional software is about correctness. Evals in LLM development are about something harder: behavioral specification. You are not just asking whether the code runs. You are asking whether the model does what a person would consider reasonable across a distribution of inputs you cannot fully enumerate.

That requires thinking, and the time to do that thinking is before you have committed to an architecture.

The Cost of Getting This Backwards

If your system can go from 41% to 92% through structured iteration, then shipping at 41% is not a launch. It is a waste of your users’ time and your credibility. Worse, without evals in place, you do not even know you are at 41%. You just know it “feels about right.”

I have watched teams spend weeks on prompt engineering with no baseline. They argue about which version is better with nothing but their own judgment and a handful of cherry-picked examples. You cannot optimize something you are not measuring, and you cannot trust your intuition about model behavior at scale. Models are too inconsistent and input distributions are too wide.

Anthropic surveyed nearly 81,000 Claude users about how they use AI and what they fear about it. The consistent theme was that people want systems they can trust and predict. Evals are how you build that trust, not just for users but for yourself.

What “Evals First” Actually Looks Like

Before writing a single prompt, write down five to ten scenarios that would represent success. Then write five to ten that would represent failure. Turn those into a test harness. Keep it simple at first. A spreadsheet with expected outputs and a grading rubric is a legitimate eval.

Then build toward it. Your eval score is the north star, not user sentiment, not your own read of the outputs, not whether the demo looked good in the standup.

When you do this, iteration becomes legible. You can point to a number and say whether you moved it. That is when you stop flying blind.

The engineers who get this right are not doing more work. They are doing work in the right order, and the difference in outcomes is not marginal.

Sources & Further Reading

#LLMDevelopment #AIEngineering #MachineLearning #LLMOps #PromptEngineering

Watch the full breakdown on YouTube

Sources & Further Reading

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *