How to write effective evals
We talk a lot about vibe coding, being able to build out product ideas quickly. However, when we deploy products built on existing AI models, we need to ensure the AI’s quality is high, consistent and scalable. Evals provide us with a system to measure quality at scale.
Best to start with manual evaluations, which you can then automate. First, we provide the context — the input data to be evaluated:
Customer input: Where is my order? It’s been delayed for a week
Assistant output:
Hi there, I sincerely apologise for the delay with your order. I understand it’s been a week since your expected delivery date. We typically process orders within 2–3 business days, with standard shipping taking 5–7 business days, but we’re currently experiencing some delays due to higher than usual demand.
As a token of our appreciation for your patience, I’d like to offer you a 10% discount on your next purchase. We value you as a customer and want to ensure you have the best possible experience with Flying Bamboo.
Would you mind providing your order number? This will help me track your package’s exact location and provide you with an updated delivery timeline.
Next, we need to define expected outcomes, being clear about the output that needs to be generated. I learned to use a 1 to 5 Likert scale to apply more granular scoring of output quality in your evaluation rubric. For example, how does the above AI-generated response perform against set criteria:
Customer input: Where is my order? It’s been delayed for a week
Assistant output:
Hi there, I sincerely apologise for the delay with your order. I understand it’s been a week since your expected delivery date . We typically process orders within 2–3 business days, with standard shipping taking 5–7 business days, but we’re currently experiencing some delays due to higher than usual demand.
As a token of our appreciation for your patience, I’d like to offer you a 10% discount on your next purchase . We value you as a customer and want to ensure you have the best possible experience with Flying Bamboo.
Would you mind providing your order number? This will help me track your package’s exact location and provide you with an updated delivery timeline.
Next, we need to define expected outcomes, being clear about the output that needs to be generated. I learned to use a 1 to 5 Likert scale to apply more granular scoring of output quality in your evaluation rubric. For example, how does the above AI-generated response perform against set criteria:
However, I’m now learning to simplify things by applying a more binary pass/fail judgement. The downside of more granular scoring is that it can become subjective and inconsistent very quickly. In our current Flying Bamboo example we want to evaluate whether the assistant effectively addressed the customer question, and that can be a simple pass or fail evaluation.
Especially when you’re starting to create evals, putting in detailed evaluator comments is key. The comment should be detailed enough to feed into a few-shot prompt for an LLM to review. A sample evaluator comment could sound something like this:
“The brand voice does reflect the Flying Bamboo brand voice. I like how the response is customer-centric and empathic (“I sincerely apologise for the delay with your order. I understand it’s been a week since your expected delivery date”).
I know that Flying Bamboo brand voice guidelines recommend including educational tidbits in customer correspondence, but the sentences about the positive impact of the customer’s purchase (While we work to get your order to you as quickly as possible, I want to share that your purchase is making a positive impact.) feel irrelevant as they don’t help to explain or resolve the issue that the customer is experiencing.”
Using the LLM as your judge
Starting with manual evaluation isn’t just about learning your criteria — it’s also about building the foundation for scalable automation. The number of human-annotated examples you need to reliably evaluate your automated evaluators will depend on your use case complexity. Your dataset doesn’t have to be large to begin with, you just want to ensure the data represents real-world scenarios and edge cases.
Your detailed evaluator comments become the foundation for few-shot prompts when you transition to LLM judges. Take our Flying Bamboo example — those detailed comments about brand voice and empathy become training examples for an automated evaluator. The automated prompt for the LLM could look something like this:
The beauty of using an LLM as a judge is that it can evaluate aspects of your AI output that traditional automated metrics simply can’t capture. Think about it — how do you programmatically measure empathy, brand voice consistency, or whether a response feels genuinely helpful? In our Flying Bamboo example, the judge LLM doesn’t just check whether the response contains an apology; it evaluates whether that apology feels sincere and aligns with the brand’s customer-centric values. This makes the LLM as a judge particularly valuable for evaluating chatbot conversations, screening for harmful content, assessing the relevance of information in RAG systems, or even checking code for correctness and style adherence.
Evaluation prompt design
To write an effective evaluation prompt for the LLM, there are several steps to go through:
Content to evaluate — Tell the LLM judge about the type of content you want it to evaluate. Example: AI-generated response to customer queries.
Define assessment criteria — Your evaluation criteria need to align with the goals of your product or feature. Do you want the LLM to judge outputs on accuracy, tone of voice, helpfulness, etc.? Example: evaluate the response based on adherence to the company’s brand guidelines.
Score outputs — This part of the prompt sets the evaluation strategy for the LLM judge. You can choose the scoring method based on the kind of insight that you need. Example: rate this response on a 1–5 scale for each quality aspect.
Provide a scoring rubric — A rubric describes what “good” looks like across different score levels. For example, what counts as a 3 on brand voice compared to a 1. Adding examples for each score helps to really improve the reliability of the LLM’s output.
Define output format — Natural language is the right format for output that will be reviewed by humans. If the evaluation results are going to be consumed by automated pipelines, a structured output format like JSON is going to be more helpful.
Here’s what a complete evaluation prompt looks like in practice, optimised for automated systems that need structured data:
You are evaluating AI-generated customer service responses for Flying Bamboo, an e-commerce company. Each response addresses customer enquiries about order status, delivery issues, or product questions.
Evaluate each response based on three criteria:
Brand voice alignment — Does it reflect Flying Bamboo’s customer-centric and empathetic tone?
Problem resolution — Does it acknowledge the issue and provide a clear path forward?
Accuracy — Does it provide correct information about policies and timelines?
Scoring Rubric:
Brand voice alignment:
Pass: Response uses empathetic language, acknowledges customer feelings, and maintains a warm, professional tone (e.g., “I sincerely apologise for the delay with your order. I understand it’s been a week since your expected delivery date”)
Fail: Response is transactional, cold, or dismissive (e.g., “Your order is delayed due to operational issues”)
Problem resolution:
Pass: Clearly acknowledges the specific issue, explains the situation, and provides actionable next steps
Fail: Ignores the customer’s concern, provides vague explanations, or fails to offer a solution
Accuracy:
Pass: All stated policies, timelines, and procedures are correct per Flying Bamboo’s guidelines
Fail: Contains incorrect information about delivery times, return policies, or company procedures
Customer Input: {customer_message}
AI Response: {ai_response}
Provide your evaluation in the following JSON format:
{
“brand_voice”: {“score”: “pass/fail”, “reasoning”: “explanation”},
“problem_resolution”: {“score”: “pass/fail”, “reasoning”: “explanation”},
“accuracy”: {“score”: “pass/fail”, “reasoning”: “explanation”},
“overall_recommendation”: “approve/revise”
}
Main learning point: Starting with manual evaluations and detailed comments creates the foundation for scaling your AI quality checks. By transitioning from granular scoring to binary pass/fail judgements and then to LLM judges, you can build consistent, automated evals that maintain quality as you scale. The key is ensuring your evaluation prompts are specific, include clear rubrics with examples, and produce outputs in the right format for your use case.
Related links for further learning:
- https://platform.openai.com/docs/guides/evals
- https://weval.org/
- https://docs.anthropic.com/en/docs/test-and-evaluate/develop-tests
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- https://creatoreconomy.so/p/complete-ai-course-on-prompting-evals-rag-fine-tuning-adam-loving
- https://www.producttalk.org/interview-coach-evals-q-a/
- https://www.mindtheproduct.com/how-to-implement-effective-ai-evaluations/
- https://www.braintrust.dev/
- https://towardsdatascience.com/llm-as-a-judge-a-practical-guide/
- LLM-as-a-judge: a complete guide to using LLMs for evaluations
- https://huggingface.co/learn/cookbook/en/llm_judge
- https://creatoreconomy.so/p/complete-ai-course-on-prompting-evals-rag-fine-tuning-adam-loving
- https://www.lennysnewsletter.com/p/building-eval-systems-that-improve
- https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete
