Measure the quality of your AI outputs with custom and built-in evaluators.
Evaluation lets you systematically score your model’s outputs against a dataset. This is essential for catching regressions, comparing model versions, and building confidence before deploying changes.
Dataset — A JSON file containing test cases. Each case has an input, an optional output (the model’s response), an optional reference (expected answer), and optional context (retrieved documents for RAG).
Evaluator — A function that scores a test case and returns a numeric, boolean, or string score.
Eval run — The result of running a dataset through one or more evaluators.
Evaluators can themselves use an LLM to score outputs — commonly called “LLM-as-judge”. This is useful for qualities that are hard to measure programmatically, such as coherence, factuality, or helpfulness:
const coherenceEvaluator = ai.defineEvaluator( { name: 'coherence', displayName: 'Coherence', definition: 'Rates how logically coherent and well-structured the response is.', isBilled: true, // flag as a billed evaluator (uses an LLM) }, async (datapoint) => { const response = await ai.generate({ prompt: [ { text: 'You are an expert evaluator. Rate the following text for coherence on a scale of 1-5.' }, { text: `Text to evaluate:\n${datapoint.output}` }, { text: 'Respond with JSON: { "score": <number>, "reasoning": "<explanation>" }' }, ], output: { schema: z.object({ score: z.number(), reasoning: z.string() }), }, }); const result = response.output!; return { testCaseId: datapoint.testCaseId, evaluation: { score: result.score, status: result.score >= 4 ? 'PASS' : 'FAIL', details: { reasoning: result.reasoning }, }, }; });
Datasets are JSON files containing an array of data points. Each data point matches the BaseDataPoint schema:
[ { "testCaseId": "case-001", "input": "Summarize the water cycle in two sentences.", "output": "Water evaporates from oceans and lakes, rises as vapor, condenses into clouds, and falls back to Earth as precipitation. This cycle continuously distributes fresh water across the planet.", "reference": "The water cycle involves evaporation, condensation, and precipitation." }, { "testCaseId": "case-002", "input": "What is photosynthesis?", "output": "Photosynthesis is the process plants use to convert sunlight, water, and carbon dioxide into glucose and oxygen." }]
Field
Required
Description
input
Yes
The input given to the model.
output
No
The model’s response. If omitted, eval:run will generate one.
Use genkit eval:run to run a dataset through all registered evaluators:
# Start your app in dev mode firstgenkit start -- npx tsx src/index.ts# In another terminal, run the evaluationgenkit eval:run dataset.json# Run with specific evaluators onlygenkit eval:run dataset.json --evaluators wordCount,coherence# Save results to a filegenkit eval:run dataset.json --output results.json# Use parallel batching for speedgenkit eval:run dataset.json --batchSize 4
After the run completes, the CLI prints a link to view results in the Dev UI:
View the evaluation results at: http://localhost:4000/evaluate/eval-run-id
Evaluators marked isBilled: true use LLM calls and may incur API charges. The CLI prompts you to confirm before running billed evaluators. Use --force to skip the confirmation.
For Retrieval-Augmented Generation (RAG) pipelines, the @genkit-ai/evaluators package includes RAGAS-based metrics that measure both retrieval and generation quality:
import { genkitEval, GenkitMetric } from '@genkit-ai/evaluators';const ai = genkit({ plugins: [ googleAI(), genkitEval({ judge: googleAI.model('gemini-2.0-flash'), metrics: [ GenkitMetric.CONTEXT_RECALL, // Did retrieved docs contain the answer? GenkitMetric.CONTEXT_PRECISION, // Was retrieval precise (low noise)? GenkitMetric.FAITHFULNESS, // Does the answer only use retrieved facts? GenkitMetric.ANSWER_RELEVANCY, // Is the answer on-topic? ], }), ],});
For RAG evaluations, include the retrieved context in your dataset:
[ { "input": "What is the return policy?", "output": "You can return items within 30 days of purchase.", "context": [ "Our return policy allows returns within 30 days for unused items.", "Items must be in original packaging." ], "reference": "30-day return window for unused items in original packaging." }]