Custom Evaluators
By default, CT uses an LLM-as-judge strategy to evaluate candidate prompts during optimization. This tutorial shows how to configure the judge model, customize evaluation criteria, and understand how scores determine the winning prompt.
Default Evaluator
When you trigger optimization without specifying an evaluator, CT runs its built-in LLM-as-judge evaluator:
The LLM-as-judge evaluator runs 3 evaluations per example with randomized response ordering to prevent position bias. The final score is the majority vote (average) of the 3 judgements.
The default judge model is gpt-4o-mini and the default teacher model (used to generate candidate prompts) is gpt-4o. Both are configurable.
Configuring Judge Criteria
Pass evaluator_config and model overrides when creating a task:
from kaizen_sdk import CTClient
client = CTClient()
task = client.create_task(
name="summarize_ticket",
description="Summarize support tickets into one or two sentences",
feedback_threshold=50,
evaluator_config={
"criteria": "accuracy, completeness, conciseness",
"scale": "0-1",
},
judge_model="gpt-4o", # upgrade judge for higher quality evaluation
teacher_model="gpt-4o-mini", # cheaper model for generating candidates
)evaluator_config fields
| Field | Type | Description |
|---|---|---|
criteria | str | Comma-separated evaluation criteria (e.g. "accuracy, conciseness") |
scale | str | Score scale description shown to the judge (e.g. "0-1", "1-5") |
The judge model receives a prompt like:
“Score the following output on a scale of 0-1 for: accuracy, completeness, conciseness.”
Changing criteria shifts what the judge optimizes for — be explicit about what “good” means for your task.
Updating an Existing Task
You can update evaluator settings on an existing task via the API:
import httpx
httpx.patch(
f"{client._base_url}/api/v1/tasks/{task_id}",
headers={"X-API-Key": client._api_key},
json={
"evaluator_config": {
"criteria": "accuracy, tone, brevity",
"scale": "0-1",
},
"judge_model": "gpt-4o",
},
)Changes take effect on the next optimization run.
Composite Evaluators
For complex tasks, list multiple criteria in evaluator_config.criteria. The judge evaluates all criteria in a single call and returns a combined score. This is more cost-effective than running separate evaluations per criterion.
task = client.create_task(
name="customer_response",
evaluator_config={
"criteria": "empathy, accuracy, resolution_clarity, professional_tone",
"scale": "0-1",
},
judge_model="gpt-4o",
)Use 2–5 criteria for best results. Too many criteria in a single prompt can reduce judge reliability. If you need more, consider splitting into separate tasks.
Using Deterministic Metrics Alongside LLM-as-Judge
For tasks with verifiable outputs (code generation, structured extraction, classification), combine the LLM judge with deterministic metrics by post-processing feedback before logging.
Exact match
def evaluate_and_log(task_id: str, inputs: dict, output: str, expected: str) -> None:
"""Score by exact match, then log to CT."""
score = 1.0 if output.strip() == expected.strip() else 0.0
client.log_feedback(
task_id=task_id,
inputs=inputs,
output=output,
score=score,
source="deterministic",
)Custom scoring function
def rouge_score(hypothesis: str, reference: str) -> float:
"""Simple unigram recall as a proxy for ROUGE-1."""
hyp_tokens = set(hypothesis.lower().split())
ref_tokens = set(reference.lower().split())
if not ref_tokens:
return 0.0
return len(hyp_tokens & ref_tokens) / len(ref_tokens)
def evaluate_summary(task_id: str, text: str, output: str, reference: str) -> None:
score = rouge_score(output, reference)
client.log_feedback(
task_id=task_id,
inputs={"text": text},
output=output,
score=score,
source="rouge",
)Blended scoring
Combine LLM and deterministic scores for robust evaluation:
def blended_score(llm_score: float, deterministic_score: float, weight: float = 0.5) -> float:
"""Blend LLM judge and deterministic metric scores."""
return llm_score * weight + deterministic_score * (1 - weight)CT does not enforce how you compute scores — log_feedback(score=...) accepts any float in 0–1. Use whatever evaluation logic fits your task.
Choosing the Right Model Pair
| Use case | teacher_model | judge_model | Notes |
|---|---|---|---|
| Fast / low-cost iteration | gpt-4o-mini | gpt-4o-mini | ~5× cheaper, less accurate |
| Balanced (default) | gpt-4o | gpt-4o-mini | Best cost/quality tradeoff |
| High-stakes prompts | gpt-4o | gpt-4o | Highest quality, higher cost |
| Anthropic models | claude-3-5-sonnet-20241022 | claude-3-haiku-20240307 | Via LiteLLM pass-through |
Set cost_budget to cap spending per optimization run:
task = client.create_task(
name="summarize_ticket",
evaluator_config={"criteria": "accuracy, conciseness"},
judge_model="gpt-4o",
teacher_model="gpt-4o",
cost_budget=2.50, # hard cap in USD
)Next Steps
- Trigger an optimization: Integrate with an Existing App
- Seed training data: Seed Data for Cold Start
- Auto-PR delivery: Auto-PR with GitHub