Skip to Content
TutorialsCustom Evaluators

Custom Evaluators

By default, CT uses an LLM-as-judge strategy to evaluate candidate prompts during optimization. This tutorial shows how to configure the judge model, customize evaluation criteria, and understand how scores determine the winning prompt.


Default Evaluator

When you trigger optimization without specifying an evaluator, CT runs its built-in LLM-as-judge evaluator:

The LLM-as-judge evaluator runs 3 evaluations per example with randomized response ordering to prevent position bias. The final score is the majority vote (average) of the 3 judgements.

The default judge model is gpt-4o-mini and the default teacher model (used to generate candidate prompts) is gpt-4o. Both are configurable.


Configuring Judge Criteria

Pass evaluator_config and model overrides when creating a task:

from kaizen_sdk import CTClient client = CTClient() task = client.create_task( name="summarize_ticket", description="Summarize support tickets into one or two sentences", feedback_threshold=50, evaluator_config={ "criteria": "accuracy, completeness, conciseness", "scale": "0-1", }, judge_model="gpt-4o", # upgrade judge for higher quality evaluation teacher_model="gpt-4o-mini", # cheaper model for generating candidates )

evaluator_config fields

FieldTypeDescription
criteriastrComma-separated evaluation criteria (e.g. "accuracy, conciseness")
scalestrScore scale description shown to the judge (e.g. "0-1", "1-5")

The judge model receives a prompt like:

“Score the following output on a scale of 0-1 for: accuracy, completeness, conciseness.”

Changing criteria shifts what the judge optimizes for — be explicit about what “good” means for your task.


Updating an Existing Task

You can update evaluator settings on an existing task via the API:

import httpx httpx.patch( f"{client._base_url}/api/v1/tasks/{task_id}", headers={"X-API-Key": client._api_key}, json={ "evaluator_config": { "criteria": "accuracy, tone, brevity", "scale": "0-1", }, "judge_model": "gpt-4o", }, )

Changes take effect on the next optimization run.


Composite Evaluators

For complex tasks, list multiple criteria in evaluator_config.criteria. The judge evaluates all criteria in a single call and returns a combined score. This is more cost-effective than running separate evaluations per criterion.

task = client.create_task( name="customer_response", evaluator_config={ "criteria": "empathy, accuracy, resolution_clarity, professional_tone", "scale": "0-1", }, judge_model="gpt-4o", )

Use 2–5 criteria for best results. Too many criteria in a single prompt can reduce judge reliability. If you need more, consider splitting into separate tasks.


Using Deterministic Metrics Alongside LLM-as-Judge

For tasks with verifiable outputs (code generation, structured extraction, classification), combine the LLM judge with deterministic metrics by post-processing feedback before logging.

Exact match

def evaluate_and_log(task_id: str, inputs: dict, output: str, expected: str) -> None: """Score by exact match, then log to CT.""" score = 1.0 if output.strip() == expected.strip() else 0.0 client.log_feedback( task_id=task_id, inputs=inputs, output=output, score=score, source="deterministic", )

Custom scoring function

def rouge_score(hypothesis: str, reference: str) -> float: """Simple unigram recall as a proxy for ROUGE-1.""" hyp_tokens = set(hypothesis.lower().split()) ref_tokens = set(reference.lower().split()) if not ref_tokens: return 0.0 return len(hyp_tokens & ref_tokens) / len(ref_tokens) def evaluate_summary(task_id: str, text: str, output: str, reference: str) -> None: score = rouge_score(output, reference) client.log_feedback( task_id=task_id, inputs={"text": text}, output=output, score=score, source="rouge", )

Blended scoring

Combine LLM and deterministic scores for robust evaluation:

def blended_score(llm_score: float, deterministic_score: float, weight: float = 0.5) -> float: """Blend LLM judge and deterministic metric scores.""" return llm_score * weight + deterministic_score * (1 - weight)

CT does not enforce how you compute scores — log_feedback(score=...) accepts any float in 0–1. Use whatever evaluation logic fits your task.


Choosing the Right Model Pair

Use caseteacher_modeljudge_modelNotes
Fast / low-cost iterationgpt-4o-minigpt-4o-mini~5× cheaper, less accurate
Balanced (default)gpt-4ogpt-4o-miniBest cost/quality tradeoff
High-stakes promptsgpt-4ogpt-4oHighest quality, higher cost
Anthropic modelsclaude-3-5-sonnet-20241022claude-3-haiku-20240307Via LiteLLM pass-through

Set cost_budget to cap spending per optimization run:

task = client.create_task( name="summarize_ticket", evaluator_config={"criteria": "accuracy, conciseness"}, judge_model="gpt-4o", teacher_model="gpt-4o", cost_budget=2.50, # hard cap in USD )

Next Steps

Last updated on