Evaluations
Contents
Evaluations is currently in beta. We'd love to hear your feedback as we develop this feature.
Evaluations automatically assess the quality of your LLM generations and return a pass/fail result with reasoning. PostHog supports two types of evaluations:
- LLM-as-a-judge – Uses an LLM to score each generation against a prompt you define. Great for nuanced, subjective checks like tone, helpfulness, or hallucination detection.
- Code-based (Hog) – Runs deterministic code you write against each generation. Great for rule-based checks like format validation, keyword detection, or length limits. Free to run with no LLM cost.
Why use evaluations?
- Monitor output quality at scale – Automatically check if generations are helpful, relevant, or safe without manual review.
- Detect problematic content – Catch hallucinations, toxicity, or jailbreak attempts before they reach users.
- Track quality trends – See pass rates across models, prompts, or user segments over time.
- Debug with reasoning – Each evaluation provides an explanation for its decision, making it easy to understand failures.
Choosing an evaluation type
| LLM-as-a-judge | Code-based (Hog) | |
|---|---|---|
| Best for | Subjective quality checks (tone, helpfulness, hallucination) | Deterministic rule-based checks (format, keywords, length) |
| Cost | LLM API call per evaluation | Free |
| Speed | Seconds | Milliseconds |
| Consistency | May vary between runs | Deterministic — same input always produces same result |
| Setup | Write a prompt | Write Hog code |
LLM-as-a-judge evaluations
How they work
When a generation is captured, PostHog samples it based on your configured rate (0.1% to 100%). If sampled, the generation's input and output are sent to an LLM judge with your evaluation prompt. The judge returns a boolean pass/fail result plus reasoning, which is stored and linked to the original generation.
You can optionally filter which generations get evaluated using event properties or person properties. For example, only evaluate generations from production, from a specific model, or above a certain cost threshold. You can also use person properties to exclude internal users or target specific user segments.
Built-in templates
PostHog provides five pre-built evaluation templates to get you started:
| Template | What it checks | Best for |
|---|---|---|
| Relevance | Whether the output addresses the user's input | Customer support bots, Q&A systems |
| Helpfulness | Whether the response is useful and actionable | Chat assistants, productivity tools |
| Jailbreak | Attempts to bypass safety guardrails | Security-sensitive applications |
| Hallucination | Made-up facts or unsupported claims | RAG systems, knowledge bases |
| Toxicity | Harmful, offensive, or inappropriate content | User-facing applications |
Creating an LLM judge evaluation
- Navigate to LLM analytics > Evaluations
- Click New evaluation
- Select LLM-as-a-judge as the evaluation type
- Choose a template or start from scratch
- Configure the evaluation:
- Name: A descriptive name for the evaluation
- Prompt: The instructions for the LLM judge (templates provide sensible defaults)
- Sampling rate: Percentage of generations to evaluate (0.1% – 100%)
- Property filters (optional): Narrow which generations to evaluate using event or person properties
- Enable the evaluation and click Save
Writing custom prompts
When creating a custom evaluation, your prompt should instruct the LLM judge to return true (pass) or false (fail) along with reasoning. The judge receives the generation's input and output for context.
Tips for effective evaluation prompts:
- Be specific about what constitutes a pass or fail
- Include examples of edge cases when relevant
- Keep the prompt concise but comprehensive
Example custom prompt:
Code-based evaluations (Hog)
Code-based evaluations run Hog code you write against each generation. They execute in milliseconds with zero LLM cost, making them ideal for high-volume, deterministic checks.
How they work
- You write Hog code that inspects the generation's input and output.
- On save, PostHog compiles your code to bytecode.
- When a generation is sampled, the code runs against it in PostHog's HogVM.
- Your code must return
true(pass) orfalse(fail). Useprint()to add reasoning. - If
allows_nais enabled, returningnullmarks the result as N/A (not applicable).
Code-based evaluations share the same sampling rate and property filter options as LLM judge evaluations.
Available globals
Your Hog code has access to these variables:
| Variable | Type | Description |
|---|---|---|
input | string or object | The LLM input (prompt or messages array) |
output | string or object | The LLM output (response or choices) |
properties | object | All event properties from the generation |
event.uuid | string | The event UUID |
event.event | string | The event name |
event.distinct_id | string | The distinct ID of the user |
Creating a code-based evaluation
- Navigate to LLM analytics > Evaluations
- Click New evaluation
- Select Code-based (Hog) as the evaluation type
- Write your Hog code in the editor
- Click Test on sample to run your code against recent generations and verify it works
- Configure:
- Name: A descriptive name for the evaluation
- Sampling rate: Percentage of generations to evaluate (0.1% – 100%)
- Allows N/A: Whether your code can return
nullto skip inapplicable generations - Property filters (optional): Narrow which generations to evaluate
- Enable the evaluation and click Save
Writing Hog evaluation code
Your code must return a boolean: true for pass, false for fail. Use print() statements to provide reasoning — the output is captured and stored alongside the result.
Hog tip: Use single quotes for strings (
'hello'),length()instead oflen(), and wrap property access withifNull()to avoid null comparison errors.
Check output length:
Check for required keywords:
Check model and cost thresholds:
Return N/A for non-applicable generations (requires allows_na enabled):
Testing on sample data
Before saving, use the Test on sample button to run your code against recent generations. This shows:
- The input and output from each sampled generation
- Whether your code returned pass, fail, or N/A
- Any
print()output (reasoning) - Any errors in your code
Testing does not create evaluation events or affect your data — it runs entirely in preview mode.
Using AI to generate evaluations
Click Generate with AI in the code editor to open Max, PostHog's AI assistant, with your evaluation context pre-loaded. Max can help you:
- Write Hog evaluation code from a description of what you want to check
- Debug errors in your existing code
- Iterate on the logic by testing and refining
Viewing results
The Evaluations page shows all your evaluations with their pass rates and recent activity. Click an evaluation to see its run history, including individual pass/fail results and the reasoning from the evaluation.
You can also filter generations by evaluation results or create insights based on evaluation data to build quality monitoring dashboards.
Pricing
Each evaluation run counts as one LLM analytics event toward your quota.
LLM judge evaluations use an LLM to score your generations. Your first 100 evaluation runs are on us so you can try the feature right away. After that, add your own API key from OpenAI, Google Gemini, Anthropic, OpenRouter, or Fireworks in Settings > LLM analytics to keep running evaluations.
If a provider API key becomes invalid or encounters an error, PostHog displays a warning banner on the evaluations page so you can take action quickly. Update or replace the key in Settings > LLM analytics.
Code-based evaluations have no LLM cost — they run your Hog code directly with no external API calls.
Use sampling rates strategically to balance coverage and cost – 5-10% sampling often provides sufficient signal for quality monitoring.