Evaluating AI work
Running 18 AI agents across an organization sounds impressive until you ask: how do you know they are doing a good job?
We have been building FlatNine Ensemble - a system where AI agents handle everything from security monitoring to content creation, from SEO analysis to customer support. Each agent learns, proposes work, and executes tasks autonomously. But autonomy without accountability is just chaos.
So we built an automated quality scoring system. Here is how it works and what we learned from scoring 100 real agent interactions.
The Problem
When you have AI agents responding to users, processing data, and making decisions 24/7, you cannot manually review every output. Traditional QA does not scale. You need the AI to evaluate itself - but in a way that actually catches problems.
The failure modes we were seeing:
- Hallucinated data - agents confidently stating incorrect facts, URLs, or prices
- Non-answers - verbose responses that sound helpful but do not actually answer the question
- Wrong context - responding to what the agent thinks you asked rather than what you actually asked
- Overkill - three paragraphs when a sentence would do
The Solution: Automated Quality Scoring
Every agent response now goes through a lightweight quality check. Here is the architecture:
- Agent responds to a user message (via Telegram, API, etc.)
- Background worker fires asynchronously - does not slow down response delivery
- A cheap, fast model (GPT-4.1-nano, costing roughly $0.001 per score) evaluates the response on a 1-10 scale
- Score and issues are logged to a database table
- Low scores (below 6) trigger self-correction - feedback is written directly to the agent's memory file
The key insight: the scoring model is different from the responding model. We use the cheapest possible model to judge the most expensive one. It does not need to be a genius to spot hallucinated URLs or non-answers.
What the Scorer Checks
The quality scorer evaluates each response against five criteria:
- Accuracy - Did the agent hallucinate any URLs, names, prices, or facts?
- Relevance - Did it actually answer what was asked?
- Conciseness - Is the response appropriately sized for the question?
- Tone - Does it match the channel (Telegram vs. email vs. API)?
- Completeness - Did it provide everything needed, or cut off early?
Results: Scoring 100 Real Interactions
We ran the scorer against our last 100 agent interactions. The results were humbling:
| Metric | Value |
|---|---|
| Average Score | 6.5/10 |
| Low Scores (below 6) | 32% |
| Hallucinations | 12% of responses |
| Non-answers | 8% of responses |
What Went Wrong
The biggest category of failures was not answering the actual question. When a user says "you back?" the agent should say "yes" - not dump a technical status report. When a user says "smoke test" the agent should run a smoke test - not return random RSS content.
The second category was hallucinated details. The agent would confidently describe task completion statuses, file locations, and specific metrics that simply did not exist.
What Went Right
When the agent had clear, specific questions and access to the right tools, quality scores averaged 8-9/10. The system works well for:
- Structured queries (calendar, tasks, database lookups)
- Technical work (code changes, system administration)
- Analysis with data (usage reports, trend analysis)
The Self-Correction Loop
This is where it gets interesting. When a response scores below 6, the system automatically writes feedback to the agent's memory:
## Feedback - 2026-04-02 (score: 4)
Question: what is the weather?
Issues: Does not provide actual weather information
Action: Avoid these issues in future responses.
Next time the agent runs, it reads its memory directory - including this feedback file. The agent literally learns from its own mistakes, without any human review process.
Over time, the quality-feedback file becomes a personalized improvement guide for each agent. Agent Wayne learns to be more concise. Agent Carla learns to personalize outreach emails. Agent Betty learns to handle edge cases in billing queries.
Cost
The entire system costs roughly $0.001 per scored response using GPT-4.1-nano. For 18 agents handling maybe 200 interactions per day, that is about $6 per month. Compare that to the cost of a single bad response reaching a customer.
What We Would Do Differently
If we were starting over:
- Score from day one. We had months of unscored interactions. The feedback loop compounds - earlier is better.
- Start with harsher scoring. Our initial scores were too generous. A 6/10 response is not "okay" - it is a response that failed the user in some way.
- Add channel-specific scoring. A Telegram response has different quality standards than an API response. We now adjust expectations per channel.
The Bigger Picture
This is not just about catching bad responses. It is about building AI systems that improve themselves without human intervention. The learning phase feeds the pitching phase. The pitching phase feeds the working phase. And the quality scorer feeds back into the learning loop.
We call it the agent lifecycle: Learn, Pitch, Work, Score, Improve. Repeat daily.
The agents that score consistently high get more autonomy. The ones that score low get more guardrails. It is natural selection for AI workers.