Evaluating AI work - FlatNine Blog

Running 18 AI agents across an organization sounds impressive until you ask: how do you know they are doing a good job?

We have been building FlatNine Ensemble - a system where AI agents handle everything from security monitoring to content creation, from SEO analysis to customer support. Each agent learns, proposes work, and executes tasks autonomously. But autonomy without accountability is just chaos.

So we built an automated quality scoring system. Here is how it works and what we learned from scoring 100 real agent interactions.

The Problem

When you have AI agents responding to users, processing data, and making decisions 24/7, you cannot manually review every output. Traditional QA does not scale. You need the AI to evaluate itself - but in a way that actually catches problems.

The failure modes we were seeing:

Hallucinated data - agents confidently stating incorrect facts, URLs, or prices
Non-answers - verbose responses that sound helpful but do not actually answer the question
Wrong context - responding to what the agent thinks you asked rather than what you actually asked
Overkill - three paragraphs when a sentence would do

The Solution: Automated Quality Scoring

Every agent response now goes through a lightweight quality check. Here is the architecture:

Agent responds to a user message (via Telegram, API, etc.)
Background worker fires asynchronously - does not slow down response delivery
A cheap, fast model (GPT-4.1-nano, costing roughly $0.001 per score) evaluates the response on a 1-10 scale
Score and issues are logged to a database table
Low scores (below 6) trigger self-correction - feedback is written directly to the agent's memory file

The key insight: the scoring model is different from the responding model. We use the cheapest possible model to judge the most expensive one. It does not need to be a genius to spot hallucinated URLs or non-answers.

What the Scorer Checks

The quality scorer evaluates each response against five criteria:

Accuracy - Did the agent hallucinate any URLs, names, prices, or facts?
Relevance - Did it actually answer what was asked?
Conciseness - Is the response appropriately sized for the question?
Tone - Does it match the channel (Telegram vs. email vs. API)?
Completeness - Did it provide everything needed, or cut off early?

Results: Scoring 100 Real Interactions

We ran the scorer against our last 100 agent interactions. The results were humbling:

Metric	Value
Average Score	6.5/10
Low Scores (below 6)	32%
Hallucinations	12% of responses
Non-answers	8% of responses

What Went Wrong

The biggest category of failures was not answering the actual question. When a user says "you back?" the agent should say "yes" - not dump a technical status report. When a user says "smoke test" the agent should run a smoke test - not return random RSS content.

The second category was hallucinated details. The agent would confidently describe task completion statuses, file locations, and specific metrics that simply did not exist.

What Went Right

When the agent had clear, specific questions and access to the right tools, quality scores averaged 8-9/10. The system works well for:

Structured queries (calendar, tasks, database lookups)
Technical work (code changes, system administration)
Analysis with data (usage reports, trend analysis)

The Self-Correction Loop

This is where it gets interesting. When a response scores below 6, the system automatically writes feedback to the agent's memory:

## Feedback - 2026-04-02 (score: 4)
Question: what is the weather?
Issues: Does not provide actual weather information
Action: Avoid these issues in future responses.

Next time the agent runs, it reads its memory directory - including this feedback file. The agent literally learns from its own mistakes, without any human review process.

Over time, the quality-feedback file becomes a personalized improvement guide for each agent. Agent Wayne learns to be more concise. Agent Carla learns to personalize outreach emails. Agent Betty learns to handle edge cases in billing queries.

Cost

The entire system costs roughly $0.001 per scored response using GPT-4.1-nano. For 18 agents handling maybe 200 interactions per day, that is about $6 per month. Compare that to the cost of a single bad response reaching a customer.

What We Would Do Differently

If we were starting over:

Score from day one. We had months of unscored interactions. The feedback loop compounds - earlier is better.
Start with harsher scoring. Our initial scores were too generous. A 6/10 response is not "okay" - it is a response that failed the user in some way.
Add channel-specific scoring. A Telegram response has different quality standards than an API response. We now adjust expectations per channel.

The Bigger Picture

This is not just about catching bad responses. It is about building AI systems that improve themselves without human intervention. The learning phase feeds the pitching phase. The pitching phase feeds the working phase. And the quality scorer feeds back into the learning loop.

We call it the agent lifecycle: Learn, Pitch, Work, Score, Improve. Repeat daily.

The agents that score consistently high get more autonomy. The ones that score low get more guardrails. It is natural selection for AI workers.