Python
LLM
AI Evaluation
Research
Automation

AI Agent Evaluation Framework

At Not A Hotel Inc., I am researching and implementing repeatable AI-agent evaluation with LLM as a Judge.

Overview

I am working with Not A Hotel Inc. on an ongoing effort to evaluate AI-agent behavior. LLM as a Judge lets me test system outputs without making every review a manual task.

I am defining criteria and the tooling to rerun them across product areas and research tasks. Reliability, useful metrics, and evaluation bias are part of that check. I am still testing which criteria hold up in day-to-day use.

What I Built

LLM as a Judge

I use LLMs to assess another system's output in automated evaluation runs.
Repeatable Criteria

I am defining criteria that make agent behavior comparable across repeated runs and tasks.
Reusable Coverage

I am testing patterns that can carry between product areas and research tasks.
Evaluation Tooling

I am building the tooling needed to run the same evaluation method consistently.

Problems

Keeping LLM-based judgments reliable and consistent across repeated runs.
Defining metrics that remain useful across different AI-agent tasks.
Finding and reducing bias in automated evaluation pipelines.

Results

Ongoing work on an automated way to assess AI-agent behavior.
Repeatable methodology under development for product and research use.
Experiments that examine reliability, metrics, and evaluation bias.

Overview

What I Built

LLM as a Judge

Repeatable Criteria

Reusable Coverage

Evaluation Tooling

Problems

Results