Agent Evaluations
Agent evaluation is about understanding how well your agent actually performed — whether it understood the task, took the right steps, used tools when needed, and produced a useful result. Use it when your system behaves like an agent (reasoning, tool use, multi-step workflows).
Floeval supports three modes for agent evaluation:
| Mode | Description | Use when |
|---|---|---|
| Pre-captured traces | You provide traces in the dataset | You already have conversation logs |
| Local agent | Floeval runs your agent in your Python environment and captures traces | Your agent runs locally (LangChain, custom) |
| FloTorch hosted | Floeval calls your agent on the FloTorch gateway | Your agent is deployed in the FloTorch Console |
Available Metrics
Section titled “Available Metrics”| Metric | Provider | What it measures |
|---|---|---|
goal_achievement | builtin | Did the agent achieve the goal? (LLM-as-judge) |
response_coherence | builtin | Is the final response consistent with the trace? |
ragas:agent_goal_accuracy | RAGAS | Agent output vs expected outcome |
ragas:tool_call_accuracy | RAGAS | Were tool calls correct? (needs reference_tool_calls) |
Agent Dataset Formats
Section titled “Agent Dataset Formats”Agent datasets use a "samples" array, similar to LLM/RAG datasets. The key difference is the trace field, which contains the full conversation log including tool calls. Always use "name" (not "tool") in reference_tool_calls.
Full dataset — pre-captured traces
Section titled “Full dataset — pre-captured traces”Use a full dataset when you already have recorded agent conversations (from your app logs, LangChain callbacks, or manual export). Each sample includes the complete trace with messages, tool calls, and the final response:
{ "samples": [ { "user_input": "Get the weather for London", "trace": { "messages": [ {"role": "human", "content": "Get the weather for London"}, {"role": "ai", "content": "", "tool_calls": [{"name": "get_weather", "args": {"city": "London"}}]}, {"role": "tool", "content": "Sunny, 22°C", "tool_name": "get_weather", "tool_call_id": "call_1"}, {"role": "ai", "content": "The weather in London is sunny with 22°C.", "tool_calls": []} ], "final_response": "The weather in London is sunny with 22°C.", "metadata": {} }, "reference_outcome": "Provides London weather", "reference_tool_calls": [{"name": "get_weather", "args": {"city": "London"}}] } ]}Partial dataset — no trace (agent runs at evaluation time)
Section titled “Partial dataset — no trace (agent runs at evaluation time)”Use a partial dataset when you have test cases but no recorded traces. Floeval runs your agent for each sample and captures the trace automatically:
{ "samples": [ { "user_input": "Get the weather for London", "reference_outcome": "Provides London weather", "reference_tool_calls": [{"name": "get_weather", "args": {"city": "London"}}] } ]}JSONL is also supported (one sample per line).
Mode 1: Pre-Captured Traces
Section titled “Mode 1: Pre-Captured Traces”Use this mode when you already have conversation logs from your app, LangChain callbacks, or manual export. No agent needs to run during evaluation — Floeval scores the traces directly.
From the command line
Section titled “From the command line”Create a config file with your LLM credentials and the agent metrics you want to run. Use --agent to tell the CLI this is an agent evaluation:
llm_config: base_url: "https://api.openai.com/v1" api_key: "your-api-key" chat_model: gpt-4o-mini
evaluation_config: default_provider: "builtin" metrics: - goal_achievement - response_coherencefloeval evaluate --agent -c agent_config.yaml -d agent_full.json -o results.jsonFrom code
Section titled “From code”Use the AgentEvaluation class to run agent evaluations from code. Load your pre-captured traces from a JSON file and pass the metrics you want to score:
from floeval.api.agent_evaluation import AgentEvaluationfrom floeval.config.schemas.io.agent_dataset import AgentDatasetfrom floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig( base_url="https://api.openai.com/v1", api_key="your-api-key", chat_model="gpt-4o-mini", embedding_model="text-embedding-3-small",)
dataset = AgentDataset.from_file("agent_full.json")
evaluation = AgentEvaluation( dataset=dataset, llm_config=llm_config, metrics=["goal_achievement", "response_coherence"], default_provider="builtin",)
results = evaluation.run()print(results.summary["aggregate_scores"])Mode 2: Local Agent (LangChain)
Section titled “Mode 2: Local Agent (LangChain)”Use this mode when your agent runs locally in your Python environment. Floeval wraps your agent with wrap_langchain_agent, runs it for each test case, captures the trace automatically, and then scores it. Requires langchain and langchain-openai.
Step 1: Create a partial dataset file
Section titled “Step 1: Create a partial dataset file”Save your test cases as a JSONL file (one sample per line). Each sample includes the question, expected outcome, and optionally the expected tool calls:
{"user_input": "What's the weather in Paris?", "reference_outcome": "Provides Paris weather", "reference_tool_calls": [{"name": "get_weather", "args": {"city": "Paris"}}]}{"user_input": "What's the weather in London?", "reference_outcome": "Provides London weather", "reference_tool_calls": [{"name": "get_weather", "args": {"city": "London"}}]}Step 2: Run evaluation
Section titled “Step 2: Run evaluation”import osfrom pathlib import Pathfrom langchain_openai import ChatOpenAIfrom langchain.agents import create_agentfrom langchain_core.tools import tool
from floeval.api.agent_evaluation import AgentEvaluationfrom floeval.config.schemas.io.agent_dataset import AgentDatasetfrom floeval.config.schemas.io.llm import OpenAIProviderConfigfrom floeval.utils.agent_trace import wrap_langchain_agent
@tooldef get_weather(city: str) -> str: """Get weather for a city.""" db = {"paris": "18°C", "tokyo": "22°C", "london": "14°C"} return db.get(city.lower(), f"{city}: No data")
llm_config = OpenAIProviderConfig( base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"), api_key=os.getenv("OPENAI_API_KEY", "your-api-key"), chat_model="gpt-4o-mini",)
llm = ChatOpenAI(model=llm_config.chat_model, api_key=llm_config.api_key, base_url=llm_config.base_url)agent = create_agent(model=llm, tools=[get_weather], system_prompt="Use tools when needed.")
dataset = AgentDataset.from_file(Path("agent_partial.jsonl"))
evaluation = AgentEvaluation( dataset=dataset, llm_config=llm_config, metrics=["goal_achievement", "response_coherence", "ragas:tool_call_accuracy"], default_provider="builtin", agent=wrap_langchain_agent(agent),)
results = evaluation.run()print(results.summary["aggregate_scores"])Custom agents with @capture_trace
Section titled “Custom agents with @capture_trace”For agents that don’t use LangChain, decorate your agent function with @capture_trace and use log_turn and log_tool_result to record what happens during execution. Floeval uses these logs to build the trace:
from floeval.utils.agent_trace import capture_trace, log_turn, log_tool_resultfrom floeval.config.schemas.io.agent_dataset import ToolCall
@capture_tracedef my_agent(user_input: str) -> str: search_result = f"Mock search for: {user_input}" log_tool_result("search", search_result) final = f"Answer based on: {search_result}" log_turn(output=final, tool_calls=[ToolCall(name="search", args={"query": user_input})]) return finalPass agent=my_agent to AgentEvaluation.
Mode 3: FloTorch Hosted Agent
Section titled “Mode 3: FloTorch Hosted Agent”Use this mode when your agent is deployed in the FloTorch Console. Floeval sends each test case to the gateway, the hosted agent processes it, and Floeval captures the trace and scores the result. Requires pip install floeval[flotorch].
Step 1: Create a partial dataset
Section titled “Step 1: Create a partial dataset”Prepare test cases with user_input and optionally reference_outcome and reference_tool_calls:
{ "samples": [ { "user_input": "What is the weather in Tokyo?", "reference_outcome": "Provides Tokyo weather", "reference_tool_calls": [{"name": "get_weather", "args": {"city": "Tokyo"}}] } ]}Step 2: Run evaluation
Section titled “Step 2: Run evaluation”Use create_flotorch_runner to connect to your deployed agent. Pass the runner as agent_runner to AgentEvaluation:
from floeval.api.agent_evaluation import AgentEvaluationfrom floeval.config.schemas.io.agent_dataset import AgentDatasetfrom floeval.config.schemas.io.llm import OpenAIProviderConfigfrom floeval.flotorch import create_flotorch_runner
llm_config = OpenAIProviderConfig( base_url="https://your-gateway/v1", api_key="your-api-key", chat_model="gpt-4o-mini", embedding_model="text-embedding-3-small",)
runner = create_flotorch_runner("my-agent", llm_config=llm_config)dataset = AgentDataset.from_file("agent_partial.json")
evaluation = AgentEvaluation( dataset=dataset, llm_config=llm_config, metrics=["goal_achievement", "response_coherence", "ragas:tool_call_accuracy"], default_provider="builtin", agent_runner=runner,)
results = evaluation.run()print(results.summary["aggregate_scores"])Command line with FloTorch
Section titled “Command line with FloTorch”For command-line evaluation against FloTorch-hosted agents, add agent_name to evaluation_config to specify which deployed agent to call:
llm_config: base_url: "https://gateway.flotorch.cloud/openai/v1" api_key: "your-flotorch-api-key" chat_model: flotorch/turbo
evaluation_config: agent_name: "my-agent" default_provider: "builtin" metrics: - goal_achievement - response_coherencefloeval evaluate --agent -c agent_config_flotorch.yaml -d agent_partial.json -o results.jsonDeploy agents in the FloTorch Agent Builder. Create API keys in Settings > API Keys.
Next Steps
Section titled “Next Steps”- Workflow Evaluations — evaluate agentic workflows (multi-agent DAGs)
- LLM Evaluations — evaluate standalone LLM quality
- RAG Evaluations — evaluate retrieval-augmented generation