Skip to content

Workflow Evaluations

Workflow evaluations validate multi-agent pipelines where multiple agents work together in a DAG (directed acyclic graph). Floeval measures whether all agents responded correctly, evaluates each agent’s tool calls and outputs, and checks whether the overall workflow output is correct. Use this when you have multi-step workflows deployed on the FloTorch gateway.

Prerequisites: Create your agents in the FloTorch Console, link them in your workflow DAG, and create an API key for authentication. You can create any number of agents in the FloTorch Console and orchestrate them as sequential, parallel, or combined workflows. Requires pip install floeval[flotorch].


  1. You define a DAG config with nodes (START, AGENT, END) and edges between them
  2. You create an agent dataset with test cases and expected outcomes
  3. WorkflowRunner executes the workflow by calling each agent node according to the DAG
  4. Floeval scores each agent’s behavior and the overall workflow output

The DAG config specifies the workflow structure as a graph of nodes and edges. Each AGENT node references a deployed agent by name (e.g. agent1:latest, agent2:latest). You can create sequential flows (agent1 → agent2 → agent3), parallel branches, or a combination of both.

One agent runs after another. The output of each agent flows to the next:

{
"uid": "sequential-workflow-001",
"name": "Sequential Workflow",
"nodes": [
{"id": "start", "type": "START", "label": "Start"},
{"id": "agent1", "type": "AGENT", "label": "Agent 1", "agentName": "agent1:latest"},
{"id": "agent2", "type": "AGENT", "label": "Agent 2", "agentName": "agent2:latest"},
{"id": "end", "type": "END", "label": "End"}
],
"edges": [
{"sourceNodeId": "start", "targetNodeId": "agent1"},
{"sourceNodeId": "agent1", "targetNodeId": "agent2"},
{"sourceNodeId": "agent2", "targetNodeId": "end"}
]
}

Multiple agents run in parallel from the same starting point. Their outputs can converge to a single node:

{
"uid": "parallel-workflow-001",
"name": "Parallel Workflow",
"nodes": [
{"id": "start", "type": "START", "label": "Start"},
{"id": "agent1", "type": "AGENT", "label": "Agent 1", "agentName": "agent1:latest"},
{"id": "agent2", "type": "AGENT", "label": "Agent 2", "agentName": "agent2:latest"},
{"id": "agent3", "type": "AGENT", "label": "Agent 3", "agentName": "agent3:latest"},
{"id": "end", "type": "END", "label": "End"}
],
"edges": [
{"sourceNodeId": "start", "targetNodeId": "agent1"},
{"sourceNodeId": "start", "targetNodeId": "agent2"},
{"sourceNodeId": "start", "targetNodeId": "agent3"},
{"sourceNodeId": "agent1", "targetNodeId": "end"},
{"sourceNodeId": "agent2", "targetNodeId": "end"},
{"sourceNodeId": "agent3", "targetNodeId": "end"}
]
}

Each sample represents a test case for the full workflow. Include user_input (what the user sends to the workflow) and reference_outcome (what you expect the workflow to produce). The workflow runner sends the input through all agent nodes in the DAG according to the edges you defined:

{
"samples": [
{
"user_input": "My order has not arrived after two weeks.",
"reference_outcome": "An apology and a case escalation to the shipping team."
},
{
"user_input": "What is the status of order #12345?",
"reference_outcome": "The order is shipped and arriving tomorrow."
}
]
}

The config includes llm_config for LLM credentials, evaluation_config for metrics, and agent_workflow_config for the DAG definition. The agent_workflow_config.config section contains the same DAG structure from Step 1:

workflow_config.yaml
llm_config:
base_url: "https://gateway.flotorch.cloud/openai/v1"
api_key: "your-gateway-key"
chat_model: gpt-4o-mini
evaluation_config:
metrics:
- goal_achievement
- ragas:agent_goal_accuracy
agent_workflow_config:
dataset_url: "https://your-storage/agent_dataset.json"
config:
uid: "workflow-001"
name: "Sequential Workflow"
nodes:
- {id: "start", type: "START", label: "Start"}
- {id: "agent1", type: "AGENT", label: "Agent 1", agentName: "agent1:latest"}
- {id: "agent2", type: "AGENT", label: "Agent 2", agentName: "agent2:latest"}
- {id: "end", type: "END", label: "End"}
edges:
- {sourceNodeId: "start", targetNodeId: "agent1"}
- {sourceNodeId: "agent1", targetNodeId: "agent2"}
- {sourceNodeId: "agent2", targetNodeId: "end"}

Use --agent with the workflow config. Floeval reads the DAG definition, runs the workflow for each sample, and scores the results:

Terminal window
floeval evaluate --agent -c workflow_config.yaml -d agent_dataset.json -o workflow_results.json

Create a WorkflowRunner from the DAG config and pass it as agent_runner to AgentEvaluation. The runner executes the full DAG for each sample. Floeval evaluates each agent’s tool calls and outputs, and scores the overall workflow output:

import json
from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset, PartialAgentSample
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.flotorch import WorkflowRunner
llm_config = OpenAIProviderConfig(
base_url="https://gateway.flotorch.cloud/openai/v1",
api_key="your-gateway-key",
chat_model="gpt-4o-mini",
)
with open("workflow_config.json") as f:
dag_config = json.load(f)
runner = WorkflowRunner(dag_config=dag_config, llm_config=llm_config)
dataset = AgentDataset(samples=[
PartialAgentSample(
user_input="My order has not arrived after two weeks.",
reference_outcome="An apology and a case escalation to the shipping team.",
),
])
evaluation = AgentEvaluation(
dataset=dataset,
agent_runner=runner,
llm_config=llm_config,
metrics=["goal_achievement", "response_coherence", "ragas:agent_goal_accuracy"],
default_provider="builtin",
)
results = evaluation.run()
print("Summary:", results.summary)
for row in results.sample_results:
print("Final response:", row.get("final_response"))
print("Agent traces:", len(row.get("agent_traces", [])), "nodes")

Floeval evaluates at both the workflow level and the individual agent level:

MetricWhat it measures
goal_achievementDid the workflow achieve the intended goal?
response_coherenceIs the final response consistent with the workflow trace?
ragas:agent_goal_accuracyWorkflow output vs expected outcome
ragas:tool_call_accuracyWere each agent’s tool calls correct?

Results include per-agent traces so you can see which agents responded and how each contributed to the overall output.