Workflow Evaluations

Workflow evaluations validate multi-agent pipelines where multiple agents work together in a DAG (directed acyclic graph). Floeval measures whether all agents responded correctly, evaluates each agent’s tool calls and outputs, and checks whether the overall workflow output is correct. Use this when you have multi-step workflows deployed on the FloTorch gateway.

Prerequisites: Create your agents in the FloTorch Console, link them in your workflow DAG, and create an API key for authentication. You can create any number of agents in the FloTorch Console and orchestrate them as sequential, parallel, or combined workflows. Requires pip install floeval[flotorch].

How It Works

You define a DAG config with nodes (START, AGENT, END) and edges between them
You create an agent dataset with test cases and expected outcomes
WorkflowRunner executes the workflow by calling each agent node according to the DAG
Floeval scores each agent’s behavior and the overall workflow output

Step 1: Define the DAG Config

The DAG config specifies the workflow structure as a graph of nodes and edges. Each AGENT node references a deployed agent by name (e.g. agent1:latest, agent2:latest). You can create sequential flows (agent1 → agent2 → agent3), parallel branches, or a combination of both.

Example: Sequential workflow

One agent runs after another. The output of each agent flows to the next:

{
  "uid": "sequential-workflow-001",
  "name": "Sequential Workflow",
  "nodes": [
    {"id": "start",   "type": "START", "label": "Start"},
    {"id": "agent1",  "type": "AGENT", "label": "Agent 1", "agentName": "agent1:latest"},
    {"id": "agent2",  "type": "AGENT", "label": "Agent 2", "agentName": "agent2:latest"},
    {"id": "end",     "type": "END",   "label": "End"}
  ],
  "edges": [
    {"sourceNodeId": "start",   "targetNodeId": "agent1"},
    {"sourceNodeId": "agent1",  "targetNodeId": "agent2"},
    {"sourceNodeId": "agent2",  "targetNodeId": "end"}
  ]
}

Example: Parallel workflow

Multiple agents run in parallel from the same starting point. Their outputs can converge to a single node:

{
  "uid": "parallel-workflow-001",
  "name": "Parallel Workflow",
  "nodes": [
    {"id": "start",   "type": "START", "label": "Start"},
    {"id": "agent1",  "type": "AGENT", "label": "Agent 1", "agentName": "agent1:latest"},
    {"id": "agent2",  "type": "AGENT", "label": "Agent 2", "agentName": "agent2:latest"},
    {"id": "agent3",  "type": "AGENT", "label": "Agent 3", "agentName": "agent3:latest"},
    {"id": "end",     "type": "END",   "label": "End"}
  ],
  "edges": [
    {"sourceNodeId": "start",  "targetNodeId": "agent1"},
    {"sourceNodeId": "start",  "targetNodeId": "agent2"},
    {"sourceNodeId": "start",  "targetNodeId": "agent3"},
    {"sourceNodeId": "agent1", "targetNodeId": "end"},
    {"sourceNodeId": "agent2", "targetNodeId": "end"},
    {"sourceNodeId": "agent3", "targetNodeId": "end"}
  ]
}

Step 2: Prepare Your Dataset

Each sample represents a test case for the full workflow. Include user_input (what the user sends to the workflow) and reference_outcome (what you expect the workflow to produce). The workflow runner sends the input through all agent nodes in the DAG according to the edges you defined:

{
  "samples": [
    {
      "user_input": "My order has not arrived after two weeks.",
      "reference_outcome": "An apology and a case escalation to the shipping team."
    },
    {
      "user_input": "What is the status of order #12345?",
      "reference_outcome": "The order is shipped and arriving tomorrow."
    }
  ]
}

Step 3: Create Your Config

The config includes llm_config for LLM credentials, evaluation_config for metrics, and agent_workflow_config for the DAG definition. The agent_workflow_config.config section contains the same DAG structure from Step 1:

llm_config:
  base_url: "https://gateway.flotorch.cloud/openai/v1"
  api_key: "your-gateway-key"
  chat_model: gpt-4o-mini

evaluation_config:
  metrics:
    - goal_achievement
    - ragas:agent_goal_accuracy

agent_workflow_config:
  dataset_url: "https://your-storage/agent_dataset.json"
  config:
    uid: "workflow-001"
    name: "Sequential Workflow"
    nodes:
      - {id: "start",   type: "START", label: "Start"}
      - {id: "agent1",  type: "AGENT", label: "Agent 1", agentName: "agent1:latest"}
      - {id: "agent2",  type: "AGENT", label: "Agent 2", agentName: "agent2:latest"}
      - {id: "end",     type: "END",   label: "End"}
    edges:
      - {sourceNodeId: "start",   targetNodeId: "agent1"}
      - {sourceNodeId: "agent1",  targetNodeId: "agent2"}
      - {sourceNodeId: "agent2",  targetNodeId: "end"}

Step 4: Run

From the command line

Use --agent with the workflow config. Floeval reads the DAG definition, runs the workflow for each sample, and scores the results:

floeval evaluate --agent -c workflow_config.yaml -d agent_dataset.json -o workflow_results.json

From code

Create a WorkflowRunner from the DAG config and pass it as agent_runner to AgentEvaluation. The runner executes the full DAG for each sample. Floeval evaluates each agent’s tool calls and outputs, and scores the overall workflow output:

import json
from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset, PartialAgentSample
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.flotorch import WorkflowRunner

llm_config = OpenAIProviderConfig(
    base_url="https://gateway.flotorch.cloud/openai/v1",
    api_key="your-gateway-key",
    chat_model="gpt-4o-mini",
)

with open("workflow_config.json") as f:
    dag_config = json.load(f)

runner = WorkflowRunner(dag_config=dag_config, llm_config=llm_config)

dataset = AgentDataset(samples=[
    PartialAgentSample(
        user_input="My order has not arrived after two weeks.",
        reference_outcome="An apology and a case escalation to the shipping team.",
    ),
])

evaluation = AgentEvaluation(
    dataset=dataset,
    agent_runner=runner,
    llm_config=llm_config,
    metrics=["goal_achievement", "response_coherence", "ragas:agent_goal_accuracy"],
    default_provider="builtin",
)

results = evaluation.run()
print("Summary:", results.summary)

for row in results.sample_results:
    print("Final response:", row.get("final_response"))
    print("Agent traces:", len(row.get("agent_traces", [])), "nodes")

Available Metrics

Floeval evaluates at both the workflow level and the individual agent level:

Metric	What it measures
`goal_achievement`	Did the workflow achieve the intended goal?
`response_coherence`	Is the final response consistent with the workflow trace?
`ragas:agent_goal_accuracy`	Workflow output vs expected outcome
`ragas:tool_call_accuracy`	Were each agent’s tool calls correct?

Results include per-agent traces so you can see which agents responded and how each contributed to the overall output.

Next Steps

Agent Evaluations — evaluate single agents
LLM Evaluations — evaluate raw LLM output
RAG Evaluations — evaluate retrieval-augmented generation