Floeval Overview

Floeval is a multi-backend evaluation framework for LLM and RAG systems. It measures how well your models, pipelines, and agents perform using quantitative metrics so you can iterate with confidence.

Floeval supports five evaluation types: LLM, RAG, Prompt, Agent (single), and Agentic Workflow (multi-agent DAG). Each maps to a different dataset shape and set of metrics.

What Can You Evaluate?

Floeval supports five evaluation types. Each is designed for a specific scenario.

LLM evaluation — Validate and compare LLMs on your dataset. You provide questions and (optionally) model responses; Floeval scores how well each model answers. Use this to decide which LLM best suits your use case.

RAG evaluation — Validate retrieval-augmented generation pipelines. You provide questions and retrieved contexts; Floeval scores how well the retrieval and generation work together. Use this when your pipeline fetches documents and passes them to an LLM to produce answers.

Prompt evaluation — Try different instructions (e.g. “Be concise” vs “Be detailed”) on the same questions. Floeval runs each question with each instruction, scores the responses, and shows you which instruction makes your LLM perform better.

Agent evaluation — Understand how well your agent performed: whether it understood the task, took the right steps, used tools when needed, and produced a useful result. Use it when your system behaves like an agent (reasoning, tool use, multi-step workflows).

Workflow evaluation — Validate multi-agent pipelines where multiple agents work together in a DAG. Floeval evaluates each agent’s tool calls and outputs, and scores the overall workflow output. Use this when you have sequential, parallel, or combined workflows deployed on the FloTorch gateway.

When to Use What

I want to…	Use
Score LLM answers quickly	LLM evaluation
Score RAG answers with retrieval context	RAG evaluation
Compare system prompt variants	Prompt evaluation
Evaluate agent traces I already have	Agent evaluation (pre-captured traces)
Run my agent and evaluate it	Agent evaluation (local or FloTorch hosted)
Evaluate a multi-agent workflow	Workflow evaluation
Create my own scoring check	Custom metrics

Two Ways to Use Floeval

Command line — Run evaluations from config files and dataset files. No code required. Best for quick runs, CI pipelines, and batch evaluation.

From code — Import Floeval into your application. Full control over dataset construction, metric selection, and result processing. Best for integration into existing systems.

Both flows support all five evaluation types.

Eval Type Quick Reference

Eval type	Key metrics	Dataset needs
LLM	`answer_relevancy`	`user_input` + `llm_response`
RAG	`faithfulness`, `context_precision`, `context_recall`	`user_input` + `llm_response` + `contexts`
Prompt	`answer_relevancy` (or RAG metrics with contexts)	Partial dataset + `prompts_file` with `prompt_ids`
Agent	`goal_achievement`, `response_coherence`, `tool_call_accuracy`	`AgentDataset` (full or partial)
Agentic workflow	Same agent metrics	`AgentDataset` + DAG config

Core Metrics at a Glance

Metric	What it measures	Required fields
`answer_relevancy`	How well the response answers the question	`user_input`, `llm_response`
`faithfulness`	Whether the response stays grounded in the retrieved context	`llm_response`, `contexts`
`context_precision`	Whether relevant documents are ranked above irrelevant ones	`contexts`, `ground_truth`
`context_recall`	How much of the reference information the retrieved context covers	`contexts`, `ground_truth`
`goal_achievement`	Whether the agent completed the intended goal	`user_input`, agent trace
`response_coherence`	Whether the agent’s final response is consistent with its trace	agent trace
`tool_call_accuracy`	Whether the agent called the right tools with the right arguments	agent trace, `reference_tool_calls`

You can extend any evaluation with custom metrics (function-based scoring) and LLM-as-judge criteria (natural language evaluation).

Key Features

Command line and code — run evaluations from config files or integrate directly into your application
Multi-provider metrics — mix RAGAS, DeepEval, builtin, and custom metrics in one run
Prompt-aware generation — expand partial samples using prompt_ids and a prompts_file
Agent evaluation — score pre-captured traces or collect traces at runtime from LangChain agents or FloTorch-hosted agents
Custom metrics — define function-based metrics or LLM-as-judge criteria for your domain
FloTorch integration — use the FloTorch Gateway for LLMs and agents

Onboarding — sign up, sign in, and create a FloTorch organization
API Keys — create and manage API keys in the FloTorch Console
API Documentation — gateway URL and model types for FloTorch
Agent Builder — deploy agents in the FloTorch Console

Next Steps

Installation — install Floeval and configure credentials
LLM Evaluations — evaluate raw LLM output quality
RAG Evaluations — evaluate retrieval-augmented generation
Prompt Evaluations — compare prompts using prompts_file
Agent Evaluations — evaluate tool-using agents
Workflow Evaluations — evaluate agentic workflows