Floeval Overview
Floeval is a multi-backend evaluation framework for LLM and RAG systems. It measures how well your models, pipelines, and agents perform using quantitative metrics so you can iterate with confidence.
Floeval supports five evaluation types: LLM, RAG, Prompt, Agent (single), and Agentic Workflow (multi-agent DAG). Each maps to a different dataset shape and set of metrics.
What Can You Evaluate?
Section titled “What Can You Evaluate?”Floeval supports five evaluation types. Each is designed for a specific scenario.
LLM evaluation — Validate and compare LLMs on your dataset. You provide questions and (optionally) model responses; Floeval scores how well each model answers. Use this to decide which LLM best suits your use case.
RAG evaluation — Validate retrieval-augmented generation pipelines. You provide questions and retrieved contexts; Floeval scores how well the retrieval and generation work together. Use this when your pipeline fetches documents and passes them to an LLM to produce answers.
Prompt evaluation — Try different instructions (e.g. “Be concise” vs “Be detailed”) on the same questions. Floeval runs each question with each instruction, scores the responses, and shows you which instruction makes your LLM perform better.
Agent evaluation — Understand how well your agent performed: whether it understood the task, took the right steps, used tools when needed, and produced a useful result. Use it when your system behaves like an agent (reasoning, tool use, multi-step workflows).
Workflow evaluation — Validate multi-agent pipelines where multiple agents work together in a DAG. Floeval evaluates each agent’s tool calls and outputs, and scores the overall workflow output. Use this when you have sequential, parallel, or combined workflows deployed on the FloTorch gateway.
When to Use What
Section titled “When to Use What”| I want to… | Use |
|---|---|
| Score LLM answers quickly | LLM evaluation |
| Score RAG answers with retrieval context | RAG evaluation |
| Compare system prompt variants | Prompt evaluation |
| Evaluate agent traces I already have | Agent evaluation (pre-captured traces) |
| Run my agent and evaluate it | Agent evaluation (local or FloTorch hosted) |
| Evaluate a multi-agent workflow | Workflow evaluation |
| Create my own scoring check | Custom metrics |
Two Ways to Use Floeval
Section titled “Two Ways to Use Floeval”Command line — Run evaluations from config files and dataset files. No code required. Best for quick runs, CI pipelines, and batch evaluation.
From code — Import Floeval into your application. Full control over dataset construction, metric selection, and result processing. Best for integration into existing systems.
Both flows support all five evaluation types.
Eval Type Quick Reference
Section titled “Eval Type Quick Reference”| Eval type | Key metrics | Dataset needs |
|---|---|---|
| LLM | answer_relevancy | user_input + llm_response |
| RAG | faithfulness, context_precision, context_recall | user_input + llm_response + contexts |
| Prompt | answer_relevancy (or RAG metrics with contexts) | Partial dataset + prompts_file with prompt_ids |
| Agent | goal_achievement, response_coherence, tool_call_accuracy | AgentDataset (full or partial) |
| Agentic workflow | Same agent metrics | AgentDataset + DAG config |
Core Metrics at a Glance
Section titled “Core Metrics at a Glance”| Metric | What it measures | Required fields |
|---|---|---|
answer_relevancy | How well the response answers the question | user_input, llm_response |
faithfulness | Whether the response stays grounded in the retrieved context | llm_response, contexts |
context_precision | Whether relevant documents are ranked above irrelevant ones | contexts, ground_truth |
context_recall | How much of the reference information the retrieved context covers | contexts, ground_truth |
goal_achievement | Whether the agent completed the intended goal | user_input, agent trace |
response_coherence | Whether the agent’s final response is consistent with its trace | agent trace |
tool_call_accuracy | Whether the agent called the right tools with the right arguments | agent trace, reference_tool_calls |
You can extend any evaluation with custom metrics (function-based scoring) and LLM-as-judge criteria (natural language evaluation).
Key Features
Section titled “Key Features”- Command line and code — run evaluations from config files or integrate directly into your application
- Multi-provider metrics — mix RAGAS, DeepEval, builtin, and custom metrics in one run
- Prompt-aware generation — expand partial samples using
prompt_idsand aprompts_file - Agent evaluation — score pre-captured traces or collect traces at runtime from LangChain agents or FloTorch-hosted agents
- Custom metrics — define function-based metrics or LLM-as-judge criteria for your domain
- FloTorch integration — use the FloTorch Gateway for LLMs and agents
Related Documentation
Section titled “Related Documentation”- Onboarding — sign up, sign in, and create a FloTorch organization
- API Keys — create and manage API keys in the FloTorch Console
- API Documentation — gateway URL and model types for FloTorch
- Agent Builder — deploy agents in the FloTorch Console
Next Steps
Section titled “Next Steps”- Installation — install Floeval and configure credentials
- LLM Evaluations — evaluate raw LLM output quality
- RAG Evaluations — evaluate retrieval-augmented generation
- Prompt Evaluations — compare prompts using
prompts_file - Agent Evaluations — evaluate tool-using agents
- Workflow Evaluations — evaluate agentic workflows