Skip to content

Floeval Overview

Floeval is a multi-backend evaluation framework for LLM and RAG systems. It measures how well your models, pipelines, and agents perform using quantitative metrics so you can iterate with confidence.

Floeval supports five evaluation types: LLM, RAG, Prompt, Agent (single), and Agentic Workflow (multi-agent DAG). Each maps to a different dataset shape and set of metrics.


Floeval supports five evaluation types. Each is designed for a specific scenario.

LLM evaluation — Validate and compare LLMs on your dataset. You provide questions and (optionally) model responses; Floeval scores how well each model answers. Use this to decide which LLM best suits your use case.

RAG evaluation — Validate retrieval-augmented generation pipelines. You provide questions and retrieved contexts; Floeval scores how well the retrieval and generation work together. Use this when your pipeline fetches documents and passes them to an LLM to produce answers.

Prompt evaluation — Try different instructions (e.g. “Be concise” vs “Be detailed”) on the same questions. Floeval runs each question with each instruction, scores the responses, and shows you which instruction makes your LLM perform better.

Agent evaluation — Understand how well your agent performed: whether it understood the task, took the right steps, used tools when needed, and produced a useful result. Use it when your system behaves like an agent (reasoning, tool use, multi-step workflows).

Workflow evaluation — Validate multi-agent pipelines where multiple agents work together in a DAG. Floeval evaluates each agent’s tool calls and outputs, and scores the overall workflow output. Use this when you have sequential, parallel, or combined workflows deployed on the FloTorch gateway.


I want to…Use
Score LLM answers quicklyLLM evaluation
Score RAG answers with retrieval contextRAG evaluation
Compare system prompt variantsPrompt evaluation
Evaluate agent traces I already haveAgent evaluation (pre-captured traces)
Run my agent and evaluate itAgent evaluation (local or FloTorch hosted)
Evaluate a multi-agent workflowWorkflow evaluation
Create my own scoring checkCustom metrics

Command line — Run evaluations from config files and dataset files. No code required. Best for quick runs, CI pipelines, and batch evaluation.

From code — Import Floeval into your application. Full control over dataset construction, metric selection, and result processing. Best for integration into existing systems.

Both flows support all five evaluation types.


Eval typeKey metricsDataset needs
LLManswer_relevancyuser_input + llm_response
RAGfaithfulness, context_precision, context_recalluser_input + llm_response + contexts
Promptanswer_relevancy (or RAG metrics with contexts)Partial dataset + prompts_file with prompt_ids
Agentgoal_achievement, response_coherence, tool_call_accuracyAgentDataset (full or partial)
Agentic workflowSame agent metricsAgentDataset + DAG config

MetricWhat it measuresRequired fields
answer_relevancyHow well the response answers the questionuser_input, llm_response
faithfulnessWhether the response stays grounded in the retrieved contextllm_response, contexts
context_precisionWhether relevant documents are ranked above irrelevant onescontexts, ground_truth
context_recallHow much of the reference information the retrieved context coverscontexts, ground_truth
goal_achievementWhether the agent completed the intended goaluser_input, agent trace
response_coherenceWhether the agent’s final response is consistent with its traceagent trace
tool_call_accuracyWhether the agent called the right tools with the right argumentsagent trace, reference_tool_calls

You can extend any evaluation with custom metrics (function-based scoring) and LLM-as-judge criteria (natural language evaluation).


  • Command line and code — run evaluations from config files or integrate directly into your application
  • Multi-provider metrics — mix RAGAS, DeepEval, builtin, and custom metrics in one run
  • Prompt-aware generation — expand partial samples using prompt_ids and a prompts_file
  • Agent evaluation — score pre-captured traces or collect traces at runtime from LangChain agents or FloTorch-hosted agents
  • Custom metrics — define function-based metrics or LLM-as-judge criteria for your domain
  • FloTorch integration — use the FloTorch Gateway for LLMs and agents

  • Onboarding — sign up, sign in, and create a FloTorch organization
  • API Keys — create and manage API keys in the FloTorch Console
  • API Documentation — gateway URL and model types for FloTorch
  • Agent Builder — deploy agents in the FloTorch Console