LLM Evaluations
LLM evaluations let you validate and compare different LLMs on your dataset. You provide questions and (optionally) model responses; Floeval scores how well each model answers. Use the results to decide which LLM best suits your use case.
Step 1: Prepare Your Dataset
Section titled “Step 1: Prepare Your Dataset”user_input is mandatory for every sample — it is the question or prompt you’re evaluating. For full datasets, you also need llm_response (the model’s answer). Wrap all samples in a "samples" array. Floeval supports both JSON and JSONL formats.
Full dataset — you already have responses
Section titled “Full dataset — you already have responses”Use a full dataset when you have pre-generated model outputs that you want to score. Each sample must include user_input (required) and llm_response:
{ "samples": [ { "user_input": "What is Python?", "llm_response": "Python is a programming language.", "ground_truth": "A programming language" }, { "user_input": "What is RAG?", "llm_response": "RAG stands for Retrieval-Augmented Generation.", "contexts": ["RAG combines retrieval with generation."] } ]}ground_truth and contexts are optional. contexts is required only if you use faithfulness.
Partial dataset — let Floeval generate responses
Section titled “Partial dataset — let Floeval generate responses”Use a partial dataset when you have questions but no answers yet. Omit llm_response and Floeval will call your LLM at runtime to generate responses, then score them automatically:
{ "samples": [ { "user_input": "What is Python?" }, { "user_input": "What is RAG?" } ]}Requirements for partial evaluations
Section titled “Requirements for partial evaluations”To run partial evaluations, you must follow these steps:
| Step | What to do |
|---|---|
| 1. Dataset | Omit llm_response from every sample. Include only user_input (and optional fields like ground_truth). |
| 2. Config (CLI) | Add dataset_generation_config with generator_model — the model Floeval will use to generate responses. |
| 3. From code | Pass partial_dataset=True to DatasetLoader.from_samples() and dataset_generator_model to Evaluation(). |
| 4. LLM access | Ensure llm_config is valid — Floeval needs it both for generation and for metrics that call the model. |
If any of these are missing, the evaluation will fail or behave unexpectedly.
Step 2: Create Your Config
Section titled “Step 2: Create Your Config”The config file tells Floeval which LLM to use and which metrics to run. Create a config.yaml with your LLM credentials and metric selection:
llm_config: base_url: "https://api.openai.com/v1" api_key: "your-api-key" chat_model: gpt-4o-mini embedding_model: text-embedding-3-small
evaluation_config: metrics: - ragas:answer_relevancyFor partial datasets, add dataset_generation_config:
dataset_generation_config: generator_model: gpt-4o-miniStep 3: Run
Section titled “Step 3: Run”From the command line
Section titled “From the command line”Point the CLI at your config and dataset files. Floeval auto-detects whether your dataset is full or partial and adjusts accordingly:
floeval evaluate -c config.yaml -d dataset.json -o results.jsonPartial datasets: Have two options
-
One step (generate + evaluate): Run
floeval evaluatedirectly on a partial dataset. Floeval generates responses at runtime and evaluates them in a single run. Best when you want a quick evaluation and don’t need to save the generated responses. -
Two steps (generate, then evaluate): First generate a full dataset from your partial one, then evaluate it. Useful when you want to audit generated responses, reuse the same dataset for multiple metric configurations, or version-control the generated data.
# Step 1: Generate full dataset from partialfloeval generate -c config.yaml -d partial_dataset.json -o complete_dataset.json
# Step 2: Evaluate the generated datasetfloeval evaluate -c config.yaml -d complete_dataset.json -o results.jsonFrom code
Section titled “From code”To integrate evaluation into your application, use the Evaluation class. Build your LLM config, load or construct a dataset, select metrics, and call run():
import osfrom floeval import Evaluation, DatasetLoaderfrom floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig( base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"), api_key=os.getenv("OPENAI_API_KEY", "your-api-key"), chat_model="gpt-4o-mini", embedding_model="text-embedding-3-small",)
dataset = DatasetLoader.from_samples([ {"user_input": "What is Python?", "llm_response": "Python is a programming language."}, {"user_input": "What is RAG?", "llm_response": "RAG stands for Retrieval-Augmented Generation."},], partial_dataset=False)
evaluation = Evaluation( dataset=dataset, llm_config=llm_config, metrics=["answer_relevancy"], default_provider="ragas",)
results = evaluation.run()print(results.aggregate_scores)For partial datasets, use partial_dataset=True and pass dataset_generator_model="gpt-4o-mini".
Available Metrics
Section titled “Available Metrics”| Metric ID | Provider | What it measures |
|---|---|---|
ragas:answer_relevancy | RAGAS | How relevant the answer is to the question |
deepeval:answer_relevancy | DeepEval | Answer relevance (DeepEval implementation) |
Custom metrics
Section titled “Custom metrics”You can define your own scoring functions using the @custom_metric decorator. The function receives the response (mapped from llm_response) and returns a float score between 0 and 1:
from floeval.api.metrics.custom import custom_metric
@custom_metric(threshold=0.5)def response_length(response: str) -> float: return min(len(response) / 100.0, 1.0)
evaluation = Evaluation( dataset=dataset, llm_config=llm_config, metrics=["ragas:answer_relevancy", response_length],)With partial datasets: Custom metrics work the same way. Floeval generates the response first, then passes it to your custom metric. No extra configuration is needed — your function receives the generated llm_response as the response argument.
Next Steps
Section titled “Next Steps”- RAG Evaluations — add context grounding with
faithfulness - Prompt Evaluations — compare system prompts at scale
- Agent Evaluations — evaluate tool-using agents