LLM Evaluations

LLM evaluations let you validate and compare different LLMs on your dataset. You provide questions and (optionally) model responses; Floeval scores how well each model answers. Use the results to decide which LLM best suits your use case.

Step 1: Prepare Your Dataset

user_input is mandatory for every sample — it is the question or prompt you’re evaluating. For full datasets, you also need llm_response (the model’s answer). Wrap all samples in a "samples" array. Floeval supports both JSON and JSONL formats.

Full dataset — you already have responses

Use a full dataset when you have pre-generated model outputs that you want to score. Each sample must include user_input (required) and llm_response:

{
  "samples": [
    {
      "user_input": "What is Python?",
      "llm_response": "Python is a programming language.",
      "ground_truth": "A programming language"
    },
    {
      "user_input": "What is RAG?",
      "llm_response": "RAG stands for Retrieval-Augmented Generation.",
      "contexts": ["RAG combines retrieval with generation."]
    }
  ]
}

ground_truth and contexts are optional. contexts is required only if you use faithfulness.

Partial dataset — let Floeval generate responses

Use a partial dataset when you have questions but no answers yet. Omit llm_response and Floeval will call your LLM at runtime to generate responses, then score them automatically:

{
  "samples": [
    { "user_input": "What is Python?" },
    { "user_input": "What is RAG?" }
  ]
}

Requirements for partial evaluations

To run partial evaluations, you must follow these steps:

Step	What to do
1. Dataset	Omit `llm_response` from every sample. Include only `user_input` (and optional fields like `ground_truth`).
2. Config (CLI)	Add `dataset_generation_config` with `generator_model` — the model Floeval will use to generate responses.
3. From code	Pass `partial_dataset=True` to `DatasetLoader.from_samples()` and `dataset_generator_model` to `Evaluation()`.
4. LLM access	Ensure `llm_config` is valid — Floeval needs it both for generation and for metrics that call the model.

If any of these are missing, the evaluation will fail or behave unexpectedly.

Step 2: Create Your Config

The config file tells Floeval which LLM to use and which metrics to run. Create a config.yaml with your LLM credentials and metric selection:

llm_config:
  base_url: "https://api.openai.com/v1"
  api_key: "your-api-key"
  chat_model: gpt-4o-mini
  embedding_model: text-embedding-3-small

evaluation_config:
  metrics:
    - ragas:answer_relevancy

For partial datasets, add dataset_generation_config:

dataset_generation_config:
  generator_model: gpt-4o-mini

Step 3: Run

From the command line

Point the CLI at your config and dataset files. Floeval auto-detects whether your dataset is full or partial and adjusts accordingly:

floeval evaluate -c config.yaml -d dataset.json -o results.json

Partial datasets: Have two options

One step (generate + evaluate): Run floeval evaluate directly on a partial dataset. Floeval generates responses at runtime and evaluates them in a single run. Best when you want a quick evaluation and don’t need to save the generated responses.
Two steps (generate, then evaluate): First generate a full dataset from your partial one, then evaluate it. Useful when you want to audit generated responses, reuse the same dataset for multiple metric configurations, or version-control the generated data.

# Step 1: Generate full dataset from partial
floeval generate -c config.yaml -d partial_dataset.json -o complete_dataset.json

# Step 2: Evaluate the generated dataset
floeval evaluate -c config.yaml -d complete_dataset.json -o results.json

From code

To integrate evaluation into your application, use the Evaluation class. Build your LLM config, load or construct a dataset, select metrics, and call run():

import os
from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
    api_key=os.getenv("OPENAI_API_KEY", "your-api-key"),
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
)

dataset = DatasetLoader.from_samples([
    {"user_input": "What is Python?", "llm_response": "Python is a programming language."},
    {"user_input": "What is RAG?", "llm_response": "RAG stands for Retrieval-Augmented Generation."},
], partial_dataset=False)

evaluation = Evaluation(
    dataset=dataset,
    llm_config=llm_config,
    metrics=["answer_relevancy"],
    default_provider="ragas",
)

results = evaluation.run()
print(results.aggregate_scores)

For partial datasets, use partial_dataset=True and pass dataset_generator_model="gpt-4o-mini".

Available Metrics

Metric ID	Provider	What it measures
`ragas:answer_relevancy`	RAGAS	How relevant the answer is to the question
`deepeval:answer_relevancy`	DeepEval	Answer relevance (DeepEval implementation)

Custom metrics

You can define your own scoring functions using the @custom_metric decorator. The function receives the response (mapped from llm_response) and returns a float score between 0 and 1:

from floeval.api.metrics.custom import custom_metric

@custom_metric(threshold=0.5)
def response_length(response: str) -> float:
    return min(len(response) / 100.0, 1.0)

evaluation = Evaluation(
    dataset=dataset,
    llm_config=llm_config,
    metrics=["ragas:answer_relevancy", response_length],
)

With partial datasets: Custom metrics work the same way. Floeval generates the response first, then passes it to your custom metric. No extra configuration is needed — your function receives the generated llm_response as the response argument.

Next Steps

RAG Evaluations — add context grounding with faithfulness
Prompt Evaluations — compare system prompts at scale
Agent Evaluations — evaluate tool-using agents