RAG Evaluations

RAG (Retrieval-Augmented Generation) evaluations validate systems that retrieve relevant documents and use them to generate answers. You provide questions and retrieved contexts; Floeval scores how well the retrieval and generation work together. Use this when your pipeline fetches documents from a knowledge base or vector store and passes them to an LLM to produce an answer.

Step 1: Prepare Your Dataset

RAG datasets extend the LLM format with a contexts field — a list of retrieved document passages for each question. Floeval uses these contexts to evaluate whether the answer is grounded in the retrieved information and whether the right documents were retrieved.

Full dataset — you already have responses and contexts

Use a full dataset when you have both model responses and the documents that were retrieved for each question:

{
  "samples": [
    {
      "user_input": "How does photosynthesis work?",
      "llm_response": "Photosynthesis converts sunlight into energy using chlorophyll in plants.",
      "contexts": [
        "Plants use chlorophyll to capture light.",
        "Converts CO2 and water into glucose and oxygen."
      ],
      "ground_truth": "Converts light into chemical energy"
    },
    {
      "user_input": "What is machine learning?",
      "llm_response": "Machine learning is a branch of AI where systems learn from data.",
      "contexts": [
        "ML uses algorithms to find patterns in data.",
        "Common types include supervised and unsupervised learning."
      ]
    }
  ]
}

Partial dataset — provide contexts, let Floeval generate responses

Use a partial dataset when you have questions and retrieved contexts but no model responses yet. Floeval calls your LLM with the question and contexts, generates the response, then scores it:

{
  "samples": [
    {
      "user_input": "What is RAG?",
      "contexts": ["RAG combines document retrieval with language generation."]
    }
  ]
}

Requirements for partial evaluations

To run partial RAG evaluations, you must follow these steps:

Step	What to do
1. Dataset	Omit `llm_response` from every sample. Include `user_input` and `contexts` (required for RAG — Floeval needs them to generate the answer). Optionally add `ground_truth` for context metrics.
2. Config (CLI)	Add `dataset_generation_config` with `generator_model` — the model Floeval will use to generate responses from question + contexts.
3. From code	Pass `partial_dataset=True` to `DatasetLoader.from_samples()` and `dataset_generator_model` to `Evaluation()`.
4. LLM access	Ensure `llm_config` is valid — Floeval needs it for generation and for scoring.

If any of these are missing, the evaluation will fail or behave unexpectedly.

Step 2: Create Your Config

The config specifies your LLM credentials and which RAG metrics to run. Start with answer_relevancy and faithfulness for a complete picture of answer quality:

llm_config:
  base_url: "https://api.openai.com/v1"
  api_key: "your-api-key"
  chat_model: gpt-4o-mini
  embedding_model: text-embedding-3-small

evaluation_config:
  metrics:
    - ragas:answer_relevancy
    - ragas:faithfulness

For partial datasets, add:

dataset_generation_config:
  generator_model: gpt-4o-mini

Step 3: Run

From the command line

Run the evaluation by pointing the CLI at your config and dataset. The CLI auto-detects full vs partial datasets:

floeval evaluate -c config.yaml -d rag_dataset.json -o results.json

From code

Use the Evaluation class to run RAG evaluations from code. The setup is the same as LLM evaluations, but your dataset includes contexts and you add context-aware metrics like faithfulness:

import os
from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
    api_key=os.getenv("OPENAI_API_KEY", "your-api-key"),
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
)

dataset = DatasetLoader.from_samples([
    {
        "user_input": "What is RAG?",
        "llm_response": "RAG stands for Retrieval-Augmented Generation.",
        "contexts": ["RAG combines retrieval with generation."],
    },
    {
        "user_input": "How does photosynthesis work?",
        "llm_response": "Photosynthesis converts sunlight into energy.",
        "contexts": ["Plants use chlorophyll.", "Converts CO2 and water into glucose."],
    },
], partial_dataset=False)

evaluation = Evaluation(
    dataset=dataset,
    llm_config=llm_config,
    metrics=["answer_relevancy", "faithfulness"],
    default_provider="ragas",
)

results = evaluation.run()
print(results.aggregate_scores)

Available Metrics

RAGAS

Metric ID	What it measures	Key fields
`ragas:answer_relevancy`	How relevant the answer is to the question	`user_input`, `llm_response`
`ragas:faithfulness`	Whether the answer is grounded in the contexts	`llm_response`, `contexts`
`ragas:context_precision`	Whether relevant contexts are ranked first	`contexts`, `ground_truth`
`ragas:context_recall`	How much reference info is covered by contexts	`contexts`, `ground_truth`
`ragas:context_entity_recall`	Entity coverage in contexts vs reference	`contexts`, `ground_truth`
`ragas:noise_sensitivity`	Sensitivity to noisy or irrelevant context	`contexts`, `llm_response`

DeepEval

Metric ID	What it measures	Key fields
`deepeval:answer_relevancy`	Answer relevance	`user_input`, `llm_response`
`deepeval:faithfulness`	Answer grounded in contexts	`llm_response`, `contexts`
`deepeval:contextual_precision`	Context precision	`contexts`, `ground_truth`
`deepeval:contextual_recall`	Context recall	`contexts`, `ground_truth`
`deepeval:contextual_relevancy`	Overall context relevancy	`contexts`

Mixing providers

You can route individual metrics to different scoring backends in the same evaluation. This is useful when you want RAGAS scoring for relevancy and DeepEval scoring for faithfulness:

evaluation = Evaluation(
    dataset=dataset,
    llm_config=llm_config,
    metrics=["ragas:answer_relevancy", "deepeval:faithfulness"],
)

Custom metrics

You can add custom metrics (see LLM Evaluations) to RAG evaluations. With partial datasets, your custom metric receives the generated response after Floeval produces it from the question and contexts. No extra configuration is needed.

Next Steps

LLM Evaluations — evaluate answers without a retrieval step
Prompt Evaluations — compare prompts with RAG metrics
Agent Evaluations — evaluate tool-using agents