Skip to content

RAG Evaluations

RAG (Retrieval-Augmented Generation) evaluations validate systems that retrieve relevant documents and use them to generate answers. You provide questions and retrieved contexts; Floeval scores how well the retrieval and generation work together. Use this when your pipeline fetches documents from a knowledge base or vector store and passes them to an LLM to produce an answer.


RAG datasets extend the LLM format with a contexts field — a list of retrieved document passages for each question. Floeval uses these contexts to evaluate whether the answer is grounded in the retrieved information and whether the right documents were retrieved.

Full dataset — you already have responses and contexts

Section titled “Full dataset — you already have responses and contexts”

Use a full dataset when you have both model responses and the documents that were retrieved for each question:

{
"samples": [
{
"user_input": "How does photosynthesis work?",
"llm_response": "Photosynthesis converts sunlight into energy using chlorophyll in plants.",
"contexts": [
"Plants use chlorophyll to capture light.",
"Converts CO2 and water into glucose and oxygen."
],
"ground_truth": "Converts light into chemical energy"
},
{
"user_input": "What is machine learning?",
"llm_response": "Machine learning is a branch of AI where systems learn from data.",
"contexts": [
"ML uses algorithms to find patterns in data.",
"Common types include supervised and unsupervised learning."
]
}
]
}

Partial dataset — provide contexts, let Floeval generate responses

Section titled “Partial dataset — provide contexts, let Floeval generate responses”

Use a partial dataset when you have questions and retrieved contexts but no model responses yet. Floeval calls your LLM with the question and contexts, generates the response, then scores it:

{
"samples": [
{
"user_input": "What is RAG?",
"contexts": ["RAG combines document retrieval with language generation."]
}
]
}

To run partial RAG evaluations, you must follow these steps:

StepWhat to do
1. DatasetOmit llm_response from every sample. Include user_input and contexts (required for RAG — Floeval needs them to generate the answer). Optionally add ground_truth for context metrics.
2. Config (CLI)Add dataset_generation_config with generator_model — the model Floeval will use to generate responses from question + contexts.
3. From codePass partial_dataset=True to DatasetLoader.from_samples() and dataset_generator_model to Evaluation().
4. LLM accessEnsure llm_config is valid — Floeval needs it for generation and for scoring.

If any of these are missing, the evaluation will fail or behave unexpectedly.


The config specifies your LLM credentials and which RAG metrics to run. Start with answer_relevancy and faithfulness for a complete picture of answer quality:

llm_config:
base_url: "https://api.openai.com/v1"
api_key: "your-api-key"
chat_model: gpt-4o-mini
embedding_model: text-embedding-3-small
evaluation_config:
metrics:
- ragas:answer_relevancy
- ragas:faithfulness

For partial datasets, add:

dataset_generation_config:
generator_model: gpt-4o-mini

Run the evaluation by pointing the CLI at your config and dataset. The CLI auto-detects full vs partial datasets:

Terminal window
floeval evaluate -c config.yaml -d rag_dataset.json -o results.json

Use the Evaluation class to run RAG evaluations from code. The setup is the same as LLM evaluations, but your dataset includes contexts and you add context-aware metrics like faithfulness:

import os
from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig(
base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
api_key=os.getenv("OPENAI_API_KEY", "your-api-key"),
chat_model="gpt-4o-mini",
embedding_model="text-embedding-3-small",
)
dataset = DatasetLoader.from_samples([
{
"user_input": "What is RAG?",
"llm_response": "RAG stands for Retrieval-Augmented Generation.",
"contexts": ["RAG combines retrieval with generation."],
},
{
"user_input": "How does photosynthesis work?",
"llm_response": "Photosynthesis converts sunlight into energy.",
"contexts": ["Plants use chlorophyll.", "Converts CO2 and water into glucose."],
},
], partial_dataset=False)
evaluation = Evaluation(
dataset=dataset,
llm_config=llm_config,
metrics=["answer_relevancy", "faithfulness"],
default_provider="ragas",
)
results = evaluation.run()
print(results.aggregate_scores)

Metric IDWhat it measuresKey fields
ragas:answer_relevancyHow relevant the answer is to the questionuser_input, llm_response
ragas:faithfulnessWhether the answer is grounded in the contextsllm_response, contexts
ragas:context_precisionWhether relevant contexts are ranked firstcontexts, ground_truth
ragas:context_recallHow much reference info is covered by contextscontexts, ground_truth
ragas:context_entity_recallEntity coverage in contexts vs referencecontexts, ground_truth
ragas:noise_sensitivitySensitivity to noisy or irrelevant contextcontexts, llm_response
Metric IDWhat it measuresKey fields
deepeval:answer_relevancyAnswer relevanceuser_input, llm_response
deepeval:faithfulnessAnswer grounded in contextsllm_response, contexts
deepeval:contextual_precisionContext precisioncontexts, ground_truth
deepeval:contextual_recallContext recallcontexts, ground_truth
deepeval:contextual_relevancyOverall context relevancycontexts

You can route individual metrics to different scoring backends in the same evaluation. This is useful when you want RAGAS scoring for relevancy and DeepEval scoring for faithfulness:

evaluation = Evaluation(
dataset=dataset,
llm_config=llm_config,
metrics=["ragas:answer_relevancy", "deepeval:faithfulness"],
)

You can add custom metrics (see LLM Evaluations) to RAG evaluations. With partial datasets, your custom metric receives the generated response after Floeval produces it from the question and contexts. No extra configuration is needed.