Installation & Setup

Prerequisites

Python 3.11 or higher — check with python --version
LLM API key — from OpenAI or any OpenAI-compatible provider
Outbound HTTPS access — to reach your model provider

Install

Copy and run this in your terminal to install Floeval:

pip install floeval

For FloTorch-hosted agents, install the optional extra:

pip install floeval[flotorch]

From source

git clone https://github.com/FloTorch/floeval.git
cd floeval
pip install -e .

Verify

Run the following to confirm Floeval is installed correctly:

floeval --version

You should see output similar to floeval 0.1.0b1 (or your installed version). If the command is not found, ensure your Python environment is activated and Floeval is installed in that environment.

Configure LLM Credentials

Floeval needs LLM credentials for metrics that call the model (answer_relevancy, faithfulness, LLM-as-judge, etc.). You provide credentials through a config file (command line) or a Python object (from code).

Config file (for CLI)

Copy the following into a file named config.yaml:

llm_config:
  base_url: "https://api.openai.com/v1"
  api_key: "your-api-key"
  chat_model: gpt-4o-mini
  embedding_model: text-embedding-3-small
  system_prompt: "You are a helpful assistant."  # optional

Python object

Use this in your Python code to build the config programmatically:

from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
)

Load credentials from environment variables or a secrets manager — Floeval only needs the final config object.

Using the FloTorch Gateway

When your LLMs or agents run on the FloTorch gateway, use your workspace gateway URL and API key:

Sign in to the FloTorch Console — see Onboarding
Create an API key — go to Settings > API Keys — see API Keys
Get your gateway URL — from the API Documentation page
Set base_url to your gateway URL and api_key to your FloTorch API key

llm_config:
  base_url: "https://gateway.flotorch.cloud/openai/v1"
  api_key: "your-flotorch-api-key"
  chat_model: flotorch/turbo
  embedding_model: text-embedding-3-small

Quick Validation

Confirm everything works with a quick test. You need a config file and a dataset file for the CLI, or you can use the Python example which builds the dataset inline.

Step 1: Create a config file

Create a new file named config.yaml in your project folder. Replace your-api-key with your actual API key:

llm_config:
  base_url: "https://api.openai.com/v1"
  api_key: "your-api-key"
  chat_model: gpt-4o-mini
  embedding_model: text-embedding-3-small

evaluation_config:
  metrics:
    - ragas:answer_relevancy

Step 2: Create a dataset file

Create a new file named dataset.json in the same folder. Copy the following (the dataset must have a "samples" array; each sample needs user_input and llm_response):

{
  "samples": [
    {
      "user_input": "What is RAG?",
      "llm_response": "RAG is Retrieval-Augmented Generation."
    }
  ]
}

Step 3: Run the CLI

From the same folder where you created config.yaml and dataset.json, run:

floeval evaluate -c config.yaml -d dataset.json

If everything is configured correctly, you should see aggregate scores printed to the terminal (e.g. {'ragas:answer_relevancy': 0.85}). Add -o results.json to save the output to a file.

Alternative: Python (no files needed)

The code below builds the dataset inline, so you can validate your setup without creating any files. Copy and run this script:

from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
)

dataset = DatasetLoader.from_samples(
    [{"user_input": "What is RAG?", "llm_response": "RAG is Retrieval-Augmented Generation."}],
    partial_dataset=False,
)

evaluation = Evaluation(
    dataset=dataset,
    llm_config=llm_config,
    metrics=["ragas:answer_relevancy"],
)

results = evaluation.run()
print(results.aggregate_scores)

Next Steps

LLM Evaluations — evaluate raw model outputs
RAG Evaluations — evaluate retrieval-augmented generation
Agent Evaluations — evaluate tool-using agents
Workflow Evaluations — evaluate agentic workflows