Creating evaluations

Before you start

Depending on the evaluation type, you may need:

A dataset in the workspace (Q&A or evaluation pairs).
Chat and/or embedding models registered for the workspace.
For RAG or prompt-with-retrieval flows: a knowledge base (vector store).
For Agent / Workflow: a published agent or workflow to evaluate.

Evaluation project names must be unique within the workspace and follow the console’s naming rules (alphanumeric and dashes). The UI warns if a name is already in use.

Starting a new evaluation

Go to Evaluate → Create evaluation (or choose a type card when the list is empty).
Select LLM, RAG, Prompt, Agent, or Workflow.
Follow the wizard. All types use the same high-level steps:

1. Configuration

Set the evaluation name, dataset, and type-specific resources:

LLM — Turn type (single-turn is supported), N-shot counts, one or more inferencing models, evaluation model (for scoring outputs), embedding model where required, optional system and user prompts.
RAG — RAG evaluation mode, KNN, knowledge base, inferencing and evaluation models, embeddings, prompts as applicable.
Prompt — One or more prompt pairs (system + user). You can run multiple prompt variants as separate experiments. Pairs may be combined with retrieval settings when the form supports it.
Agent — Select the agent, evaluation model or evaluator options, optional prompts, trajectory options where shown.
Workflow — Select the workflow, evaluation model, optional prompts.

The wizard may generate multiple experiments from your choices (for example one experiment per inferencing model or per prompt-pair index). The total count must stay within the per-project experiment limit (50).

2. Metrics selection

Choose metrics from the registry for your evaluation category. Metrics are labeled by provider (for example Ragas, DeepEval, builtin). You can optionally set thresholds where the UI supports them.

Only metrics valid for the selected evaluation type are offered; invalid keys are rejected when the run is submitted.

3. Review

Confirm dataset, models, prompts, and the list of experiments to be created. Submit to enqueue the evaluation run.

After you submit

The new project appears under Evaluations with a status. Open View (or the project from the overview) to see the results page: experiment table, progress, and links into per-experiment detail.

See Results and metrics for reading scores, question-level tables, and export.