Skip to content

Commit 596ae2d

Browse files
authored
chore(weave): Imperative Evaluation Docs (#4194)
1 parent 72a6493 commit 596ae2d

File tree

4 files changed

+107
-5
lines changed

4 files changed

+107
-5
lines changed

docs/docs/guides/core-types/evaluations.md

+10-2
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,12 @@ weave.init('intro-example')
3838
asyncio.run(evaluation.evaluate(function_to_evaluate))
3939
```
4040

41+
:::info Looking for a less opinionated approach?
42+
43+
If you prefer a more flexible evaluation framework, check out Weave's [Imperative Evaluations](../evaluation/imperative_evaluations.md). The imperative approach offers more flexibility for complex workflows, while the standard evaluation framework provides more structure and guidance.
44+
45+
:::
46+
4147
## Create an Evaluation
4248

4349
To systematically improve your application, it's helpful to test your changes against a consistent dataset of potential inputs so that you catch regressions and can inspect your apps behaviour under different conditions. Using the `Evaluation` class, you can be sure you're comparing apples-to-apples by keeping track of all of the details that you're experimenting and evaluating with.
@@ -192,6 +198,7 @@ asyncio.run(evaluation.evaluate(function_to_evaluate))
192198
### Using `preprocess_model_input` to format dataset rows before evaluating
193199

194200
The `preprocess_model_input` parameter allows you to transform your dataset examples before they are passed to your evaluation function. This is useful when you need to:
201+
195202
- Rename fields to match your model's expected input
196203
- Transform data into the correct format
197204
- Add or remove fields
@@ -241,6 +248,7 @@ asyncio.run(evaluation.evaluate(function_to_evaluate))
241248
In this example, our dataset contains examples with an `input_text` field, but our evaluation function expects a `question` parameter. The `preprocess_example` function transforms each example by renaming the field, allowing the evaluation to work correctly.
242249

243250
The preprocessing function:
251+
244252
1. Receives the raw example from your dataset
245253
2. Returns a dictionary with the fields your model expects
246254
3. Is applied to each example before it's passed to your evaluation function
@@ -249,8 +257,8 @@ This is particularly useful when working with external datasets that may have di
249257

250258
### Using HuggingFace Datasets with evaluations
251259

252-
We are continuously improving our integrations with third-party services and libraries.
260+
We are continuously improving our integrations with third-party services and libraries.
253261

254-
While we work on building more seamless integrations, you can use `preprocess_model_input` as a temporary workaround for using HuggingFace Datasets in Weave evaluations.
262+
While we work on building more seamless integrations, you can use `preprocess_model_input` as a temporary workaround for using HuggingFace Datasets in Weave evaluations.
255263

256264
See our [Using HuggingFace Datasets in evaluations cookbook](/reference/gen_notebooks/hf_dataset_evals) for the current approach.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Imperative Evaluations
2+
3+
The `EvaluationLogger` provides a flexible way to log evaluation data directly from your Python code. You don't need deep knowledge of Weave's internal data types; simply instantiate a logger and use its methods (`log_prediction`, `log_score`, `log_summary`) to record evaluation steps.
4+
5+
This approach is particularly helpful in complex workflows where the entire dataset or all scorers might not be defined upfront.
6+
7+
In contrast to the standard `Evaluation` object, which requires a predefined `Dataset` and list of `Scorer` objects, the imperative logger allows you to log individual predictions and their associated scores incrementally as they become available.
8+
9+
:::info Looking for a more opinionated approach?
10+
11+
If you prefer a more structured evaluation framework with predefined datasets and scorers, check out Weave's standard [Evaluation framework](../core-types/evaluations.md). The standard approach provides a more declarative way to define and run evaluations, with built-in support for datasets, scorers, and comprehensive reporting.
12+
13+
The imperative approach described on this page offers more flexibility for complex workflows, while the standard evaluation framework provides more structure and guidance.
14+
15+
:::
16+
17+
## Basic usage
18+
19+
1. **Initialize the logger:** Create an instance of `EvaluationLogger`. You can optionally provide strings or dictionaries as metadata for the `Model` and `Dataset` being evaluated. If omitted, default placeholders are used.
20+
2. **Log Predictions:** For each input/output pair from your model or system, call `log_prediction`. This method returns an `ScoreLogger` object tied to that specific prediction event.
21+
3. **Log Scores:** Use the `ScoreLogger` object obtained in the previous step to log scores via the `log_score` method. You can log multiple scores from different conceptual scorers (identified by string names or `Scorer` objects) for the same prediction. Call `finish()` on the score logger when you're done logging scores for that prediction to finalize it. _Note: After calling `finish()`, the `ScoreLogger` instance cannot be used to log additional scores._
22+
4. **Log Summary:** After processing all your examples and logging their predictions and scores, call `log_summary` on the main `EvaluationLogger` instance. This action finalizes the overall evaluation. Weave automatically calculates summaries for common score types (like counts and fractions for boolean scores) and merges these with any custom summary dictionary you provide. You can include metrics not logged as row-level scores, such as total elapsed time or other aggregate measures, in this summary dictionary.
23+
24+
## Example
25+
26+
The following example shows how to use `EvaluationLogger` to log predictions and scores inline with your existing Python code.
27+
28+
The `user_model` model function is defined and applied to a list of inputs. For each example:
29+
30+
- The input and output are logged using `log_prediction`.
31+
- A simple correctness score (`correctness_score`) is logged via `log_score`.
32+
- `finish()` finalizes logging for that prediction.
33+
34+
Finally, `log_summary` records any aggregate metrics and triggers automatic score summarization in Weave.
35+
36+
```python
37+
import weave
38+
from openai import OpenAI
39+
from weave.flow.eval_imperative import EvaluationLogger
40+
41+
# Initialize the logger (model/dataset names are optional metadata)
42+
eval_logger = EvaluationLogger(
43+
model="my_model",
44+
dataset="my_dataset"
45+
)
46+
47+
# Example input data (this can be any data structure you want)
48+
eval_samples = [
49+
{'inputs': {'a': 1, 'b': 2}, 'expected': 3},
50+
{'inputs': {'a': 2, 'b': 3}, 'expected': 5},
51+
{'inputs': {'a': 3, 'b': 4}, 'expected': 7},
52+
]
53+
54+
# Example model logic. This does not have to be decorated with @weave.op,
55+
# but if you do, it will be traced and logged.
56+
@weave.op
57+
def user_model(a: int, b: int) -> int:
58+
oai = OpenAI()
59+
_ = oai.chat.completions.create(messages=[{"role": "user", "content": f"What is {a}+{b}?"}], model="gpt-4o-mini")
60+
return a + b
61+
62+
# Iterate through examples, predict, and log
63+
for sample in eval_samples:
64+
inputs = sample["inputs"]
65+
model_output = user_model(**inputs) # Pass inputs as kwargs
66+
67+
# Log the prediction input and output
68+
pred_logger = eval_logger.log_prediction(
69+
inputs=inputs,
70+
output=model_output
71+
)
72+
73+
# Calculate and log a score for this prediction
74+
expected = sample["expected"]
75+
correctness_score = model_output == expected
76+
pred_logger.log_score(
77+
scorer="correctness", # Simple string name for the scorer
78+
score=correctness_score
79+
)
80+
81+
# Finish logging for this specific prediction
82+
pred_logger.finish()
83+
84+
# Log a final summary for the entire evaluation.
85+
# Weave auto-aggregates the 'correctness' scores logged above.
86+
summary_stats = {"subjective_overall_score": 0.8}
87+
eval_logger.log_summary(summary_stats)
88+
89+
print("Evaluation logging complete. View results in the Weave UI.")
90+
91+
```
92+
93+
This imperative approach allows for logging traces and evaluation data step-by-step, integrating easily into existing Python loops or workflows without requiring pre-collection of all data points.

docs/sidebars.ts

+1
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ const sidebars: SidebarsConfig = {
103103
"guides/evaluation/scorers",
104104
"guides/evaluation/builtin_scorers",
105105
"guides/evaluation/weave_local_scorers",
106+
"guides/evaluation/imperative_evaluations",
106107
]
107108
},
108109
],

weave/flow/eval_imperative.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -281,9 +281,9 @@ class EvaluationLogger(BaseModel):
281281
using the `log_prediction` method, and finished when the `log_summary` method
282282
is called.
283283
284-
Each time you log a prediction, you will get back an `ImperativePredictionLogger`
285-
object. You can use this object to log scores and metadata for that specific
286-
prediction (see that class for more details).
284+
Each time you log a prediction, you will get back a `ScoreLogger` object.
285+
You can use this object to log scores and metadata for that specific
286+
prediction. For more information, see the `ScoreLogger` class.
287287
288288
Example:
289289
```python

0 commit comments

Comments
 (0)