chore(weave): Imperative Evaluation Docs (#4194)

andrewtruong · web-flow · commit 596ae2d7cf7e · 2025-04-24T21:07:50.000Z
diff --git a/docs/docs/guides/core-types/evaluations.md b/docs/docs/guides/core-types/evaluations.md
@@ -38,6 +38,12 @@ weave.init('intro-example')
 asyncio.run(evaluation.evaluate(function_to_evaluate))
 ```
 
+:::info Looking for a less opinionated approach?
+
+If you prefer a more flexible evaluation framework, check out Weave's [Imperative Evaluations](../evaluation/imperative_evaluations.md). The imperative approach offers more flexibility for complex workflows, while the standard evaluation framework provides more structure and guidance.
+
+:::
+
 ## Create an Evaluation
 
 To systematically improve your application, it's helpful to test your changes against a consistent dataset of potential inputs so that you catch regressions and can inspect your apps behaviour under different conditions. Using the `Evaluation` class, you can be sure you're comparing apples-to-apples by keeping track of all of the details that you're experimenting and evaluating with.
@@ -192,6 +198,7 @@ asyncio.run(evaluation.evaluate(function_to_evaluate))
 ### Using `preprocess_model_input` to format dataset rows before evaluating
 
 The `preprocess_model_input` parameter allows you to transform your dataset examples before they are passed to your evaluation function. This is useful when you need to:
+
 - Rename fields to match your model's expected input
 - Transform data into the correct format
 - Add or remove fields
@@ -241,6 +248,7 @@ asyncio.run(evaluation.evaluate(function_to_evaluate))
 In this example, our dataset contains examples with an `input_text` field, but our evaluation function expects a `question` parameter. The `preprocess_example` function transforms each example by renaming the field, allowing the evaluation to work correctly.
 
 The preprocessing function:
+
 1. Receives the raw example from your dataset
 2. Returns a dictionary with the fields your model expects
 3. Is applied to each example before it's passed to your evaluation function
@@ -249,8 +257,8 @@ This is particularly useful when working with external datasets that may have di
 
 ### Using HuggingFace Datasets with evaluations
 
-We are continuously improving our integrations with third-party services and libraries. 
+We are continuously improving our integrations with third-party services and libraries.
 
-While we work on building more seamless integrations, you can use `preprocess_model_input` as a temporary workaround for using HuggingFace Datasets in Weave evaluations. 
+While we work on building more seamless integrations, you can use `preprocess_model_input` as a temporary workaround for using HuggingFace Datasets in Weave evaluations.
 
 See our [Using HuggingFace Datasets in evaluations cookbook](/reference/gen_notebooks/hf_dataset_evals) for the current approach.
diff --git a/docs/docs/guides/evaluation/imperative_evaluations.md b/docs/docs/guides/evaluation/imperative_evaluations.md
@@ -0,0 +1,93 @@
+# Imperative Evaluations
+
+The `EvaluationLogger` provides a flexible way to log evaluation data directly from your Python code. You don't need deep knowledge of Weave's internal data types; simply instantiate a logger and use its methods (`log_prediction`, `log_score`, `log_summary`) to record evaluation steps.
+
+This approach is particularly helpful in complex workflows where the entire dataset or all scorers might not be defined upfront.
+
+In contrast to the standard `Evaluation` object, which requires a predefined `Dataset` and list of `Scorer` objects, the imperative logger allows you to log individual predictions and their associated scores incrementally as they become available.
+
+:::info Looking for a more opinionated approach?
+
+If you prefer a more structured evaluation framework with predefined datasets and scorers, check out Weave's standard [Evaluation framework](../core-types/evaluations.md). The standard approach provides a more declarative way to define and run evaluations, with built-in support for datasets, scorers, and comprehensive reporting.
+
+The imperative approach described on this page offers more flexibility for complex workflows, while the standard evaluation framework provides more structure and guidance.
+
+:::
+
+## Basic usage
+
+1.  **Initialize the logger:** Create an instance of `EvaluationLogger`. You can optionally provide strings or dictionaries as metadata for the `Model` and `Dataset` being evaluated. If omitted, default placeholders are used.
+2.  **Log Predictions:** For each input/output pair from your model or system, call `log_prediction`. This method returns an `ScoreLogger` object tied to that specific prediction event.
+3.  **Log Scores:** Use the `ScoreLogger` object obtained in the previous step to log scores via the `log_score` method. You can log multiple scores from different conceptual scorers (identified by string names or `Scorer` objects) for the same prediction. Call `finish()` on the score logger when you're done logging scores for that prediction to finalize it. _Note: After calling `finish()`, the `ScoreLogger` instance cannot be used to log additional scores._
+4.  **Log Summary:** After processing all your examples and logging their predictions and scores, call `log_summary` on the main `EvaluationLogger` instance. This action finalizes the overall evaluation. Weave automatically calculates summaries for common score types (like counts and fractions for boolean scores) and merges these with any custom summary dictionary you provide. You can include metrics not logged as row-level scores, such as total elapsed time or other aggregate measures, in this summary dictionary.
+
+## Example
+
+The following example shows how to use `EvaluationLogger` to log predictions and scores inline with your existing Python code.
+
+The `user_model` model function is defined and applied to a list of inputs. For each example:
+
+- The input and output are logged using `log_prediction`.
+- A simple correctness score (`correctness_score`) is logged via `log_score`.
+- `finish()` finalizes logging for that prediction.
+
+Finally, `log_summary` records any aggregate metrics and triggers automatic score summarization in Weave.
+
+```python
+import weave
+from openai import OpenAI
+from weave.flow.eval_imperative import EvaluationLogger
+
+# Initialize the logger (model/dataset names are optional metadata)
+eval_logger = EvaluationLogger(
+    model="my_model",
+    dataset="my_dataset"
+)
+
+# Example input data (this can be any data structure you want)
+eval_samples = [
+    {'inputs': {'a': 1, 'b': 2}, 'expected': 3},
+    {'inputs': {'a': 2, 'b': 3}, 'expected': 5},
+    {'inputs': {'a': 3, 'b': 4}, 'expected': 7},
+]
+
+# Example model logic.  This does not have to be decorated with @weave.op,
+# but if you do, it will be traced and logged.
+@weave.op
+def user_model(a: int, b: int) -> int:
+    oai = OpenAI()
+    _ = oai.chat.completions.create(messages=[{"role": "user", "content": f"What is {a}+{b}?"}], model="gpt-4o-mini")
+    return a + b
+
+# Iterate through examples, predict, and log
+for sample in eval_samples:
+    inputs = sample["inputs"]
+    model_output = user_model(**inputs) # Pass inputs as kwargs
+
+    # Log the prediction input and output
+    pred_logger = eval_logger.log_prediction(
+        inputs=inputs,
+        output=model_output
+    )
+
+    # Calculate and log a score for this prediction
+    expected = sample["expected"]
+    correctness_score = model_output == expected
+    pred_logger.log_score(
+        scorer="correctness", # Simple string name for the scorer
+        score=correctness_score
+    )
+
+    # Finish logging for this specific prediction
+    pred_logger.finish()
+
+# Log a final summary for the entire evaluation.
+# Weave auto-aggregates the 'correctness' scores logged above.
+summary_stats = {"subjective_overall_score": 0.8}
+eval_logger.log_summary(summary_stats)
+
+print("Evaluation logging complete. View results in the Weave UI.")
+
+```
+
+This imperative approach allows for logging traces and evaluation data step-by-step, integrating easily into existing Python loops or workflows without requiring pre-collection of all data points.
diff --git a/docs/sidebars.ts b/docs/sidebars.ts
@@ -103,6 +103,7 @@ const sidebars: SidebarsConfig = {
             "guides/evaluation/scorers",
             "guides/evaluation/builtin_scorers",
             "guides/evaluation/weave_local_scorers",
+            "guides/evaluation/imperative_evaluations",
           ]
         },
       ],
diff --git a/weave/flow/eval_imperative.py b/weave/flow/eval_imperative.py
@@ -281,9 +281,9 @@ class EvaluationLogger(BaseModel):
     using the `log_prediction` method, and finished when the `log_summary` method
     is called.
 
-    Each time you log a prediction, you will get back an `ImperativePredictionLogger`
-    object.  You can use this object to log scores and metadata for that specific
-    prediction (see that class for more details).
+    Each time you log a prediction, you will get back a `ScoreLogger` object.
+    You can use this object to log scores and metadata for that specific
+    prediction. For more information, see the `ScoreLogger` class.
 
     Example:
         ```python

Original file line number	Diff line number	Diff line change
`@@ -103,6 +103,7 @@ const sidebars: SidebarsConfig = {`
`103`	`103`	`"guides/evaluation/scorers",`
`104`	`104`	`"guides/evaluation/builtin_scorers",`
`105`	`105`	`"guides/evaluation/weave_local_scorers",`
	`106`	`+ "guides/evaluation/imperative_evaluations",`
`106`	`107`	`]`
`107`	`108`	`},`
`108`	`109`	`],`