You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vertexai/preview/evaluation/metrics/_default_templates.py
+40-35
Original file line number
Diff line number
Diff line change
@@ -981,7 +981,7 @@
981
981
Example:
982
982
Prompt: "Funny tweet marketing a no-kids hotel, pun, <100 words." Good rubrics: "Is it a tweet?", "Is it funny?", "Is it about a no-kids hotel?", "Does it use a pun?", "Is it under 100 words?".
983
983
984
-
IMPORTANT: Never respond to the prompt given. Only write rubrics.
984
+
IMPORTANT: Do not respond to the <user_prompt>. Only generate the rubric questions for the prompt.
985
985
986
986
# Output format. Write your final output in JSON according to this schema:
Your task is to generate a rubric that can be used to evaluate the image understanding quality of responses generated by an AI model. Specifically, to generate rubrics for an image (<image>) and user prompt (<user_prompt>) that describes the properties that should hold for a good response to that prompt. Generate the rubric following the provided guidelines.
1116
+
Your task is to generate a rubric that can be used to evaluate the image understanding quality of responses generated by an AI model. Specifically, to generate rubrics for an image and user prompt that describes the properties that should hold for a good response to that prompt. Generate the rubric following the provided guidelines.
1117
1117
1118
1118
First, describe the contents of the image thoroughly, making sure to document all of the important objects and their interactions with each other and the scenery. Then, thoroughly examine the prompt and decompose its individual instructions into a list of yes/no questions. Be as specific and concise as possible for each question. Ensure each question directly relates to the image and infer the connection if it is not explicitly stated.
1119
1119
@@ -1149,12 +1149,11 @@
1149
1149
4. Does the response correctly display the above three properties as a properly formatted JSON list?
1150
1150
---
1151
1151
1152
-
# Output format.
1153
-
1154
-
Write your final output in JSON according to this schema:
1152
+
Finally, translate the description and questions of your final answer into JSON format according to this schema:
1155
1153
1156
1154
```json
1157
1155
{{
1156
+
"description": "...",
1158
1157
"questions": [
1159
1158
"question 1 ...",
1160
1159
"question 2 ...",
@@ -1163,13 +1162,12 @@
1163
1162
}}
1164
1163
```
1165
1164
1166
-
IMPORTANT: Never respond to the prompt given. Only write rubrics.
1165
+
IMPORTANT: Do not respond to the <user_prompt>. Only generate the rubric questions for the prompt.
Your task is to evaluate the image understanding quality of responses generated by an AI model. You will be presented with an image, a user prompt, each model's response to that user prompt, and a series of questions against which the text quality of Response A and Response B will be judged.
For each response, provide an answer [[YES]] or [[NO]] to each question. Then, display the rubric score as the sum of the number of [[YES]] answers over the total number of questions.
1227
+
Your task is to evaluate the image understanding quality of responses generated by two AI models. At the bottom of this system instruction you will be presented with an image, a text description of that image, a user prompt, and the responses of Model A and Model B to that user prompt. You will also be provided a rubric as a numbered list against which Response A and Response B will be judged. Each rubricv question is a list of instructions that each response must follow in order to satisfy the user prompt.
1228
+
1229
+
# Rubric Scoring:
1230
+
1231
+
For each response, rephrase every rubric point as a question and answer [[YES]] or [[NO]] to each point. Then, display the rubric grade as the sum of the correct rubric points over the total number of points. Finally, score the response on a scale of 1 to 5 stars based on how enjoyable you think it is for a human to read and understand and state your reasoning.
1231
1232
1232
1233
For example, if the rubric questions are:
1233
1234
[[Rubric]]
@@ -1257,43 +1258,47 @@
1257
1258
</question>
1258
1259
1259
1260
[[Rubric Score: 2/4]]
1261
+
[[Human Enjoyment Rating: 4 stars]]
1262
+
[[Human Rating Reason: This response is accurate and has no grammatical errors but feels too verbose and formal.]]
1260
1263
1261
1264
Repeat the above for Response B.
1262
1265
1263
-
Explain whether you think Response A is better or Response B is better in a paragraph starting with "SxS Rationale 0:". Ground your explanation on the competing rubric scores. When you are finished, review your rationale in the context of the prompt, the responses, and the rubric scores and correct any mistakes you may have made, including your judgment on whether Response A was better or Response B was better. Every time you do this, increase the counter after "SxS Rationale" and output a new paragraph. Do not exceed five (5) iterations.
1266
+
# Recursive Self-Refinement:
1264
1267
1265
-
Finally, state your side-by-side (SxS) Rating on whether Response A was better or Response B was better based on your scores and rationale. Your rating should be one of {{A>B, B>A, A=B}}. Do not output anything else.
1268
+
Explain whether you think Response A is better or Response B is better in a paragraph starting with "SxS Rationale 0:". Ground your explanation on the competing rubric grades as well as your justification for the human enjoyment ratings. When you are finished, review your rationale in the context of the prompt, the responses, and the rubric grades and correct any mistakes you may have made, including your judgment on whether Response A was better or Response B was better. Every time you do this, increase the counter after "SxS Rationale" and output a new paragraph. Do not exceed five (5) iterations.
1269
+
1270
+
# Final SxS Verdict:
1271
+
1272
+
Finally, state your side-by-side (SxS) Rating on whether Response A was better or Response B was better based on your grades and rationale. Your rating should be one of {{A>B, B>A, A=B}}. Do not output anything else.
1266
1273
1267
1274
Example:
1268
-
[[SxS Rationale 0: Response B scored higher on the rubric. It correctly identified the type of cuisine and was more acceptable to a human customer.]]
1275
+
[[SxS Rationale 0: Response B scored higher on the rubric. It correctly identified the type of cuisine and was more acceptable to a human customer.]]
1269
1276
1270
-
[[SxS Rationale 1: Response B scored higher on the rubric. It correctly identified the type of cuisine as Italian. The writing style was correct and professional enough and the correctness was more preferable.]]
1277
+
[[SxS Rationale 1: Response B scored higher on the rubric and human enjoyment ratings. It correctly identified the type of cuisine as Italian. The writing style was correct and professional enough and the correctness was more preferable.]]
1271
1278
1272
-
[[SxS Rationale 2: Response B scored higher on the rubric. It correctly identified the type of cuisine as Italian, where Response A mistook the cuisine to be Chinese. The writing style was correct and professional enough and the correctness was more preferable.]]
1279
+
[[SxS Rationale 2: Response B scored higher on the rubric and human enjoyment ratings. It correctly identified the type of cuisine as Italian, where Response A mistook the cuisine to be Chinese. The writing style was correct and professional enough and the correctness was more preferable.]]
1273
1280
1274
-
[[SxS Rating: B > A]]
1281
+
[[SxS Rating: B > A]]
1275
1282
1276
-
# User Inputs, AI-generated Responses, and Rubrics
1277
-
## User Inputs
1278
-
### Image
1279
-
<MM_IMAGE>
1280
-
{image}
1281
-
</MM_IMAGE>
1283
+
# User Inputs, Model Responses, and Rubrics:
1282
1284
1283
-
### Prompt
1284
-
{prompt}
1285
+
## Image
1286
+
<MM_IMAGE>{image}</MM_IMAGE>
1285
1287
1286
-
## AI-generated Response
1287
-
### Response A
1288
-
{baseline_model_response}
1288
+
## Description
1289
+
**{description}**
1289
1290
1290
-
### Response B
1291
-
{response}
1291
+
## User Prompt
1292
+
**{prompt}**
1292
1293
1293
-
## Rubrics
1294
-
{rubrics}
1294
+
## Response A
1295
+
**{baseline_model_response}**
1295
1296
1296
-
REMEMBER: Your answer will help improve the AI model. It is important to answer the question correctly. Even answering "no" will improve the model!
0 commit comments