Skip to content

Commit 4b1b513

Browse files
authored
Merge 6fd514d into 59532c9
2 parents 59532c9 + 6fd514d commit 4b1b513

File tree

2 files changed

+17
-8
lines changed

2 files changed

+17
-8
lines changed

.github/workflows/integration-runner.yml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -85,14 +85,14 @@ jobs:
8585

8686
- name: Configure config.toml for testing with DeepSeek
8787
env:
88-
LLM_MODEL: "litellm_proxy/deepseek-chat"
89-
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
90-
LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
88+
LLM_MODEL: "deepseek/deepseek-chat"
89+
LLM_API_KEY: ${{ secrets.DEEPSEEK_API_KEY }}
90+
#LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
9191
run: |
9292
echo "[llm.eval]" > config.toml
9393
echo "model = \"$LLM_MODEL\"" >> config.toml
9494
echo "api_key = \"$LLM_API_KEY\"" >> config.toml
95-
echo "base_url = \"$LLM_BASE_URL\"" >> config.toml
95+
#echo "base_url = \"$LLM_BASE_URL\"" >> config.toml
9696
echo "temperature = 0.0" >> config.toml
9797
9898
- name: Run integration test evaluation for DeepSeek
@@ -141,8 +141,8 @@ jobs:
141141
id: create_comment
142142
uses: KeisukeYamashita/create-comment@v1
143143
with:
144-
# if triggered by PR, use PR number, otherwise use 5077 as fallback issue number for manual triggers
145-
number: ${{ github.event_name == 'pull_request' && github.event.pull_request.number || 5077 }}
144+
# if triggered by PR, use PR number, otherwise use 5318 as fallback issue number for manual triggers
145+
number: ${{ github.event_name == 'pull_request' && github.event.pull_request.number || 9 }}
146146
unique: false
147147
comment: |
148148
Trigger by: ${{ github.event_name == 'pull_request' && format('Pull Request (integration-test label on PR #{0})', github.event.pull_request.number) || (github.event_name == 'workflow_dispatch' && format('Manual Trigger: {0}', github.event.inputs.reason)) || 'Nightly Scheduled Run' }}
@@ -155,4 +155,4 @@ jobs:
155155
DeepSeek LLM Test Results:
156156
${{ env.INTEGRATION_TEST_REPORT_DEEPSEEK }}
157157
---
158-
Download evaluation outputs (includes both Haiku and DeepSeek results): [Download](${{ steps.upload_results_artifact.outputs.artifact-url }})
158+
Download testing outputs (includes both Haiku and DeepSeek results): [Download](${{ steps.upload_results_artifact.outputs.artifact-url }})

evaluation/integration_tests/run_infer.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,8 @@ def load_integration_tests() -> pd.DataFrame:
218218
)
219219

220220
df = pd.read_json(output_file, lines=True, orient='records')
221+
222+
# record success and reason for failure for the final report
221223
df['success'] = df['test_result'].apply(lambda x: x['success'])
222224
df['reason'] = df['test_result'].apply(lambda x: x['reason'])
223225
logger.info('-' * 100)
@@ -231,9 +233,16 @@ def load_integration_tests() -> pd.DataFrame:
231233
)
232234
logger.info('-' * 100)
233235

236+
# record cost for each instance, with 3 decimal places
237+
df['cost'] = df['metrics'].apply(lambda x: round(x['accumulated_cost'], 3))
238+
logger.info(f'Total cost: USD {df["cost"].sum():.2f}')
239+
234240
report_file = os.path.join(metadata.eval_output_dir, 'report.md')
235241
with open(report_file, 'w') as f:
236242
f.write(
237243
f'Success rate: {df["success"].mean():.2%} ({df["success"].sum()}/{len(df)})\n'
238244
)
239-
f.write(df[['instance_id', 'success', 'reason']].to_markdown(index=False))
245+
f.write(f'\nTotal cost: USD {df["cost"].sum():.2f}\n')
246+
f.write(
247+
df[['instance_id', 'success', 'reason', 'cost']].to_markdown(index=False)
248+
)

0 commit comments

Comments
 (0)