Skip to content

Commit 0f1ee48

Browse files
committed
docs: add claude and sonnet
1 parent 12cfb16 commit 0f1ee48

File tree

1 file changed

+6
-0
lines changed

1 file changed

+6
-0
lines changed

reports/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ This table shows the percentage of immediate ("1️⃣") successes per task.
1515
|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|
1616
| claude-3.5-sonnet (anthropic/claude-3.5-sonnet) | 100.0% | 100.0% | 100.0% | 0.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | N/A | N/A | N/A |
1717
| claude-3.7-sonnet (anthropic/claude-3.7-sonnet) | 100.0% | 100.0% | 100.0% | 0.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | N/A | N/A | N/A |
18+
| claude-opus-4 (anthropic/claude-opus-4) | 100.0% | 40.0% | 100.0% | 90.0% | 100.0% | 100.0% | 100.0% | 60.0% | 70.0% | 50.0% | 60.0% | 100.0% | N/A | N/A | N/A |
19+
| claude-sonnet-4 (anthropic/claude-sonnet-4) | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | N/A | N/A | N/A |
1820
| deepseek-chat-v3-0324 (deepseek/deepseek-chat-v3-0324) | 100.0% | 30.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | N/A | N/A | N/A |
1921
| deepseek-r1 (deepseek/deepseek-r1) | 70.0% | 0.0% | 20.0% | 40.0% | 30.0% | 40.0% | 60.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | N/A | N/A | N/A |
2022
| gemini-2.0-flash-001 (google/gemini-2.0-flash-001) | N/A | N/A | 0.0% | 0.0% | 100.0% | 0.0% | 0.0% | N/A | N/A | N/A | N/A | N/A | 100.0% | 0.0% | 100.0% |
@@ -41,6 +43,8 @@ For each task the cell shows the percentage of trials (out of 10) that ultimatel
4143
|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|
4244
| claude-3.5-sonnet (anthropic/claude-3.5-sonnet) | 100.0% | 100.0% | 100.0% | 0.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | N/A | N/A | N/A |
4345
| claude-3.7-sonnet (anthropic/claude-3.7-sonnet) | 100.0% | 100.0% | 100.0% | 0.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | N/A | N/A | N/A |
46+
| claude-opus-4 (anthropic/claude-opus-4) | 100.0% | 40.0% | 100.0% | 90.0% | 100.0% | 100.0% | 100.0% | 60.0% | 70.0% | 50.0% | 60.0% | 100.0% | N/A | N/A | N/A |
47+
| claude-sonnet-4 (anthropic/claude-sonnet-4) | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | N/A | N/A | N/A |
4448
| deepseek-chat-v3-0324 (deepseek/deepseek-chat-v3-0324) | 100.0% | 30.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | N/A | N/A | N/A |
4549
| deepseek-r1 (deepseek/deepseek-r1) | 70.0% | 0.0% | 20.0% | 40.0% | 30.0% | 40.0% | 60.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | N/A | N/A | N/A |
4650
| gemini-2.0-flash-001 (google/gemini-2.0-flash-001) | N/A | N/A | 0.0% | 0.0% | 100.0% | 0.0% | 0.0% | N/A | N/A | N/A | N/A | N/A | 100.0% | 0.0% | 100.0% |
@@ -65,6 +69,8 @@ For each task the cell shows the percentage of trials (out of 10) that ultimatel
6569
|-----------|:---------------------:|:------------------:|:---------------:|:-----------------:|:----------------:|:--------------:|:--------------:|:--------------:|
6670
| claude-3.5-sonnet (anthropic/claude-3.5-sonnet) | 110 | 110 | 0 | 0.00% | 10 | 1.458 | 6 | 1 |
6771
| claude-3.7-sonnet (anthropic/claude-3.7-sonnet) | 110 | 110 | 0 | 0.00% | 10 | 1.425 | 6 | 1 |
72+
| claude-opus-4 (anthropic/claude-opus-4) | 97 | 97 | 0 | 0.00% | 23 | 1.050 | 2 | 1 |
73+
| claude-sonnet-4 (anthropic/claude-sonnet-4) | 120 | 120 | 0 | 0.00% | 0 | 1.000 | 1 | 1 |
6874
| deepseek-chat-v3-0324 (deepseek/deepseek-chat-v3-0324) | 113 | 113 | 0 | 0.00% | 7 | 1.175 | 5 | 1 |
6975
| deepseek-r1 (deepseek/deepseek-r1) | 26 | 26 | 0 | 0.00% | 94 | 1.000 | 1 | 1 |
7076
| gemini-2.0-flash-001 (google/gemini-2.0-flash-001) | 30 | 30 | 0 | 0.00% | 50 | 3.500 | 6 | 1 |

0 commit comments

Comments
 (0)