Skip to content

Commit eb3d036

Browse files
authored
Merge pull request #19 from wrtnlabs/update-reports
docs: add CLAUDE.md
2 parents b37841a + bfa062f commit eb3d036

File tree

6 files changed

+683
-57
lines changed

6 files changed

+683
-57
lines changed

CLAUDE.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Essential Commands
6+
7+
### Build and Run
8+
```bash
9+
pnpm install # Install dependencies
10+
pnpm run build # Build the project using unbuild
11+
pnpm start # Run the benchmark (requires OPENAI_API_KEY in .env)
12+
```
13+
14+
### Linting and Formatting
15+
```bash
16+
pnpm run lint # Run Biome linter
17+
pnpm run format # Format code with Biome
18+
```
19+
20+
### Environment Setup
21+
```bash
22+
# Create .env file with OpenRouter API key
23+
echo "OPENAI_API_KEY=sk-..." >> .env
24+
25+
# Set model environment variable (required)
26+
export OPENAI_MODEL="openai/gpt-4o" # or other model from OpenRouter
27+
28+
# Optional: Set schema model (defaults to "chatgpt")
29+
export SCHEMA_MODEL="chatgpt"
30+
```
31+
32+
## Architecture Overview
33+
34+
This is a benchmark system for testing LLM validation capabilities across different AI models. The system validates whether LLMs can generate correct responses based on TypeScript type definitions.
35+
36+
### Core Architecture
37+
38+
1. **ValidateBenchmark Engine** (`src/validate/ValidateBenchmark.ts`):
39+
- Central orchestrator that executes validation experiments
40+
- Supports concurrent execution with configurable parallelism
41+
- Implements retry logic for validation failures
42+
- Generates detailed reports of benchmark results
43+
44+
2. **Scenario System** (`scenarios/`):
45+
- Each scenario tests a specific type validation capability
46+
- Scenarios range from simple object validation to complex hierarchical structures
47+
- All scenarios export a validation function that the benchmark engine uses
48+
- Scenarios leverage the `typia` library for runtime type information
49+
50+
3. **Report Generation** (`reports/`):
51+
- Results are written to `reports/validate/{vendor}/{model}/`
52+
- Each experiment produces JSON files with trial results
53+
- Reports include success/failure status and detailed validation feedback
54+
55+
### Key Design Patterns
56+
57+
1. **Type-Driven Validation**: Uses TypeScript's type system with `typia` to generate validation schemas at runtime
58+
2. **Multi-Model Support**: Abstracts vendor-specific implementations through `IBenchmarkVendor` interface
59+
3. **OpenRouter Integration**: All LLM calls go through OpenRouter API for unified model access
60+
4. **Concurrent Execution**: Uses semaphore pattern for controlled parallel execution
61+
62+
### Build System
63+
64+
- Uses `unbuild` with Rollup for building
65+
- `@ryoppippi/unplugin-typia` plugin transforms TypeScript types at build time
66+
- Scenarios are built as separate entry points alongside main index
67+
- Source maps enabled for debugging

README.md

Lines changed: 196 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,211 @@
1-
## Benchmark
1+
# LLM TypeScript Validation Benchmark
2+
3+
A comprehensive benchmark system for testing LLM validation capabilities across different AI models using TypeScript type definitions. This benchmark evaluates how well Large Language Models can generate correct responses that conform to TypeScript type schemas.
4+
5+
## Quick Start
6+
27
```bash
8+
# Set up environment
39
mv .env.example .env
410
echo "OPENAI_API_KEY=sk-..." >> .env
11+
12+
# Install dependencies
513
pnpm install
14+
15+
# Build the project
616
pnpm run build
17+
18+
# Run benchmark with specific model
719
OPENAI_MODEL="openai/gpt-4o" pnpm start
820
```
921

10-
Benchmark program for Ryoppippi's paper.
22+
## What This Benchmark Tests
1123

12-
Try editing the properties of the `src/index.ts` file.
24+
This benchmark evaluates LLM capabilities in:
1325

14-
Also, benchmark report would be written under the `reports` directory.
26+
1. **Type Schema Validation**: Understanding and generating data that conforms to TypeScript type definitions
27+
2. **Complex Object Structure**: Handling nested objects, unions, arrays, and hierarchical data
28+
3. **Constraint Adherence**: Respecting validation rules like string formats, number ranges, and required fields
29+
4. **Error Recovery**: Improving responses based on validation feedback across multiple attempts
1530

16-
## Structure
31+
### Benchmark Scenarios
1732

18-
```bash
33+
The benchmark includes multiple validation scenarios:
34+
35+
- **ObjectSimple**: Basic object validation with primitive types
36+
- **ObjectConstraint**: Objects with validation constraints (age limits, email formats, etc.)
37+
- **ObjectHierarchical**: Nested object structures with multiple levels
38+
- **ObjectJsonSchema**: JSON Schema-based validation
39+
- **ObjectUnionExplicit/Implicit**: Union type handling
40+
- **ObjectFunctionSchema**: Function parameter validation
41+
- **Shopping Cart Scenarios**: Real-world e-commerce object validation
42+
43+
## How the Benchmark Works
44+
45+
1. **Trial Execution**: Each scenario runs 10 trials per model
46+
2. **First Attempt**: Models receive schema and prompt without validation feedback
47+
3. **Retry Logic**: Failed attempts get validation feedback for up to 5 total attempts
48+
4. **Success Metrics**:
49+
- **First Try Success Rate**: Percentage of immediate successes (no feedback needed)
50+
- **Overall Success Rate**: Percentage that eventually succeed (including after feedback)
51+
52+
## Project Structure
53+
54+
```
1955
.
20-
├── README.md
21-
├── dist // executable files
22-
├── reports // benchmark reports
23-
├── scenarios // benchmark scenarios
24-
└── src // source code
56+
├── README.md # This file
57+
├── CLAUDE.md # Claude Code instructions
58+
├── package.json # Dependencies and scripts
59+
├── build.config.ts # Build configuration
60+
├── tsconfig.json # TypeScript configuration
61+
├── biome.json # Linting and formatting
62+
├── src/ # Source code
63+
│ ├── index.ts # Main entry point
64+
│ ├── constants.ts # Configuration constants
65+
│ ├── openai.ts # OpenAI/OpenRouter integration
66+
│ ├── validate/ # Validation benchmark engine
67+
│ └── utils/ # Utility functions
68+
├── scenarios/ # Benchmark test scenarios
69+
│ ├── ValidateObjectSimple.ts
70+
│ ├── ValidateObjectConstraint.ts
71+
│ ├── ValidateShoppingCart*.ts
72+
│ └── internal/ # Shared scenario utilities
73+
├── reports/ # Benchmark results
74+
│ ├── validate/ # Validation benchmark reports
75+
│ │ ├── README.md # Results summary tables
76+
│ │ ├── anthropic/ # Claude model results
77+
│ │ ├── openai/ # GPT model results
78+
│ │ ├── google/ # Gemini model results
79+
│ │ ├── deepseek/ # DeepSeek model results
80+
│ │ ├── meta-llama/ # Llama model results
81+
│ │ └── mistralai/ # Mistral model results
82+
│ └── analyze-results.ts # Analysis script
83+
└── dist/ # Built output files
84+
```
85+
86+
## Running the Benchmark
87+
88+
### Prerequisites
89+
90+
1. **API Keys**: You need an OpenRouter API key to access different models
91+
2. **Environment**: Node.js 18+ and pnpm
92+
3. **Model Selection**: Choose from available OpenRouter models
93+
94+
### Environment Setup
95+
96+
Create a `.env` file with your OpenRouter API key:
97+
98+
```bash
99+
# Required: OpenRouter API key
100+
OPENAI_API_KEY=sk-or-v1-...
101+
102+
# Required: Model to benchmark (examples)
103+
OPENAI_MODEL="openai/gpt-4o" # GPT-4o
104+
OPENAI_MODEL="anthropic/claude-3.5-sonnet" # Claude 3.5 Sonnet
105+
OPENAI_MODEL="google/gemini-pro-1.5" # Gemini Pro
106+
OPENAI_MODEL="meta-llama/llama-3.3-70b-instruct" # Llama 3.3
107+
108+
# Optional: Schema model for validation (defaults to "chatgpt")
109+
SCHEMA_MODEL="chatgpt"
25110
```
26111

112+
### Running Benchmarks
113+
114+
```bash
115+
# Run with specific model
116+
OPENAI_MODEL="openai/gpt-4o" pnpm start
117+
118+
# Run with different models
119+
OPENAI_MODEL="anthropic/claude-3.5-sonnet" pnpm start
120+
OPENAI_MODEL="google/gemini-pro-1.5" pnpm start
121+
```
122+
123+
### Development Commands
124+
125+
```bash
126+
# Install dependencies
127+
pnpm install
128+
129+
# Build the project
130+
pnpm run build
131+
132+
# Lint and format code
133+
pnpm run lint
134+
pnpm run format
135+
136+
# Generate updated results table
137+
pnpm run analyze-reports
138+
```
139+
140+
## Understanding Results
141+
142+
### Report Structure
143+
144+
Results are organized by vendor and model:
145+
```
146+
reports/validate/
147+
├── openai/gpt-4o-2024-11-20/
148+
│ ├── ObjectSimple/
149+
│ │ ├── README.md
150+
│ │ └── trials/
151+
│ │ ├── 1.success.json
152+
│ │ ├── 2.success.json
153+
│ │ └── ...
154+
│ └── ObjectConstraint/...
155+
└── anthropic/claude-3.5-sonnet/...
156+
```
157+
158+
### Trial Results
159+
160+
Each trial produces a JSON file indicating:
161+
- `X.success.json`: Trial X succeeded
162+
- `X.failure.json`: Trial X failed validation
163+
- `X.error.json`: Trial X encountered an error
164+
- `X.nothing.json`: Trial X produced no response
165+
166+
### Metrics
167+
168+
The benchmark tracks:
169+
1. **First Try Success Rate**: Immediate success without feedback
170+
2. **Overall Success Rate**: Success after retries and feedback
171+
3. **Average Attempts**: Mean number of attempts needed
172+
4. **Failed Tasks**: Number of tasks that never succeeded
173+
174+
## Architecture
175+
176+
### Core Components
177+
178+
1. **ValidateBenchmark Engine** (`src/validate/ValidateBenchmark.ts`):
179+
- Orchestrates benchmark execution
180+
- Handles concurrency and retries
181+
- Generates detailed reports
182+
183+
2. **Scenario System** (`scenarios/`):
184+
- Type-driven validation scenarios
185+
- Uses `typia` for runtime type validation
186+
- Covers various complexity levels
187+
188+
3. **OpenRouter Integration** (`src/openai.ts`):
189+
- Unified API access to multiple LLM providers
190+
- Consistent request/response handling
191+
192+
### Key Technologies
193+
194+
- **TypeScript**: Type definitions and validation
195+
- **Typia**: Runtime type validation and schema generation
196+
- **OpenRouter**: Multi-provider LLM API access
197+
- **Unbuild**: Modern build system with rollup
198+
199+
## Contributing
200+
201+
1. **Adding Scenarios**: Create new validation scenarios in `scenarios/`
202+
2. **Model Support**: Add new models by updating model constants
203+
3. **Analysis**: Enhance the analysis script for better insights
204+
205+
## Results
206+
207+
See [`reports/validate/README.md`](reports/validate/README.md) for the latest benchmark results across all tested models.
208+
209+
## License
210+
211+
[MIT License](LICENSE)

package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
"start": "node --env-file=.env ./dist/index.mjs",
1010
"lint": "biome check .",
1111
"format": "biome check --write .",
12+
"analyze-reports": "cd reports && node --experimental-strip-types analyze-results.ts",
1213
"prepare": "husky"
1314
},
1415
"peerDependencies": {

0 commit comments

Comments
 (0)