Skip to content

Commit bfa062f

Browse files
committed
docs: update root readme
1 parent cbe8c97 commit bfa062f

File tree

1 file changed

+196
-11
lines changed

1 file changed

+196
-11
lines changed

README.md

Lines changed: 196 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,211 @@
1-
## Benchmark
1+
# LLM TypeScript Validation Benchmark
2+
3+
A comprehensive benchmark system for testing LLM validation capabilities across different AI models using TypeScript type definitions. This benchmark evaluates how well Large Language Models can generate correct responses that conform to TypeScript type schemas.
4+
5+
## Quick Start
6+
27
```bash
8+
# Set up environment
39
mv .env.example .env
410
echo "OPENAI_API_KEY=sk-..." >> .env
11+
12+
# Install dependencies
513
pnpm install
14+
15+
# Build the project
616
pnpm run build
17+
18+
# Run benchmark with specific model
719
OPENAI_MODEL="openai/gpt-4o" pnpm start
820
```
921

10-
Benchmark program for Ryoppippi's paper.
22+
## What This Benchmark Tests
1123

12-
Try editing the properties of the `src/index.ts` file.
24+
This benchmark evaluates LLM capabilities in:
1325

14-
Also, benchmark report would be written under the `reports` directory.
26+
1. **Type Schema Validation**: Understanding and generating data that conforms to TypeScript type definitions
27+
2. **Complex Object Structure**: Handling nested objects, unions, arrays, and hierarchical data
28+
3. **Constraint Adherence**: Respecting validation rules like string formats, number ranges, and required fields
29+
4. **Error Recovery**: Improving responses based on validation feedback across multiple attempts
1530

16-
## Structure
31+
### Benchmark Scenarios
1732

18-
```bash
33+
The benchmark includes multiple validation scenarios:
34+
35+
- **ObjectSimple**: Basic object validation with primitive types
36+
- **ObjectConstraint**: Objects with validation constraints (age limits, email formats, etc.)
37+
- **ObjectHierarchical**: Nested object structures with multiple levels
38+
- **ObjectJsonSchema**: JSON Schema-based validation
39+
- **ObjectUnionExplicit/Implicit**: Union type handling
40+
- **ObjectFunctionSchema**: Function parameter validation
41+
- **Shopping Cart Scenarios**: Real-world e-commerce object validation
42+
43+
## How the Benchmark Works
44+
45+
1. **Trial Execution**: Each scenario runs 10 trials per model
46+
2. **First Attempt**: Models receive schema and prompt without validation feedback
47+
3. **Retry Logic**: Failed attempts get validation feedback for up to 5 total attempts
48+
4. **Success Metrics**:
49+
- **First Try Success Rate**: Percentage of immediate successes (no feedback needed)
50+
- **Overall Success Rate**: Percentage that eventually succeed (including after feedback)
51+
52+
## Project Structure
53+
54+
```
1955
.
20-
├── README.md
21-
├── dist // executable files
22-
├── reports // benchmark reports
23-
├── scenarios // benchmark scenarios
24-
└── src // source code
56+
├── README.md # This file
57+
├── CLAUDE.md # Claude Code instructions
58+
├── package.json # Dependencies and scripts
59+
├── build.config.ts # Build configuration
60+
├── tsconfig.json # TypeScript configuration
61+
├── biome.json # Linting and formatting
62+
├── src/ # Source code
63+
│ ├── index.ts # Main entry point
64+
│ ├── constants.ts # Configuration constants
65+
│ ├── openai.ts # OpenAI/OpenRouter integration
66+
│ ├── validate/ # Validation benchmark engine
67+
│ └── utils/ # Utility functions
68+
├── scenarios/ # Benchmark test scenarios
69+
│ ├── ValidateObjectSimple.ts
70+
│ ├── ValidateObjectConstraint.ts
71+
│ ├── ValidateShoppingCart*.ts
72+
│ └── internal/ # Shared scenario utilities
73+
├── reports/ # Benchmark results
74+
│ ├── validate/ # Validation benchmark reports
75+
│ │ ├── README.md # Results summary tables
76+
│ │ ├── anthropic/ # Claude model results
77+
│ │ ├── openai/ # GPT model results
78+
│ │ ├── google/ # Gemini model results
79+
│ │ ├── deepseek/ # DeepSeek model results
80+
│ │ ├── meta-llama/ # Llama model results
81+
│ │ └── mistralai/ # Mistral model results
82+
│ └── analyze-results.ts # Analysis script
83+
└── dist/ # Built output files
84+
```
85+
86+
## Running the Benchmark
87+
88+
### Prerequisites
89+
90+
1. **API Keys**: You need an OpenRouter API key to access different models
91+
2. **Environment**: Node.js 18+ and pnpm
92+
3. **Model Selection**: Choose from available OpenRouter models
93+
94+
### Environment Setup
95+
96+
Create a `.env` file with your OpenRouter API key:
97+
98+
```bash
99+
# Required: OpenRouter API key
100+
OPENAI_API_KEY=sk-or-v1-...
101+
102+
# Required: Model to benchmark (examples)
103+
OPENAI_MODEL="openai/gpt-4o" # GPT-4o
104+
OPENAI_MODEL="anthropic/claude-3.5-sonnet" # Claude 3.5 Sonnet
105+
OPENAI_MODEL="google/gemini-pro-1.5" # Gemini Pro
106+
OPENAI_MODEL="meta-llama/llama-3.3-70b-instruct" # Llama 3.3
107+
108+
# Optional: Schema model for validation (defaults to "chatgpt")
109+
SCHEMA_MODEL="chatgpt"
25110
```
26111

112+
### Running Benchmarks
113+
114+
```bash
115+
# Run with specific model
116+
OPENAI_MODEL="openai/gpt-4o" pnpm start
117+
118+
# Run with different models
119+
OPENAI_MODEL="anthropic/claude-3.5-sonnet" pnpm start
120+
OPENAI_MODEL="google/gemini-pro-1.5" pnpm start
121+
```
122+
123+
### Development Commands
124+
125+
```bash
126+
# Install dependencies
127+
pnpm install
128+
129+
# Build the project
130+
pnpm run build
131+
132+
# Lint and format code
133+
pnpm run lint
134+
pnpm run format
135+
136+
# Generate updated results table
137+
pnpm run analyze-reports
138+
```
139+
140+
## Understanding Results
141+
142+
### Report Structure
143+
144+
Results are organized by vendor and model:
145+
```
146+
reports/validate/
147+
├── openai/gpt-4o-2024-11-20/
148+
│ ├── ObjectSimple/
149+
│ │ ├── README.md
150+
│ │ └── trials/
151+
│ │ ├── 1.success.json
152+
│ │ ├── 2.success.json
153+
│ │ └── ...
154+
│ └── ObjectConstraint/...
155+
└── anthropic/claude-3.5-sonnet/...
156+
```
157+
158+
### Trial Results
159+
160+
Each trial produces a JSON file indicating:
161+
- `X.success.json`: Trial X succeeded
162+
- `X.failure.json`: Trial X failed validation
163+
- `X.error.json`: Trial X encountered an error
164+
- `X.nothing.json`: Trial X produced no response
165+
166+
### Metrics
167+
168+
The benchmark tracks:
169+
1. **First Try Success Rate**: Immediate success without feedback
170+
2. **Overall Success Rate**: Success after retries and feedback
171+
3. **Average Attempts**: Mean number of attempts needed
172+
4. **Failed Tasks**: Number of tasks that never succeeded
173+
174+
## Architecture
175+
176+
### Core Components
177+
178+
1. **ValidateBenchmark Engine** (`src/validate/ValidateBenchmark.ts`):
179+
- Orchestrates benchmark execution
180+
- Handles concurrency and retries
181+
- Generates detailed reports
182+
183+
2. **Scenario System** (`scenarios/`):
184+
- Type-driven validation scenarios
185+
- Uses `typia` for runtime type validation
186+
- Covers various complexity levels
187+
188+
3. **OpenRouter Integration** (`src/openai.ts`):
189+
- Unified API access to multiple LLM providers
190+
- Consistent request/response handling
191+
192+
### Key Technologies
193+
194+
- **TypeScript**: Type definitions and validation
195+
- **Typia**: Runtime type validation and schema generation
196+
- **OpenRouter**: Multi-provider LLM API access
197+
- **Unbuild**: Modern build system with rollup
198+
199+
## Contributing
200+
201+
1. **Adding Scenarios**: Create new validation scenarios in `scenarios/`
202+
2. **Model Support**: Add new models by updating model constants
203+
3. **Analysis**: Enhance the analysis script for better insights
204+
205+
## Results
206+
207+
See [`reports/validate/README.md`](reports/validate/README.md) for the latest benchmark results across all tested models.
208+
209+
## License
210+
211+
[MIT License](LICENSE)

0 commit comments

Comments
 (0)