|
1 |
| -## Benchmark |
| 1 | +# LLM TypeScript Validation Benchmark |
| 2 | + |
| 3 | +A comprehensive benchmark system for testing LLM validation capabilities across different AI models using TypeScript type definitions. This benchmark evaluates how well Large Language Models can generate correct responses that conform to TypeScript type schemas. |
| 4 | + |
| 5 | +## Quick Start |
| 6 | + |
2 | 7 | ```bash
|
| 8 | +# Set up environment |
3 | 9 | mv .env.example .env
|
4 | 10 | echo "OPENAI_API_KEY=sk-..." >> .env
|
| 11 | + |
| 12 | +# Install dependencies |
5 | 13 | pnpm install
|
| 14 | + |
| 15 | +# Build the project |
6 | 16 | pnpm run build
|
| 17 | + |
| 18 | +# Run benchmark with specific model |
7 | 19 | OPENAI_MODEL="openai/gpt-4o" pnpm start
|
8 | 20 | ```
|
9 | 21 |
|
10 |
| -Benchmark program for Ryoppippi's paper. |
| 22 | +## What This Benchmark Tests |
11 | 23 |
|
12 |
| -Try editing the properties of the `src/index.ts` file. |
| 24 | +This benchmark evaluates LLM capabilities in: |
13 | 25 |
|
14 |
| -Also, benchmark report would be written under the `reports` directory. |
| 26 | +1. **Type Schema Validation**: Understanding and generating data that conforms to TypeScript type definitions |
| 27 | +2. **Complex Object Structure**: Handling nested objects, unions, arrays, and hierarchical data |
| 28 | +3. **Constraint Adherence**: Respecting validation rules like string formats, number ranges, and required fields |
| 29 | +4. **Error Recovery**: Improving responses based on validation feedback across multiple attempts |
15 | 30 |
|
16 |
| -## Structure |
| 31 | +### Benchmark Scenarios |
17 | 32 |
|
18 |
| -```bash |
| 33 | +The benchmark includes multiple validation scenarios: |
| 34 | + |
| 35 | +- **ObjectSimple**: Basic object validation with primitive types |
| 36 | +- **ObjectConstraint**: Objects with validation constraints (age limits, email formats, etc.) |
| 37 | +- **ObjectHierarchical**: Nested object structures with multiple levels |
| 38 | +- **ObjectJsonSchema**: JSON Schema-based validation |
| 39 | +- **ObjectUnionExplicit/Implicit**: Union type handling |
| 40 | +- **ObjectFunctionSchema**: Function parameter validation |
| 41 | +- **Shopping Cart Scenarios**: Real-world e-commerce object validation |
| 42 | + |
| 43 | +## How the Benchmark Works |
| 44 | + |
| 45 | +1. **Trial Execution**: Each scenario runs 10 trials per model |
| 46 | +2. **First Attempt**: Models receive schema and prompt without validation feedback |
| 47 | +3. **Retry Logic**: Failed attempts get validation feedback for up to 5 total attempts |
| 48 | +4. **Success Metrics**: |
| 49 | + - **First Try Success Rate**: Percentage of immediate successes (no feedback needed) |
| 50 | + - **Overall Success Rate**: Percentage that eventually succeed (including after feedback) |
| 51 | + |
| 52 | +## Project Structure |
| 53 | + |
| 54 | +``` |
19 | 55 | .
|
20 |
| -├── README.md |
21 |
| -├── dist // executable files |
22 |
| -├── reports // benchmark reports |
23 |
| -├── scenarios // benchmark scenarios |
24 |
| -└── src // source code |
| 56 | +├── README.md # This file |
| 57 | +├── CLAUDE.md # Claude Code instructions |
| 58 | +├── package.json # Dependencies and scripts |
| 59 | +├── build.config.ts # Build configuration |
| 60 | +├── tsconfig.json # TypeScript configuration |
| 61 | +├── biome.json # Linting and formatting |
| 62 | +├── src/ # Source code |
| 63 | +│ ├── index.ts # Main entry point |
| 64 | +│ ├── constants.ts # Configuration constants |
| 65 | +│ ├── openai.ts # OpenAI/OpenRouter integration |
| 66 | +│ ├── validate/ # Validation benchmark engine |
| 67 | +│ └── utils/ # Utility functions |
| 68 | +├── scenarios/ # Benchmark test scenarios |
| 69 | +│ ├── ValidateObjectSimple.ts |
| 70 | +│ ├── ValidateObjectConstraint.ts |
| 71 | +│ ├── ValidateShoppingCart*.ts |
| 72 | +│ └── internal/ # Shared scenario utilities |
| 73 | +├── reports/ # Benchmark results |
| 74 | +│ ├── validate/ # Validation benchmark reports |
| 75 | +│ │ ├── README.md # Results summary tables |
| 76 | +│ │ ├── anthropic/ # Claude model results |
| 77 | +│ │ ├── openai/ # GPT model results |
| 78 | +│ │ ├── google/ # Gemini model results |
| 79 | +│ │ ├── deepseek/ # DeepSeek model results |
| 80 | +│ │ ├── meta-llama/ # Llama model results |
| 81 | +│ │ └── mistralai/ # Mistral model results |
| 82 | +│ └── analyze-results.ts # Analysis script |
| 83 | +└── dist/ # Built output files |
| 84 | +``` |
| 85 | + |
| 86 | +## Running the Benchmark |
| 87 | + |
| 88 | +### Prerequisites |
| 89 | + |
| 90 | +1. **API Keys**: You need an OpenRouter API key to access different models |
| 91 | +2. **Environment**: Node.js 18+ and pnpm |
| 92 | +3. **Model Selection**: Choose from available OpenRouter models |
| 93 | + |
| 94 | +### Environment Setup |
| 95 | + |
| 96 | +Create a `.env` file with your OpenRouter API key: |
| 97 | + |
| 98 | +```bash |
| 99 | +# Required: OpenRouter API key |
| 100 | +OPENAI_API_KEY=sk-or-v1-... |
| 101 | + |
| 102 | +# Required: Model to benchmark (examples) |
| 103 | +OPENAI_MODEL="openai/gpt-4o" # GPT-4o |
| 104 | +OPENAI_MODEL="anthropic/claude-3.5-sonnet" # Claude 3.5 Sonnet |
| 105 | +OPENAI_MODEL="google/gemini-pro-1.5" # Gemini Pro |
| 106 | +OPENAI_MODEL="meta-llama/llama-3.3-70b-instruct" # Llama 3.3 |
| 107 | + |
| 108 | +# Optional: Schema model for validation (defaults to "chatgpt") |
| 109 | +SCHEMA_MODEL="chatgpt" |
25 | 110 | ```
|
26 | 111 |
|
| 112 | +### Running Benchmarks |
| 113 | + |
| 114 | +```bash |
| 115 | +# Run with specific model |
| 116 | +OPENAI_MODEL="openai/gpt-4o" pnpm start |
| 117 | + |
| 118 | +# Run with different models |
| 119 | +OPENAI_MODEL="anthropic/claude-3.5-sonnet" pnpm start |
| 120 | +OPENAI_MODEL="google/gemini-pro-1.5" pnpm start |
| 121 | +``` |
| 122 | + |
| 123 | +### Development Commands |
| 124 | + |
| 125 | +```bash |
| 126 | +# Install dependencies |
| 127 | +pnpm install |
| 128 | + |
| 129 | +# Build the project |
| 130 | +pnpm run build |
| 131 | + |
| 132 | +# Lint and format code |
| 133 | +pnpm run lint |
| 134 | +pnpm run format |
| 135 | + |
| 136 | +# Generate updated results table |
| 137 | +pnpm run analyze-reports |
| 138 | +``` |
| 139 | + |
| 140 | +## Understanding Results |
| 141 | + |
| 142 | +### Report Structure |
| 143 | + |
| 144 | +Results are organized by vendor and model: |
| 145 | +``` |
| 146 | +reports/validate/ |
| 147 | +├── openai/gpt-4o-2024-11-20/ |
| 148 | +│ ├── ObjectSimple/ |
| 149 | +│ │ ├── README.md |
| 150 | +│ │ └── trials/ |
| 151 | +│ │ ├── 1.success.json |
| 152 | +│ │ ├── 2.success.json |
| 153 | +│ │ └── ... |
| 154 | +│ └── ObjectConstraint/... |
| 155 | +└── anthropic/claude-3.5-sonnet/... |
| 156 | +``` |
| 157 | + |
| 158 | +### Trial Results |
| 159 | + |
| 160 | +Each trial produces a JSON file indicating: |
| 161 | +- `X.success.json`: Trial X succeeded |
| 162 | +- `X.failure.json`: Trial X failed validation |
| 163 | +- `X.error.json`: Trial X encountered an error |
| 164 | +- `X.nothing.json`: Trial X produced no response |
| 165 | + |
| 166 | +### Metrics |
| 167 | + |
| 168 | +The benchmark tracks: |
| 169 | +1. **First Try Success Rate**: Immediate success without feedback |
| 170 | +2. **Overall Success Rate**: Success after retries and feedback |
| 171 | +3. **Average Attempts**: Mean number of attempts needed |
| 172 | +4. **Failed Tasks**: Number of tasks that never succeeded |
| 173 | + |
| 174 | +## Architecture |
| 175 | + |
| 176 | +### Core Components |
| 177 | + |
| 178 | +1. **ValidateBenchmark Engine** (`src/validate/ValidateBenchmark.ts`): |
| 179 | + - Orchestrates benchmark execution |
| 180 | + - Handles concurrency and retries |
| 181 | + - Generates detailed reports |
| 182 | + |
| 183 | +2. **Scenario System** (`scenarios/`): |
| 184 | + - Type-driven validation scenarios |
| 185 | + - Uses `typia` for runtime type validation |
| 186 | + - Covers various complexity levels |
| 187 | + |
| 188 | +3. **OpenRouter Integration** (`src/openai.ts`): |
| 189 | + - Unified API access to multiple LLM providers |
| 190 | + - Consistent request/response handling |
| 191 | + |
| 192 | +### Key Technologies |
| 193 | + |
| 194 | +- **TypeScript**: Type definitions and validation |
| 195 | +- **Typia**: Runtime type validation and schema generation |
| 196 | +- **OpenRouter**: Multi-provider LLM API access |
| 197 | +- **Unbuild**: Modern build system with rollup |
| 198 | + |
| 199 | +## Contributing |
| 200 | + |
| 201 | +1. **Adding Scenarios**: Create new validation scenarios in `scenarios/` |
| 202 | +2. **Model Support**: Add new models by updating model constants |
| 203 | +3. **Analysis**: Enhance the analysis script for better insights |
| 204 | + |
| 205 | +## Results |
| 206 | + |
| 207 | +See [`reports/validate/README.md`](reports/validate/README.md) for the latest benchmark results across all tested models. |
| 208 | + |
| 209 | +## License |
| 210 | + |
| 211 | +[MIT License](LICENSE) |
0 commit comments