LLM TypeScript Validation Benchmark

A comprehensive benchmark system for testing LLM validation capabilities across different AI models using TypeScript type definitions. This benchmark evaluates how well Large Language Models can generate correct responses that conform to TypeScript type schemas.

Quick Start

# Set up environment
mv .env.example .env
echo "OPENAI_API_KEY=sk-..." >> .env

# Install dependencies
pnpm install

# Build the project
pnpm run build

# Run benchmark with specific model
OPENAI_MODEL="openai/gpt-4o" pnpm start

What This Benchmark Tests

This benchmark evaluates LLM capabilities in:

Type Schema Validation: Understanding and generating data that conforms to TypeScript type definitions
Complex Object Structure: Handling nested objects, unions, arrays, and hierarchical data
Constraint Adherence: Respecting validation rules like string formats, number ranges, and required fields
Error Recovery: Improving responses based on validation feedback across multiple attempts

Benchmark Scenarios

The benchmark includes multiple validation scenarios:

ObjectSimple: Basic object validation with primitive types
ObjectConstraint: Objects with validation constraints (age limits, email formats, etc.)
ObjectHierarchical: Nested object structures with multiple levels
ObjectJsonSchema: JSON Schema-based validation
ObjectUnionExplicit/Implicit: Union type handling
ObjectFunctionSchema: Function parameter validation
Shopping Cart Scenarios: Real-world e-commerce object validation

How the Benchmark Works

Trial Execution: Each scenario runs 10 trials per model
First Attempt: Models receive schema and prompt without validation feedback
Retry Logic: Failed attempts get validation feedback for up to 5 total attempts
Success Metrics:
- First Try Success Rate: Percentage of immediate successes (no feedback needed)
- Overall Success Rate: Percentage that eventually succeed (including after feedback)

Project Structure

.
├── README.md                 # This file
├── CLAUDE.md                # Claude Code instructions
├── package.json             # Dependencies and scripts
├── build.config.ts          # Build configuration
├── tsconfig.json           # TypeScript configuration
├── biome.json              # Linting and formatting
├── src/                    # Source code
│   ├── index.ts            # Main entry point
│   ├── constants.ts        # Configuration constants
│   ├── openai.ts          # OpenAI/OpenRouter integration
│   ├── validate/          # Validation benchmark engine
│   └── utils/             # Utility functions
├── scenarios/             # Benchmark test scenarios
│   ├── ValidateObjectSimple.ts
│   ├── ValidateObjectConstraint.ts
│   ├── ValidateShoppingCart*.ts
│   └── internal/          # Shared scenario utilities
├── reports/               # Benchmark results
│   ├── validate/          # Validation benchmark reports
│   │   ├── README.md      # Results summary tables
│   │   ├── anthropic/     # Claude model results
│   │   ├── openai/        # GPT model results
│   │   ├── google/        # Gemini model results
│   │   ├── deepseek/      # DeepSeek model results
│   │   ├── meta-llama/    # Llama model results
│   │   └── mistralai/     # Mistral model results
│   └── analyze-results.ts # Analysis script
└── dist/                 # Built output files

Running the Benchmark

Prerequisites

API Keys: You need an OpenRouter API key to access different models
Environment: Node.js 18+ and pnpm
Model Selection: Choose from available OpenRouter models

Environment Setup

Create a .env file with your OpenRouter API key:

# Required: OpenRouter API key
OPENAI_API_KEY=sk-or-v1-...

# Required: Model to benchmark (examples)
OPENAI_MODEL="openai/gpt-4o"              # GPT-4o
OPENAI_MODEL="anthropic/claude-3.5-sonnet" # Claude 3.5 Sonnet
OPENAI_MODEL="google/gemini-pro-1.5"       # Gemini Pro
OPENAI_MODEL="meta-llama/llama-3.3-70b-instruct" # Llama 3.3

# Optional: Schema model for validation (defaults to "chatgpt")
SCHEMA_MODEL="chatgpt"

Running Benchmarks

# Run with specific model
OPENAI_MODEL="openai/gpt-4o" pnpm start

# Run with different models
OPENAI_MODEL="anthropic/claude-3.5-sonnet" pnpm start
OPENAI_MODEL="google/gemini-pro-1.5" pnpm start

Development Commands

# Install dependencies
pnpm install

# Build the project
pnpm run build

# Lint and format code
pnpm run lint
pnpm run format

# Generate updated results table
pnpm run analyze-reports

Understanding Results

Report Structure

Results are organized by vendor and model:

reports/validate/
├── openai/gpt-4o-2024-11-20/
│   ├── ObjectSimple/
│   │   ├── README.md
│   │   └── trials/
│   │       ├── 1.success.json
│   │       ├── 2.success.json
│   │       └── ...
│   └── ObjectConstraint/...
└── anthropic/claude-3.5-sonnet/...

Trial Results

Each trial produces a JSON file indicating:

X.success.json: Trial X succeeded
X.failure.json: Trial X failed validation
X.error.json: Trial X encountered an error
X.nothing.json: Trial X produced no response

Metrics

The benchmark tracks:

First Try Success Rate: Immediate success without feedback
Overall Success Rate: Success after retries and feedback
Average Attempts: Mean number of attempts needed
Failed Tasks: Number of tasks that never succeeded

Architecture

Core Components

ValidateBenchmark Engine (src/validate/ValidateBenchmark.ts):
- Orchestrates benchmark execution
- Handles concurrency and retries
- Generates detailed reports
Scenario System (scenarios/):
- Type-driven validation scenarios
- Uses typia for runtime type validation
- Covers various complexity levels
OpenRouter Integration (src/openai.ts):
- Unified API access to multiple LLM providers
- Consistent request/response handling

Key Technologies

TypeScript: Type definitions and validation
Typia: Runtime type validation and schema generation
OpenRouter: Multi-provider LLM API access
Unbuild: Modern build system with rollup

Contributing

Adding Scenarios: Create new validation scenarios in scenarios/
Model Support: Add new models by updating model constants
Analysis: Enhance the analysis script for better insights

Results

See reports/validate/README.md for the latest benchmark results across all tested models.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
.husky		.husky
reports		reports
scenarios		scenarios
src		src
.env.example		.env.example
.gitignore		.gitignore
.secretlintrc.json		.secretlintrc.json
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
build.config.ts		build.config.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json
typos.toml		typos.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM TypeScript Validation Benchmark

Quick Start

What This Benchmark Tests

Benchmark Scenarios

How the Benchmark Works

Project Structure

Running the Benchmark

Prerequisites

Environment Setup

Running Benchmarks

Development Commands

Understanding Results

Report Structure

Trial Results

Metrics

Architecture

Core Components

Key Technologies

Contributing

Results

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

wrtnlabs/leaderboard

Folders and files

Latest commit

History

Repository files navigation

LLM TypeScript Validation Benchmark

Quick Start

What This Benchmark Tests

Benchmark Scenarios

How the Benchmark Works

Project Structure

Running the Benchmark

Prerequisites

Environment Setup

Running Benchmarks

Development Commands

Understanding Results

Report Structure

Trial Results

Metrics

Architecture

Core Components

Key Technologies

Contributing

Results

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages