wrtnlabs
diff --git a/‎CLAUDE.md
Lines changed: 67 additions & 0 deletions b/‎CLAUDE.md
Lines changed: 67 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 196 additions & 11 deletions b/‎README.md
Lines changed: 196 additions & 11 deletions
diff --git a/‎package.json
Lines changed: 1 addition & 0 deletions b/‎package.json
Lines changed: 1 addition & 0 deletions
@@ -0,0 +1,67 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Essential Commands
+
+### Build and Run
+```bash
+pnpm install          # Install dependencies
+pnpm run build        # Build the project using unbuild
+pnpm start            # Run the benchmark (requires OPENAI_API_KEY in .env)
+```
+
+### Linting and Formatting
+```bash
+pnpm run lint         # Run Biome linter
+pnpm run format       # Format code with Biome
+```
+
+### Environment Setup
+```bash
+# Create .env file with OpenRouter API key
+echo "OPENAI_API_KEY=sk-..." >> .env
+
+# Set model environment variable (required)
+export OPENAI_MODEL="openai/gpt-4o"  # or other model from OpenRouter
+
+# Optional: Set schema model (defaults to "chatgpt")
+export SCHEMA_MODEL="chatgpt"
+```
+
+## Architecture Overview
+
+This is a benchmark system for testing LLM validation capabilities across different AI models. The system validates whether LLMs can generate correct responses based on TypeScript type definitions.
+
+### Core Architecture
+
+1. **ValidateBenchmark Engine** (`src/validate/ValidateBenchmark.ts`):
+   - Central orchestrator that executes validation experiments
+   - Supports concurrent execution with configurable parallelism
+   - Implements retry logic for validation failures
+   - Generates detailed reports of benchmark results
+
+2. **Scenario System** (`scenarios/`):
+   - Each scenario tests a specific type validation capability
+   - Scenarios range from simple object validation to complex hierarchical structures
+   - All scenarios export a validation function that the benchmark engine uses
+   - Scenarios leverage the `typia` library for runtime type information
+
+3. **Report Generation** (`reports/`):
+   - Results are written to `reports/validate/{vendor}/{model}/`
+   - Each experiment produces JSON files with trial results
+   - Reports include success/failure status and detailed validation feedback
+
+### Key Design Patterns
+
+1. **Type-Driven Validation**: Uses TypeScript's type system with `typia` to generate validation schemas at runtime
+2. **Multi-Model Support**: Abstracts vendor-specific implementations through `IBenchmarkVendor` interface
+3. **OpenRouter Integration**: All LLM calls go through OpenRouter API for unified model access
+4. **Concurrent Execution**: Uses semaphore pattern for controlled parallel execution
+
+### Build System
+
+- Uses `unbuild` with Rollup for building
+- `@ryoppippi/unplugin-typia` plugin transforms TypeScript types at build time
+- Scenarios are built as separate entry points alongside main index
+- Source maps enabled for debugging
@@ -1,26 +1,211 @@
-## Benchmark
+# LLM TypeScript Validation Benchmark
+
+A comprehensive benchmark system for testing LLM validation capabilities across different AI models using TypeScript type definitions. This benchmark evaluates how well Large Language Models can generate correct responses that conform to TypeScript type schemas.
+
+## Quick Start
+
 ```bash
+# Set up environment
 mv .env.example .env
 echo "OPENAI_API_KEY=sk-..." >> .env
+
+# Install dependencies
 pnpm install
+
+# Build the project
 pnpm run build
+
+# Run benchmark with specific model
 OPENAI_MODEL="openai/gpt-4o" pnpm start
 ```
 
-Benchmark program for Ryoppippi's paper.
+## What This Benchmark Tests
 
-Try editing the properties of the `src/index.ts` file.
+This benchmark evaluates LLM capabilities in:
 
-Also, benchmark report would be written under the `reports` directory.
+1. **Type Schema Validation**: Understanding and generating data that conforms to TypeScript type definitions
+2. **Complex Object Structure**: Handling nested objects, unions, arrays, and hierarchical data
+3. **Constraint Adherence**: Respecting validation rules like string formats, number ranges, and required fields
+4. **Error Recovery**: Improving responses based on validation feedback across multiple attempts
 
-## Structure
+### Benchmark Scenarios
 
-```bash
+The benchmark includes multiple validation scenarios:
+
+- **ObjectSimple**: Basic object validation with primitive types
+- **ObjectConstraint**: Objects with validation constraints (age limits, email formats, etc.)
+- **ObjectHierarchical**: Nested object structures with multiple levels
+- **ObjectJsonSchema**: JSON Schema-based validation
+- **ObjectUnionExplicit/Implicit**: Union type handling
+- **ObjectFunctionSchema**: Function parameter validation
+- **Shopping Cart Scenarios**: Real-world e-commerce object validation
+
+## How the Benchmark Works
+
+1. **Trial Execution**: Each scenario runs 10 trials per model
+2. **First Attempt**: Models receive schema and prompt without validation feedback
+3. **Retry Logic**: Failed attempts get validation feedback for up to 5 total attempts
+4. **Success Metrics**: 
+   - **First Try Success Rate**: Percentage of immediate successes (no feedback needed)
+   - **Overall Success Rate**: Percentage that eventually succeed (including after feedback)
+
+## Project Structure
+
+```
 .
-├── README.md
-├── dist // executable files
-├── reports // benchmark reports
-├── scenarios // benchmark scenarios
-└── src // source code
+├── README.md                 # This file
+├── CLAUDE.md                # Claude Code instructions
+├── package.json             # Dependencies and scripts
+├── build.config.ts          # Build configuration
+├── tsconfig.json           # TypeScript configuration
+├── biome.json              # Linting and formatting
+├── src/                    # Source code
+│   ├── index.ts            # Main entry point
+│   ├── constants.ts        # Configuration constants
+│   ├── openai.ts          # OpenAI/OpenRouter integration
+│   ├── validate/          # Validation benchmark engine
+│   └── utils/             # Utility functions
+├── scenarios/             # Benchmark test scenarios
+│   ├── ValidateObjectSimple.ts
+│   ├── ValidateObjectConstraint.ts
+│   ├── ValidateShoppingCart*.ts
+│   └── internal/          # Shared scenario utilities
+├── reports/               # Benchmark results
+│   ├── validate/          # Validation benchmark reports
+│   │   ├── README.md      # Results summary tables
+│   │   ├── anthropic/     # Claude model results
+│   │   ├── openai/        # GPT model results
+│   │   ├── google/        # Gemini model results
+│   │   ├── deepseek/      # DeepSeek model results
+│   │   ├── meta-llama/    # Llama model results
+│   │   └── mistralai/     # Mistral model results
+│   └── analyze-results.ts # Analysis script
+└── dist/                 # Built output files
+```
+
+## Running the Benchmark
+
+### Prerequisites
+
+1. **API Keys**: You need an OpenRouter API key to access different models
+2. **Environment**: Node.js 18+ and pnpm
+3. **Model Selection**: Choose from available OpenRouter models
+
+### Environment Setup
+
+Create a `.env` file with your OpenRouter API key:
+
+```bash
+# Required: OpenRouter API key
+OPENAI_API_KEY=sk-or-v1-...
+
+# Required: Model to benchmark (examples)
+OPENAI_MODEL="openai/gpt-4o"              # GPT-4o
+OPENAI_MODEL="anthropic/claude-3.5-sonnet" # Claude 3.5 Sonnet
+OPENAI_MODEL="google/gemini-pro-1.5"       # Gemini Pro
+OPENAI_MODEL="meta-llama/llama-3.3-70b-instruct" # Llama 3.3
+
+# Optional: Schema model for validation (defaults to "chatgpt")
+SCHEMA_MODEL="chatgpt"
 ```
 
+### Running Benchmarks
+
+```bash
+# Run with specific model
+OPENAI_MODEL="openai/gpt-4o" pnpm start
+
+# Run with different models
+OPENAI_MODEL="anthropic/claude-3.5-sonnet" pnpm start
+OPENAI_MODEL="google/gemini-pro-1.5" pnpm start
+```
+
+### Development Commands
+
+```bash
+# Install dependencies
+pnpm install
+
+# Build the project
+pnpm run build
+
+# Lint and format code
+pnpm run lint
+pnpm run format
+
+# Generate updated results table
+pnpm run analyze-reports
+```
+
+## Understanding Results
+
+### Report Structure
+
+Results are organized by vendor and model:
+```
+reports/validate/
+├── openai/gpt-4o-2024-11-20/
+│   ├── ObjectSimple/
+│   │   ├── README.md
+│   │   └── trials/
+│   │       ├── 1.success.json
+│   │       ├── 2.success.json
+│   │       └── ...
+│   └── ObjectConstraint/...
+└── anthropic/claude-3.5-sonnet/...
+```
+
+### Trial Results
+
+Each trial produces a JSON file indicating:
+- `X.success.json`: Trial X succeeded
+- `X.failure.json`: Trial X failed validation
+- `X.error.json`: Trial X encountered an error
+- `X.nothing.json`: Trial X produced no response
+
+### Metrics
+
+The benchmark tracks:
+1. **First Try Success Rate**: Immediate success without feedback
+2. **Overall Success Rate**: Success after retries and feedback
+3. **Average Attempts**: Mean number of attempts needed
+4. **Failed Tasks**: Number of tasks that never succeeded
+
+## Architecture
+
+### Core Components
+
+1. **ValidateBenchmark Engine** (`src/validate/ValidateBenchmark.ts`):
+   - Orchestrates benchmark execution
+   - Handles concurrency and retries
+   - Generates detailed reports
+
+2. **Scenario System** (`scenarios/`):
+   - Type-driven validation scenarios
+   - Uses `typia` for runtime type validation
+   - Covers various complexity levels
+
+3. **OpenRouter Integration** (`src/openai.ts`):
+   - Unified API access to multiple LLM providers
+   - Consistent request/response handling
+
+### Key Technologies
+
+- **TypeScript**: Type definitions and validation
+- **Typia**: Runtime type validation and schema generation
+- **OpenRouter**: Multi-provider LLM API access
+- **Unbuild**: Modern build system with rollup
+
+## Contributing
+
+1. **Adding Scenarios**: Create new validation scenarios in `scenarios/`
+2. **Model Support**: Add new models by updating model constants
+3. **Analysis**: Enhance the analysis script for better insights
+
+## Results
+
+See [`reports/validate/README.md`](reports/validate/README.md) for the latest benchmark results across all tested models.
+
+## License
+
+[MIT License](LICENSE)
@@ -9,6 +9,7 @@
 		"start": "node --env-file=.env ./dist/index.mjs",
 		"lint": "biome check .",
 		"format": "biome check --write .",
+		"analyze-reports": "cd reports && node --experimental-strip-types analyze-results.ts",
 		"prepare": "husky"
 	},
 	"peerDependencies": {