Skip to content

HazyResearch/scaling-verification

Repository files navigation

Weaver Logo

GitHub arXiv

Shrinking the Generation-Verification Gap with Weak Verifiers

📄 Paper
🤗 Datasets and Models
📝 Blog Post

Weaver is a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. To reduce the dependency on labeled data, Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs into a unified score that better reflects true response quality. Weaver significantly improves the pass@1 performance across several reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as the generator, and an ensemble of smaller judge and reward models as the verifiers.

Quick Start

Installation

1. Clone the repository:

git clone https://github.com/ScalingIntelligence/scaling-verification.git
cd scaling-verification

2. Create a Python environment (Python 3.10+ required):

conda create -n weaver python=3.11
conda activate weaver

3. Install Weaver and dependencies:

# Install dependencies
pip install -e .

# Grab pre-built CUDA wheel that matches driver
pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu124

# [Optional] Installs FlashAttention for faster GPU inference. Skip it if you only need CPU or generic CUDA
pip install -e ".[cuda]"

# Install metal-ama for weak supervision (required)
git clone https://github.com/mayeechen/metal-ama.git
cd metal-ama
pip install -e .
cd ..

4. Verify installation:

python test_installation.py

5. API Keys:

While not strictly necessary, we recommend you export these API Keys:

export OPENAI_API_KEY="your-key"
export TOGETHER_API_KEY="your-key"
export WANDB_ENTITY="your-wandb-entity"

6. Setup WANDB and HuggingFace

To ensure WANDB works, we recommend you login before starting the process:

wandb login

To ensure your HuggingFace works, we recommend you perform either of the following commands

export HF_TOKEN="your_HF_token_here"
huggingface-cli login

Optional Dependencies

For development and additional features:

# Development tools
pip install -e ".[dev]"

# Jupyter notebooks and full environment
pip install -e ".[full]"

Run Weaver

Weaver consists of three main stages, each of which are described below:

We provide publicly available datasets that already contain the data from full runs of stage 1 and stage 2. This means you can skip directly to stage 2 if you would like to quickly reproduce our main results. We also provide off-the-shelf models produced by running stage 3 on our datasets.

Check out example.sh for an example sequence of commands to run weaver end to end on a small set of data.

1. Generation

Generate reasoning samples, collect verifier scores for benchmark problems, store in a Huggingface dataset.

What it does: Takes benchmark datasets (MATH, GPQA, MMLU, MMLU-Pro) and generates multiple reasoning responses using an LLM of choice, then evaluates them with various reward models and LM judges.

Key scripts:

  • generate_reasoning_samples.py - Generate model responses
  • unified_evaluation.py - Extract answers and check correctness
  • unified_RMs_and_LM_Judges.py - Score model responses with verifiers

See generation/README.md for detailed instructions →

2. Selection

Train and evaluate Weaver, along with baseline methods, for response selection.

What it does: Uses the datasets from stage 1 to train Weaver models that use weak verifier scores to select the best response from multiple candidates, comparing against baselines like majority voting and supervised methods.

Key script: run.py

See selection/README.md for detailed instructions →

3. Distillation

Distill Weaver's ensemble into a compact 400M parameter model.

What it does: Trains a lightweight cross-encoder model that captures 98.7% of Weaver's accuracy while reducing compute by 99.97%, making verification practical for deployment.

Key scripts:

  • train.py - Train distilled model
  • evaluate.py - Evaluate performance

See distillation/README.md for detailed instructions →

Pre-trained Models & Datasets

We provide ready-to-use datasets and models in our Hugging Face collection:

Datasets:

  • MATH-500, GPQA, MMLU, MMLU-Pro with Llama-3.1 (70B & 8B) generations
  • Pre-scored with 15+ reward models and LM judges
  • Includes final verification scores from our best Weaver configuration

Distilled Models:

  • 5 models distilled for various tasks from various base models

Troubleshooting

Import errors:

# Make sure you're in the right directory and environment
conda activate weaver
cd /path/to/weaver
pip install -e .

CUDA/GPU issues:

# Check PyTorch CUDA installation
python -c "import torch; print(torch.cuda.is_available())"

Metal-ama installation issues:

# Re-install metal-ama
cd metal-ama
pip install -e . --force-reinstall

Citation

If you use this work, please cite:

@misc{saadfalcon2025shrinkinggenerationverificationgapweak,
      title={Shrinking the Generation-Verification Gap with Weak Verifiers}, 
      author={Jon Saad-Falcon and E. Kelly Buchanan and Mayee F. Chen and Tzu-Heng Huang and Brendan McLaughlin and Tanvir Bhathal and Shang Zhu and Ben Athiwaratkun and Frederic Sala and Scott Linderman and Azalia Mirhoseini and Christopher Ré},
      year={2025},
      eprint={2506.18203},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2506.18203}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •