Skip to content

Commit 1cf48eb

Browse files
committed
Initial commit
1 parent 35e000e commit 1cf48eb

21 files changed

+22540
-2
lines changed

.gitignore

+175
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
share/python-wheels/
24+
*.egg-info/
25+
.installed.cfg
26+
*.egg
27+
MANIFEST
28+
29+
# PyInstaller
30+
# Usually these files are written by a python script from a template
31+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
32+
*.manifest
33+
*.spec
34+
35+
# Installer logs
36+
pip-log.txt
37+
pip-delete-this-directory.txt
38+
39+
# Unit test / coverage reports
40+
htmlcov/
41+
.tox/
42+
.nox/
43+
.coverage
44+
.coverage.*
45+
.cache
46+
nosetests.xml
47+
coverage.xml
48+
*.cover
49+
*.py,cover
50+
.hypothesis/
51+
.pytest_cache/
52+
cover/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
.pybuilder/
76+
target/
77+
78+
# Jupyter Notebook
79+
.ipynb_checkpoints
80+
81+
# IPython
82+
profile_default/
83+
ipython_config.py
84+
85+
# pyenv
86+
# For a library or package, you might want to ignore these files since the code is
87+
# intended to run in multiple environments; otherwise, check them in:
88+
# .python-version
89+
90+
# pipenv
91+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
93+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
94+
# install all needed dependencies.
95+
#Pipfile.lock
96+
97+
# UV
98+
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
99+
# This is especially recommended for binary packages to ensure reproducibility, and is more
100+
# commonly ignored for libraries.
101+
#uv.lock
102+
103+
# poetry
104+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
105+
# This is especially recommended for binary packages to ensure reproducibility, and is more
106+
# commonly ignored for libraries.
107+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
108+
#poetry.lock
109+
110+
# pdm
111+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
112+
#pdm.lock
113+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
114+
# in version control.
115+
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
116+
.pdm.toml
117+
.pdm-python
118+
.pdm-build/
119+
120+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
121+
__pypackages__/
122+
123+
# Celery stuff
124+
celerybeat-schedule
125+
celerybeat.pid
126+
127+
# SageMath parsed files
128+
*.sage.py
129+
130+
# Environments
131+
.env
132+
.venv
133+
env/
134+
venv/
135+
ENV/
136+
env.bak/
137+
venv.bak/
138+
139+
# Spyder project settings
140+
.spyderproject
141+
.spyproject
142+
143+
# Rope project settings
144+
.ropeproject
145+
146+
# mkdocs documentation
147+
/site
148+
149+
# mypy
150+
.mypy_cache/
151+
.dmypy.json
152+
dmypy.json
153+
154+
# Pyre type checker
155+
.pyre/
156+
157+
# pytype static type analyzer
158+
.pytype/
159+
160+
# Cython debug symbols
161+
cython_debug/
162+
163+
# PyCharm
164+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
165+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
166+
# and can be added to the global gitignore or merged into this file. For a more nuclear
167+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
168+
#.idea/
169+
170+
# PyPI configuration file
171+
.pypirc
172+
173+
# ruff
174+
.ruff_cache/
175+
predicts/

.pre-commit-config.yaml

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
repos:
2+
- repo: https://github.com/pre-commit/pre-commit-hooks
3+
rev: v5.0.0
4+
hooks:
5+
- id: check-ast
6+
- id: check-added-large-files
7+
args: ['--maxkb=25000']
8+
- id: check-merge-conflict
9+
- id: check-yaml
10+
- id: debug-statements
11+
- id: end-of-file-fixer
12+
- id: requirements-txt-fixer
13+
- id: trailing-whitespace
14+
args: [--markdown-linebreak-ext=md]
15+
- id: no-commit-to-branch
16+
args: ['--branch', 'master']
17+
18+
- repo: https://github.com/asottile/pyupgrade
19+
rev: v3.17.0
20+
hooks:
21+
- id: pyupgrade
22+
args: [--py38-plus]
23+
24+
- repo: https://github.com/astral-sh/ruff-pre-commit
25+
rev: v0.6.9
26+
hooks:
27+
- id: ruff
28+
args: [--fix]
29+
- id: ruff-format

Makefile

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
.PHONY: commit quality style
2+
3+
check_dirs := prm800k scripts setup.py
4+
5+
commit:
6+
pre-commit install
7+
pre-commit run --all-files
8+
9+
quality:
10+
ruff check $(check_dirs)
11+
ruff format --check $(check_dirs)
12+
13+
style:
14+
ruff check $(check_dirs) --fix
15+
ruff format $(check_dirs)

README.md

+110-2
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,110 @@
1-
# MathUtils
2-
A tool for evaluating LLMs on the MATH dataset.
1+
# MathRuler
2+
3+
*A light-weight tool for evaluating LLMs in rule-based ways.*
4+
5+
## Installation
6+
7+
We use [vLLM](https://github.com/vllm-project/vllm) to accelerate the generation.
8+
9+
```bash
10+
git clone https://github.com/hiyouga/MathRuler.git
11+
cd MathRuler
12+
pip install .
13+
```
14+
15+
## Datasets
16+
17+
- [MATH](https://github.com/hendrycks/math): 500 problems.
18+
- [GSM8K](https://github.com/openai/grade-school-math): 1319 problems.
19+
- [AIME24](https://huggingface.co/datasets/HuggingFaceH4/aime_2024): 30 problems.
20+
- [AIME25](https://huggingface.co/datasets/math-ai/aime25): 30 problems.
21+
22+
## Generate
23+
24+
```bash
25+
CUDA_VISIBLE_DEVICES=0,1,2,3 mathruler gen Qwen/Qwen2.5-Math-7B-Instruct
26+
```
27+
28+
Example output:
29+
30+
> Processed prompts: 100%|██████| 500/500 [00:36<00:00, 13.75it/s, est. speed input: 15765.84 toks/s, output: 5299.80 toks/s]
31+
32+
### Optional Arguments
33+
34+
- **json_path** (str): path to the eval file, defaults to `data/math_splits/test.jsonl`
35+
- **save_path** (str): path to the predicted file, defaults to `predicts/test.jsonl`
36+
- **n_shot** (int): number of few-shot examples, defaults to `0`
37+
- **demo_split** (str): split to build few-shot examples, defaults to `math`
38+
- **system** (str): system message for generation, defaults to `Please reason step by step, and put your final answer within \boxed{}.`
39+
- **temperature** (float): decode temperature value, defaults to `0.0`
40+
- **top_p** (float): decode top p value, defaults to `1.0`
41+
- **max_tokens** (int): maximum number of generated tokens, defaults to `4096`
42+
- **sample_num** (int): best-of-n evaluation, defaults to `1`
43+
44+
## Evaluate
45+
46+
```bash
47+
mathruler eval predicts/test.jsonl
48+
```
49+
50+
Example output:
51+
52+
> Processing sample: 100%|██████| 500/500 [00:00<00:00, 926.32it/s]
53+
>
54+
> Accuracy: 413/500 = 82.60%.
55+
56+
## Experimental Results
57+
58+
### MATH Dataset
59+
60+
| Command | Measured Acc | Reported Acc |
61+
| ---------------------------------------------- | ------------ | ------------ |
62+
| mathruler gen meta-llama/Meta-Llama-3-8B | 29.2% | 29.1%* |
63+
| mathruler gen meta-llama/Llama-3.1-8B-Instruct | 50.8% | 51.9%* |
64+
| mathruler gen meta-llama/Llama-3.2-3B-Instruct | 48.4% | 48.0%** |
65+
| mathruler gen Qwen/Qwen2.5-Math-7B-Instruct | 82.6% | 83.6%*** |
66+
67+
### GSM8K Dataset
68+
69+
> Use `--json_path data/gsm8k_splits/test.jsonl` to evaluate models on the GSM8K dataset.
70+
71+
| Command | Measured Acc | Reported Acc |
72+
| ---------------------------------------------- | ------------ | ------------ |
73+
| mathruler gen meta-llama/Meta-Llama-3-8B | 65.3% | 80.6%* |
74+
| mathruler gen meta-llama/Llama-3.1-8B-Instruct | 81.7% | 84.5%* |
75+
| mathruler gen meta-llama/Llama-3.2-3B-Instruct | 74.6% | 77.7%** |
76+
| mathruler gen Qwen/Qwen2.5-Math-7B-Instruct | 95.6% | 95.2%*** |
77+
78+
> [!NOTE]
79+
> For the GSM8K dataset, we evaluate all the models in zero-shot CoT setting, while the reported values of the Llama models are extracted from 8-shot CoT setting (*).
80+
81+
- *: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
82+
- **: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
83+
- ***: https://qwenlm.github.io/blog/qwen2.5-math/
84+
85+
## Example Use
86+
87+
```python
88+
from mathruler.grader import extract_boxed_content, grade_answer
89+
90+
grade_answer(given_answer: str, ground_truth: str)
91+
grade_answer(extract_boxed_content(generated_result: str), answer: str)
92+
```
93+
94+
## Acknowledgement
95+
96+
- [openai/prm800k](https://github.com/openai/prm800k)
97+
- [openai/grade-school-math](https://github.com/openai/grade-school-math)
98+
- [QwenLM/Qwen2.5-Math](https://github.com/QwenLM/Qwen2.5-Math)
99+
- [vllm-project/vllm](https://github.com/vllm-project/vllm)
100+
101+
## Citation
102+
103+
```bibtex
104+
@Misc{mathruler,
105+
title = {MathRuler},
106+
author = {hiyouga},
107+
howpublished = {\url{https://github.com/hiyouga/MathRuler}},
108+
year = {2025}
109+
}
110+
```

0 commit comments

Comments
 (0)