Code for NAACL 2025 paper "Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data"
- Remember to use the newest version of vLLM
- Might need to install azure cli if things don't work:
pip install azure-cli azure-functions azure-identity
- Dependency: https://github.com/microsoft/controllable-safety-alignment
Make sure to correctly setup environment variables before running all scripts:
PROJ_DIR
andPYTHONPATH
should be the project directory ofcontrollable-safety-alignment
repo (instead of this one)!MODEL_DIR
,DATA_DIR
: dedicated directory to store downloaded models and dataOUTPUT_DIR
: dedicated directory to store trained checkpoints
- Format data into huggingface dataset
Example: data/nq/dev
, data/nq/train
Dataset({
features: ['prompt', 'reference'],
num_rows: 110865
})
- Start vLLM server by using
start_vllm.sh
(make sure port 8000 is not in use or change to a different port! Chaning port requires modifying the last few lines ofmodel_name_to_endpoints
function in$PROJ_DIR/src/oai_inference.py
) and run$PROJ_DIR/src/oai_inference.py
to generate responses on training and dev data.
Example: run_gen_bo32.sh
-
Setup quip score server (or use the existing one accesible on internet, 'https://acc2-private-wiki.dataportraits.org/quip'). Make sure the url in
quip_api.py
is correct. -
Use
run_metric_on_gen.py
to score responses with quip score. See command line arguments there for details. -
Use
best_of_n_to_paired_gen.py
to produce paired data for DPO. See examples inbest_of_n_to_paired_gen.sh
. Next, convert the produced .json into huggingface dataset viadata_processing/convert_paired_gens_json_to_dataset.py
. Example available indata_processing/convert_paired_gens_json_to_dataset.sh
-
Add path of the paired data (converted to HF dataset) as a dataset in the
PAIRED_DATA_DICT
of$PROJ_DIR/dpo/preference_datasets.py
. Search for 'qt_gemma2-9b-it-inst_bo32_dq0.10_dl0.10-concise_sysp' in the file for example. -
Conduct DPO training! Use
$PROJ_DIR/dpo/train_qt.sh
. This part of code is based onhttps://github.com/eric-mitchell/direct-preference-optimization
. WANDB integration is supported. After training, Convert the trained model .pt file back to huggingface format usingcheckpoint_pt_to_hf.py
. -
Run evaluation using
run_eval_combined.sh
.
If you find our work useful, we kindly invite you to cite it:
@misc{zhang2025verifiabledesignaligninglanguage,
title={Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data},
author={Jingyu Zhang and Marc Marone and Tianjian Li and Benjamin Van Durme and Daniel Khashabi},
year={2025},
eprint={2404.03862},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2404.03862},
}