To get started, simply install conda:
conda create -n sta python=3.11 -y
pip install -r requirements.txt
cd ./TransformerLens
pip install -e . # 2.4.0
cd ../trl
pip install -e . # for sft dpo training
Dataset and Steering Vector
The data for STA can be downloaded here.
Directory Structure
steer-target-atoms
└── data
├── mmlu
└── safety
If you download data from here, then you can get the steering vectors used in this paper:
-
steering vecotr for Gemma-2-9b-it (./data/safety/toxic_DINM_it/sae_caa_vector_it/gemma-2-9b-it_safety/act_and_fre_trim/steering_vector)
-
steering vecotr for Gemma-2-9b-pt (./data/safety/toxic_DINM_pt/sae_caa_vector_pt/gemma-2-9b_safety/act_and_fre_trim/steering_vector)
Then, you can directly go to the Steering the behaviors of LLMs section.
You can also generate these steering vectors using the following steps by yourself:
- Download the sae
-
Download sea for Gemma-2-9b-it from here, then replace the value of sae_paths (in ./scripts/generate_vector/gemma/sta/run_selection_safe_gemma_it_DINM.sh) with your own path.
-
Download sea for Gemma-2-9b-pt from here, then replace the value of sae_paths (in ./scripts/generate_vector/gemma/sta/run_selection_safe_gemma_pt_DINM.sh) with your own path.
- Genetate steering vector
bash run_generate_vector.sh
You can steering the behaviors of LLMs by steering vector
bash run_main_table.sh
❗️ You should replace the value of model_name_or_path in the corresponding xx.sh file with your own model path.
bash run_eval.sh
This repository is developed for our STA paper. We also release EasyEdit2, a unified framework for controllable editing without retraining. It integrates multiple steering methods to facilitate usage and evaluation. Unlike this repository, which depends on TransformerLens, EasyEdit2 is independent of it.
We recommend using EasyEdit2 for future research and applications.
Please cite our paper if you use STA in your work.
@misc{wang2025STA,
title={Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms},
author={Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, Ningyu Zhang},
year={2025},
eprint={2505.20322},
archivePrefix={arXiv},
primaryClass={cs.CL}
}