Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

🔧 Pip Installation

To get started, simply install conda:

conda create -n sta python=3.11 -y
pip install -r requirements.txt
cd ./TransformerLens
pip install -e . # 2.4.0
cd ../trl
pip install -e . # for sft dpo training

📂 Data Preparation

Dataset and Steering Vector

The data for STA can be downloaded here.

Directory Structure

steer-target-atoms
└── data
    ├── mmlu
    └── safety

💻 Run

Steering vector

directly download

If you download data from here, then you can get the steering vectors used in this paper:

steering vecotr for Gemma-2-9b-it (./data/safety/toxic_DINM_it/sae_caa_vector_it/gemma-2-9b-it_safety/act_and_fre_trim/steering_vector)
steering vecotr for Gemma-2-9b-pt (./data/safety/toxic_DINM_pt/sae_caa_vector_pt/gemma-2-9b_safety/act_and_fre_trim/steering_vector)

Then, you can directly go to the Steering the behaviors of LLMs section.

Generate the steering vector by yourself

You can also generate these steering vectors using the following steps by yourself:

Download the sae

Download sea for Gemma-2-9b-it from here, then replace the value of sae_paths (in ./scripts/generate_vector/gemma/sta/run_selection_safe_gemma_it_DINM.sh) with your own path.
Download sea for Gemma-2-9b-pt from here, then replace the value of sae_paths (in ./scripts/generate_vector/gemma/sta/run_selection_safe_gemma_pt_DINM.sh) with your own path.

Genetate steering vector

bash run_generate_vector.sh

Steering the behaviors of LLMs

You can steering the behaviors of LLMs by steering vector

bash run_main_table.sh

❗️ You should replace the value of model_name_or_path in the corresponding xx.sh file with your own model path.

Evaluation

bash run_eval.sh

🌟 Some Important Information

This repository is developed for our STA paper. We also release EasyEdit2, a unified framework for controllable editing without retraining. It integrates multiple steering methods to facilitate usage and evaluation. Unlike this repository, which depends on TransformerLens, EasyEdit2 is independent of it.

We recommend using EasyEdit2 for future research and applications.

📖 Citation

Please cite our paper if you use STA in your work.

@misc{wang2025STA,
      title={Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms}, 
      author={Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, Ningyu Zhang},
      year={2025},
      eprint={2505.20322},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
TransformerLens		TransformerLens
baseline		baseline
evaluate_safety		evaluate_safety
scripts		scripts
LICENSE		LICENSE
README.md		README.md
dataloader.py		dataloader.py
generate_sae_caa_vector.py		generate_sae_caa_vector.py
requirements.txt		requirements.txt
run_eval.sh		run_eval.sh
run_generate_vector.sh		run_generate_vector.sh
run_main_table.sh		run_main_table.sh
sae_feature_selection.py		sae_feature_selection.py
sae_utils.py		sae_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

🔧 Pip Installation

📂 Data Preparation

💻 Run

Steering vector

directly download

Generate the steering vector by yourself

Steering the behaviors of LLMs

Evaluation

🌟 Some Important Information

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

zjunlp/steer-target-atoms

Folders and files

Latest commit

History

Repository files navigation

Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

🔧 Pip Installation

📂 Data Preparation

💻 Run

Steering vector

directly download

Generate the steering vector by yourself

Steering the behaviors of LLMs

Evaluation

🌟 Some Important Information

📖 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages