You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During 2025, I plan to work on the fine-tuning of Anemoi models. This issue should work as a roadmap and info-dump, as well as a discussion ground for different users.
Current State
The current state of fine-tuning in Anemoi revolves around re-training the entire model.
While this makes sense in pre-training the dynamics with single-step rollout and then finalizing the training with multi-step rollout, it is disadvantageous when we want to fine-tune on data with different distributions. This is the naive fine-tuning approach:
Train model $M$ with weights $W$ on dataset (or task) $A$
Train model $M'$ modifying the weights $W$ to $W'$ with a lower learning rate (and other tricks to make it work)
The disadvantages here are that the extensive pre-training of the model often captures nuances that may be "forgotten" by retraining the model. This has been observed in the training process and often leads to a lot of "finicky" training set-ups.
Implementation details
This is currently implemented as warm-starts and forking of the training, is implemented in multiple config entries in training:
# resume or fork a training from a checkpoint last.ckpt or specified in hardware.files.warm_startrun_id: nullfork_run_id: nulltransfer_learning: False # activate to perform transfer learningload_weights_only: False # only load model weights, do not restore optimiser states etc.
# resume or fork a training from a checkpoint last.ckpt or specified in hardware.files.warm_start
run_id: null
fork_run_id: null
transfer_learning: False # activate to perform transfer learning
load_weights_only: False # only load model weights, do not restore optimiser states etc.
This has implications for traceability and the automation of training pipelines. But it also has certain drawbacks: the forking of a run has to be specified by run_id that is expected to be on the same system, which discourages collaboration and sharing of checkpoints. These were design choices made early in the project before we anticipated the breadth of adoption.
Proposed Solution
To address these limitations, I propose implementing a comprehensive fine-tuning capability in Anemoi that includes:
1. Enhanced Model Freezing
Building on PR #61, we need to extend the submodule freezing functionality to enable more granular control over which parts of the model are trainable during fine-tuning. This will allow:
Freezing arbitrary submodules at multiple levels of the model hierarchy
Partial freezing of specific parameter groups
Possibly even differential learning rates for different model components
2. Integration with PEFT Library
Rather than implementing Parameter-Efficient Fine-Tuning (PEFT) methods from scratch, I propose integrating with Hugging Face's PEFT library (https://huggingface.co/docs/peft/). This will provide access to multiple state-of-the-art fine-tuning methods including:
LoRA (Low-Rank Adaptation): A technique that significantly reduces the number of trainable parameters by injecting trainable low-rank matrices into each layer of the model.
LoRA parameterizes the weight updates as $\Delta W = BA$, where $B \in \mathbb{R}^{(d\times r)}$ and $A \in \mathbb{R}^{(r\times k)}$ are low-rank matrices $(r \ll \min(d,k))$. Instead of fine-tuning all parameters in $W \in \mathbb{R}^{(d\times k)}$, we only need to train $r(d+k)$ parameters, resulting in significant memory savings while maintaining model performance.
For a given matrix W, the output is computed as: h = Wx + BAx, where only BA is trained while W remains frozen.
QLoRA: Quantized LoRA for even more memory-efficient fine-tuning
Prefix Tuning: Optimizes a small continuous task-specific vector (prefix) while keeping the model frozen
Prompt Tuning: Fine-tunes continuous prompts prepended to inputs
AdaLoRA: Adaptive budget allocation across weight matrices based on importance
3. Decoupling of Checkpoint loading and IDs
The checkpoints should be loadable independent of the IDs and current systems, or at least side-loading functionality should exist.
It would also be possible to implement an optional fine-tuning config. This will become more clear during implementation, I believe.
5. Integration with Training Pipelines
Collaborate with colleagues working on training pipeline automation to ensure:
Fine-tuning configurations can be versioned and tracked
Automated experimentation can be conducted across fine-tuning hyperparameters
Results from fine-tuning can be systematically compared and evaluated
6. Maintain full traceability of training settings
When implementing new configs and possible training pipelines, the different model training stages should be reflected in the provenance of the model to ensure high re-traceability of a model.
Create initial integration with PEFT library for LoRA
Phase 2: Core Functionality
Complete PEFT integration with all major methods
Implement enhanced configuration system
Develop comprehensive testing suite for fine-tuning capabilities
Create documentation and examples
Phase 3: Advanced Features and Integration
Integrate with training pipeline automation
Optimize performance for large-scale fine-tuning
Create benchmarking tools for fine-tuning approaches
Success Criteria
The fine-tuning capability will be considered successful when:
Users can fine-tune Anemoi models with a single configuration change
Memory usage during fine-tuning is reduced by at least 70% compared to full fine-tuning
Fine-tuned models maintain or improve performance metrics compared to current approaches
Checkpoint sharing and collaboration becomes seamless across different systems
Documentation and examples make the system approachable for new users
Alternatives Considered
Custom Implementation of PEFT Methods
While implementing our own versions of PEFT methods would give us maximum control, it would require significant development and maintenance effort. The Hugging Face PEFT library is well-maintained, extensively tested, and continuously updated with new methods, making it a more sustainable choice.
Adapter-Based Approaches
Traditional adapter approaches insert new modules between existing layers. While effective, this can change the model architecture significantly. LoRA and similar methods preserve the original architecture while fine-tuning, which aligns better with our goals. (Although technically PEFT also uses adapters...)
Full Model Distillation
Knowledge distillation could be used to transfer knowledge from the pre-trained model to a task-specific model. However, this approach requires training a new model from scratch for each task, which is computationally expensive and doesn't leverage the efficiency gains of modern fine-tuning techniques.
Additional Context
This work on fine-tuning capabilities aligns with broader industry trends toward more efficient adaptation of large models. By implementing these capabilities in Anemoi, we'll enable users to:
Adapt global models to regional domains with minimal computational resources
Fine-tune on specific weather phenomena without degrading general performance
Build ensembles of specialized models derived from a common base
Collaborate more effectively by sharing and building upon each other's work
Organization
ECMWF
The text was updated successfully, but these errors were encountered:
Very exciting! I also agree that building on top of peft is a good idea and I really like the proposed schema for checkpoint: .... I think Anemoi users will be really happy with this new schema given some of the problems we are having with loading training checkpoints.
During 2025, I plan to work on the fine-tuning of Anemoi models. This issue should work as a roadmap and info-dump, as well as a discussion ground for different users.
Current State
The current state of fine-tuning in Anemoi revolves around re-training the entire model.
While this makes sense in pre-training the dynamics with single-step rollout and then finalizing the training with multi-step rollout, it is disadvantageous when we want to fine-tune on data with different distributions. This is the
naive fine-tuning
approach:The disadvantages here are that the extensive pre-training of the model often captures nuances that may be "forgotten" by retraining the model. This has been observed in the training process and often leads to a lot of "finicky" training set-ups.
Implementation details
This is currently implemented as warm-starts and forking of the training, is implemented in multiple config entries in
training
:cf:
anemoi-core/training/src/anemoi/training/config/training/default.yaml
Lines 3 to 7 in d02e0bb
This has implications for traceability and the automation of training pipelines. But it also has certain drawbacks: the forking of a run has to be specified by run_id that is expected to be on the same system, which discourages collaboration and sharing of checkpoints. These were design choices made early in the project before we anticipated the breadth of adoption.
Proposed Solution
To address these limitations, I propose implementing a comprehensive fine-tuning capability in Anemoi that includes:
1. Enhanced Model Freezing
Building on PR #61, we need to extend the submodule freezing functionality to enable more granular control over which parts of the model are trainable during fine-tuning. This will allow:
2. Integration with PEFT Library
Rather than implementing Parameter-Efficient Fine-Tuning (PEFT) methods from scratch, I propose integrating with Hugging Face's PEFT library (https://huggingface.co/docs/peft/). This will provide access to multiple state-of-the-art fine-tuning methods including:
For a given matrix W, the output is computed as: h = Wx + BAx, where only BA is trained while W remains frozen.
3. Decoupling of Checkpoint loading and IDs
The checkpoints should be loadable independent of the IDs and current systems, or at least side-loading functionality should exist.
4. Enhanced Configuration System
Expand the configuration system to support:
It would also be possible to implement an optional fine-tuning config. This will become more clear during implementation, I believe.
5. Integration with Training Pipelines
Collaborate with colleagues working on training pipeline automation to ensure:
6. Maintain full traceability of training settings
When implementing new configs and possible training pipelines, the different model training stages should be reflected in the provenance of the model to ensure high re-traceability of a model.
Implementation Plan
Phase 1: Foundation
Phase 2: Core Functionality
Phase 3: Advanced Features and Integration
Success Criteria
The fine-tuning capability will be considered successful when:
Alternatives Considered
Custom Implementation of PEFT Methods
While implementing our own versions of PEFT methods would give us maximum control, it would require significant development and maintenance effort. The Hugging Face PEFT library is well-maintained, extensively tested, and continuously updated with new methods, making it a more sustainable choice.
Adapter-Based Approaches
Traditional adapter approaches insert new modules between existing layers. While effective, this can change the model architecture significantly. LoRA and similar methods preserve the original architecture while fine-tuning, which aligns better with our goals. (Although technically PEFT also uses adapters...)
Full Model Distillation
Knowledge distillation could be used to transfer knowledge from the pre-trained model to a task-specific model. However, this approach requires training a new model from scratch for each task, which is computationally expensive and doesn't leverage the efficiency gains of modern fine-tuning techniques.
Additional Context
This work on fine-tuning capabilities aligns with broader industry trends toward more efficient adaptation of large models. By implementing these capabilities in Anemoi, we'll enable users to:
Organization
ECMWF
The text was updated successfully, but these errors were encountered: