Anemoi [Fine-tuning, Transfer Learning, Model Freezing] Roadmap #248

JesperDramsch · 2025-04-10T09:01:56Z

During 2025, I plan to work on the fine-tuning of Anemoi models. This issue should work as a roadmap and info-dump, as well as a discussion ground for different users.

Current State

The current state of fine-tuning in Anemoi revolves around re-training the entire model.

While this makes sense in pre-training the dynamics with single-step rollout and then finalizing the training with multi-step rollout, it is disadvantageous when we want to fine-tune on data with different distributions. This is the naive fine-tuning approach:

Train model $M$ with weights $W$ on dataset (or task) $A$
Train model $M'$ modifying the weights $W$ to $W'$ with a lower learning rate (and other tricks to make it work)

The disadvantages here are that the extensive pre-training of the model often captures nuances that may be "forgotten" by retraining the model. This has been observed in the training process and often leads to a lot of "finicky" training set-ups.

Implementation details

This is currently implemented as warm-starts and forking of the training, is implemented in multiple config entries in training:

# resume or fork a training from a checkpoint last.ckpt or specified in hardware.files.warm_start
run_id: null
fork_run_id: null
transfer_learning: False # activate to perform transfer learning
load_weights_only: False # only load model weights, do not restore optimiser states etc.

cf:

anemoi-core/training/src/anemoi/training/config/training/default.yaml

Lines 3 to 7 in d02e0bb

    
           # resume or fork a training from a checkpoint last.ckpt or specified in hardware.files.warm_start 
        
           run_id: null 
        
           fork_run_id: null 
        
           transfer_learning: False # activate to perform transfer learning 
        
           load_weights_only: False # only load model weights, do not restore optimiser states etc.

This has implications for traceability and the automation of training pipelines. But it also has certain drawbacks: the forking of a run has to be specified by run_id that is expected to be on the same system, which discourages collaboration and sharing of checkpoints. These were design choices made early in the project before we anticipated the breadth of adoption.

Proposed Solution

To address these limitations, I propose implementing a comprehensive fine-tuning capability in Anemoi that includes:

1. Enhanced Model Freezing

Building on PR #61, we need to extend the submodule freezing functionality to enable more granular control over which parts of the model are trainable during fine-tuning. This will allow:

Freezing arbitrary submodules at multiple levels of the model hierarchy
Partial freezing of specific parameter groups
Possibly even differential learning rates for different model components

2. Integration with PEFT Library

Rather than implementing Parameter-Efficient Fine-Tuning (PEFT) methods from scratch, I propose integrating with Hugging Face's PEFT library (https://huggingface.co/docs/peft/). This will provide access to multiple state-of-the-art fine-tuning methods including:

LoRA (Low-Rank Adaptation): A technique that significantly reduces the number of trainable parameters by injecting trainable low-rank matrices into each layer of the model.

LoRA parameterizes the weight updates as $\Delta W = BA$, where $B \in \mathbb{R}^{(d\times r)}$ and $A \in \mathbb{R}^{(r\times k)}$ are low-rank matrices $(r \ll \min(d,k))$. Instead of fine-tuning all parameters in $W \in \mathbb{R}^{(d\times k)}$, we only need to train $r(d+k)$ parameters, resulting in significant memory savings while maintaining model performance.

For a given matrix W, the output is computed as: h = Wx + BAx, where only BA is trained while W remains frozen.

QLoRA: Quantized LoRA for even more memory-efficient fine-tuning
Prefix Tuning: Optimizes a small continuous task-specific vector (prefix) while keeping the model frozen
Prompt Tuning: Fine-tunes continuous prompts prepended to inputs
AdaLoRA: Adaptive budget allocation across weight matrices based on importance

3. Decoupling of Checkpoint loading and IDs

The checkpoints should be loadable independent of the IDs and current systems, or at least side-loading functionality should exist.

4. Enhanced Configuration System

Expand the configuration system to support:

training:
  fine_tuning:
    enabled: True
    strategy: "lora"  # Options: "full", "freeze", "lora", "qlora", "prompt", "prefix", etc.
    checkpoint:
      source: "s3://anemoi-models/global-10day/v1.2.3/checkpoint.pt"  # Remote sources supported
      local_cache: "~/.anemoi/cache/"
    peft:
      rank: 8  # For LoRA-based methods
      alpha: 16  # Scaling factor
      dropout: 0.1
    freeze:
      modules: ["encoder.block.0", "encoder.block.1"]  # Explicit module paths
      patterns: ["encoder.block.[2-11]", "processor.*"]  # Regex patterns supported
    optimizer:
      differential_lr: True
      lr_groups:
        - modules: ["decoder.*"]
          lr: 1e-4
        - modules: ["peft_layers.*"]
          lr: 5e-4

It would also be possible to implement an optional fine-tuning config. This will become more clear during implementation, I believe.

5. Integration with Training Pipelines

Collaborate with colleagues working on training pipeline automation to ensure:

Fine-tuning configurations can be versioned and tracked
Automated experimentation can be conducted across fine-tuning hyperparameters
Results from fine-tuning can be systematically compared and evaluated

6. Maintain full traceability of training settings

When implementing new configs and possible training pipelines, the different model training stages should be reflected in the provenance of the model to ensure high re-traceability of a model.

Implementation Plan

Phase 1: Foundation

Enhance the existing model freezing functionality (building on PR feat: Model Freezing ❄️ #61)
Create initial integration with PEFT library for LoRA

Phase 2: Core Functionality

Complete PEFT integration with all major methods
Implement enhanced configuration system
Develop comprehensive testing suite for fine-tuning capabilities
Create documentation and examples

Phase 3: Advanced Features and Integration

Integrate with training pipeline automation
Optimize performance for large-scale fine-tuning
Create benchmarking tools for fine-tuning approaches

Success Criteria

The fine-tuning capability will be considered successful when:

Users can fine-tune Anemoi models with a single configuration change
Memory usage during fine-tuning is reduced by at least 70% compared to full fine-tuning
Fine-tuned models maintain or improve performance metrics compared to current approaches
Checkpoint sharing and collaboration becomes seamless across different systems
Documentation and examples make the system approachable for new users

Alternatives Considered

Custom Implementation of PEFT Methods

While implementing our own versions of PEFT methods would give us maximum control, it would require significant development and maintenance effort. The Hugging Face PEFT library is well-maintained, extensively tested, and continuously updated with new methods, making it a more sustainable choice.

Adapter-Based Approaches

Traditional adapter approaches insert new modules between existing layers. While effective, this can change the model architecture significantly. LoRA and similar methods preserve the original architecture while fine-tuning, which aligns better with our goals. (Although technically PEFT also uses adapters...)

Full Model Distillation

Knowledge distillation could be used to transfer knowledge from the pre-trained model to a task-specific model. However, this approach requires training a new model from scratch for each task, which is computationally expensive and doesn't leverage the efficiency gains of modern fine-tuning techniques.

Additional Context

This work on fine-tuning capabilities aligns with broader industry trends toward more efficient adaptation of large models. By implementing these capabilities in Anemoi, we'll enable users to:

Adapt global models to regional domains with minimal computational resources
Fine-tune on specific weather phenomena without degrading general performance
Build ensembles of specialized models derived from a common base
Collaborate more effectively by sharing and building upon each other's work

Organization

ECMWF

The text was updated successfully, but these errors were encountered:

JPXKQX · 2025-04-11T10:34:52Z

Very exciting! I also agree that building on top of peft is a good idea and I really like the proposed schema for checkpoint: .... I think Anemoi users will be really happy with this new schema given some of the problems we are having with loading training checkpoints.

JesperDramsch added the enhancement New feature or request label Apr 10, 2025

JesperDramsch self-assigned this Apr 10, 2025

JesperDramsch added this to Anemoi-dev Apr 10, 2025

JesperDramsch added this to the Fine-Tuning milestone Apr 10, 2025

JesperDramsch mentioned this issue May 13, 2025

transfer_learning_loading loads buffers it shouldn't #309

Open

JesperDramsch changed the title ~~Anemoi Fine-tuning Roadmap~~ Anemoi [Fine-tuning, Transfer Learning, Model Freezing] Roadmap May 13, 2025

JesperDramsch mentioned this issue May 13, 2025

Transfer learning broken for models trained before #182 #249

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anemoi [Fine-tuning, Transfer Learning, Model Freezing] Roadmap #248

Anemoi [Fine-tuning, Transfer Learning, Model Freezing] Roadmap #248

JesperDramsch commented Apr 10, 2025 •

edited

Loading

JPXKQX commented Apr 11, 2025

Anemoi [Fine-tuning, Transfer Learning, Model Freezing] Roadmap #248

Anemoi [Fine-tuning, Transfer Learning, Model Freezing] Roadmap #248

Comments

JesperDramsch commented Apr 10, 2025 • edited Loading

Current State

Implementation details

Proposed Solution

1. Enhanced Model Freezing

2. Integration with PEFT Library

3. Decoupling of Checkpoint loading and IDs

4. Enhanced Configuration System

5. Integration with Training Pipelines

6. Maintain full traceability of training settings

Implementation Plan

Phase 1: Foundation

Phase 2: Core Functionality

Phase 3: Advanced Features and Integration

Success Criteria

Alternatives Considered

Custom Implementation of PEFT Methods

Adapter-Based Approaches

Full Model Distillation

Additional Context

Organization

JPXKQX commented Apr 11, 2025

JesperDramsch commented Apr 10, 2025 •

edited

Loading