Skip to content

Anemoi [Fine-tuning, Transfer Learning, Model Freezing] Roadmap #248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
9 tasks
JesperDramsch opened this issue Apr 10, 2025 · 1 comment
Open
9 tasks
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@JesperDramsch
Copy link
Member

JesperDramsch commented Apr 10, 2025

During 2025, I plan to work on the fine-tuning of Anemoi models. This issue should work as a roadmap and info-dump, as well as a discussion ground for different users.

Current State

The current state of fine-tuning in Anemoi revolves around re-training the entire model.

While this makes sense in pre-training the dynamics with single-step rollout and then finalizing the training with multi-step rollout, it is disadvantageous when we want to fine-tune on data with different distributions. This is the naive fine-tuning approach:

  1. Train model $M$ with weights $W$ on dataset (or task) $A$
  2. Train model $M'$ modifying the weights $W$ to $W'$ with a lower learning rate (and other tricks to make it work)

The disadvantages here are that the extensive pre-training of the model often captures nuances that may be "forgotten" by retraining the model. This has been observed in the training process and often leads to a lot of "finicky" training set-ups.

Implementation details

This is currently implemented as warm-starts and forking of the training, is implemented in multiple config entries in training:

# resume or fork a training from a checkpoint last.ckpt or specified in hardware.files.warm_start
run_id: null
fork_run_id: null
transfer_learning: False # activate to perform transfer learning
load_weights_only: False # only load model weights, do not restore optimiser states etc.

cf:

# resume or fork a training from a checkpoint last.ckpt or specified in hardware.files.warm_start
run_id: null
fork_run_id: null
transfer_learning: False # activate to perform transfer learning
load_weights_only: False # only load model weights, do not restore optimiser states etc.

This has implications for traceability and the automation of training pipelines. But it also has certain drawbacks: the forking of a run has to be specified by run_id that is expected to be on the same system, which discourages collaboration and sharing of checkpoints. These were design choices made early in the project before we anticipated the breadth of adoption.

Proposed Solution

To address these limitations, I propose implementing a comprehensive fine-tuning capability in Anemoi that includes:

1. Enhanced Model Freezing

Building on PR #61, we need to extend the submodule freezing functionality to enable more granular control over which parts of the model are trainable during fine-tuning. This will allow:

  • Freezing arbitrary submodules at multiple levels of the model hierarchy
  • Partial freezing of specific parameter groups
  • Possibly even differential learning rates for different model components

2. Integration with PEFT Library

Rather than implementing Parameter-Efficient Fine-Tuning (PEFT) methods from scratch, I propose integrating with Hugging Face's PEFT library (https://huggingface.co/docs/peft/). This will provide access to multiple state-of-the-art fine-tuning methods including:

  • LoRA (Low-Rank Adaptation): A technique that significantly reduces the number of trainable parameters by injecting trainable low-rank matrices into each layer of the model.

LoRA parameterizes the weight updates as $\Delta W = BA$, where $B \in \mathbb{R}^{(d\times r)}$ and $A \in \mathbb{R}^{(r\times k)}$ are low-rank matrices $(r \ll \min(d,k))$. Instead of fine-tuning all parameters in $W \in \mathbb{R}^{(d\times k)}$, we only need to train $r(d+k)$ parameters, resulting in significant memory savings while maintaining model performance.

For a given matrix W, the output is computed as: h = Wx + BAx, where only BA is trained while W remains frozen.

  • QLoRA: Quantized LoRA for even more memory-efficient fine-tuning
  • Prefix Tuning: Optimizes a small continuous task-specific vector (prefix) while keeping the model frozen
  • Prompt Tuning: Fine-tunes continuous prompts prepended to inputs
  • AdaLoRA: Adaptive budget allocation across weight matrices based on importance

3. Decoupling of Checkpoint loading and IDs

The checkpoints should be loadable independent of the IDs and current systems, or at least side-loading functionality should exist.

4. Enhanced Configuration System

Expand the configuration system to support:

training:
  fine_tuning:
    enabled: True
    strategy: "lora"  # Options: "full", "freeze", "lora", "qlora", "prompt", "prefix", etc.
    checkpoint:
      source: "s3://anemoi-models/global-10day/v1.2.3/checkpoint.pt"  # Remote sources supported
      local_cache: "~/.anemoi/cache/"
    peft:
      rank: 8  # For LoRA-based methods
      alpha: 16  # Scaling factor
      dropout: 0.1
    freeze:
      modules: ["encoder.block.0", "encoder.block.1"]  # Explicit module paths
      patterns: ["encoder.block.[2-11]", "processor.*"]  # Regex patterns supported
    optimizer:
      differential_lr: True
      lr_groups:
        - modules: ["decoder.*"]
          lr: 1e-4
        - modules: ["peft_layers.*"]
          lr: 5e-4

It would also be possible to implement an optional fine-tuning config. This will become more clear during implementation, I believe.

5. Integration with Training Pipelines

Collaborate with colleagues working on training pipeline automation to ensure:

  • Fine-tuning configurations can be versioned and tracked
  • Automated experimentation can be conducted across fine-tuning hyperparameters
  • Results from fine-tuning can be systematically compared and evaluated

6. Maintain full traceability of training settings

When implementing new configs and possible training pipelines, the different model training stages should be reflected in the provenance of the model to ensure high re-traceability of a model.

Implementation Plan

Phase 1: Foundation

Phase 2: Core Functionality

  • Complete PEFT integration with all major methods
  • Implement enhanced configuration system
  • Develop comprehensive testing suite for fine-tuning capabilities
  • Create documentation and examples

Phase 3: Advanced Features and Integration

  • Integrate with training pipeline automation
  • Optimize performance for large-scale fine-tuning
  • Create benchmarking tools for fine-tuning approaches

Success Criteria

The fine-tuning capability will be considered successful when:

  1. Users can fine-tune Anemoi models with a single configuration change
  2. Memory usage during fine-tuning is reduced by at least 70% compared to full fine-tuning
  3. Fine-tuned models maintain or improve performance metrics compared to current approaches
  4. Checkpoint sharing and collaboration becomes seamless across different systems
  5. Documentation and examples make the system approachable for new users

Alternatives Considered

Custom Implementation of PEFT Methods

While implementing our own versions of PEFT methods would give us maximum control, it would require significant development and maintenance effort. The Hugging Face PEFT library is well-maintained, extensively tested, and continuously updated with new methods, making it a more sustainable choice.

Adapter-Based Approaches

Traditional adapter approaches insert new modules between existing layers. While effective, this can change the model architecture significantly. LoRA and similar methods preserve the original architecture while fine-tuning, which aligns better with our goals. (Although technically PEFT also uses adapters...)

Full Model Distillation

Knowledge distillation could be used to transfer knowledge from the pre-trained model to a task-specific model. However, this approach requires training a new model from scratch for each task, which is computationally expensive and doesn't leverage the efficiency gains of modern fine-tuning techniques.

Additional Context

This work on fine-tuning capabilities aligns with broader industry trends toward more efficient adaptation of large models. By implementing these capabilities in Anemoi, we'll enable users to:

  1. Adapt global models to regional domains with minimal computational resources
  2. Fine-tune on specific weather phenomena without degrading general performance
  3. Build ensembles of specialized models derived from a common base
  4. Collaborate more effectively by sharing and building upon each other's work

Organization

ECMWF

@JesperDramsch JesperDramsch added the enhancement New feature or request label Apr 10, 2025
@JesperDramsch JesperDramsch self-assigned this Apr 10, 2025
@JesperDramsch JesperDramsch added this to the Fine-Tuning milestone Apr 10, 2025
@JPXKQX
Copy link
Member

JPXKQX commented Apr 11, 2025

Very exciting! I also agree that building on top of peft is a good idea and I really like the proposed schema for checkpoint: .... I think Anemoi users will be really happy with this new schema given some of the problems we are having with loading training checkpoints.

@JesperDramsch JesperDramsch changed the title Anemoi Fine-tuning Roadmap Anemoi [Fine-tuning, Transfer Learning, Model Freezing] Roadmap May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: No status
Development

No branches or pull requests

2 participants