Skip to content

Integrate TLoRA (Tri-Matrix LoRA) #2533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
itanvir opened this issue May 6, 2025 · 1 comment
Open

Integrate TLoRA (Tri-Matrix LoRA) #2533

itanvir opened this issue May 6, 2025 · 1 comment

Comments

@itanvir
Copy link

itanvir commented May 6, 2025

Feature request

We would like to propose integrating a novel parameter-efficient fine-tuning method called TLoRA (Tri-Matrix LoRA) into the peft library. We believe TLoRA offers significant advantages in terms of parameter efficiency, making it a valuable addition to the PEFT ecosystem.

Our method is detailed in the paper: https://arxiv.org/abs/2504.18735

What is TLoRA?

TLoRA is a variation of LoRA that introduces a tri-matrix decomposition for the weight update matrix $\Delta W$. Instead of the standard $W + A B$, TLoRA uses $W + \alpha A B C $, where:

  • $W$ is the original pre-trained weight matrix.
  • $A$ is a fixed, non-trainable matrix (e.g., initialized randomly or using Kaiming/Xavier).
  • $B$ is the only trainable matrix.
  • $C$ is another fixed, non-trainable matrix (similar initialization as A).
  • $\alpha$ is a trainable scaling parameter.

The $\Delta W$ update is computed as the product of three matrices: a fixed input projection matrix $A$, a small trainable bottleneck matrix $B$, and a fixed output projection matrix $C$. Only matrix $B$ is updated during fine-tuning.

TLoRA Implementation:

The core idea can be represented in a layer similar to this (based on our implementation):

class TLoRALayer(nn.Module):
    def __init__(self, weight, bias, rank=32):
        super(TLoRALayer, self).__init__()

        row, column = weight.shape

        # Restore Linear layer
        if bias is None:
            self.linear = nn.Linear(column, row, bias=False)
            self.linear.load_state_dict({"weight": weight})
        else:
            self.linear = nn.Linear(column, row)
            self.linear.load_state_dict({"weight": weight, "bias": bias})

        # Create TLoRA weights with initialization
        self.random_A = nn.Parameter(
            torch.zeros(column, rank), requires_grad=False
        )  # First matrix, non-trainable
        nn.init.kaiming_normal_(self.random_A, a=math.sqrt(5))

        self.lora_B = nn.Parameter(torch.zeros(rank, rank))  # Second matrix (trainable)

        self.random_C = nn.Parameter(
            torch.zeros(rank, row), requires_grad=False
        )  # Third matrix
        nn.init.kaiming_normal_(self.random_C, a=math.sqrt(5))

        self.lora_scaling = nn.Parameter(torch.ones(1))
        self.dropout = nn.Dropout(0.5)

    def forward(self, input):
        # Standard linear transformation
        x = self.linear(input)

        # Low-rank adaptation with tri-matrix TLoRA
        # Using the scaling to control the LoRA output
        y = self.lora_scaling * (input @ self.random_A @ self.lora_B @ self.random_C)

        y = self.dropout(y)

        return x + y

Full Repo: https://github.com/itanvir/tlora

Motivation

  1. Extreme Parameter Efficiency: The core trainable component in TLoRA is the matrix $B$ with dimensions rank x rank. Compared to standard LoRA's trainable matrices $A$ (input_dim x rank) and $B$ (rank x output_dim), TLoRA's trainable parameters are significantly fewer. This makes TLoRA potentially one of the most parameter-efficient methods in PEFT for a given rank.
  2. Competitive Performance: The fixed matrices $A$ and $C$ can be seen as defining fixed subspaces. By training only the matrix $B$ connecting these subspaces, TLoRA might capture more focused and effective updates compared to training the full $A$ and $B$ matrices in standard LoRA. Our paper provides empirical evidence supporting its effectiveness.

Your contribution

Can give inputs on the design. It should be straightforward.

@githubnemo
Copy link
Collaborator

Hey @itanvir,

thank you for the recommendation.

At first glance this looks very similar to MosLoRA (see also: #1905 and #2013). There was a bit of a discussion regarding the effectiveness of the approach due to the linear nature of the formulation (N -> r * r -> r * r -> M equals mathematically N -> r x r -> M). But I think that the gradient freezing justifies this formulation. If we proceed with this, it may be worthwhile to combine implementation efforts. This is something that is probably well suited to be implemented as a LoRA variant.

It also reminds me of VeRA which has an even more drastic reduction of trainable parameters from what I can tell. In the referenced paper this is also reported but showing a higher parameter count. How was this done? I'd have thought that VeRA would have a lower count since it also shared A and B across the model. Do you know?
It would also be interesting to me to see what the main benefit over VeRA would be.

I hope this does not sound dismissive but as a maintainer it is important to see whether a method adds enough distinctiveness to the whole to be worth the added maintenance effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants