Integrate TLoRA (Tri-Matrix LoRA) #2533

itanvir · 2025-05-06T21:22:50Z

Feature request

We would like to propose integrating a novel parameter-efficient fine-tuning method called TLoRA (Tri-Matrix LoRA) into the peft library. We believe TLoRA offers significant advantages in terms of parameter efficiency, making it a valuable addition to the PEFT ecosystem.

Our method is detailed in the paper: https://arxiv.org/abs/2504.18735

What is TLoRA?

TLoRA is a variation of LoRA that introduces a tri-matrix decomposition for the weight update matrix $\Delta W$. Instead of the standard $W + A B$, TLoRA uses $W + \alpha A B C $, where:

$W$ is the original pre-trained weight matrix.
$A$ is a fixed, non-trainable matrix (e.g., initialized randomly or using Kaiming/Xavier).
$B$ is the only trainable matrix.
$C$ is another fixed, non-trainable matrix (similar initialization as A).
$\alpha$ is a trainable scaling parameter.

The $\Delta W$ update is computed as the product of three matrices: a fixed input projection matrix $A$, a small trainable bottleneck matrix $B$, and a fixed output projection matrix $C$. Only matrix $B$ is updated during fine-tuning.

TLoRA Implementation:

The core idea can be represented in a layer similar to this (based on our implementation):

class TLoRALayer(nn.Module):
    def __init__(self, weight, bias, rank=32):
        super(TLoRALayer, self).__init__()

        row, column = weight.shape

        # Restore Linear layer
        if bias is None:
            self.linear = nn.Linear(column, row, bias=False)
            self.linear.load_state_dict({"weight": weight})
        else:
            self.linear = nn.Linear(column, row)
            self.linear.load_state_dict({"weight": weight, "bias": bias})

        # Create TLoRA weights with initialization
        self.random_A = nn.Parameter(
            torch.zeros(column, rank), requires_grad=False
        )  # First matrix, non-trainable
        nn.init.kaiming_normal_(self.random_A, a=math.sqrt(5))

        self.lora_B = nn.Parameter(torch.zeros(rank, rank))  # Second matrix (trainable)

        self.random_C = nn.Parameter(
            torch.zeros(rank, row), requires_grad=False
        )  # Third matrix
        nn.init.kaiming_normal_(self.random_C, a=math.sqrt(5))

        self.lora_scaling = nn.Parameter(torch.ones(1))
        self.dropout = nn.Dropout(0.5)

    def forward(self, input):
        # Standard linear transformation
        x = self.linear(input)

        # Low-rank adaptation with tri-matrix TLoRA
        # Using the scaling to control the LoRA output
        y = self.lora_scaling * (input @ self.random_A @ self.lora_B @ self.random_C)

        y = self.dropout(y)

        return x + y

Full Repo: https://github.com/itanvir/tlora

Motivation

Extreme Parameter Efficiency: The core trainable component in TLoRA is the matrix $B$ with dimensions rank x rank. Compared to standard LoRA's trainable matrices $A$ (input_dim x rank) and $B$ (rank x output_dim), TLoRA's trainable parameters are significantly fewer. This makes TLoRA potentially one of the most parameter-efficient methods in PEFT for a given rank.
Competitive Performance: The fixed matrices $A$ and $C$ can be seen as defining fixed subspaces. By training only the matrix $B$ connecting these subspaces, TLoRA might capture more focused and effective updates compared to training the full $A$ and $B$ matrices in standard LoRA. Our paper provides empirical evidence supporting its effectiveness.

Your contribution

Can give inputs on the design. It should be straightforward.

The text was updated successfully, but these errors were encountered:

githubnemo · 2025-05-07T14:11:37Z

Hey @itanvir,

thank you for the recommendation.

At first glance this looks very similar to MosLoRA (see also: #1905 and #2013). There was a bit of a discussion regarding the effectiveness of the approach due to the linear nature of the formulation (N -> r * r -> r * r -> M equals mathematically N -> r x r -> M). But I think that the gradient freezing justifies this formulation. If we proceed with this, it may be worthwhile to combine implementation efforts. This is something that is probably well suited to be implemented as a LoRA variant.

It also reminds me of VeRA which has an even more drastic reduction of trainable parameters from what I can tell. In the referenced paper this is also reported but showing a higher parameter count. How was this done? I'd have thought that VeRA would have a lower count since it also shared A and B across the model. Do you know?
It would also be interesting to me to see what the main benefit over VeRA would be.

I hope this does not sound dismissive but as a maintainer it is important to see whether a method adds enough distinctiveness to the whole to be worth the added maintenance effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate TLoRA (Tri-Matrix LoRA) #2533

Integrate TLoRA (Tri-Matrix LoRA) #2533

itanvir commented May 6, 2025 •

edited

Loading

githubnemo commented May 7, 2025

Integrate TLoRA (Tri-Matrix LoRA) #2533

Integrate TLoRA (Tri-Matrix LoRA) #2533

Comments

itanvir commented May 6, 2025 • edited Loading

Feature request

Motivation

Your contribution

githubnemo commented May 7, 2025

itanvir commented May 6, 2025 •

edited

Loading