You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We would like to propose integrating a novel parameter-efficient fine-tuning method called TLoRA (Tri-Matrix LoRA) into the peft library. We believe TLoRA offers significant advantages in terms of parameter efficiency, making it a valuable addition to the PEFT ecosystem.
TLoRA is a variation of LoRA that introduces a tri-matrix decomposition for the weight update matrix $\Delta W$. Instead of the standard $W + A B$, TLoRA uses $W + \alpha A B C $, where:
$W$ is the original pre-trained weight matrix.
$A$ is a fixed, non-trainable matrix (e.g., initialized randomly or using Kaiming/Xavier).
$B$ is the only trainable matrix.
$C$ is another fixed, non-trainable matrix (similar initialization as A).
$\alpha$ is a trainable scaling parameter.
The $\Delta W$ update is computed as the product of three matrices: a fixed input projection matrix $A$, a small trainable bottleneck matrix $B$, and a fixed output projection matrix $C$. Only matrix $B$ is updated during fine-tuning.
TLoRA Implementation:
The core idea can be represented in a layer similar to this (based on our implementation):
classTLoRALayer(nn.Module):
def__init__(self, weight, bias, rank=32):
super(TLoRALayer, self).__init__()
row, column=weight.shape# Restore Linear layerifbiasisNone:
self.linear=nn.Linear(column, row, bias=False)
self.linear.load_state_dict({"weight": weight})
else:
self.linear=nn.Linear(column, row)
self.linear.load_state_dict({"weight": weight, "bias": bias})
# Create TLoRA weights with initializationself.random_A=nn.Parameter(
torch.zeros(column, rank), requires_grad=False
) # First matrix, non-trainablenn.init.kaiming_normal_(self.random_A, a=math.sqrt(5))
self.lora_B=nn.Parameter(torch.zeros(rank, rank)) # Second matrix (trainable)self.random_C=nn.Parameter(
torch.zeros(rank, row), requires_grad=False
) # Third matrixnn.init.kaiming_normal_(self.random_C, a=math.sqrt(5))
self.lora_scaling=nn.Parameter(torch.ones(1))
self.dropout=nn.Dropout(0.5)
defforward(self, input):
# Standard linear transformationx=self.linear(input)
# Low-rank adaptation with tri-matrix TLoRA# Using the scaling to control the LoRA outputy=self.lora_scaling* (input @ self.random_A @ self.lora_B @ self.random_C)
y=self.dropout(y)
returnx+y
Extreme Parameter Efficiency: The core trainable component in TLoRA is the matrix $B$ with dimensions rank x rank. Compared to standard LoRA's trainable matrices $A$ (input_dim x rank) and $B$ (rank x output_dim), TLoRA's trainable parameters are significantly fewer. This makes TLoRA potentially one of the most parameter-efficient methods in PEFT for a given rank.
Competitive Performance: The fixed matrices $A$ and $C$ can be seen as defining fixed subspaces. By training only the matrix $B$ connecting these subspaces, TLoRA might capture more focused and effective updates compared to training the full $A$ and $B$ matrices in standard LoRA. Our paper provides empirical evidence supporting its effectiveness.
Your contribution
Can give inputs on the design. It should be straightforward.
The text was updated successfully, but these errors were encountered:
At first glance this looks very similar to MosLoRA (see also: #1905 and #2013). There was a bit of a discussion regarding the effectiveness of the approach due to the linear nature of the formulation (N -> r * r -> r * r -> M equals mathematically N -> r x r -> M). But I think that the gradient freezing justifies this formulation. If we proceed with this, it may be worthwhile to combine implementation efforts. This is something that is probably well suited to be implemented as a LoRA variant.
It also reminds me of VeRA which has an even more drastic reduction of trainable parameters from what I can tell. In the referenced paper this is also reported but showing a higher parameter count. How was this done? I'd have thought that VeRA would have a lower count since it also shared A and B across the model. Do you know?
It would also be interesting to me to see what the main benefit over VeRA would be.
I hope this does not sound dismissive but as a maintainer it is important to see whether a method adds enough distinctiveness to the whole to be worth the added maintenance effort.
Feature request
We would like to propose integrating a novel parameter-efficient fine-tuning method called TLoRA (Tri-Matrix LoRA) into the
peft
library. We believe TLoRA offers significant advantages in terms of parameter efficiency, making it a valuable addition to the PEFT ecosystem.Our method is detailed in the paper: https://arxiv.org/abs/2504.18735
What is TLoRA?
TLoRA is a variation of LoRA that introduces a tri-matrix decomposition for the weight update matrix$\Delta W$ . Instead of the standard $W + A B$ , TLoRA uses $W + \alpha A B C $ , where:
The$\Delta W$ update is computed as the product of three matrices: a fixed input projection matrix $A$ , a small trainable bottleneck matrix $B$ , and a fixed output projection matrix $C$ . Only matrix $B$ is updated during fine-tuning.
TLoRA Implementation:
The core idea can be represented in a layer similar to this (based on our implementation):
Full Repo: https://github.com/itanvir/tlora
Motivation
rank x rank
. Compared to standard LoRA's trainable matricesinput_dim x rank
) andrank x output_dim
), TLoRA's trainable parameters are significantly fewer. This makes TLoRA potentially one of the most parameter-efficient methods in PEFT for a given rank.Your contribution
Can give inputs on the design. It should be straightforward.
The text was updated successfully, but these errors were encountered: