You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ability to apply LoRA (or other adapters) to experts in MoE models.
Motivation
Mixture-of-experts models with token choice routing contain FFNs within each expert, which are often implemented using a batched matmul over all experts (ref from Llama4). This is a bit different than vanilla FFNs as the parameters are represented as 3D nn.Parameters as opposed to nn.Linears. However, given that it's fairly common to apply LoRA to vanilla FFNs, it would also be useful to tune the experts in an MoE model with PEFT. (There are some challenges here, e.g. the usage of nn.Parameters probably precludes the possibility of doing this via module swap directly on nn.Linears.)
Your contribution
Happy to help with thoughts on the design. We have a version of this in torchtune (ref) and would love to interoperate with PEFT if this is something you're interested in supporting!
The text was updated successfully, but these errors were encountered:
thanks for raising attention for LoRA MoE adapters :)
If I understand correctly, we would need a specific layer adapter for Llama4TextExperts as there currently is no established interface for how grouped experts are to be implemented (in contrast to, say, multi-head attention). This would also alleviate the need for dealing with nn.Parameters since we target the whole module (although there would be a way to deal with nn.Parameters as done for nn.MultiHeadAttention).
Feature request
The ability to apply LoRA (or other adapters) to experts in MoE models.
Motivation
Mixture-of-experts models with token choice routing contain FFNs within each expert, which are often implemented using a batched matmul over all experts (ref from Llama4). This is a bit different than vanilla FFNs as the parameters are represented as 3D nn.Parameters as opposed to nn.Linears. However, given that it's fairly common to apply LoRA to vanilla FFNs, it would also be useful to tune the experts in an MoE model with PEFT. (There are some challenges here, e.g. the usage of nn.Parameters probably precludes the possibility of doing this via module swap directly on nn.Linears.)
Your contribution
Happy to help with thoughts on the design. We have a version of this in torchtune (ref) and would love to interoperate with PEFT if this is something you're interested in supporting!
The text was updated successfully, but these errors were encountered: