Skip to content

Fix for tensor parallelism with newly split sin/cos tensors #758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 11, 2025

Conversation

MikeRoz47
Copy link

Since 7b05acd, sin and cos in the device context are split by layer. However, the external functions tp_attn_forward_ and tp_attn_forward_paged_ expect a single tensor per device. I've changed the get_sin_cos method of the TPContext class to concatenate these tensors before providing them to the external functions.

The error message I was receiving before making this change was:

TypeError: tp_attn_forward_(): incompatible function arguments. The following argument types are supported:
    1. (arg0: int, arg1: torch.Tensor, arg2: list[torch.Tensor], arg3: list[torch.Tensor], arg4: list[torch.Tensor], arg5: list[torch.Tensor], arg6: list[torch.Tensor], arg7: list[torch.Tensor], arg8: list[torch.Tensor], arg9: list[torch.Tensor], arg10: list[torch.Tensor], arg11: list[torch.Tensor], arg12: float, arg13: list[int], arg14: list[int], arg15: list[int], arg16: list[int], arg17: int, arg18: int, arg19: int, arg20: int, arg21: list[torch.Tensor], arg22: list[torch.Tensor], arg23: list[torch.Tensor], arg24: float) -> None

Invoked with: 1399607536, tensor([[     0.00057,      0.00150,     -0.00022,  ...,     -0.00005,     -0.00012,     -0.00014],

followed by many more lines of truncated string representations of tensors.

I've tested this change locally, which seemed to exercise the tp_attn_forward_ path. I'm not sure how I'd exercise the tp_attn_forward_paged_ path. Where inference was failing before due to this error, I'm now receiving coherent output that seems consistent from what I've seen from the model before.

@frenzybiscuit
Copy link

Does this fix the massive performance loss on windows 11?

Linux -> 21 token/second
Windows 11 -> 14 token/second

This is with 70B models and tensor parallel. If I disable TP on windows 11, performance jumps to around 18 token/second.

@MikeRoz47
Copy link
Author

MikeRoz47 commented Mar 23, 2025

I don't know anything about that. This PR fixes a fatal error I received when performing inference.

EDIT: I'll also add that I received the error when using the dev branch, not master or any release.

@Ph0rk0z
Copy link

Ph0rk0z commented Mar 23, 2025

I think you found it.

This looks suspiciously like the error I got when running post-gemma dev. Fails on test inference while loading the model in tabby.

Unfortunately I'm using paged since it's all ampere cards with FA.

@MikeRoz47
Copy link
Author

Just so you know, the paged path should be fixed too, I just don't think my test exercised that path. But I'm running Ampere cards too, so maybe it did? I just know that my first failure was in trying to call tp_attn_forward_.

@Ph0rk0z
Copy link

Ph0rk0z commented Mar 24, 2025

I couldn't see the top of the message due to mosh only showing so many lines, but the tensor puke on the bottom looks about right.

In tabby it defaults to paged and in textgen I think it mainly uses the classic engine. Should just try it when I get back.

edit: It worked. Same model that failed before now loads.

@turboderp
Copy link
Member

Sorry I missed this. (:

@turboderp turboderp merged commit 61450b4 into turboderp-org:dev Apr 11, 2025
@EthanAndersonUSA
Copy link

Huge thanks for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants