Skip to content

Support new arch of GLM4 models #2991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

guoqingbao
Copy link
Contributor

The latest GLM-4 (0414 version) uses a different architecture. The existing GLM-4 implementation is not compatible with the GLM-4-0414 series. This PR adds support for the new architecture.

Tested case

cargo run --example glm4_new --release --features cuda -- --weight-path /home/data/GLM-4-9B-0414 --prompt "How are you today?"
   Compiling candle-examples v0.9.1 (/home/bob/candle/candle-examples)
    Finished `release` profile [optimized] target(s) in 4.31s
     Running `target/release/examples/glm4_new --weight-path /home/data/GLM-4-9B-0414 --prompt 'How are you today?'`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.20 repeat-last-n: 64
retrieved the files in 159.358088ms
loaded the model in 3.527865914s
starting the inference loop
How are you today?
I'm just a computer program, so I don't have feelings or emotions. But thank you for asking! How can I assist you today?

31 tokens generated (28.97 token/s)

@greenrazer
Copy link
Collaborator

I think the new example you created should be combined with the existing glm4 example using some switching logic similar to the gemma example here.

Otherwise, it looks good!

@guoqingbao
Copy link
Contributor Author

I think the new example you created should be combined with the existing glm4 example using some switching logic similar to the gemma example here.

Otherwise, it looks good!

Thanks for the feedback, I will revise this.

@guoqingbao
Copy link
Contributor Author

I think the new example you created should be combined with the existing glm4 example using some switching logic similar to the gemma example here.

Otherwise, it looks good!

As suggested, I’ve integrated both the old and new GLM4 into a single example, using the which argument to distinguish between the two archs. I also fixed issues related to EOS tokens and the chat template for the old GLM4—since the model uses multiple EOS tokens and still requires the chat template to produce correct generation results.

Copy link
Collaborator

@greenrazer greenrazer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality looks good overall.
I noticed this introduces a dependency on the either crate. To keep our dependency tree minimal, please avoid adding new dependencies if possible. A simple custom enum could replace the Either usage here.
Thanks for working on this PR! :)

@guoqingbao guoqingbao requested a review from greenrazer July 4, 2025 09:33
@guoqingbao
Copy link
Contributor Author

The functionality looks good overall. I noticed this introduces a dependency on the either crate. To keep our dependency tree minimal, please avoid adding new dependencies if possible. A simple custom enum could replace the Either usage here. Thanks for working on this PR! :)

Thanks for the comments. I have removed either crate by using a custom EosTokenId struct and deserialization pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants