-
Notifications
You must be signed in to change notification settings - Fork 26
Unable to reproduce LLaMA-7B results when training from scratch #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello, this is unrelated to your issue but I was hoping you could point me in the right direction since I'm an ML beginner trying to leverage this repo for another organization. Did you try running Thanks. |
Hi, I tried the example command in the README and it works without an issue: python -m src.compress --model_name_or_path llama-7b-gist-1-recovered \
--instruction "Name the top cities in France that should not be missed. Include the best aspects of each place as well." where Also, I would recommend you open a separate issue if there's any follow-up / new question. |
Hi, give me a few days to look into this. Have you tried training the positive control model? Curious if the issue is just the gist model or all of the models. |
Also, are you using the decapoda-research llama checkpoint? Can you check whether it might be due to an incorrect tokenization config with the decapoda models? See |
I'm guessing why it worked with no modification was because your GPU specs fit the required ones perfectly haha. Thanks |
Thank you for the reply. I haven't tried training the positive control model but let me do it now. Just another two quick follow-ups:
|
This is a good point. All my previous experiments were done using the decapoda-research checkpoint, and I was actually redoing experiments with the official checkpoint today. I'll let you know how it goes. I also have a question w.r.t. the provided gist checkpoint: it seems like in the {
"_from_model_config": true,
"bos_token_id": 0,
"eos_token_id": 1,
"pad_token_id": -1,
"transformers_version": "4.28.0.dev0"
} While in the decapoda-research checkpoint it is: {
"_from_model_config": true,
"bos_token_id": 0,
"eos_token_id": 1,
"pad_token_id": 0,
"transformers_version": "4.27.0.dev0"
} and in the official checkpoint: {
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"transformers_version": "4.28.0.dev0"
} Could you explain the differences in your gist checkpoint tokenization config? Thank you. |
It seems like there may be some versioning issues here...I think I wrote this code and converted the official Llama checkpoint back when there were still some outstanding tokenization issues in the huggingface repo (e.g. prior to huggingface/transformers#22402); note the pinned transformers version in this codebase's The Llama checkpoint I used has the following configs:
which seems to be incorrect, however, in a debug session after loading the LlamaTokenizer from this checkpoint, I get
so I'm definitely a bit confused here, the pinned version of transformers I used might be overwriting the checkpoint values. In fact I recall having to manually overwrite the generation config here: gisting/src/trainer_seq2seq.py Lines 303 to 313 in 297907c
I previously attributed this to a DeepSpeed bug (for some reason I remember not seeing this issue without deepspeed) but maybe it's related to tokenization. Some answers that would help clarify things:
Appreciate your help debugging this! |
No, that is the only change required!
Aha, this was also the case in the original Alpaca codebase—it specified a cosine LR scheduler but the LR wasn't actually changed at least according to wandb. I didn't look too closely into this. So even if not entirely correct, this is expected, and I observed this in my experiments. |
Thank you for putting the efforts on the detailed response! With regard to your questions:
|
I got no luck in running those new experiments:
Basically there is no significant difference between the decapoda-research checkpoint and the converted checkpoint, and both gist and pos control experiments have worse results than those reported in the paper. Could you kindly consider rerunning the training on your side to verify if you're able to reproduce the results presented in the paper? |
Thank you for the thorough investigation. Let me run experiments with your newly provided environment first. If that doesn't work I'll dig into the checkpoint. Just a side note, I was actually running most of my experiments using 8 A100-40GB gpus since I have relatively limited A100-80GB resources. I changed |
Switching to the provided python envs indeed fixed the training! I got seen 57.39, unseen 46.53, human 25.83 for the 1 gist token setting trained using the converted official checkpoint. These are pretty much consistent with the paper results. It requires a bit more changes to the conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia Then slightly updated the
Using this environment could successfully reproduce the gist results when training from scratch on my end. I think it's worth further investigating the version mismatch of which packages caused the performance degradation in my previous experiments, but this can serve as a temporary solution for now. Thanks again for your time in looking into this! |
I ran a few more experiments and located the issue -- the performance discrepancy is not caused by the torch version, but the deepspeed version. I was still able to reproduce the paper results after changing the torch version back to 2.0.0, so I did some further debugging and realized that since I did not create a new env from scratch previously, my older experiments were run with I then conducted experiments using the same env with only the deepspeed versions being different, and verified that only the I haven't figured out why the up-to-date deepspeed could cause such performance degradation. And I apologize for my oversight (again!) -- your current listed dependencies should be all good. |
Apologies for the late response, I've been on vacation 😄 Glad to hear the results replicate, and thanks so much for looking so closely into this! This will help a lot with reproducibility. Super weird that the DeepSpeed version leads to such drastic performance differences. I'll make a note of this in the repo when I get back from vacation. |
Hi @Xiuyu-Li, could you please share your reproduced results? I also tried to replicate Llama-7b from scratch, and this is my results:
I train the model on 4*A100(80G) which takes about 24 hours to finish training and I use |
Hi @Hannibal046 , the results you just reported align with the paper's, namely these lines:
and the table: ![]() so it seems like you've successfully reproduced the results. Did you have other questions? |
Hi, @jayelm |
Hi @jayelm, sorry to bother. |
Hi @Hannibal046, see this comment earlier in the thread:
|
aha, got it! my mistake and appreciate your help! Also, if I would like to help migrate the current code base to the latest deepspeed and huggingface transformers, what would you suggest to take care of? currently, these are in the to-do list:
|
Hi @Hannibal046 , those are the things I'm aware of! First step would be migrating to a newer huggingface version while keeping the DeepSpeed version the same—you should be able to exactly reproduce the results. Second you can try debugging what's going on with DeepSpeed, I really have no clue and haven't had time to look myself. If you do figure out either of these, please let me know and/or submit a PR, would be really helpful! |
Hi,
I was trying to reproduce the LLaMA-7B with 1 gist token results from scratch following the training instruction in the README. I ran the script below on 4 A100-80GB GPUs:
However, the final results after 3 epochs are much lower than the reported ones in the paper. I got seen 51.24, unseen 42.01, human 19.00 for ROUGE-L. I tried training for longer epochs but it didn't help with unseen and human ROUGE-L results. I did not change anything in the training config other than the wandb account.
I also evaluated the 3 provided checkpoints (gist, pos_control, neg_control) and the results are consistent with the paper (< 0.1 difference in terms of ROUGE-L) for all of them, so the evaluation code should function normally. Could you help double check if the above training setup is correct, and do you have any suggestions on how to reproduce LLaMA results in the paper?
The text was updated successfully, but these errors were encountered: