Open
Description
Hi,
Thank you very much for your brilliant work on Adan!
And from you paper, it said Adan should get a lower loss (both Train and test) than Adamw according to Figure 1. However, I got a higher training loss with Adan than AdamW in ViT-H:
Steps | Adamw_train_loss | Adan_train_loss |
---|---|---|
200 | 6.9077 | 6.9077 |
400 | 6.9074 | 6.9075 |
600 | 6.9068 | 6.9073 |
800 | 6.9061 | 6.907 |
1000 | 6.905 | 6.9064 |
1200 | 6.9036 | 6.9056 |
1400 | 6.9014 | 6.9044 |
1600 | 6.899 | 6.9028 |
1800 | 6.8953 | 6.9003 |
2000 | 6.8911 | 6.8971 |
2200 | 6.8848 | 6.8929 |
2400 | 6.8789 | 6.8893 |
2600 | 6.8699 | 6.8843 |
2800 | 6.8626 | 6.8805 |
3000 | 6.8528 | 6.8744 |
3200 | 6.8402 | 6.868 |
3400 | 6.8293 | 6.862 |
3600 | 6.8172 | 6.8547 |
3800 | 6.7989 | 6.8465 |
4000 | 6.7913 | 6.8405 |
I used the same HPs as AdamW and only changed beta from (0.9, 0.999) to (0.9, 0.92, 0.999).
I only trained for few steps to see the trend. But it seems the loss gap from AdamW is quite big, should I change other HPs to better using Adan? How can I get a lower Loss than AdamW?
I noticed that Adan prefers a large batch size in Vision tasks, should we using a larger batch size?
Or should I train with more steps to see the trend?
Thank you!
Metadata
Metadata
Assignees
Labels
No labels