Skip to content

About the convergence trend comparison with Adamw in ViT-H #16

Open
@haihai-00

Description

@haihai-00

Hi,
Thank you very much for your brilliant work on Adan!
And from you paper, it said Adan should get a lower loss (both Train and test) than Adamw according to Figure 1. However, I got a higher training loss with Adan than AdamW in ViT-H:

Steps Adamw_train_loss Adan_train_loss
200 6.9077 6.9077
400 6.9074 6.9075
600 6.9068 6.9073
800 6.9061 6.907
1000 6.905 6.9064
1200 6.9036 6.9056
1400 6.9014 6.9044
1600 6.899 6.9028
1800 6.8953 6.9003
2000 6.8911 6.8971
2200 6.8848 6.8929
2400 6.8789 6.8893
2600 6.8699 6.8843
2800 6.8626 6.8805
3000 6.8528 6.8744
3200 6.8402 6.868
3400 6.8293 6.862
3600 6.8172 6.8547
3800 6.7989 6.8465
4000 6.7913 6.8405

I used the same HPs as AdamW and only changed beta from (0.9, 0.999) to (0.9, 0.92, 0.999).
I only trained for few steps to see the trend. But it seems the loss gap from AdamW is quite big, should I change other HPs to better using Adan? How can I get a lower Loss than AdamW?
I noticed that Adan prefers a large batch size in Vision tasks, should we using a larger batch size?
Or should I train with more steps to see the trend?
Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions