Description
Thank you to the authors for providing such meaningful work to the motion generation community. After reading the paper, I have two questions:
-
I noticed that the FID metric on the HumanML3D dataset converged to 0.10, but recent works like MoMask and LaMP, which use Mask Transformers, have achieved better results, with FID as low as 0.03. I would like to know the authors' thoughts on the strengths and weaknesses of these two approaches (Autoregressive & Mask Transformer).
-
Recent research (https://openreview.net/forum?id=UxzKcIZedp, https://openreview.net/forum?id=Oh8MuCacJW) has discussed dataset differences. The first paper analyzes the dataset gap between InterX and HumanML3D, while the second paper's rebuttal reported that training on Motion-X and testing on HumanML3D did not yield good results. I am concerned that multi-dataset training may introduce potential issues. What are the authors' views on this potential problem?
Thanks in advance for your answer!