Description
Hi MONAI team,
while reading through the implementation of DiffusionModelUNetMaisi, I noticed the following logic for enabling attention at each level:
with_attn = attention_levels[i] and not with_conditioning
with_cross_attn = attention_levels[i] and with_conditioning
This effectively means that self-attention is never used when the model is in conditioning mode (with_conditioning=True), even if attention_levels[i] is True.
Is this behavior intentional?
In other diffusion-based architectures such as Stable Diffusion, it is common practice to enable both self-attention and cross-attention simultaneously within the same layers.
Would it be acceptable, or even recommended, to modify the logic as follows to allow both mechanisms in parallel?
Or is there a specific reason this mutual exclusivity was enforced?
Looking forward to your insights, and thank you for the great work on this model!
Best regards,
Daniele Molino