You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix gradient shape error for DPMultiheadAttention (issue 650) (#651)
Summary:
When batch_first = True, the activation and partial gradient for each linear layer in DPMultiheadAttention still has batch_size in the second dimension, thus causing wrong gradient shape in optimizer.step().
Details in: #650
Reviewed By: EnayatUllah
Differential Revision: D57446245
0 commit comments