Fix Fast Gradient Clipping bias gradient calculation for three dim data #751

EnayatUllah · 2025-04-09T21:35:46Z

Summary:
The bias grad calculation for three dim data was incorect.

Let G = g^Tg, where g, of dimensions Txd be the per-sample activation gradient, where T is the number of tokens and d dimension.

The per-sample gradient norm with respect to bias is
vec(G)^T vec(1), instead of the erroneous,vec(G)^T vec(G) before. This diff fixes it.

Reviewed By: HuanyuZhang

Differential Revision: D70823094

facebook-github-bot · 2025-04-09T21:35:56Z

This pull request was exported from Phabricator. Differential Revision: D70823094

…ta (meta-pytorch#751) Summary: The bias grad calculation for three dim data was incorect. Let `G = g^Tg`, where `g`, of dimensions `Txd` be the per-sample activation gradient, where `T` is the number of tokens and `d` dimension. The per-sample gradient norm with respect to bias is `vec(G)^T vec(1)`, instead of the erroneous,`vec(G)^T vec(G)` before. This diff fixes it. Reviewed By: HuanyuZhang Differential Revision: D70823094

facebook-github-bot · 2025-04-09T21:58:44Z

This pull request was exported from Phabricator. Differential Revision: D70823094

…ta (meta-pytorch#751) Summary: The bias grad calculation for three dim data was incorect. Let `G = g^Tg`, where `g`, of dimensions `Txd` be the per-sample activation gradient, where `T` is the number of tokens and `d` dimension. The per-sample gradient norm with respect to bias is `vec(G)^T vec(1)`, instead of the erroneous,`vec(G)^T vec(G)` before. This diff fixes it. Reviewed By: HuanyuZhang Differential Revision: D70823094

…ta (meta-pytorch#751) Summary: Pull Request resolved: meta-pytorch#751 The bias grad calculation for three dim data was incorect. Let `G = g^Tg`, where `g`, of dimensions `Txd` be the per-sample activation gradient, where `T` is the number of tokens and `d` dimension. The per-sample gradient norm with respect to bias is `vec(G)^T vec(1)`, instead of the erroneous,`vec(G)^T vec(G)` before. This diff fixes it. Reviewed By: HuanyuZhang Differential Revision: D70823094

facebook-github-bot · 2025-04-09T22:10:27Z

This pull request was exported from Phabricator. Differential Revision: D70823094

…ta (meta-pytorch#751) Summary: The bias grad calculation for three dim data was incorect. Let `G = g^Tg`, where `g`, of dimensions `Txd` be the per-sample activation gradient, where `T` is the number of tokens and `d` dimension. The per-sample gradient norm with respect to bias is `vec(G)^T vec(1)`, instead of the erroneous,`vec(G)^T vec(G)` before. This diff fixes it. Reviewed By: HuanyuZhang Differential Revision: D70823094

facebook-github-bot · 2025-04-09T22:12:09Z

This pull request was exported from Phabricator. Differential Revision: D70823094

facebook-github-bot · 2025-04-10T02:11:44Z

This pull request has been merged in 7264cd7.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 9, 2025

facebook-github-bot added the fb-exported label Apr 9, 2025

EnayatUllah force-pushed the export-D70823094 branch from 0189db5 to d0a98cb Compare April 9, 2025 21:58

EnayatUllah force-pushed the export-D70823094 branch from d0a98cb to 8f993b6 Compare April 9, 2025 22:07

EnayatUllah force-pushed the export-D70823094 branch from 8f993b6 to d2d6e41 Compare April 9, 2025 22:10

facebook-github-bot closed this in 7264cd7 Apr 10, 2025

facebook-github-bot added the Merged label Apr 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Fast Gradient Clipping bias gradient calculation for three dim data #751

Fix Fast Gradient Clipping bias gradient calculation for three dim data #751

Uh oh!

EnayatUllah commented Apr 9, 2025

Uh oh!

facebook-github-bot commented Apr 9, 2025

Uh oh!

facebook-github-bot commented Apr 9, 2025

Uh oh!

facebook-github-bot commented Apr 9, 2025

Uh oh!

facebook-github-bot commented Apr 9, 2025

Uh oh!

facebook-github-bot commented Apr 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix Fast Gradient Clipping bias gradient calculation for three dim data #751

Fix Fast Gradient Clipping bias gradient calculation for three dim data #751

Uh oh!

Conversation

EnayatUllah commented Apr 9, 2025

Uh oh!

facebook-github-bot commented Apr 9, 2025

Uh oh!

facebook-github-bot commented Apr 9, 2025

Uh oh!

facebook-github-bot commented Apr 9, 2025

Uh oh!

facebook-github-bot commented Apr 9, 2025

Uh oh!

facebook-github-bot commented Apr 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants