Skip to content

Support LoRA adapters with Tensor Parallelism (TP) #789

@xzhouAxon

Description

@xzhouAxon

🚀 Feature

Support seamless integration of LoRA adapters with Tensor Parallelism (TP) during fine-tuning.

Motivation

In early experiments, I combined LoRA adapters with TP as a quick sanity check. Using ~500 SQuAD samples (normally sufficient to observe convergence trends), I noticed that the training loss did not converge — fluctuating between 2.0 and 4.0 with a noisy curve.

In contrast, LoRA without TP converged cleanly, which suggests that TP may be interfering with LoRA’s low-rank update path. I could not find strong prior work on combining LoRA with TP (closest related work is lorax, which is JAX-based).

This raises the question of whether LoRA + TP should theoretically work out-of-the-box, or if there are caveats around sharding LoRA’s low-rank side adapters.

Pitch

Clarify whether LoRA + TP is expected to work seamlessly in principle.
Document any known limitations or caveats when combining TP with LoRA’s low-rank adapters.
Provide guidance on whether users should prefer FSDP + LoRA for large-context fine-tuning (e.g., 16k) over TP.

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions