Support LoRA adapters with Tensor Parallelism (TP)

## 🚀 Feature
Support seamless integration of LoRA adapters with Tensor Parallelism (TP) during fine-tuning.


## Motivation
In early experiments, I combined LoRA adapters with TP as a quick sanity check. Using ~500 SQuAD samples (normally sufficient to observe convergence trends), I noticed that the training loss did not converge — fluctuating between 2.0 and 4.0 with a noisy curve.

In contrast, LoRA without TP converged cleanly, which suggests that TP may be interfering with LoRA’s low-rank update path. I could not find strong prior work on combining LoRA with TP (closest related work is [lorax](https://github.com/davisyoshida/lorax), which is JAX-based).

This raises the question of whether LoRA + TP should theoretically work out-of-the-box, or if there are caveats around sharding LoRA’s low-rank side adapters.

## Pitch
Clarify whether LoRA + TP is expected to work seamlessly in principle.
Document any known limitations or caveats when combining TP with LoRA’s low-rank adapters.
Provide guidance on whether users should prefer FSDP + LoRA for large-context fine-tuning (e.g., 16k) over TP.

## Alternatives



## Additional context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support LoRA adapters with Tensor Parallelism (TP) #789

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support LoRA adapters with Tensor Parallelism (TP) #789

Description

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions