-
Notifications
You must be signed in to change notification settings - Fork 388
Description
🚀 Feature
Support seamless integration of LoRA adapters with Tensor Parallelism (TP) during fine-tuning.
Motivation
In early experiments, I combined LoRA adapters with TP as a quick sanity check. Using ~500 SQuAD samples (normally sufficient to observe convergence trends), I noticed that the training loss did not converge — fluctuating between 2.0 and 4.0 with a noisy curve.
In contrast, LoRA without TP converged cleanly, which suggests that TP may be interfering with LoRA’s low-rank update path. I could not find strong prior work on combining LoRA with TP (closest related work is lorax, which is JAX-based).
This raises the question of whether LoRA + TP should theoretically work out-of-the-box, or if there are caveats around sharding LoRA’s low-rank side adapters.
Pitch
Clarify whether LoRA + TP is expected to work seamlessly in principle.
Document any known limitations or caveats when combining TP with LoRA’s low-rank adapters.
Provide guidance on whether users should prefer FSDP + LoRA for large-context fine-tuning (e.g., 16k) over TP.