Skip to content

Commit b8a1c1b

Browse files
[None][perf] Autotune TRT-LLM Gen MoE when using CUDA graphs
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
1 parent f167b1f commit b8a1c1b

File tree

1 file changed

+6
-0
lines changed

1 file changed

+6
-0
lines changed

tensorrt_llm/_torch/pyexecutor/model_engine.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -787,6 +787,12 @@ def release_batch(result: ScheduledRequests | None):
787787
f"Run generation only CUDA graph warmup for batch size={bs}, draft_len={draft_len}"
788788
)
789789
self.enable_spec_decode = draft_len > 0 or self.is_draft_model
790+
if self.pytorch_backend_config.enable_autotuner:
791+
with self.no_cuda_graph(), autotune():
792+
self.forward(batch,
793+
new_tensors_device=None,
794+
resource_manager=resource_manager)
795+
torch.cuda.synchronize()
790796
self.forward(batch,
791797
new_tensors_device=None,
792798
resource_manager=resource_manager)

0 commit comments

Comments
 (0)