Skip to content

Commit b427966

Browse files
[None][perf] Autotune TRT-LLM Gen MoE when using CUDA graphs
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
1 parent f167b1f commit b427966

File tree

1 file changed

+5
-0
lines changed

1 file changed

+5
-0
lines changed

tensorrt_llm/_torch/pyexecutor/model_engine.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -787,6 +787,11 @@ def release_batch(result: ScheduledRequests | None):
787787
f"Run generation only CUDA graph warmup for batch size={bs}, draft_len={draft_len}"
788788
)
789789
self.enable_spec_decode = draft_len > 0 or self.is_draft_model
790+
with self.no_cuda_graph(), autotune():
791+
self.forward(batch,
792+
new_tensors_device=None,
793+
resource_manager=resource_manager)
794+
torch.cuda.synchronize()
790795
self.forward(batch,
791796
new_tensors_device=None,
792797
resource_manager=resource_manager)

0 commit comments

Comments
 (0)