NVIDIA
diff --git a/‎docs/source/blogs/media/tech_blog10_baseline_performance_detail.png‎
Lines changed: 3 additions & 0 deletions b/‎docs/source/blogs/media/tech_blog10_baseline_performance_detail.png‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/source/blogs/media/tech_blog10_baseline_performance_overview.png‎
350 KB b/‎docs/source/blogs/media/tech_blog10_baseline_performance_overview.png‎
350 KB
diff --git a/‎docs/source/blogs/media/tech_blog10_baseline_round_robin_strategy.png‎
41.1 KB b/‎docs/source/blogs/media/tech_blog10_baseline_round_robin_strategy.png‎
41.1 KB
diff --git a/‎docs/source/blogs/media/tech_blog10_context_wait_performance.png‎
Lines changed: 3 additions & 0 deletions b/‎docs/source/blogs/media/tech_blog10_context_wait_performance.png‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/source/blogs/media/tech_blog10_dataset_token_distribution.png‎
61.3 KB b/‎docs/source/blogs/media/tech_blog10_dataset_token_distribution.png‎
61.3 KB
diff --git a/‎docs/source/blogs/media/tech_blog10_full_strategy_performance.png‎
Lines changed: 3 additions & 0 deletions b/‎docs/source/blogs/media/tech_blog10_full_strategy_performance.png‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/source/blogs/media/tech_blog10_tps_ttft_pareto_curve.png‎
400 KB b/‎docs/source/blogs/media/tech_blog10_tps_ttft_pareto_curve.png‎
400 KB
diff --git a/‎docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md‎
Lines changed: 140 additions & 0 deletions b/‎docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md‎
Lines changed: 140 additions & 0 deletions
diff --git a/‎docs/source/commands/trtllm-serve/trtllm-serve.rst‎
Lines changed: 34 additions & 30 deletions b/‎docs/source/commands/trtllm-serve/trtllm-serve.rst‎
Lines changed: 34 additions & 30 deletions
@@ -71,7 +71,7 @@ Note: MoE module load balancing is handled separately by the Expert Parallel Loa
 The $sol\\_tps$ represents the theoretical upper-bound throughput achievable with perfect load balancing:
 
 ```math
-sol\_time = \sum_{i=0}^{\infty} \frac{time_i}{balance\_ratio_i}
+sol\_time = \sum_{i=0}^{\infty} time_i * balance\_ratio_i
 ```
 
 ```math
 
@@ -0,0 +1,140 @@
+## Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT-LLM)
+
+This guide sets up a production endpoint that uses Eagle3 speculative decoding on NVIDIA GB200 or B200 GPUs only. It replaces the low‑latency flow from the previous guide and intentionally omits max‑throughput, Hopper, and benchmarking content.
+
+### Prerequisites
+
+- NVIDIA GB200 or B200 GPUs (example below assumes 8 GPUs; adjust flags for your setup)
+- Fast SSD storage for model weights
+- Base model weights available under a directory named `gpt-oss-120b` (example path)
+- Eagle3 speculative model assets available under a directory named `eagle`
+
+Expected directory layout on the host (example):
+
+```
+/path/to/models/
+  ├─ gpt-oss-120b/  # base model directory
+  └─ eagle/         # Eagle3 speculative decoding assets
+```
+
+### Get the TensorRT-LLM Container (1.1.0rc0)
+
+If required by your environment, log into NGC and pull the image:
+
+```bash
+# Create an API key at https://ngc.nvidia.com (if you don't have one)
+docker login nvcr.io
+# Username: $oauthtoken
+# Password: <your NGC API key>
+
+docker pull nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0
+```
+
+### Start the TensorRT-LLM Container
+
+Run the container and bind-mount your models directory to `/config/models` inside the container:
+
+```bash
+docker run --rm --ipc=host -it \
+  --ulimit stack=67108864 \
+  --ulimit memlock=-1 \
+  --gpus all \
+  -p 8000:8000 \
+  -v /path/to/models:/config/models:rw \
+  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
+  /bin/bash
+```
+
+Replace `/path/to/models` with the absolute path on your host.
+
+### Download the models (Base + Eagle3)
+
+Inside the container, download the base model and the Eagle3 speculative model to the expected directories under `/config/models/`:
+
+```bash
+# Optional: authenticate if the repository requires it
+# export HF_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+# huggingface-cli login --token "$HF_TOKEN" --add-to-git-credential
+
+pip install -q "huggingface_hub[cli]"
+
+# Base model: openai/gpt-oss-120b
+huggingface-cli download openai/gpt-oss-120b \
+  --local-dir /config/models/gpt-oss-120b \
+  --repo-type model
+
+# Eagle3 model assets
+mkdir -p /config/models/eagle
+huggingface-cli download nvidia/gpt-oss-120b-Eagle3 \
+  --local-dir /config/models/eagle \
+  --repo-type model
+```
+
+References: `https://huggingface.co/openai/gpt-oss-120b` and `https://huggingface.co/nvidia/gpt-oss-120b-Eagle3`
+
+### Create the Eagle3 Configuration
+
+Inside the container, create the YAML file at `/config/models/eagle/eagle.yaml` with the following content:
+
+```bash
+mkdir -p /config/models/eagle
+cat > /config/models/eagle/eagle.yaml << 'EOF'
+trust_remote_code: true
+kv_cache_config:
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.8
+speculative_config:
+  decoding_type: Eagle
+  max_draft_len: 3
+  speculative_model_dir: /config/models/eagle/
+cuda_graph_config:
+  max_batch_size: 10
+use_torch_sampler: true
+moe_config:
+  backend: TRTLLM
+EOF
+```
+
+Notes:
+- Ensure your base model directory is `/config/models/gpt-oss-120b`.
+- Ensure your Eagle3 assets are present under `/config/models/eagle/`.
+- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`.
+
+### Launch the Server (Eagle3 Speculative Decoding)
+
+Run the following command inside the container to start the endpoint:
+
+```bash
+TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 --max_batch_size 10  --tp_size 8 --ep_size 4 --trust_remote_code --extra_llm_api_options /config/models/eagle/eagle.yaml --max_num_tokens 131072 --max_seq_len 131072
+```
+
+The server initializes, loads, and optimizes the models. After it is ready, it listens on port 8000.
+
+### Quick Health Check
+
+From another terminal on the host, verify that the server is healthy:
+
+```bash
+curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
+```
+
+When `Status: 200` is returned, the endpoint is ready to serve requests.
+
+### Sample Chat Completions Request
+
+Note: This Eagle3 + TensorRT-LLM endpoint currently supports only greedy sampling. The following Chat Completions parameters are ignored (no-ops): `temperature`, `top_p`, `top_k`, and `seed`.
+
+Send a simple OpenAI-compatible Chat Completions request to the running server:
+
+```bash
+curl -X POST "http://localhost:8000/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gpt-oss-120b",
+    "messages": [
+      {"role": "user", "content": "Give me a two-sentence summary of Eagle3 speculative decoding."}
+    ],
+    "max_tokens": 128,
+    "stream": false
+  }'
+```
@@ -201,56 +201,60 @@ Metrics Endpoint
 
 .. note::
 
-   This endpoint is beta maturity.
+   The metrics endpoint for the default PyTorch backend are in beta and are not as comprehensive as those for the TensorRT backend.
 
-   The statistics for the PyTorch backend are beta and not as comprehensive as those for the TensorRT backend.
+   Some fields, such as CPU memory usage, are not yet available for the PyTorch backend.
 
-   Some fields, such as CPU memory usage, are not available for the PyTorch backend.
+   Enabling ``enable_iter_perf_stats`` in the PyTorch backend can slightly impact performance, depending on the serving configuration.
 
-   Enabling ``enable_iter_perf_stats`` in the PyTorch backend can impact performance slightly, depending on the serving configuration.
+The ``/metrics`` endpoint provides runtime iteration statistics such as GPU memory usage and KV cache details.
 
-The ``/metrics`` endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details.
-For the TensorRT backend, these statistics are enabled by default.
-However, for the PyTorch backend, you must explicitly enable iteration statistics logging by setting the `enable_iter_perf_stats` field in a YAML configuration file as shown in the following example:
+For the default PyTorch backend, iteration statistics logging is enabled by setting the ``enable_iter_perf_stats`` field in a YAML file:
 
 .. code-block:: yaml
 
-   # extra-llm-api-config.yml
-   pytorch_backend_config:
-    enable_iter_perf_stats: true
+   # extra_llm_config.yaml
+   enable_iter_perf_stats: true
 
-Then start the server and specify the ``--extra_llm_api_options`` argument with the path to the YAML file as shown in the following example:
+Start the server and specify the ``--extra_llm_api_options`` argument with the path to the YAML file:
 
 .. code-block:: bash
 
-   trtllm-serve <model> \
-     --extra_llm_api_options <path-to-extra-llm-api-config.yml> \
-     [--tp_size <tp> --pp_size <pp> --ep_size <ep> --host <host> --port <port>]
+   trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --extra_llm_api_options extra_llm_config.yaml
 
-After at least one inference request is sent to the server, you can fetch the runtime-iteration statistics by polling the `/metrics` endpoint:
+After sending at least one inference request to the server, you can fetch runtime iteration statistics by polling the ``/metrics`` endpoint.
+Since the statistics are stored in an internal queue and removed once retrieved, it's recommended to poll the endpoint shortly after each request and store the results if needed.
 
 .. code-block:: bash
 
-   curl -X GET http://<host>:<port>/metrics
+   curl -X GET http://localhost:8000/metrics
 
-*Example Output*
+Example output:
 
 .. code-block:: json
 
-   [
-       {
-           "gpuMemUsage": 56401920000,
-        "inflightBatchingStats": {
+    [
+        {
+            "gpuMemUsage": 76665782272,
+            "iter": 154,
+            "iterLatencyMS": 7.00688362121582,
+            "kvCacheStats": {
+                "allocNewBlocks": 3126,
+                "allocTotalBlocks": 3126,
+                "cacheHitRate": 0.00128,
+                "freeNumBlocks": 101253,
+                "maxNumBlocks": 101256,
+                "missedBlocks": 3121,
+                "reusedBlocks": 4,
+                "tokensPerBlock": 32,
+                "usedNumBlocks": 3
+            },
+            "numActiveRequests": 1
             ...
-        },
-        "iter": 1,
-        "iterLatencyMS": 16.505143404006958,
-        "kvCacheStats": {
-            ...
-        },
-        "newActiveRequestsQueueLatencyMS": 0.0007503032684326172
-    }
-]
+        }
+    ]
+
+
 
 Syntax
 ------