Skip to content

Commit fbec724

Browse files
committed
revert unintended changes, revert openai server changes
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
1 parent 4aa47fb commit fbec724

19 files changed

+482
-281
lines changed
Lines changed: 3 additions & 0 deletions
Loading
350 KB
Loading
41.1 KB
Loading
Lines changed: 3 additions & 0 deletions
Loading
61.3 KB
Loading
Lines changed: 3 additions & 0 deletions
Loading
400 KB
Loading

docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ Note: MoE module load balancing is handled separately by the Expert Parallel Loa
7171
The $sol\\_tps$ represents the theoretical upper-bound throughput achievable with perfect load balancing:
7272

7373
```math
74-
sol\_time = \sum_{i=0}^{\infty} \frac{time_i}{balance\_ratio_i}
74+
sol\_time = \sum_{i=0}^{\infty} time_i * balance\_ratio_i
7575
```
7676

7777
```math
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
## Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT-LLM)
2+
3+
This guide sets up a production endpoint that uses Eagle3 speculative decoding on NVIDIA GB200 or B200 GPUs only. It replaces the low‑latency flow from the previous guide and intentionally omits max‑throughput, Hopper, and benchmarking content.
4+
5+
### Prerequisites
6+
7+
- NVIDIA GB200 or B200 GPUs (example below assumes 8 GPUs; adjust flags for your setup)
8+
- Fast SSD storage for model weights
9+
- Base model weights available under a directory named `gpt-oss-120b` (example path)
10+
- Eagle3 speculative model assets available under a directory named `eagle`
11+
12+
Expected directory layout on the host (example):
13+
14+
```
15+
/path/to/models/
16+
├─ gpt-oss-120b/ # base model directory
17+
└─ eagle/ # Eagle3 speculative decoding assets
18+
```
19+
20+
### Get the TensorRT-LLM Container (1.1.0rc0)
21+
22+
If required by your environment, log into NGC and pull the image:
23+
24+
```bash
25+
# Create an API key at https://ngc.nvidia.com (if you don't have one)
26+
docker login nvcr.io
27+
# Username: $oauthtoken
28+
# Password: <your NGC API key>
29+
30+
docker pull nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0
31+
```
32+
33+
### Start the TensorRT-LLM Container
34+
35+
Run the container and bind-mount your models directory to `/config/models` inside the container:
36+
37+
```bash
38+
docker run --rm --ipc=host -it \
39+
--ulimit stack=67108864 \
40+
--ulimit memlock=-1 \
41+
--gpus all \
42+
-p 8000:8000 \
43+
-v /path/to/models:/config/models:rw \
44+
nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
45+
/bin/bash
46+
```
47+
48+
Replace `/path/to/models` with the absolute path on your host.
49+
50+
### Download the models (Base + Eagle3)
51+
52+
Inside the container, download the base model and the Eagle3 speculative model to the expected directories under `/config/models/`:
53+
54+
```bash
55+
# Optional: authenticate if the repository requires it
56+
# export HF_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
57+
# huggingface-cli login --token "$HF_TOKEN" --add-to-git-credential
58+
59+
pip install -q "huggingface_hub[cli]"
60+
61+
# Base model: openai/gpt-oss-120b
62+
huggingface-cli download openai/gpt-oss-120b \
63+
--local-dir /config/models/gpt-oss-120b \
64+
--repo-type model
65+
66+
# Eagle3 model assets
67+
mkdir -p /config/models/eagle
68+
huggingface-cli download nvidia/gpt-oss-120b-Eagle3 \
69+
--local-dir /config/models/eagle \
70+
--repo-type model
71+
```
72+
73+
References: `https://huggingface.co/openai/gpt-oss-120b` and `https://huggingface.co/nvidia/gpt-oss-120b-Eagle3`
74+
75+
### Create the Eagle3 Configuration
76+
77+
Inside the container, create the YAML file at `/config/models/eagle/eagle.yaml` with the following content:
78+
79+
```bash
80+
mkdir -p /config/models/eagle
81+
cat > /config/models/eagle/eagle.yaml << 'EOF'
82+
trust_remote_code: true
83+
kv_cache_config:
84+
enable_block_reuse: false
85+
free_gpu_memory_fraction: 0.8
86+
speculative_config:
87+
decoding_type: Eagle
88+
max_draft_len: 3
89+
speculative_model_dir: /config/models/eagle/
90+
cuda_graph_config:
91+
max_batch_size: 10
92+
use_torch_sampler: true
93+
moe_config:
94+
backend: TRTLLM
95+
EOF
96+
```
97+
98+
Notes:
99+
- Ensure your base model directory is `/config/models/gpt-oss-120b`.
100+
- Ensure your Eagle3 assets are present under `/config/models/eagle/`.
101+
- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`.
102+
103+
### Launch the Server (Eagle3 Speculative Decoding)
104+
105+
Run the following command inside the container to start the endpoint:
106+
107+
```bash
108+
TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 --max_batch_size 10 --tp_size 8 --ep_size 4 --trust_remote_code --extra_llm_api_options /config/models/eagle/eagle.yaml --max_num_tokens 131072 --max_seq_len 131072
109+
```
110+
111+
The server initializes, loads, and optimizes the models. After it is ready, it listens on port 8000.
112+
113+
### Quick Health Check
114+
115+
From another terminal on the host, verify that the server is healthy:
116+
117+
```bash
118+
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
119+
```
120+
121+
When `Status: 200` is returned, the endpoint is ready to serve requests.
122+
123+
### Sample Chat Completions Request
124+
125+
Note: This Eagle3 + TensorRT-LLM endpoint currently supports only greedy sampling. The following Chat Completions parameters are ignored (no-ops): `temperature`, `top_p`, `top_k`, and `seed`.
126+
127+
Send a simple OpenAI-compatible Chat Completions request to the running server:
128+
129+
```bash
130+
curl -X POST "http://localhost:8000/v1/chat/completions" \
131+
-H "Content-Type: application/json" \
132+
-d '{
133+
"model": "gpt-oss-120b",
134+
"messages": [
135+
{"role": "user", "content": "Give me a two-sentence summary of Eagle3 speculative decoding."}
136+
],
137+
"max_tokens": 128,
138+
"stream": false
139+
}'
140+
```

docs/source/commands/trtllm-serve/trtllm-serve.rst

Lines changed: 34 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -201,56 +201,60 @@ Metrics Endpoint
201201
202202
.. note::
203203
204-
This endpoint is beta maturity.
204+
The metrics endpoint for the default PyTorch backend are in beta and are not as comprehensive as those for the TensorRT backend.
205205
206-
The statistics for the PyTorch backend are beta and not as comprehensive as those for the TensorRT backend.
206+
Some fields, such as CPU memory usage, are not yet available for the PyTorch backend.
207207
208-
Some fields, such as CPU memory usage, are not available for the PyTorch backend.
208+
Enabling ``enable_iter_perf_stats`` in the PyTorch backend can slightly impact performance, depending on the serving configuration.
209209
210-
Enabling ``enable_iter_perf_stats`` in the PyTorch backend can impact performance slightly, depending on the serving configuration.
210+
The ``/metrics`` endpoint provides runtime iteration statistics such as GPU memory usage and KV cache details.
211211
212-
The ``/metrics`` endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details.
213-
For the TensorRT backend, these statistics are enabled by default.
214-
However, for the PyTorch backend, you must explicitly enable iteration statistics logging by setting the `enable_iter_perf_stats` field in a YAML configuration file as shown in the following example:
212+
For the default PyTorch backend, iteration statistics logging is enabled by setting the ``enable_iter_perf_stats`` field in a YAML file:
215213
216214
.. code-block:: yaml
217215
218-
# extra-llm-api-config.yml
219-
pytorch_backend_config:
220-
enable_iter_perf_stats: true
216+
# extra_llm_config.yaml
217+
enable_iter_perf_stats: true
221218
222-
Then start the server and specify the ``--extra_llm_api_options`` argument with the path to the YAML file as shown in the following example:
219+
Start the server and specify the ``--extra_llm_api_options`` argument with the path to the YAML file:
223220
224221
.. code-block:: bash
225222
226-
trtllm-serve <model> \
227-
--extra_llm_api_options <path-to-extra-llm-api-config.yml> \
228-
[--tp_size <tp> --pp_size <pp> --ep_size <ep> --host <host> --port <port>]
223+
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --extra_llm_api_options extra_llm_config.yaml
229224
230-
After at least one inference request is sent to the server, you can fetch the runtime-iteration statistics by polling the `/metrics` endpoint:
225+
After sending at least one inference request to the server, you can fetch runtime iteration statistics by polling the ``/metrics`` endpoint.
226+
Since the statistics are stored in an internal queue and removed once retrieved, it's recommended to poll the endpoint shortly after each request and store the results if needed.
231227
232228
.. code-block:: bash
233229
234-
curl -X GET http://<host>:<port>/metrics
230+
curl -X GET http://localhost:8000/metrics
235231
236-
*Example Output*
232+
Example output:
237233
238234
.. code-block:: json
239235
240-
[
241-
{
242-
"gpuMemUsage": 56401920000,
243-
"inflightBatchingStats": {
236+
[
237+
{
238+
"gpuMemUsage": 76665782272,
239+
"iter": 154,
240+
"iterLatencyMS": 7.00688362121582,
241+
"kvCacheStats": {
242+
"allocNewBlocks": 3126,
243+
"allocTotalBlocks": 3126,
244+
"cacheHitRate": 0.00128,
245+
"freeNumBlocks": 101253,
246+
"maxNumBlocks": 101256,
247+
"missedBlocks": 3121,
248+
"reusedBlocks": 4,
249+
"tokensPerBlock": 32,
250+
"usedNumBlocks": 3
251+
},
252+
"numActiveRequests": 1
244253
...
245-
},
246-
"iter": 1,
247-
"iterLatencyMS": 16.505143404006958,
248-
"kvCacheStats": {
249-
...
250-
},
251-
"newActiveRequestsQueueLatencyMS": 0.0007503032684326172
252-
}
253-
]
254+
}
255+
]
256+
257+
254258
255259
Syntax
256260
------

0 commit comments

Comments
 (0)