Skip to content

Commit ce18a01

Browse files
re-write troubleshooting section
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
1 parent 3f9e385 commit ce18a01

File tree

1 file changed

+14
-2
lines changed

1 file changed

+14
-2
lines changed

examples/models/core/exaone/README.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -93,15 +93,27 @@ For more information, please refer to official [docs](https://github.com/NVIDIA/
9393

9494
Troubleshooting
9595

96+
The following error may occur during quantization:
9697
```bash
9798
torch._dynamo.exc.Unsupported: Graph break under GenericContextWrappingVariable
9899
Explanation: Attempted to graph break in an active context manager(s) that doesn't support graph breaking.
99100
Hint: Move the offending context manager(s) to outside the compiled region.
100101
Hint: This graph break may have been caused by an earlier graph break. Resolving the earlier graph break may resolve this one.
101102
```
102-
If you encounter the above log messages, it means torch.compile() may not be compatible with the HybridCache module of the transformers library. As a result, [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (ModelOpt) cannot perform PTQ with HybridCache.
103103
104-
To resolve this, please use DynamicCache when creating PTQ models. Be aware that DynamicCache disables sliding windows, which may break the model behavior. By default, ModelOpt's PTQ procedure uses relatively short input lengths (less than the sliding window size of EXAONE-4.0), so this workaround is effective as long as input lengths are not increased. In our tests, the default ModelOpt settings did not hurt accuracy on MMLU or GSM8k benchmarks.
104+
This error may indicate an incompatibility between `torch.compile()` and the `HybridCache` module of the transformers library. As a result, [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (ModelOpt) cannot perform PTQ with HybridCache.
105+
106+
Temporarily switching to `DynamicCache` when creating PTQ models could help address the issue. This can be done by updating the `cache_implementation` field in the `generation_config.json` file located in the model checkpoint directory, for example:
107+
```json
108+
# generation_config.json
109+
{
110+
// Change "hybrid" to "dynamic" to run PTQ.
111+
// Revert this to "hybrid" after quantization is complete.
112+
"cache_implementation": "hybrid",
113+
...
114+
}
115+
```
116+
For models with sliding window attention, DynamicCache is less memory-efficient than HybridCache because it retains the entire key-value cache. However, this does not break the model's attention logic, as the cache implementation is separated from the attention computation itself. This trade-off is acceptable for the PTQ process, which is a one-time procedure. Our tests confirm that this workaround does not degrade accuracy on MMLU or GSM8K benchmarks with the default ModelOpt settings.
105117

106118
### TRT flow
107119
### Convert checkpoint and build TensorRT engine(s)

0 commit comments

Comments
 (0)