You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/models/core/exaone/README.md
+14-2Lines changed: 14 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -93,15 +93,27 @@ For more information, please refer to official [docs](https://github.com/NVIDIA/
93
93
94
94
Troubleshooting
95
95
96
+
The following error may occur during quantization:
96
97
```bash
97
98
torch._dynamo.exc.Unsupported: Graph break under GenericContextWrappingVariable
98
99
Explanation: Attempted to graph breakin an active context manager(s) that doesn't support graph breaking.
99
100
Hint: Move the offending context manager(s) to outside the compiled region.
100
101
Hint: This graph break may have been caused by an earlier graph break. Resolving the earlier graph break may resolve this one.
101
102
```
102
-
If you encounter the above log messages, it means torch.compile() may not be compatible with the HybridCache module of the transformers library. As a result, [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (ModelOpt) cannot perform PTQ with HybridCache.
103
103
104
-
To resolve this, please use DynamicCache when creating PTQ models. Be aware that DynamicCache disables sliding windows, which may break the model behavior. By default, ModelOpt's PTQ procedure uses relatively short input lengths (less than the sliding window size of EXAONE-4.0), so this workaround is effective as long as input lengths are not increased. In our tests, the default ModelOpt settings did not hurt accuracy on MMLU or GSM8k benchmarks.
104
+
This error may indicate an incompatibility between `torch.compile()` and the `HybridCache` module of the transformers library. As a result, [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (ModelOpt) cannot perform PTQ with HybridCache.
105
+
106
+
Temporarily switching to `DynamicCache` when creating PTQ models could help address the issue. This can be done by updating the `cache_implementation` field in the `generation_config.json` file located in the model checkpoint directory, for example:
107
+
```json
108
+
# generation_config.json
109
+
{
110
+
// Change "hybrid" to "dynamic" to run PTQ.
111
+
// Revert this to "hybrid" after quantization is complete.
112
+
"cache_implementation": "hybrid",
113
+
...
114
+
}
115
+
```
116
+
For models with sliding window attention, DynamicCache is less memory-efficient than HybridCache because it retains the entire key-value cache. However, this does not break the model's attention logic, as the cache implementation is separated from the attention computation itself. This trade-off is acceptable for the PTQ process, which is a one-time procedure. Our tests confirm that this workaround does not degrade accuracy on MMLU or GSM8K benchmarks with the default ModelOpt settings.
105
117
106
118
### TRT flow
107
119
### Convert checkpoint and build TensorRT engine(s)
0 commit comments