Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale

RY Aminabadi, S Rajbhandari, AA Awan…�- …�Conference for High�…, 2022 - ieeexplore.ieee.org
RY Aminabadi, S Rajbhandari, AA Awan, C Li, D Li, E Zheng, O Ruwase, S Smith, M Zhang
SC22: International Conference for High Performance Computing�…, 2022ieeexplore.ieee.org
The landscape of transformer model inference is increasingly diverse in model size, model
characteristics, latency and throughput requirements, hardware requirements, etc. With such
diversity, designing a versatile inference system is challenging. DeepSpeed-Inference
addresses these challenges by (1) a multi-GPU inference solution to minimize latency while
maximizing throughput for both dense and sparse transformers when the model fits in
aggregate GPU memory, and (2) a heterogeneous inference solution that leverages�…
The landscape of transformer model inference is increasingly diverse in model size, model characteristics, latency and throughput requirements, hardware requirements, etc. With such diversity, designing a versatile inference system is challenging. DeepSpeed-Inference addresses these challenges by (1) a multi-GPU inference solution to minimize latency while maximizing throughput for both dense and sparse transformers when the model fits in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU/NVMe/GPU memory to enable high-throughput inference for models larger than aggregate GPU memory. DeepSpeed-Inference reduces latency by 6.4� and increases throughput by 1.5 �over the state-of-the-art. It enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25 �larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over 50% of A6000 peak).
ieeexplore.ieee.org