Google Scholar

Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale

RY Aminabadi, S Rajbhandari, AA Awan…�- …�Conference for High�…, 2022 - ieeexplore.ieee.org

RY Aminabadi, S Rajbhandari, AA Awan, C Li, D Li, E Zheng, O Ruwase, S Smith, M Zhang…

SC22: International Conference for High Performance Computing�…, 2022•ieeexplore.ieee.org

The landscape of transformer model inference is increasingly diverse in model size, model characteristics, latency and throughput requirements, hardware requirements, etc. With such diversity, designing a versatile inference system is challenging. DeepSpeed-Inference addresses these challenges by (1) a multi-GPU inference solution to minimize latency while maximizing throughput for both dense and sparse transformers when the model fits in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU/NVMe/GPU memory to enable high-throughput inference for models larger than aggregate GPU memory. DeepSpeed-Inference reduces latency by 6.4� and increases throughput by 1.5 �over the state-of-the-art. It enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25 �larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over 50% of A6000 peak).

ieeexplore.ieee.org

Mehr anzeigenWeniger anzeigen

Speichern Zitieren Zitiert von: 605 �hnliche Artikel Alle 7 Versionen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale