MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

Anthony, Quentin; Awan, Ammar Ahmad; Rasley, Jeff; He, Yuxiong; Shafi, Aamir; Abduljabbar, Mustafa; Subramoni, Hari; Panda, Dhabaleswar

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2303.08374 (cs)

[Submitted on 15 Mar 2023]

Title:MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

Authors:Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda

View PDF

Abstract:In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales. Examples of models using advanced parallelism strategies include Deep Learning Recommendation Models (DLRM) and Mixture-of-Experts (MoE). Communication libraries' performance varies wildly across different communication operations, scales, and message sizes. We propose MCR-DL: an extensible DL communication framework that supports all point-to-point and collective operations while enabling users to dynamically mix-and-match communication backends for a given operation without deadlocks. MCR-DL also comes packaged with a tuning suite for dynamically selecting the best communication backend for a given input tensor. We select DeepSpeed-MoE and DLRM as candidate DL models and demonstrate a 31% improvement in DS-MoE throughput on 256 V100 GPUs on the Lassen HPC system. Further, we achieve a 20% throughput improvement in a dense Megatron-DeepSpeed model and a 25% throughput improvement in DLRM on 32 A100 GPUs with the Theta-GPU HPC system.

Comments:	Accepted, to be presented at IPDPS 2023
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2303.08374 [cs.DC]
	(or arXiv:2303.08374v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2303.08374

Submission history

From: Quentin Anthony [view email]
[v1] Wed, 15 Mar 2023 05:23:42 UTC (6,270 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators