Speed up PyTorch with DeepSpeed to coach giant language fashions with Intel Habana Gaudi-based DL1 EC2 cases



Coaching giant language fashions (LLMs) with billions of parameters will be difficult. Along with designing the mannequin structure, researchers have to arrange state-of-the-art coaching strategies for distributed coaching like blended precision assist, gradient accumulation, and checkpointing. With giant fashions, the coaching setup is much more difficult as a result of the out there reminiscence in a single accelerator system bounds the dimensions of fashions skilled utilizing solely knowledge parallelism, and utilizing mannequin parallel coaching requires extra degree of modifications to the coaching code. Libraries comparable to DeepSpeed (an open-source deep studying optimization library for PyTorch) handle a few of these challenges, and may also help speed up mannequin improvement and coaching.

On this put up, we arrange coaching on the Intel Habana Gaudi-based Amazon Elastic Compute Cloud (Amazon EC2) DL1 cases and quantify the advantages of utilizing a scaling framework comparable to DeepSpeed. We current scaling outcomes for an encoder-type transformer mannequin (BERT with 340 million to 1.5 billion parameters). For the 1.5-billion-parameter mannequin, we achieved a scaling effectivity of 82.7% throughout 128 accelerators (16 dl1.24xlarge cases) utilizing DeepSpeed ZeRO stage 1 optimizations. The optimizer states had been partitioned by DeepSpeed to coach giant fashions utilizing the info parallel paradigm. This method has been prolonged to coach a 5-billion-parameter mannequin utilizing knowledge parallelism. We additionally used Gaudi’s native assist of the BF16 knowledge sort for diminished reminiscence dimension and elevated coaching efficiency in comparison with utilizing the FP32 knowledge sort. Because of this, we achieved pre-training (section 1) mannequin convergence inside 16 hours (our goal was to coach a big mannequin inside a day) for the BERT 1.5-billion-parameter mannequin utilizing the wikicorpus-en dataset.

Coaching setup

We provisioned a managed compute cluster comprised of 16 dl1.24xlarge cases utilizing AWS Batch. We developed an AWS Batch workshop that illustrates the steps to arrange the distributed coaching cluster with AWS Batch. Every dl1.24xlarge occasion has eight Habana Gaudi accelerators, every with 32 GB of reminiscence and a full mesh RoCE community between playing cards with a complete bi-directional interconnect bandwidth of 700 Gbps every (see Amazon EC2 DL1 cases Deep Dive for extra data). The dl1.24xlarge cluster additionally used 4 AWS Elastic Cloth Adapters (EFA), with a complete of 400 Gbps interconnect between nodes.

The distributed coaching workshop illustrates the steps to arrange the distributed coaching cluster. The workshop exhibits the distributed coaching setup utilizing AWS Batch and particularly, the multi-node parallel jobs function to launch large-scale containerized coaching jobs on absolutely managed clusters. Extra particularly, a totally managed AWS Batch compute surroundings is created with DL1 cases. The containers are pulled from Amazon Elastic Container Registry (Amazon ECR) and launched robotically into the cases within the cluster based mostly on the multi-node parallel job definition. The workshop concludes by operating a multi-node, multi-HPU knowledge parallel coaching of a BERT (340 million to 1.5 billion parameters) mannequin utilizing PyTorch and DeepSpeed.

BERT 1.5B pre-training with DeepSpeed

Habana SynapseAI v1.5 and v1.6 assist DeepSpeed ZeRO1 optimizations. The Habana fork of the DeepSpeed GitHub repository contains the modifications essential to assist the Gaudi accelerators. There may be full assist of distributed knowledge parallel (multi-card, multi-instance), ZeRO1 optimizations, and BF16 knowledge sorts.

All these options are enabled on the BERT 1.5B mannequin reference repository, which introduces a 48-layer, 1600-hidden dimension, and 25-head bi-directional encoder mannequin, derived from a BERT implementation. The repository additionally incorporates the baseline BERT Giant mannequin implementation: a 24-layer, 1024-hidden, 16-head, 340-million-parameter neural community structure. The pre-training modeling scripts are derived from the NVIDIA Deep Studying Examples repository to obtain the wikicorpus_en knowledge, preprocess the uncooked knowledge into tokens, and shard the info into smaller h5 datasets for distributed knowledge parallel coaching. You possibly can undertake this generic method to coach your customized PyTorch mannequin architectures utilizing your datasets utilizing DL1 cases.

Pre-training (section 1) scaling outcomes

For pre-training giant fashions at scale, we primarily centered on two points of the answer: coaching efficiency, as measured by the point to coach, and cost-effectiveness of arriving at a totally converged answer. Subsequent, we dive deeper into these two metrics with BERT 1.5B pre-training for instance.

Scaling efficiency and time to coach

We begin by measuring the efficiency of the BERT Giant implementation as a baseline for scalability. The next desk lists the measured throughput of sequences per second from 1-8 dl1.24xlarge cases (with eight accelerator units per occasion). Utilizing the single-instance throughput as baseline, we measured the effectivity of scaling throughout a number of cases, which is a vital lever to grasp the price-performance coaching metric.

Variety of Cases Variety of Accelerators Sequences per Second Sequences per Second per Accelerator Scaling Effectivity
1 8 1,379.76 172.47 100.0%
2 16 2,705.57 169.10 98.04%
4 32 5,291.58 165.36 95.88%
8 64 9,977.54 155.90 90.39%

The next determine illustrates the scaling effectivity.

For BERT 1.5B, we modified the hyperparameters for the mannequin within the reference repository to ensure convergence. The efficient batch dimension per accelerator was set to 384 (for max reminiscence utilization), with micro-batches of 16 per step and 24 steps of gradient accumulation. Studying charges of 0.0015 and 0.003 had been used for 8 and 16 nodes, respectively. With these configurations, we achieved convergence of the section 1 pre-training of BERT 1.5B throughout 8 dl1.24xlarge cases (64 accelerators) in roughly 25 hours, and 15 hours throughout 16 dl1.24xlarge cases (128 accelerators). The next determine exhibits the typical loss as a operate of variety of coaching epochs, as we scale up the variety of accelerators.

With the configuration described earlier, we obtained 85% sturdy scaling effectivity with 64 accelerators and 83% with 128 accelerators, from a baseline of 8 accelerators in a single occasion. The next desk summarizes the parameters.

Variety of Cases Variety of Accelerators Sequences per Second Sequences per Second per Accelerator Scaling Effectivity
1 8 276.66 34.58 100.0%
8 64 1,883.63 29.43 85.1%
16 128 3,659.15 28.59 82.7%

The next determine illustrates the scaling effectivity.


On this put up, we evaluated assist for DeepSpeed by Habana SynapseAI v1.5/v1.6 and the way it helps scale LLM coaching on Habana Gaudi accelerators. Pre-training of a 1.5-billion-parameter BERT mannequin took 16 hours to converge on a cluster of 128 Gaudi accelerators, with 85% sturdy scaling. We encourage you to check out the structure demonstrated within the AWS workshop and think about adopting it to coach customized PyTorch mannequin architectures utilizing DL1 cases.

Concerning the authors

Mahadevan Balasubramaniam is a Principal Options Architect for Autonomous Computing with almost 20 years of expertise within the space of physics-infused deep studying, constructing, and deploying digital twins for industrial programs at scale. Mahadevan obtained his PhD in Mechanical Engineering from the Massachusetts Institute of Know-how and has over 25 patents and publications to his credit score.

RJ is an engineer in Search M5 crew main the efforts for constructing giant scale deep studying programs for coaching and inference. Outdoors of labor he explores completely different cuisines of meals and performs racquet sports activities.

Sundar Ranganathan is the Head of Enterprise Growth, ML Frameworks on the Amazon EC2 crew. He focuses on large-scale ML workloads throughout AWS providers like Amazon EKS, Amazon ECS, Elastic Cloth Adapter, AWS Batch, and Amazon SageMaker. His expertise contains management roles in product administration and product improvement at NetApp, Micron Know-how, Qualcomm, and Mentor Graphics.

Abhinandan Patni is a Senior Software program Engineer at Amazon Search. He focuses on constructing programs and tooling for scalable distributed deep studying coaching and actual time inference.

Pierre-Yves Aquilanti is Head of Frameworks ML Options at Amazon Internet Providers the place he helps develop the business’s finest cloud based mostly ML Frameworks options. His background is in Excessive Efficiency Computing and previous to becoming a member of AWS, Pierre-Yves was working within the Oil & Fuel business. Pierre-Yves is initially from France and holds a Ph.D. in Laptop Science from the College of Lille.