Microsoft and Google Open Sourced These Frameworks Based on Their Work Scaling Deep Learning Training

Google and Microsoft have recently released new frameworks for distributed deep learning training.



I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:

Image

 

Microsoft and Google have been actively working on new models for training deep neural networks. The result of that work has been the release of two new frameworks: Microsoft’s PipeDream and Google’s GPipe that follow similar principles to scale the training of deep learning models. Both projects have been detailed in respective research papers(PipeDream, GPipe) which I would try to summarize today.

Training is one of those areas of the lifecycle of deep learning programs that we don’t think of as challenging until the model’s hit certain scale. While training basic models during experimentation is relatively trivial, the complexity scales linearly with the quality and size of the model. For example, the winner of the 2014 ImageNet visual recognition challenge was GoogleNet, which achieved 74.8% top-1 accuracy with 4 million parameters, while just three years later, the winner of the 2017 ImageNet challenge went to Squeeze-and-Excitation Networks, which achieved 82.7% top-1 accuracy with 145.8 million (36x more) parameters. However, in the same period, GPU memory has only increased by a factor of ~3.

Image for post

As models scale in order to achieve higher levels of accuracy, the training of those models becomes increasingly challenging. The previous example demonstrates that is unsustainable to rely on GPU infrastructure improvements to achieve better training. Instead, we need distributed computing methods that parallelize training workloads across different nodes in order to scale training. That concept of parallelizable training might sound trivial but it results extremely complicated in practice. If you think about it, we are talking about partitioning the knowledge acquisition aspects of a model across different nodes and recombining the pieces into a cohesive model after. However, training parallelism is a must in order to scale deep learning models. To address those challenges, Microsoft and Google have devoted months of research and engineering that resulted in the release of GPipe and PipeDream respectively.

 

Google’s GPipe

 
GPipe focuses on scaling training workloads for deep learning programs. The complexity of training processes from an infrastructure standpoint is an often-overlooked aspect of deep learning models. Training datasets are getting larger and more complex. For instance, in the health care space, is not uncommon to encounter models that need to be trained using millions of high resolution images. As a result, training processes often take a long time to complete and result incredibly expensive from the memory and CPU consumption.

An effective way to think about the parallelism of deep learning models is to divide it between data and model parallelism. The data parallelism approach employs large clusters of machines to split the input data across them. Model parallelism attempts to move the model to accelerators, such as GPUs or TPUs, which have special hardware to accelerate model training. At a high level, almost all training datasets can be parallelized following certain logic but the same can’t be said about models. For instance, some deep learning models are composed of parallel branches which can be trained independently. In that case, a classic strategy is to divide the computation into partitions and assign different partitions to different branches. However, that strategy falls short in deep learning models that stack layers sequentially, presenting a challenge to parallelize computation efficiently.

GPipe combines both data and model parallelism by leveraging aa technique called pipelining. Conceptually, GPipe is a distributed machine learning library that uses synchronous stochastic gradient descent and pipeline parallelism for training, applicable to any DNN that consists of multiple sequential layers. GPipe partitions a model across different accelerators and automatically splits a mini-batch of training examples into smaller micro-batches. This model allows GPipe’s accelerators to operate in parallel maximizing the scalability of the training process.

The following figure illustrates the GPipe model with a neural network with sequential layers is partitioned across four accelerators. Fk is the composite forward computation function of kth partition. Bk is the corresponding backpropagation function. Bk depends on both Bk+1 from upper layer and the intermediate activations of Fk. In the top model, we can see how the sequential nature of the network leads to the underutilization of resources. The bottom figure shows the GPipe approach in which the input mini-batch is divided into smaller macro-batches which can be processed by the accelerators at the same time.

 

Microsoft’s PipeDream

 
A few months ago, Microsoft Research announced the creation of Project Fiddle, a series of research projects to streamline distributed deep learning. PipeDreams is one of the first released from Project Fiddle focusing on the parallelization of the training of deep learning models.

PipeDream takes a different approach from other methods to scale the training of deep learning models leveraging a technique known as pipeline parallelism. This approach tries to address some of the challenges of data and model parallelism techniques such as the ones used in GPipe. Typically, data parallelism methods suffer from high communication costs at scale when training on cloud infrastructure and can increase GPU compute speed over time. Similarly, model parallelism techniques often leverage hardware resources inefficiently and places an undue burden on programmers to determine how to split their specific model given a hardware deployment.

 

PipeDream tries to overcome some of the challenges of data-model parallelism methods by using a technique called pipeline parallelism. Conceptually, Pipeline-parallel computation involves partitioning the layers of a DNN model into multiple stages, where each stage consists of a consecutive set of layers in the model. Each stage is mapped to a separate GPU that performs the forward pass (and backward pass) for all layers in that stage.

Given a specific deep neural network, PipeDream automatically determines how to partition the operators of the DNN based on a short profiling run performed on a single GPU, balancing computational load among the different stages while minimizing communication for the target platform. PipeDream effectively load balances even in the presence of model diversity (computation and communication) and platform diversity (interconnect topologies and hierarchical bandwidths). PipeDream’s approach to training parallelism its principles offer several advantages over data-model parallelism methods. For starters, PipeDream requires less communications between the worker nodes as each worker in a pipeline execution has to communicate only subsets of the gradients and output activations, to only a single other worker. Also, PipeDream separates computation and communication in a way that leads to easier parallelism.

 

Training parallelism is one of the key challenges for building larger and more accurate deep learning models. An active area of research within the deep learning community, training parallelism methods needs to combine effective concurrent programming techniques with the nature of deep learning models. While still in early stages, Google’s GPipe and Microsoft’s PipeDream stand on its own merits as two of the most creative approaches to training parallelism available to deep learning developers.

 
Original. Reposted with permission.

Related: