Google AI introduces a method called Task-Level Mixture-of-Experts (TaskMoE), which takes advantage of the quality gains of model scaling while remaining efficient to serve

Scaling the large-scale language model has resulted in significant quality gains in natural language understanding (T5), generation (GPT-3), and multilingual neural machine translation (M4). A typical method to create a larger model is to increase the depth (number of layers) and width (dimensionality of layers), essentially expanding the existing dimensions of the network. These dense patterns take an input sequence (divided into smaller components called tokens) and route each token through the entire network, enabling every layer and parameter. Although these large, dense models have shown state-of-the-art results on various natural language processing (NLP) applications, their training costs increase linearly with model size.

Building low-enabled models based on mixture of experts (MoE) (e.g., GShard-M4 or GLaM), where each token provided to the network follows a separate subnet bypassing some of the model parameters, is an alternative technique and more common. Small networks of routers that are educated along with the rest decide how to distribute ingress tokens to each subnet (the “experts”). This allows researchers to increase model size (and therefore performance) without proportionally increasing training costs.

Although an efficient technique for training, delivering tokens from a long sequence to multiple experts increases the computational cost of inference since the experts must be dispersed among a large number of accelerators. Serving the 1.2T-parameter GLaM model, for example, requires 256 TPU-v3 chips. The number of processors required to service an MoE model, such as dense models, increases linearly with model size, increasing computing requirements while incurring significant communication overhead and additional technical complexity.

An approach called Task-level Mixture-of-Experts (TaskMoE) in “Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference” was discovered, taking advantage of the quality advantages of model scaling while being efficient for the service. The strategy is to train a large multitasking model. Smaller autonomous subnets per task suitable for inference were extracted while maintaining model quality and significantly reducing inference latency. This strategy is more efficient for multilingual neural machine translation (NMT) than previous expert mixture models and models compressed via knowledge distillation.

Using Task Information to Train Large Low-Activity Models

A sparsely active model has been trained in which router networks learn to provide tokens from each task-specific entry to subnets associated with the model. Each token of a particular language is routed to the same subnet in the case of a multilingual NMT. This contrasts with the loosely closed mix of expert models (e.g., TokenMoE), in which router networks learn to provide distinct tokens as input to different subnets independently of the task.

By extracting the subarrays, the distillation can be avoided.

The training difference between TaskMoE and models like TokenMoE affects how inference is approached. TokenMoE is computationally expensive during inference because it follows the practice of distributing tokens of the same task to many experts at both training and inference time.

During training and inference, a smaller subnet was assigned to a single task identity in TaskMoE. Subnets during inference were extracted by removing unnecessary experts for each job. TaskMoE and its variants made it possible to train a single extended multitasking network and then employ a separate subnet for each task at inference time without the need for different compression algorithms after training. Below is the figure on how to train a TaskMoE network and then extract subnets per task for inference.

Models have been trained based on the Transformer architecture to demonstrate this method. Similar to GShard-M4 and GLaM, the feedforward network of every other transformer layer has been replaced by a Mixture-of-Experts (MoE) layer composed of many identical feedforward networks, the “experts”. The routing network, trained with the rest of the model, maintains the task identity for all input tokens for each task and selects a particular number of experts per layer (two in this example) to construct the sub- task-specific network. Both TaskMoE and TokenMoE are 6 layers deep, but with 32 experts for each MoE layer and a total of 533 million parameters. Models using publicly available WMT datasets contain over 431 million sentences from 30 distinct language families and scripts.


To highlight the benefit of using TaskMoE when inferring, compare the throughput or number of decoded tokens per second. After each task’s subnet is retrieved, TaskMoE is 7 times smaller than the 533M parameter TokenMoE model and can be serviced on a single TPUv3 core rather than the 64 cores required by TokenMoE.

TaskMoE models have twice the peak throughput of TokenMoE models. Moreover, in the TokenMoE model, 25% of the inference time is spent on inter-device communication, while TaskMoE spends almost no time communicating.

Knowledge distillation, in which a large teacher model trains a smaller student model to match teacher performance, is a common way to build a smaller network that nevertheless works well. This strategy, however, comes at the expense of the extra computation required to instruct the teacher’s student. Therefore, compare TaskMoE to a basic TokenMoE model that has been compressed by knowledge distillation. The compressed TokenMoE model is the same size as the per-task subnet retrieved from TaskMoE.

TaskMoE outperforms a distilled TokenMoE model of 2.1 BLUE on average across all languages ​​in our multilingual translation model, in addition to being a simpler strategy that requires no additional training. Distillation preserves 43% of the performance benefits obtained by fitting a dense multilingual model to a TokenMoE, but extracting the smallest subnet from the TaskMoE model does not cause any loss in quality.

The growing requirement to train models that can generalize to different tasks and modalities only adds to the demand for scaling models. However, serving these huge models remains a significant difficulty. Efficient deployment of large models is an important area of ​​study, and TaskMoE is a promising step towards more inference-friendly algorithms that maintain the quality benefits of scaling.

For an in-depth read, refer to the publication or Google research paper.

Comments are closed.