Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

FL33TW00D · January 12, 2021, 8:13am

Interesting new paper from Google improving upon T5.

ClaudeCOULOMBE · January 20, 2021, 3:37pm

Just to add to the previous post… Google Brain recently unveiled a language model of 1.6 trillion (1.6E+12) parameters with performance equal to or better than the SOTA on several NLP tasks. It surpasses the 175 billion (1.75E+11) parameters of GPT-3. The mastodon was made possible by the development of a new attention-based architecture (switch transform) that divides training data and parameters between a multitude of sub-models or mix of experts connected by trainable gating. Despite its gigantic size, this text-to-text model would have been 7 times faster to train on the C4 (Colossal Clean Crawled Corpus, 750 GB) using the same amount of computation. The original article: https://bit.ly/2LQzsmJ, the source code: http://bit.ly/390j0ZY

Topic		Replies	Views
Resources for model design (number of layers, attention heads, etc) Beginners	2	610	January 4, 2021
Question answer model for Process Data in IIOT 🤗Transformers	3	21	June 18, 2025
Optimize large scale transformer model inference with ONNX Runtime Models	0	380	January 18, 2022
Big Model Inference: CPU/Disk Offloading for Transformers Using from_pretrained 🤗Accelerate	2	4671	February 28, 2024
Parallelize model call for TFBertModel 🤗Transformers	3	1031	January 7, 2021

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Related topics