Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Interesting new paper from Google improving upon T5.


Just to add to the previous post… Google Brain recently unveiled a language model of 1.6 trillion (1.6E+12) parameters with performance equal to or better than the SOTA on several NLP tasks. It surpasses the 175 billion (1.75E+11) parameters of GPT-3. The mastodon was made possible by the development of a new attention-based architecture (switch transform) that divides training data and parameters between a multitude of sub-models or mix of experts connected by trainable gating. Despite its gigantic size, this text-to-text model would have been 7 times faster to train on the C4 (Colossal Clean Crawled Corpus, 750 GB) using the same amount of computation. The original article: https://bit.ly/2LQzsmJ, the source code: http://bit.ly/390j0ZY

1 Like