Sharding Models for Inference

LucBar1 · November 19, 2023, 6:50pm

Is it possible to partially load an existing transformer model that stops up until a certain intermediate layer, only outputting the last intermediate activation?

This is to try a proof of concept around decentralized inference, where one node passes intermediary outputs to the following node, and that one picks off where one has left.

Topic		Replies	Views
Run portion of model during inference Intermediate	0	340	January 27, 2022
Best way to infer continuously with Transformer? Research	0	557	July 26, 2021
Big Model Inference: CPU/Disk Offloading for Transformers Using from_pretrained 🤗Accelerate	2	4653	February 28, 2024
How to visualize attention of a large encoder-decoder transformer model that isn't a model on hugging face? 🤗Transformers	0	2319	June 28, 2021
On Symbolic Residue: The Missing Biological Knockout Experiments in Advanced Transformer Models 🤗Transformers	0	141	April 6, 2025

Sharding Models for Inference

Related topics