Sharding Models for Inference

Is it possible to partially load an existing transformer model that stops up until a certain intermediate layer, only outputting the last intermediate activation?

This is to try a proof of concept around decentralized inference, where one node passes intermediary outputs to the following node, and that one picks off where one has left.

1 Like