hey @burrt there’s a few threads on the forum about making transformer models smaller / faster, e.g.
my standard recommendation is to try quantization followed by ONNX / ONNX Runtime. before doing that though, i’d first try to understand what’s causing the timeout on your endpoint - it might be unrelated to the model and you don’t want to spend a lot of time optimising the wrong thing