Is it possible, when using TorchServe for inference, to improve the speed of inferencing T5 in specific (or transformers in general) by doing either:
jit.trace
jit.script
quantize
And if possible, how?
When I simply try to save/export a pretrained model using:
traced_model = torch.jit.trace(model, (dummy_input_ids, dummy_attention_mask, dummy_decoder_input_ids))
torch.jit.save(traced_model, "t5_small_traced.pt")
I get an error message.