*How can we optimize the hardware when employing the aforementioned model?

I’m training the aforementioned model for a chatbot using Langchain on dominolab gpu, but I’d like to know how to run it locally without requiring extremely expensive hardware.



Hugging Face provides the Optimum library to optimize HF models: 🤗 Optimum. This includes things like ONNX export (ONNX is an efficient format to store neural networks), quantization (rather than using 32 bits = 4 bytes to store each parameter, you can use 4 or 8 bits).

There’s also frameworks like TGI and ggml which make sure chatbot-like models run as fast as possible, even on your local laptop. Both of them seem to support the T5 architecture, which google/flan-UL2 uses.