Which model for inference on 11 GB GPU?

Hello everybody

I’ve just found the amazing Huggingface library. It is an awesome piece of work.

I would like to train a chatbot on some existing dataset or several datasets (e.g. the Pile). For training (or fine-tuning) the model I have no GPU memory limitations (48 GB GPU is available). For inference, I only have a GPU with 11 GB available. Inference should be feasible in real-time (i.e. below around 3 seconds) and the model should be adjustable, i.e. the source code should be available to change the structure of the model.

What model is best when taking into account these requirements? Probably one of the best models is GPT-J but I think for inference it needs more than 11 GB GPU.

Does anybody have some input? Any input is highly appreciated.