I am trying to fine-tune falcon7b on a series of essays I wrote to see how well it can generalize my writing style to new essay prompts. However, doing this requires using a pretty lengthy system prompt, plus a ~300-400 word essay completion for each prompt. Using a T4 GPU and QLoRA I am able to get a model with good loss in ~3 hours. But when I try and generate a whole essay with the fine-tuned model by setting it up in a pipeline, I can wait 20 minutes and not have anything be produced. Is it typical to do inference with models like this in Colab? I can share code if that helps but my question is more theoretically concerned with the fact of if this is even possible. If not, I was also considering trying to run the model locally since I have an Apple M1 chip and I saw people getting fast inference on that with falcon7b. Also, this is my first time posting here, so apologies for any formatting issues or norms I haven’t followed.