I have fine tuned a llama model and have loaded it into a model object with AutoPeftModelForCausalLM.
I now want to used this model for inference on a lot of queries. Currently I am looping (yes for looping) through each query and calling mymodel.generate, then tokenizer.decode to get each response.
Is there a smarter way I can do this in huggingface? Can you generate in parrellel or batches? I have 4 GPUs avaliable to me.