Hi,
I am fine-tuning Llama2 for my particular use case. For this purpose, during evaluation, I want to compute model’s performance on my downstream task. For this I need to simulate model.generate
as I would use during inference. From browsing the documentation, I can see that one approach is to either create a callback. So this would be something like (pseudo code):
class MyCallBack(TrainerCallBack):
def on_evaluate(self, args, state, model, tokenizer):
tokens = tokenizer("text")
generated_text = model.generate(tokens["input_ids"], tokens["attention_mask"]
I have a few questions regarding this approach:
- Do I need to do any special treatment to the model before calling generate, i.e. do I need to call
model.eval()
so that gradients are not computed unnecessarily? - If I have loaded the model in quantized mode, do I need to take care of this when using
model.generate
? - Can I override model’s default generation config here without affecting training?
Another solution which could be helpful is to use the trainer.predict
which is provided in the trainer
API. However, I am not sure if this is same as calling model.generate
i.e does it generate next token based on model’s generated tokens or based on the correct token from input?