Custom evaluation during Llama2 fine tuning


I am fine-tuning Llama2 for my particular use case. For this purpose, during evaluation, I want to compute model’s performance on my downstream task. For this I need to simulate model.generate as I would use during inference. From browsing the documentation, I can see that one approach is to either create a callback. So this would be something like (pseudo code):

class MyCallBack(TrainerCallBack):
   def on_evaluate(self, args, state, model, tokenizer):
         tokens = tokenizer("text")
         generated_text  = model.generate(tokens["input_ids"], tokens["attention_mask"]

I have a few questions regarding this approach:

  1. Do I need to do any special treatment to the model before calling generate, i.e. do I need to call model.eval() so that gradients are not computed unnecessarily?
  2. If I have loaded the model in quantized mode, do I need to take care of this when using model.generate?
  3. Can I override model’s default generation config here without affecting training?

Another solution which could be helpful is to use the trainer.predict which is provided in the trainer API. However, I am not sure if this is same as calling model.generate i.e does it generate next token based on model’s generated tokens or based on the correct token from input?


Hi, I am also trying to customize evolution for the downstream task. I wonder did you figure out what would be the best approach to do so?