Is it possible to evaluate generations/output while fine-tuning a LLM?

Is it possible to run some prompts and generate outputs for these prompts during fine-tuning (with transformers.Trainer), e.g. on each eval step?

I’ve seen some predictions in the w&b report for this blog here but I am not sure whether these predictions were made by loading the respective checkpoints.

If someone stumbles upon the same question: Here is where to look to implement custom callbacks on evaluation.

1 Like

Late here, but Braintrust is a great tool / platform for evaluating your LLM. We have a simple library in Python/Typescript for running and logging evaluations so you can use our web UI to dig into the results.

It’s free to use @ https://braintrustdata.com/