Is it possible to run some prompts and generate outputs for these prompts during fine-tuning (with transformers.Trainer), e.g. on each eval step?
I’ve seen some predictions in the w&b report for this blog here but I am not sure whether these predictions were made by loading the respective checkpoints.
If someone stumbles upon the same question: Here is where to look to implement custom callbacks on evaluation.
Late here, but Braintrust is a great tool / platform for evaluating your LLM. We have a simple library in Python/Typescript for running and logging evaluations so you can use our web UI to dig into the results.
It’s free to use @ https://braintrustdata.com/