thank you for the useful example!
it is still not working as expected, for example. the generate method doesn’t have a way to only output the newly generated tokens.
my usage is mostly research based, are there any recommendations for that case?
running LLMs locally, mostly for inference but also fine-tuning, on a cluster with multiple GPUs?
are there any supporting packages or useful repos for research environments?