Can someone point me to resources on how to go about preparing custom datasets for fine-tuning NLP models?
Suppose I want to fine-tune a text completion model on a particular author’s works so that the outputs are in their style and I have access to the text in pdf format, how do I go from bunch of pdfs to a dataset that is acceptable for input to the model?