Preparing datasets for NLP tasks

Can someone point me to resources on how to go about preparing custom datasets for fine-tuning NLP models?

Suppose I want to fine-tune a text completion model on a particular author’s works so that the outputs are in their style and I have access to the text in pdf format, how do I go from bunch of pdfs to a dataset that is acceptable for input to the model?

I think you might need to convert the pdfs to a readable format like json or csv . Then you can clean and process it through pandas or the datasets library provided by hugging face …