Preparing datasets for NLP tasks

hgarg · July 28, 2021, 3:45am

Can someone point me to resources on how to go about preparing custom datasets for fine-tuning NLP models?

Suppose I want to fine-tune a text completion model on a particular author’s works so that the outputs are in their style and I have access to the text in pdf format, how do I go from bunch of pdfs to a dataset that is acceptable for input to the model?

ahmedlone123 · July 28, 2021, 6:31am

I think you might need to convert the pdfs to a readable format like json or csv . Then you can clean and process it through pandas or the datasets library provided by hugging face …

Topic		Replies	Views
Format requirements of dataset when fine tuning another model 🤗Datasets	1	876	April 7, 2022
How to format a dataset for question/answers text for fine Beginners	0	966	December 13, 2023
Generate dataset for fine tuning on PDF(s) 🤗Transformers	6	3205	September 3, 2024
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12834	February 12, 2024
Dataset Preparation for Q&A FineTuning Beginners	0	447	September 28, 2023

Preparing datasets for NLP tasks

Related topics