If I have multiple texts in a folder and a csv file with token classification labels, how would I merge them together so when I index the dataset the text and labels will be in the same index (like how in the examples the imdb dataset has sentiment and text at the same index). My understanding is that you can only pass one file type to load_datasets, and map I cant figure out how to use map when the size of the labels varies (it depends on amount of tokens).
1 Like
What I would do is:
Read in your files
Align your labels to your tokenized text. Try using tokenizer(…, return_offsets_mapping=True) helps you align labels to tokens.
Then create a dataset object manually.
1 Like
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.