Adding labels from different files

zacharia-husain · June 3, 2025, 4:34pm

If I have multiple texts in a folder and a csv file with token classification labels, how would I merge them together so when I index the dataset the text and labels will be in the same index (like how in the examples the imdb dataset has sentiment and text at the same index). My understanding is that you can only pass one file type to load_datasets, and map I cant figure out how to use map when the size of the labels varies (it depends on amount of tokens).

Mdrnfox · June 3, 2025, 4:48pm

What I would do is:

Read in your files
Align your labels to your tokenized text. Try using tokenizer(…, return_offsets_mapping=True) helps you align labels to tokens.
Then create a dataset object manually.

system · June 4, 2025, 2:58pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can we download dataset from folder of text file 🤗Datasets	2	1224	January 18, 2022
Create custom dataset with labels for token classification from large text file 🤗Datasets	0	679	February 3, 2023
Issue concatenating datasets 🤗Datasets	3	4512	January 3, 2023
Multilabel token classification (dataloader issues) 🤗Datasets	0	178	April 20, 2024
Understanding multi-label classification training Beginners	0	820	February 14, 2023

Adding labels from different files

Related topics