Create custom dataset with labels for token classification from large text file

dreamk73 · February 3, 2023, 11:47pm

I have a large text file that I can load using load_datasets. I want to train a token classification task (comma prediction) but don’t have the labels in the text file yet so I am trying to write a function to create the labels from text.

I need to do the following:

Split text into words (whitespace is easy)
If word is punctuation mark, assign label 2
If next word is punctuation mark, assign label 1
If random number > some threshold, delete punctuation word and label 2 (but keep the label 1 on the preceding word)
Else assign label 0

I was able to do step 2 using a map function but it looks at the text word-for-word. I am not sure of the most efficient way to add steps 3 and 4 and then add the word and label features to the dataset?

Topic		Replies	Views
Preprocessing data for text classification, HF dataset 🤗Datasets	1	571	October 3, 2022
Multilabel token classification (dataloader issues) 🤗Datasets	0	178	April 20, 2024
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12715	October 6, 2021
Adding labels from different files Beginners	2	14	June 3, 2025
Word Specific Classification (custom token classification?) Beginners	0	76	May 28, 2024

Create custom dataset with labels for token classification from large text file

Related topics