I have a large text file that I can load using load_datasets. I want to train a token classification task (comma prediction) but don’t have the labels in the text file yet so I am trying to write a function to create the labels from text.
I need to do the following:
- Split text into words (whitespace is easy)
- If word is punctuation mark, assign label 2
- If next word is punctuation mark, assign label 1
- If random number > some threshold, delete punctuation word and label 2 (but keep the label 1 on the preceding word)
- Else assign label 0
I was able to do step 2 using a map function but it looks at the text word-for-word. I am not sure of the most efficient way to add steps 3 and 4 and then add the word and label features to the dataset?