Create custom dataset with labels for token classification from large text file

I have a large text file that I can load using load_datasets. I want to train a token classification task (comma prediction) but don’t have the labels in the text file yet so I am trying to write a function to create the labels from text.

I need to do the following:

  1. Split text into words (whitespace is easy)
  2. If word is punctuation mark, assign label 2
  3. If next word is punctuation mark, assign label 1
  4. If random number > some threshold, delete punctuation word and label 2 (but keep the label 1 on the preceding word)
  5. Else assign label 0

I was able to do step 2 using a map function but it looks at the text word-for-word. I am not sure of the most efficient way to add steps 3 and 4 and then add the word and label features to the dataset?