Question on splitting input sequence

Hey everyone,

I’ve trained Roberta for a token classification task, and unfortunately some of my input text are just over 512 tokens long. It shouldn’t be a big problem for me to split the task in half and assess each half separately. I don’t want to just split the input text though, in case I accidentally split a word in-half, which could damage a critical entity.

My current plan is to tokenize the input text, use np.array_split() to split the input text into 510-long chunks (so I can add the sequence beginning and end tokens), run the model for each chunk, and then stitch the outputs of each chunk together to form one output that should match the original input.

The reason I am doing this manually is because I’m doing an ONNX export of the model so I won’t be using huggingface’s built in pipelines. I guess my question is, is there a better way of doing token classification for sequences > 512? I read somewhere that there’s a chunk pipeline in huggingface that analyzes over-lapping chunks and then averages out the predictions on the overlapping segments. Is there a way for me to access the components of this chunking pipeline outside of actually using the full pipeline for inference?

I hope what I’m asking makes sense!

Did you find the answer to this , I am also looking for the same.

Unfortunately not. I think I’m just going to implement my own splitting/chunking logic for now and if I can figure out how the pre-process pipeline for chunking works, I’ll try to use those functions from the transformers library. I’ll reply here if I figure that out!

1 Like

So I built out my own chunking pipeline and this was my logic. I can post my code here later if you’re stuck but it’s relatively straightforward. Keep in mind that I am using ONNX so I’m working with the output logits of my token classifier directly.

  1. Tokenize your input sequence using the tokenizer for your model
  2. Check if the length of the tokenized inputs is > 512. If not, you can just operate normally.
  3. For simplicity, I extract the [“input_ids”] and [“attention_mask”] from the output dictionary from the tokenizer into two separate lists.
  4. I use more_itertools.windowed() to iterate over each list to extract chunks of length 510 and with 32 tokens of overlap. I throw these list of chunks into two new lists.
  5. I check each chunk for the presence of the sequence start/end tokens at the beginning at the end. I chose chunks of length 510 so that for all the middle chunks, I can add the extra tokens and still be in the 512 limit.
  6. I take each input_id and attention_mask chunk and create a new dictionary, and append it to a new list.
    6b (optional) Instead of appending each to a list, you may choose to stack them together in a new numpy array so you can put them through the model as a batch instead of doing it one at a time.
  7. Iterate through each of the chunks (or batch if you went with 6b) and get the output logits for each chunk. Strip the first or last batch of scores for each chunk correlating to if you added a begin/end sequence token or not.
  8. Add the output logits of the first chunk directly into a new list.
  9. For each subsequent chunk, take the last X (where X is the amount of overlap) sets of scores from the master list of scores, and the first X sets of scores from the output of this chunk, average them, and replace them in the master list. Then take the remaining chunk scores and add them to the master list. Repeat until finished going through all of the chunks.
  10. Argmax the score to get the label id of the best scoring label for each token.
  11. Convert token labels to entities and stitch together IOB labels as necessary to finish your data extraction.

I hope that train of logic makes sense and helps you out!

1 Like