So I built out my own chunking pipeline and this was my logic. I can post my code here later if you’re stuck but it’s relatively straightforward. Keep in mind that I am using ONNX so I’m working with the output logits of my token classifier directly.
- Tokenize your input sequence using the tokenizer for your model
- Check if the length of the tokenized inputs is > 512. If not, you can just operate normally.
- For simplicity, I extract the [“input_ids”] and [“attention_mask”] from the output dictionary from the tokenizer into two separate lists.
- I use more_itertools.windowed() to iterate over each list to extract chunks of length 510 and with 32 tokens of overlap. I throw these list of chunks into two new lists.
- I check each chunk for the presence of the sequence start/end tokens at the beginning at the end. I chose chunks of length 510 so that for all the middle chunks, I can add the extra tokens and still be in the 512 limit.
- I take each input_id and attention_mask chunk and create a new dictionary, and append it to a new list.
6b (optional) Instead of appending each to a list, you may choose to stack them together in a new numpy array so you can put them through the model as a batch instead of doing it one at a time. - Iterate through each of the chunks (or batch if you went with 6b) and get the output logits for each chunk. Strip the first or last batch of scores for each chunk correlating to if you added a begin/end sequence token or not.
- Add the output logits of the first chunk directly into a new list.
- For each subsequent chunk, take the last X (where X is the amount of overlap) sets of scores from the master list of scores, and the first X sets of scores from the output of this chunk, average them, and replace them in the master list. Then take the remaining chunk scores and add them to the master list. Repeat until finished going through all of the chunks.
- Argmax the score to get the label id of the best scoring label for each token.
- Convert token labels to entities and stitch together IOB labels as necessary to finish your data extraction.
I hope that train of logic makes sense and helps you out!