Question on splitting input sequence

shensmobile · June 14, 2022, 10:28pm

So I built out my own chunking pipeline and this was my logic. I can post my code here later if you’re stuck but it’s relatively straightforward. Keep in mind that I am using ONNX so I’m working with the output logits of my token classifier directly.

Tokenize your input sequence using the tokenizer for your model
Check if the length of the tokenized inputs is > 512. If not, you can just operate normally.
For simplicity, I extract the [“input_ids”] and [“attention_mask”] from the output dictionary from the tokenizer into two separate lists.
I use more_itertools.windowed() to iterate over each list to extract chunks of length 510 and with 32 tokens of overlap. I throw these list of chunks into two new lists.
I check each chunk for the presence of the sequence start/end tokens at the beginning at the end. I chose chunks of length 510 so that for all the middle chunks, I can add the extra tokens and still be in the 512 limit.
I take each input_id and attention_mask chunk and create a new dictionary, and append it to a new list.
6b (optional) Instead of appending each to a list, you may choose to stack them together in a new numpy array so you can put them through the model as a batch instead of doing it one at a time.
Iterate through each of the chunks (or batch if you went with 6b) and get the output logits for each chunk. Strip the first or last batch of scores for each chunk correlating to if you added a begin/end sequence token or not.
Add the output logits of the first chunk directly into a new list.
For each subsequent chunk, take the last X (where X is the amount of overlap) sets of scores from the master list of scores, and the first X sets of scores from the output of this chunk, average them, and replace them in the master list. Then take the remaining chunk scores and add them to the master list. Repeat until finished going through all of the chunks.
Argmax the score to get the label id of the best scoring label for each token.
Convert token labels to entities and stitch together IOB labels as necessary to finish your data extraction.

I hope that train of logic makes sense and helps you out!

Topic		Replies	Views
Sentiment analysis for long text - canonical solution Beginners	1	2482	April 22, 2023
Chunk tokens into desired chunk length without simply getting rid of rest of tokens 🤗Tokenizers	0	643	June 15, 2023
Returning score associated with prediction_value from loaded_tokenizer Beginners	2	38	October 11, 2024
TokenClassification pipeline doing batch processing over a sequence of already tokenised messages Intermediate	1	832	July 6, 2022
Token Classification: How to tokenize and align labels with overflow and stride? 🤗Tokenizers	4	6172	July 22, 2024

Question on splitting input sequence

Related topics