Token classification on long sentences

rogeriobromfman · February 2, 2022, 9:55pm

I have fine-tuned a BERT model for named-entity recognition. My tasks usually involve long sentences (over 512 tokens is very common).

At the moment, I’m solving this by not calling the TokenClassificationPipeline, but instead using the max_length and stride parameters of the tokenizer to deal with splitting the sentences, and then aggregating the resulting token labels in post-processing. For this, I have to manually call the pyTorch model for each sub-sentence.

Is there a native (or at least more efficient) way to do this? I saw this was implemented in the QuestionAnsweringPipeline.

The reason I’m asking is that I’m trying to implement this as a serverless endpoint in Amazon SageMaker. And for that to run smoothly, the inference script should be as optimized as possible.

Topic		Replies	Views
Text classification training on long text Intermediate	3	4955	June 18, 2024
Token Classification Models on (Very) Long Text Models	8	11155	March 9, 2023
How to do text classification on long sequence? Beginners	3	3251	May 14, 2023
Access tokenizer from within predict_fn Amazon SageMaker	7	1028	January 14, 2022
Limit max # of tokens for inference in pipeline? Beginners	0	1080	April 7, 2023

Token classification on long sentences

Related topics