Pretrained model with stride doesn't predict long text

nbosc · November 29, 2023, 4:43pm

My objective is to annotate long documents with bioformer-8L. I have been said to use stride and truncation so I don’t have to split my documents in chunks of 512 tokens.

In the training phase, I called the tokenizer like this:

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, stride = 128, return_overflowing_tokens=True, model_max_length=512, truncation=True, is_split_into_words=True)

For the prediction I do:

model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, stride = 128, return_overflowing_tokens=True, model_max_length=512, truncation=True, is_split_into_words=True)
ner = pipeline(“token-classification”, model=model, tokenizer=tokenizer, aggregation_strategy=“first”)

But it does not work, the model stops providing annotations in the middle of the text. For the test I duplicated several time the same sentence that I know contains annotations I am looking for.

Help please.

nbosc · December 1, 2023, 10:59am

Solution is to move stride from from_trained to pipeline.

tokenizer = AutoTokenizer.from_pretrained(model_path, return_overflowing_tokens=True, model_max_length=512, truncation=True, is_split_into_words=True)

ner = pipeline(“token-classification”, model=model, tokenizer=tokenizer, aggregation_strategy=“first”, stride = 128)

Topic		Replies	Views
Text classification training on long text Intermediate	3	4964	June 18, 2024
Truncating sequence -- within a pipeline Beginners	7	5799	May 3, 2024
Predictions with pipeline fails to truncate test set 🤗Transformers	0	180	January 23, 2024
Tokenizer behaviour with pipeline 🤗Tokenizers	0	923	August 1, 2023
Tokenizer truncation Beginners	1	1788	June 14, 2022

Pretrained model with stride doesn't predict long text

Related topics