My objective is to annotate long documents with bioformer-8L. I have been said to use stride and truncation so I don’t have to split my documents in chunks of 512 tokens.
In the training phase, I called the tokenizer like this:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, stride = 128, return_overflowing_tokens=True, model_max_length=512, truncation=True, is_split_into_words=True)
For the prediction I do:
model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, stride = 128, return_overflowing_tokens=True, model_max_length=512, truncation=True, is_split_into_words=True)
ner = pipeline(“token-classification”, model=model, tokenizer=tokenizer, aggregation_strategy=“first”)
But it does not work, the model stops providing annotations in the middle of the text. For the test I duplicated several time the same sentence that I know contains annotations I am looking for.
Help please.