Lilt - Token Shift/Misalignment during model inference

I have trained a Lilt model successfully, and based most of my implementation on the data and code as shown in this notebook:

As I have a lot of data available, which I unfortunately cannot disclose due to privacy reasons, the performance is very good and training is running smooth. Most of the images are documents which exceed the token limit, which is why I have implemented a sliding window approach which seems to work fine. The model thus receives individual windows as inputs and inference on a document is usually a combination of individual windows put back together. This is the model, feature extractor for processing input documents:

    feature_extractor = LayoutLMv3FeatureExtractor(apply_ocr=False) # OCR is provided as input data
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    processor = LayoutLMv3Processor(feature_extractor, tokenizer)

When running inference and plotting the results for an actual image, I receive mostly very good results. However, there are some very confusing cases which appear from time to time. The image shows that many predictions are off, whereas they are not wrong, they are simply misaligned with the actual labels of the words.
When investigating further, I noticed the following. When looking at the amount of predicted labels, which I generate over the following function for a single window:

def get_predictions_per_window(model, window):

    # unsqueeze because we need to keep batch dimension for model
    batch_samples = {
        "input_ids": window["input_ids"].unsqueeze(0),
        "attention_mask": window["attention_mask"].unsqueeze(0),
        "bbox": window["bbox"].unsqueeze(0),
        "labels": window["labels"].unsqueeze(0),

    output = model(batch_samples["input_ids"], attention_mask=batch_samples["attention_mask"],

    predictions = output.logits.argmax(-1).squeeze().tolist()

    mask = batch_samples["attention_mask"].squeeze().tolist()

    required_indices = [index for index, value in enumerate(mask) if value == 1]

    filtered_predictions = [predictions[i] for i in required_indices]
    filtered_labels = [batch_samples["labels"][0][i] for i in
                       required_indices]  # Assuming `num_rows` is 1 for `proc_dataset`

    # Now, filter out predictions where the corresponding label is -100
    final_predictions = [pred for pred, label in zip(filtered_predictions, filtered_labels) if label != -100]

    labels = [model.config.id2label[prediction] for prediction in final_predictions]

    return labels

The predicted labels by the model simply do not match the actual labels that are present in the data, in the sense that the predicted labels exceed the actual labels by ~2/3 labels, sometimes. This is very puzzling to me, as I have calculated the token limit for each page and left more than enough room for each window to fit into the model. When generating the encoding, used as an input to the mode, I use this function:

def generate_encoding(sample, processor=None):
    image =["image"])
    # flatten window attributes for encoding
    words = list(chain(*sample["words"]))
    ner_tags = list(chain(*sample["ner_tags"]))
    bboxes = list(chain(*sample["bboxes"]))
    encoding = processor(
    # remove pixel values not needed for LiLT
    del encoding["pixel_values"]
    return encoding```

The amount of labels, which in turn create the tokenization, somehow deviate from the amount of labels that the processor (my custom processor I have saved after training), adds more labels. I am very confused as to why this happens. When investigating the actual predictions per text, one-by-one, it is very evident that there are cases, in which a single word or token is added in the middle, causing all other predictions to shift and thus misalign. I strongly believe this is not due to the shifting windows, as windows are treated independently and the model has no knowledge about the rest of the document. 

One thing I have tried is removing all whitespace characters from the input text for tokenization, such as \n, \r, etc. This, however, caused some of the images to have exactly the reverse behaviour, such that now the actual labels exceed the predicted labels by ~ 2-3 words and the whole predicted words shift into the other direction. 

Did anyone observe a similar behaviour or has an experience with the model adding special tokens, which are not correctly flagged and thus are predicted as words in a document? It's quite hard to debug, my only solution at this point is going into the actual LayoutProcessor and checking how the input words are being transformed and if there is somehow a token incorrectly tagged.