How to Decode InputIDs back to String in LayoutLMV3

Hi, I am using Layoutlmv3. I Have used LMV3 processor. At train time I have labels as well along with image,text,boxes and labels. But at Inference time, I do not have labels. So the code for inference goes like below.

processor = AutoProcessor.from_pretrained("microsfotlmv3_repo", apply_ocr=False)
encoding = processor(images = resize_image, 
                    text = tokens,
                    boxes= boxes,
                    return_offsets_mapping=True, 
                    return_tensors="pt",
                    padding  = "max_length",
                    truncation = True,
                    max_length = 512
                    )
offset_mapping = encoding.pop('offset_mapping')
outputs = test_model1(**encoding)
predictions = outputs.logits.argmax(-1).squeeze().tolist()
is_subword = np.array(offset_mapping.squeeze().tolist())[:,0] != 0
true_predictions = [id2label[pred] for idx, pred in enumerate(predictions) if not is_subword[idx]]

Currently, I am decoding text like:

cleaned_input_ids = encoding['input_ids'][encoding['attention_mask']>0]
text = processor.tokenizer.decode(cleaned_input_ids.squeeze().tolist())
text = text[4:-4]
tokens = text.split(" ")

But the count of tokens and count of true_predictions are not matching.

I am expecting the result to be:
“sun”:label,
“rises”:label,
“in”:label,
“the”:label,
“east”:label.

Currently i am not able to map them, as their counts/lengths are not matching. How to resolve this.

Tagging @nielsr, and Others

Hi,

You can use encoding.word_ids to know to which word each token belongs.