TrOCR, CER metric error

I am finetuning TrOCR and using Character Error Rate from jiwer as the metric.

def compute_cer(pred_ids, label_ids, processor):
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
    print(f"len of label_ids {len(label_ids)}")
    label_str = processor.batch_decode(label_ids, skip_special_tokens=True)
    print(f"len_pred_str={len(pred_str)}, len_label={len(label_str)}")
    cer = cer_metric.compute(predictions=pred_str, references=label_str)
    return cer 

Except for the print statements the code is a direct copy from @nielsr tutorial . Despite len(pred_str) and len(label_str) being the same,

I am getting

ValueError: number of ground truth inputs (17) and hypothesis inputs (24) must match.

I have attached the screenshot of the same

Please let me know, if you have any clue what might be causing the issue

I believe this was a bug that has been fixed, see Datasets.load_metric("cer") does not work

1 Like