Hi, I’m fine-tuning Roberta-large for Relation Classification Task. It works fine on datasets having TACRED format. For this specific purpose, I have extracted text from pdf files using OCR tools. The extracted text contains alphanumerics, merged tokens, and wrongly spelled words. Means to say, these tokens don’t seem to be originally present in English Vocabulary due to the OCR tool that extracts everything from PDF files.
Whenever I run training, it throws Index out of range error as shown in the image attached.
Do Roberta correctly tokenize these types of words mentioned.
Also is there any way in the implementations to handle new words/tokens that are not in the lookup table of model vocabulary?
What might be the possible solution so that we don’t have to care what word we are feeding to the model and run smoothly.
NOTE: I prepared conll data from the same text for fine-tuning bert and it works smothly on similar tokens.