RoBERTa Index out of range error on relation extraction data

ahmadaii · April 29, 2022, 9:23am

Hi, I’m fine-tuning Roberta-large for Relation Classification Task. It works fine on datasets having TACRED format. For this specific purpose, I have extracted text from pdf files using OCR tools. The extracted text contains alphanumerics, merged tokens, and wrongly spelled words. Means to say, these tokens don’t seem to be originally present in English Vocabulary due to the OCR tool that extracts everything from PDF files.
Whenever I run training, it throws Index out of range error as shown in the image attached.
Do Roberta correctly tokenize these types of words mentioned.
Also is there any way in the implementations to handle new words/tokens that are not in the lookup table of model vocabulary?

What might be the possible solution so that we don’t have to care what word we are feeding to the model and run smoothly.
NOTE: I prepared conll data from the same text for fine-tuning bert and it works smothly on similar tokens.

Topic		Replies	Views
IndexError: index out of range in self on train() Beginners	0	1226	June 19, 2023
Positional encoding error in RoBERTa 🤗Transformers	1	330	October 2, 2023
Error while using LILT model "index out of range in self" 🤗Transformers	5	702	March 14, 2024
RoBERTa fine-tuning on a dataset of short sentences and low cardinality 🤗Transformers	0	731	December 4, 2023
[HELP] How to fix IndexError: index out of range in self Beginners	1	1550	March 31, 2023

RoBERTa Index out of range error on relation extraction data

Related topics