IndexError while training Roberta with a custom tokenizer

Hello… It’s my first time using huggingface, so I could really use some help. I trained a BPE tokenizer using this tutorial from: colab. The BPE tokeniser seems to be loaded correctly. In the example below you see that it added a start token at the beginning of the string ([CLS] = 0) and a masked token in the middle ([MASK]= 4) and a end of sequence token at the end ([SEP] = 2) and ofcourse the padded tokens to reach the 512 length.

#### Loading Tokenizer ####
tokenizer = RobertaTokenizerFast.from_pretrained('./BPE',max_len=512)

#### Testing Tokenizer ####
string = "MEPTKIVENLYLGNIQNGIRHSNYGFDKIINLTRFNNQYGIPTVWINID<mask>SESSDLYSHLQKVTTLIHDSIE!GNKVLVHCQAGISRSATVVIAYIMRSKRY"
inputs = tokenizer(string, max_length=512, padding="max_length", truncation=True, return_tensors="pt")
inputs
{'input_ids': tensor([[   0,  142,  228, 4136,   59,  595, 1888,   86,  101,  163, 1844,  127,
         1236,   59,  140, 1672, 1847,  106,  198, 3076,   73,    4, 8914,   41,
          135,   96,  200, 7849,  192,  112,   70,   11,  216,   33,  154,  426,
         4433,  111, 4650, 1307,  307,  111,  115,   28,    2,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,

But when I try to pretrain my Roberta model I always get this IndexError. I checked my training and evaluation datasets and they have the correct shapes and the tokens within them are within the normal range [0:10001] (I have a vocab_size = 10002). This is my code:

# Model Configurations
config = RobertaConfig(
    vocab_size=10_002,
    max_position_embeddings=512,
    num_attention_heads=12,
    num_hidden_layers=6,
)

# Create Model
model = RobertaForMaskedLM(config=config)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

# BUILD DATASETS
def encode(examples):
    """Mapping function to tokenize the sentences passed"""
    inputs = tokenizer(examples, max_length=512, padding="max_length", truncation=True, return_tensors="pt")
    inputs['labels'] = inputs.input_ids.detach().clone()
    return inputs


def tokenizing_function(dataset):
    data = pd.read_csv(dataset, header=None)
    result = []
    for i in tqmd(data[0], total=len(data)): result.append(encode(i))
    return result

train = tokenizing_function("./dataset/Train")
test = tokenizing_function("./dataset/Test")
val = tokenizing_function("./dataset/Val")

# DATA LOADER
class DataLoader(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: val[0] for key, val in self.encodings[idx].items()}

    def __len__(self):
        return len(self.encodings)

train_dataset = DataLoader(train)
test_dataset = DataLoader(test)
val_dataset = DataLoader(val)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

# TRAINING WITH LAMB OPTIMIZER
optimizer = optim.Lamb(model.parameters(), lr=0.0025)
scheduler = get_polynomial_decay_schedule_with_warmup(optimizer=optimizer,
                                                      num_warmup_steps=3125,
                                                      num_training_steps=125000,
                                                      power=0.01)


def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

training_args = TrainingArguments(
    output_dir="./ROBERTA/model/",
    logging_dir='./ROBERTA/logs',
    overwrite_output_dir=True,
    evaluation_strategy="steps",
    do_train=True,
    do_eval=True,
    num_train_epochs=10,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    save_steps=500,
    logging_steps=1000,
    eval_steps=250,
    prediction_loss_only=True,
    weight_decay=0.01,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    optimizers=(optimizer, scheduler),
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

# Start Training
trainer.train()

I am working on proteins and each protein is supposed to be a different document. That is why I am creating my own DataLoader. When I am running it I get this error:

***** Running training *****
  Num examples = 5000
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1570
  0%|          | 0/1570 [00:00<?, ?it/s]Traceback (most recent call last):
  File "./CLASSIFIER/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2183, in training_step
    loss = self.compute_loss(model, inputs)
  File "./CLASSIFIER/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2215, in compute_loss
    outputs = model(**inputs)
  File "./CLASSIFIER/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "./CLASSIFIER/venv/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 1094, in forward
    outputs = self.roberta(
  File "./CLASSIFIER/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "./CLASSIFIER/venv/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 840, in forward
    embedding_output = self.embeddings(
  File "./CLASSIFIER/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "./CLASSIFIER/venv/lib/python3.10/site-packages/transformers/models/roberta/modeling_roberta.py", line 133, in forward
    position_embeddings = self.position_embeddings(position_ids)
  File "./CLASSIFIER/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "./CLASSIFIER/venv/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "./CLASSIFIER/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Any ideas why I am getting this error?
Thanks

hi.

In my case, there num_label (task’s label count.) mismatch return same error.

did you check config’s num_label is correctly same with your task’s num_label?

regards

My first thought is that the new tokenizer has more tokens than the original model accounts for, so you will have to make sure that vocab_size is the correct size.

config = RobertaConfig(
    vocab_size=len(tokenizer),  # here
    max_position_embeddings=512,
    num_attention_heads=12,
    num_hidden_layers=6,
)

Or you can do it after initializing the model, by resizing the embeddings:

model.resize_token_embeddings(len(tokenizer))

Hello…
I am just pre-training my model on MLM. So I have no num_labels. I think this feature is for classification tasks.

Hello,
unfortunately, it did not work. I tried it and it did not work.
Any other ideas?

I had some error occurred case before.

I hope these errors log help to solve your error.

- dataloader make wrong embedding.
before get input to model, i checked sample batch embedding to check they contain wrong.

  • some tensor has minus value .
  • zero divide make wrong value.

- tokenizer config error.

  • vocab dictionary over index.
  • padding to max_length does not worked.
  • specials token mapping mismatch.
  • tokenizer config didn’t contain info of custom vocab size.

can you show more detail of that error?

regards.

Please be more specific when you say something does not work. What did not work? Which error did you get? Please post a MINIMAL code example, and the full error trace that you get.

Good morning everyone,
I just tracked down my error. It has to do with the position ids. So, when my example has no padding the position ids are within [2:512] which gives the IndexError. This is the function that produces the error:

def create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length=0):
    """
    Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
    are ignored. This is modified from fairseq's `utils.make_positions`.
    Args:
        x: torch.Tensor x:
    Returns: torch.Tensor
    """
    # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.
    mask = input_ids.ne(padding_idx).int()
    incremental_indices = (torch.cumsum(mask, dim=1).type_as(mask) + past_key_values_length) * mask
    return incremental_indices.long() + padding_idx

To solve my issue, I just truncated my sequence a little bit more, adding at least 2 padding positions.
In any case, I think this is a bug. The only way to get indices within the embeddings range is to have a negative padding_idx = -1.

1 Like