Trainer.train() giving me Key Error: [random number]

I see I’m not the first one to have this problem, but unfortunately it looks like previous users with similar troubles didn’t get responses. Here’s hoping I’ll be luckier!

I’m attempting to tune my hyperparameters to fine tune a BERT model (specifically, distilbert-base-uncased)

I’ve tried many versions of the code for the objective function as I’ve searched online, with and without a model_init. A few have given me KeyError: 142224. I know this number is in fact random, because it’s consistently the same number until I change my random seed. My training set has over 250,000 rows. People with smaller datasets in the forums seem to get smaller random numbers. If I shorten the training dataset, the KeyError number changes, so I strongly suspect it’s trying to access a particular row.

My indices in the dataframe were originally random numbers up into the millions (because of the way I selected the dev set from my full dataset) but the error persists even if I reset_index().

I guess without further ado, here’s what I’ve got. Here’s a sample of my dataset so you can see if it’s in the correct format:

 	input_ids 	            attention_mask 	    labels
0 	[101, 1037, 3803... 	[1, 1, 1, 1... 	        2
1 	[101, 2307, 2326... 	[1, 1, 1, 1... 	        2  
2 	[101, 1996, 2326...
3 	[101, 2077, 1045... 	[1, 1, 1, 1... 	        1
4 	[101, 3083, 3319... 	[1, 1, 1, 1... 	        1

And here’s the code leading up to the error:

def model_init(trial):
      # Define hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True)
    num_train_epochs = trial.suggest_int("num_train_epochs", 1, 3)
    gradient_accumulation_steps = trial.suggest_int("gradient_accumulation_steps", 1, 8)
    per_device_train_batch_size = trial.suggest_int("per_device_train_batch_size", 4, 16)
    evaluation_strategy = trial.suggest_categorical("evaluation_strategy", ['steps', 'epoch'])
    per_device_eval_batch_size = trial.suggest_int("per_device_eval_batch_size", 4, 16)
    warmup_steps = trial.suggest_int("warmup_steps", 100, 500)
    weight_decay = trial.suggest_float("weight_decay", 0.0, 0.1)

    model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased',num_labels=2)

    return model

def objective(trial):


    # Define training arguments
    training_args = TrainingArguments(
        output_dir='drive/MyDrive/BERT Sentiment/output',
        seed=42,
        logging_dir='drive/MyDrive/BERT Sentiment/output/logs',
        logging_steps=1000
    )
    print("Defined the training arguments")


    model = model_init(trial)
    print("Initialized the model")

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_set,
        eval_dataset=eval_set)
    
    print("Created the trainer")

    trainer.train()
    print("Trained the model")

    results = trainer.hyperparameter_search(model=None, direction='maximize',args=training_args,model_init=model_init)
    print(results.metrics['f1'])

study = optuna.create_study(direction='maximize')

study.optimize(objective, n_trials=1)
best_hyperparameters = study.best_params

print("Best hyperparameters" + str(best_hyperparameters))

And here’s the error all of that throws.

[I 2023-10-28 05:20:15,502] A new study created in memory with name: no-name-681288ba-c1a7-4d00-bd14-ccb4fad8cdac

Defined the training arguments

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Initialized the model
Created the trainer

[W 2023-10-28 05:20:16,477] Trial 0 failed with parameters: {'learning_rate': 3.948249070738038e-05, 'num_train_epochs': 3, 'gradient_accumulation_steps': 2, 'per_device_train_batch_size': 7, 'evaluation_strategy': 'steps', 'per_device_eval_batch_size': 6, 'warmup_steps': 333, 'weight_decay': 0.048816647569152063} because of the following error: KeyError(142224).

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "<ipython-input-18-715f9f522d66>", line 25, in objective
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1870, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 451, in __iter__
    current_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py", line 3807, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: 142224
[W 2023-10-28 05:20:16,483] Trial 0 failed with value None.

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3801             try:
-> 3802                 return self._engine.get_loc(casted_key)
   3803             except KeyError as err:

17 frames

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 142224


The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3802                 return self._engine.get_loc(casted_key)
   3803             except KeyError as err:
-> 3804                 raise KeyError(key) from err
   3805             except TypeError:
   3806                 # If we have a listlike key, _check_indexing_error will raise

KeyError: 142224

There’s any number of places I could’ve made an error, of course. I’ve tried a few different tutorials on HuggingFace, as well as asking ChatGPT to explain/fix the errors, but so far no progress on this one.

I’m working on Google Colab, if it’s relevant. Please let me know if there’s any other information I could give you that would help with diagnosis.

Thanks for reading!

Hi! well, apparently the data you’re training and evaluating on have to be numpy arrays instead of dataframes. So when you tokenize your train_set and test_set, try this:

tokenized_train = tokenized_train.values
tokenized_test = tokenized_test.values

I’m not sure if this would work for you, but it’s worth a try!