KeyError: '__index_level_0__' error with datasets arrow_writer.py

kt66nf · May 18, 2022, 1:01am

Hi folks.

I’ve been following along this tutorial notebook (https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb) to figure out how to use LayoutLM model.

I was able to fine-tune my model on my desktop with open source dataset, but when I try to use the same script on my own dataset, I get the following error message

Traceback (most recent call last):
File “main_v1.py”, line 410, in
main(csv=True)
File “main_v1.py”, line 322, in main
encoded_train_dataset = train_dataset.map(lambda example: encode_example(example), features=features)
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_dataset.py”, line 2364, in map
desc=desc,
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_dataset.py”, line 532, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_dataset.py”, line 499, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/fingerprint.py”, line 458, in wrapper
out = func(self, *args, **kwargs)
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_dataset.py”, line 2757, in _map_single
writer.finalize()
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_writer.py”, line 537, in finalize
self.write_examples_on_file()
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_writer.py”, line 414, in write_examples_on_file
self.write_batch(batch_examples=batch_examples)
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_writer.py”, line 504, in write_batch
col_type = features[col] if features else None
KeyError: ‘index_level_0’

So I took a look at the file arrow_writer.py (https://github.com/huggingface/datasets/blob/master/src/datasets/arrow_writer.py) and it appears that I’m getting that error because the code is thinking ‘index_level_0’ is a name of a column being passed and it’s not finding it within features. This is a bit confusing as I’m using the following for features

features = Features({
‘input_ids’: Sequence(feature=Value(dtype=‘int64’)),
‘bbox’: Array2D(dtype=“int64”, shape=(512, 4)),
‘attention_mask’: Sequence(Value(dtype=‘int64’)),
‘token_type_ids’: Sequence(Value(dtype=‘int64’)),
‘label’: ClassLabel(names=[‘refuted’, ‘entailed’]),
‘image_path’: Value(dtype=‘string’),
‘words’: Sequence(feature=Value(dtype=‘string’)),
})

Anyone have any clue as to what I’m doing wrong ? I’m not sure where I need to look into to debug this issue.

kt66nf · May 18, 2022, 1:04am

Just to give more context, it appears that the error message above triggers, when the

return encoding

gets triggered in the function

def encode_example(example, max_seq_length=512, pad_token_box=[0, 0, 0, 0]):
  words = example['words']
  normalized_word_boxes = example['bbox']

  assert len(words) == len(normalized_word_boxes)

  token_boxes = []
  for word, box in zip(words, normalized_word_boxes):
      word_tokens = tokenizer.tokenize(word)
      token_boxes.extend([box] * len(word_tokens))
  
  # Truncation of token_boxes
  special_tokens_count = 2 
  if len(token_boxes) > max_seq_length - special_tokens_count:
      token_boxes = token_boxes[: (max_seq_length - special_tokens_count)]
  
  # add bounding boxes of cls + sep tokens
  token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]
  
  encoding = tokenizer(' '.join(words), padding='max_length', truncation=True)
  # Padding of token_boxes up the bounding boxes to the sequence length.
  input_ids = tokenizer(' '.join(words), truncation=True)["input_ids"]
  padding_length = max_seq_length - len(input_ids)
  token_boxes += [pad_token_box] * padding_length
  encoding['bbox'] = token_boxes
  encoding['label'] = label2idx[example['label']]

  assert len(encoding['input_ids']) == max_seq_length
  assert len(encoding['attention_mask']) == max_seq_length
  assert len(encoding['token_type_ids']) == max_seq_length
  assert len(encoding['bbox']) == max_seq_length

  return encoding

Any help / hint would be greatly appreciated. !

Ollie · May 19, 2022, 5:53am

Are you loading your dataset from a Pandas data frame by any chance? If you do, it adds the index as a separate column. Unless you need the index values then you can just remove this column. dataset = dataset.remove_columns(["__index_level_0__"]).

strange-kira · August 29, 2024, 1:45pm

Or, before convert pandas.dataframe to pyarrow.dataset there is an opportunity reset index column. Like that df.reset_index(drop=True)

Topic		Replies	Views
KeyError: '_data' when training on AWS Amazon SageMaker	10	2560	August 20, 2021
Getting KeyErrors when training Transformer Beginners	1	1526	June 21, 2022
MLM: IndexError: index out of bounds Beginners	3	2107	June 28, 2021
IndexError: index out of bounds Beginners	1	985	January 23, 2022
Fine-tune transformers for language model Beginners	2	662	August 14, 2022

KeyError: '__index_level_0__' error with datasets arrow_writer.py

Related topics