KeyError: '__index_level_0__' error with datasets arrow_writer.py

Hi folks.

I’ve been following along this tutorial notebook (https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb) to figure out how to use LayoutLM model.

I was able to fine-tune my model on my desktop with open source dataset, but when I try to use the same script on my own dataset, I get the following error message

Traceback (most recent call last):
File “main_v1.py”, line 410, in
main(csv=True)
File “main_v1.py”, line 322, in main
encoded_train_dataset = train_dataset.map(lambda example: encode_example(example), features=features)
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_dataset.py”, line 2364, in map
desc=desc,
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_dataset.py”, line 532, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_dataset.py”, line 499, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/fingerprint.py”, line 458, in wrapper
out = func(self, *args, **kwargs)
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_dataset.py”, line 2757, in _map_single
writer.finalize()
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_writer.py”, line 537, in finalize
self.write_examples_on_file()
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_writer.py”, line 414, in write_examples_on_file
self.write_batch(batch_examples=batch_examples)
File “/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/datasets/arrow_writer.py”, line 504, in write_batch
col_type = features[col] if features else None
KeyError: ‘index_level_0

So I took a look at the file arrow_writer.py (https://github.com/huggingface/datasets/blob/master/src/datasets/arrow_writer.py) and it appears that I’m getting that error because the code is thinking ‘index_level_0’ is a name of a column being passed and it’s not finding it within features. This is a bit confusing as I’m using the following for features

features = Features({
‘input_ids’: Sequence(feature=Value(dtype=‘int64’)),
‘bbox’: Array2D(dtype=“int64”, shape=(512, 4)),
‘attention_mask’: Sequence(Value(dtype=‘int64’)),
‘token_type_ids’: Sequence(Value(dtype=‘int64’)),
‘label’: ClassLabel(names=[‘refuted’, ‘entailed’]),
‘image_path’: Value(dtype=‘string’),
‘words’: Sequence(feature=Value(dtype=‘string’)),
})

Anyone have any clue as to what I’m doing wrong ? I’m not sure where I need to look into to debug this issue.

Just to give more context, it appears that the error message above triggers, when the

return encoding

gets triggered in the function

def encode_example(example, max_seq_length=512, pad_token_box=[0, 0, 0, 0]):
  words = example['words']
  normalized_word_boxes = example['bbox']

  assert len(words) == len(normalized_word_boxes)

  token_boxes = []
  for word, box in zip(words, normalized_word_boxes):
      word_tokens = tokenizer.tokenize(word)
      token_boxes.extend([box] * len(word_tokens))
  
  # Truncation of token_boxes
  special_tokens_count = 2 
  if len(token_boxes) > max_seq_length - special_tokens_count:
      token_boxes = token_boxes[: (max_seq_length - special_tokens_count)]
  
  # add bounding boxes of cls + sep tokens
  token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]
  
  encoding = tokenizer(' '.join(words), padding='max_length', truncation=True)
  # Padding of token_boxes up the bounding boxes to the sequence length.
  input_ids = tokenizer(' '.join(words), truncation=True)["input_ids"]
  padding_length = max_seq_length - len(input_ids)
  token_boxes += [pad_token_box] * padding_length
  encoding['bbox'] = token_boxes
  encoding['label'] = label2idx[example['label']]

  assert len(encoding['input_ids']) == max_seq_length
  assert len(encoding['attention_mask']) == max_seq_length
  assert len(encoding['token_type_ids']) == max_seq_length
  assert len(encoding['bbox']) == max_seq_length

  return encoding

Any help / hint would be greatly appreciated. !

Are you loading your dataset from a Pandas data frame by any chance? If you do, it adds the index as a separate column. Unless you need the index values then you can just remove this column. dataset = dataset.remove_columns(["__index_level_0__"]).

4 Likes

Or, before convert pandas.dataframe to pyarrow.dataset there is an opportunity reset index column. Like that df.reset_index(drop=True)

1 Like