Map method to tokenize raises index error

Hi,

unfortunately I still have this problem, although I use the latest datasets version 1.8.0.
I am trying to run the run_ner.py from transformers/examples/pytorch/token-classification at master · huggingface/transformers · GitHub on Google Colab using a custom dataset. For a small set it does work but when using my dataset in its entirety I always get the following error:

  0% 0/1 [00:00<?, ?ba/s]Traceback (most recent call last):
  File "/content/drive/MyDrive/ner/run_ner.py", line 512, in <module>
    main()
  File "/content/drive/MyDrive/ner/run_ner.py", line 359, in main
    load_from_cache_file=not data_args.overwrite_cache,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1635, in map
    desc=desc,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 186, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 397, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1954, in _map_single
    batch = input_dataset[i : i + batch_size]
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1484, in __getitem__
    format_kwargs=self._format_kwargs,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 1471, in _getitem
    pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
  File "/usr/local/lib/python3.7/dist-packages/datasets/formatting/formatting.py", line 368, in query_table
    pa_subtable = _query_table(table, key)
  File "/usr/local/lib/python3.7/dist-packages/datasets/formatting/formatting.py", line 84, in _query_table
    return table.fast_slice(key.start, key.stop - key.start)
  File "/usr/local/lib/python3.7/dist-packages/datasets/table.py", line 129, in fast_slice
    i = _interpolation_search(self._offsets, offset)
  File "/usr/local/lib/python3.7/dist-packages/datasets/table.py", line 92, in _interpolation_search
    raise IndexError(f"Invalid query '{x}' for size {arr[-1] if len(arr) else 'none'}.")
IndexError: Invalid query '0' for size 1.
  0% 0/1 [00:00<?, ?ba/s]

Do you have any idea what could cause it? Or is there a workaround for this?
Sorry, I am a newbie to Huggingface…

Thank you!

2 Likes