What is the data file format of `run_ner.py`?

I am using the example script run_ner.py from huggingface/transformers/examples/pytorch/token-classification, but there is no clarification on what kind of dataset it supports.

This script only supports CSV and JSON file, so I processed my own data into CSV files that looks like this:

Both input and output fields of my file is a list, which contains all words in a sentence / all word labels.

However, an error occurred:

Traceback (most recent call last):                                                                                                        
  File "run_cls_token_level.py", line 793, in <module>
    main()
  File "run_cls_token_level.py", line 515, in main
    train_dataset = train_dataset.map(
  File "/scratch/rml6079/anaconda3/envs/t2t/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/scratch/rml6079/anaconda3/envs/t2t/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/scratch/rml6079/anaconda3/envs/t2t/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3004, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/scratch/rml6079/anaconda3/envs/t2t/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3380, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/scratch/rml6079/anaconda3/envs/t2t/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "run_cls_token_level.py", line 476, in tokenize_and_align_labels
    word_ids = tokenized_inputs.word_ids(batch_index=i)
  File "/scratch/rml6079/anaconda3/envs/t2t/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 366, in word_ids
    return self._encodings[batch_index].word_ids
IndexError: list index out of range

I have no idea about this error, since I guess this is mainly due to the invalid format of my data files.

Hi, were you able to figure this error out? Or the correct format that is required ?

Hi,

That’s indeed something that needs to be improved. The example script is showcased to work on this dataset: conll2003 · Datasets at Hugging Face. So one would need to create a dataset in the same format.

Refer to the docs here: Load text data