I am using the example script run_ner.py
from huggingface/transformers/examples/pytorch/token-classification, but there is no clarification on what kind of dataset it supports.
This script only supports CSV and JSON file, so I processed my own data into CSV files that looks like this:
Both input and output fields of my file is a list, which contains all words in a sentence / all word labels.
However, an error occurred:
Traceback (most recent call last):
File "run_cls_token_level.py", line 793, in <module>
main()
File "run_cls_token_level.py", line 515, in main
train_dataset = train_dataset.map(
File "/scratch/rml6079/anaconda3/envs/t2t/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/scratch/rml6079/anaconda3/envs/t2t/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/scratch/rml6079/anaconda3/envs/t2t/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3004, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/scratch/rml6079/anaconda3/envs/t2t/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3380, in _map_single
batch = apply_function_on_filtered_inputs(
File "/scratch/rml6079/anaconda3/envs/t2t/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "run_cls_token_level.py", line 476, in tokenize_and_align_labels
word_ids = tokenized_inputs.word_ids(batch_index=i)
File "/scratch/rml6079/anaconda3/envs/t2t/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 366, in word_ids
return self._encodings[batch_index].word_ids
IndexError: list index out of range
I have no idea about this error, since I guess this is mainly due to the invalid format of my data files.