I’m using transformers==4.22.1
and datasets==2.5.1
, along with Python 3.9.13, and I have come across this error while loading a dataset from a set of three TSV files for train, validation and test sets. list_of_codes
represents a list of all the possible categories that a token may have in a token classification problem.
features = datasets.Features(
{
"index": datasets.Value("int32"),
"texts": datasets.Value("string"),
"tokens": datasets.Sequence(datasets.Value("string")),
"labels": datasets.Sequence(datasets.ClassLabel(
num_classes=len(list_of_codes),
names=list_of_codes)
),
"patient_id": datasets.Value("int32"),
"date": datasets.Value("string"),
"type": datasets.Value("string")
}
I cannot post a realistic example of my corpus por data privacy reasons, but I can make up an example of a record in the desired dataset. It could be something like that:
"index": 567,
"texts": "The patient suffers from severe cephalalgia. She
also complaints about left arm",
"tokens": ["The", "patient", "suffers", "from" "severe", "cephalalgia", ".", "She", "also", "complaints", "about", "left", "arm"]
"labels": ["O", "O", "O", "O" "C56", "C56", "O", "O", "O", "O", "O", "D30.1", "D30.1"]
"patient_id": 145,
"date": "20181111",
"type": "examination"
The execution of the following line according to the previously explained
dataset = datasets.load_dataset("csv", sep="\t", data_files=data_files, features=features)
File "<stdin>", line 1, in <module>
File "/home/users/user/.local/lib/python3.9/site-packages/datasets/load.py", line 1698, in load_dataset
builder_instance.download_and_prepare(
File "/home/users/user/.local/lib/python3.9/site-packages/datasets/builder.py", line 807, in download_and_prepare
self._download_and_prepare(
File "/home/users/user/.local/lib/python3.9/site-packages/datasets/builder.py", line 898, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/users/user/.local/lib/python3.9/site-packages/datasets/builder.py", line 1495, in _prepare_split
for key, table in logging.tqdm(
File "/home/users/user/.local/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/home/users/user/.local/lib/python3.9/site-packages/datasets/packaged_modules/csv/csv.py", line 182, in _generate_tables
yield (file_idx, batch_idx), self._cast_table(pa_table)
File "/home/users/user/.local/lib/python3.9/site-packages/datasets/packaged_modules/csv/csv.py", line 160, in _cast_table
pa_table = table_cast(pa_table, schema)
File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 2044, in table_cast
return cast_table_to_schema(table, schema)
File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 2006, in cast_table_to_schema
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 2006, in <listcomp>
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 1716, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 1716, in <listcomp>
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 1889, in cast_array_to_feature
raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
string
to
Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
Since it is the only field featuring a sequence of strings, I assume that the field responsible for the error is tokens
, which was obtained from the tokenization of texts
by tokenizer call
method, writing the result out to the TSV:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")
batch_encoding = tokenizer(list_of_corpus_texts, truncation=True, add_special_tokens=False)
I have not found any similar issues in the forum. Could anybody provide some help?