TypeError in load_dataset related to a sequence of strings

I’m using transformers==4.22.1 and datasets==2.5.1, along with Python 3.9.13, and I have come across this error while loading a dataset from a set of three TSV files for train, validation and test sets. list_of_codes represents a list of all the possible categories that a token may have in a token classification problem.

features = datasets.Features(
		{
			"index": datasets.Value("int32"),
			"texts": datasets.Value("string"),
			"tokens": datasets.Sequence(datasets.Value("string")),
			"labels": datasets.Sequence(datasets.ClassLabel(
				num_classes=len(list_of_codes),
				names=list_of_codes)
			),
			"patient_id": datasets.Value("int32"),
			"date": datasets.Value("string"),
			"type": datasets.Value("string")
		}

I cannot post a realistic example of my corpus por data privacy reasons, but I can make up an example of a record in the desired dataset. It could be something like that:

			"index": 567,
			"texts": "The patient suffers from severe cephalalgia. She
also complaints about left arm",
			"tokens": ["The", "patient", "suffers", "from" "severe", "cephalalgia", ".", "She", "also", "complaints", "about", "left", "arm"]
			"labels": ["O", "O", "O", "O" "C56", "C56", "O", "O", "O", "O", "O", "D30.1", "D30.1"]
			"patient_id": 145,
			"date": "20181111",
			"type": "examination"

The execution of the following line according to the previously explained

dataset = datasets.load_dataset("csv", sep="\t", data_files=data_files, features=features)
  File "<stdin>", line 1, in <module>
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/load.py", line 1698, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/builder.py", line 807, in download_and_prepare
    self._download_and_prepare(
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/builder.py", line 898, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/builder.py", line 1495, in _prepare_split
    for key, table in logging.tqdm(
  File "/home/users/user/.local/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/packaged_modules/csv/csv.py", line 182, in _generate_tables
    yield (file_idx, batch_idx), self._cast_table(pa_table)
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/packaged_modules/csv/csv.py", line 160, in _cast_table
    pa_table = table_cast(pa_table, schema)
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 2044, in table_cast
    return cast_table_to_schema(table, schema)
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 2006, in cast_table_to_schema
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 2006, in <listcomp>
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 1716, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 1716, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 1889, in cast_array_to_feature
    raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
string
to
Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)

Since it is the only field featuring a sequence of strings, I assume that the field responsible for the error is tokens, which was obtained from the tokenization of texts by tokenizer call method, writing the result out to the TSV:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")
batch_encoding = tokenizer(list_of_corpus_texts, truncation=True, add_special_tokens=False)

I have not found any similar issues in the forum. Could anybody provide some help?

Hi! Our CSV/TSV builder uses pd.read_csv internally, which means list/array columns are not handled automatically. So, to load your files successfully, replace the Sequence(...) columns with Value("string"), load the files and then cast them to the desired type:

dataset = datasets.load_dataset("csv", sep="\t", data_files=data_files, features=features_with_strings)
def string_to_list(ex):
    ex["tokens"] = ex["tokens"].split(delim) # specify your array elem delim
    ex["labels"] = ex["labels"].split(delim) # specify your array elem delim
    return ex
dataset = dataset.map(string_to_list, features=features_with_sequence)

pd.read_csv supports passing converters to address this “limitation”. I think we can add something similar to our API.

1 Like

Hi! I actually had guessed what was going on and implemented my own solution with literal_eval from Abstract Syntax Trees, which also leverages the mapmethod:

from ast import literal_eval

def adjust_datasets(batch):
    batch["tokens"] = [literal_eval(expression) for expression in batch["tokens"]]
    batch["labels"] = [literal_eval(expression) for expression in batch["labels"]]
    return batch

dataset = dataset.map(adjust_datasets, batched=True)

Is it also valid, isn’t it? Which one would you recommend most in terms of efficiency?