TypeError in load_dataset related to a sequence of strings

I’m using transformers==4.22.1 and datasets==2.5.1, along with Python 3.9.13, and I have come across this error while loading a dataset from a set of three TSV files for train, validation and test sets. list_of_codes represents a list of all the possible categories that a token may have in a token classification problem.

features = datasets.Features(
		{
			"index": datasets.Value("int32"),
			"texts": datasets.Value("string"),
			"tokens": datasets.Sequence(datasets.Value("string")),
			"labels": datasets.Sequence(datasets.ClassLabel(
				num_classes=len(list_of_codes),
				names=list_of_codes)
			),
			"patient_id": datasets.Value("int32"),
			"date": datasets.Value("string"),
			"type": datasets.Value("string")
		}

I cannot post a realistic example of my corpus por data privacy reasons, but I can make up an example of a record in the desired dataset. It could be something like that:

			"index": 567,
			"texts": "The patient suffers from severe cephalalgia. She
also complaints about left arm",
			"tokens": ["The", "patient", "suffers", "from" "severe", "cephalalgia", ".", "She", "also", "complaints", "about", "left", "arm"]
			"labels": ["O", "O", "O", "O" "C56", "C56", "O", "O", "O", "O", "O", "D30.1", "D30.1"]
			"patient_id": 145,
			"date": "20181111",
			"type": "examination"

The execution of the following line according to the previously explained

dataset = datasets.load_dataset("csv", sep="\t", data_files=data_files, features=features)
  File "<stdin>", line 1, in <module>
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/load.py", line 1698, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/builder.py", line 807, in download_and_prepare
    self._download_and_prepare(
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/builder.py", line 898, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/builder.py", line 1495, in _prepare_split
    for key, table in logging.tqdm(
  File "/home/users/user/.local/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/packaged_modules/csv/csv.py", line 182, in _generate_tables
    yield (file_idx, batch_idx), self._cast_table(pa_table)
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/packaged_modules/csv/csv.py", line 160, in _cast_table
    pa_table = table_cast(pa_table, schema)
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 2044, in table_cast
    return cast_table_to_schema(table, schema)
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 2006, in cast_table_to_schema
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 2006, in <listcomp>
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 1716, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 1716, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/users/user/.local/lib/python3.9/site-packages/datasets/table.py", line 1889, in cast_array_to_feature
    raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
string
to
Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)

Since it is the only field featuring a sequence of strings, I assume that the field responsible for the error is tokens, which was obtained from the tokenization of texts by tokenizer call method, writing the result out to the TSV:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")
batch_encoding = tokenizer(list_of_corpus_texts, truncation=True, add_special_tokens=False)

I have not found any similar issues in the forum. Could anybody provide some help?

Hi! Our CSV/TSV builder uses pd.read_csv internally, which means list/array columns are not handled automatically. So, to load your files successfully, replace the Sequence(...) columns with Value("string"), load the files and then cast them to the desired type:

dataset = datasets.load_dataset("csv", sep="\t", data_files=data_files, features=features_with_strings)
def string_to_list(ex):
    ex["tokens"] = ex["tokens"].split(delim) # specify your array elem delim
    ex["labels"] = ex["labels"].split(delim) # specify your array elem delim
    return ex
dataset = dataset.map(string_to_list, features=features_with_sequence)

pd.read_csv supports passing converters to address this “limitation”. I think we can add something similar to our API.

1 Like

Hi! I actually had guessed what was going on and implemented my own solution with literal_eval from Abstract Syntax Trees, which also leverages the mapmethod:

from ast import literal_eval

def adjust_datasets(batch):
    batch["tokens"] = [literal_eval(expression) for expression in batch["tokens"]]
    batch["labels"] = [literal_eval(expression) for expression in batch["labels"]]
    return batch

dataset = dataset.map(adjust_datasets, batched=True)

Is it also valid, isn’t it? Which one would you recommend most in terms of efficiency?

Your approach supports the batched mode, so it should be faster than mine (fewer writes to disk).

PS: I’ve opened a PR that adds support for converters here

1 Like