TypeError: Provided `function` which is applied to all elements of table returns a variable of type <class 'list'>

I am trying to transform my data to dataset format to use it with a bert tokenizer but I get this error :

raise TypeError(
TypeError: Provided `function` which is applied to all elements of table returns a variable of type <class 'list'>. Make sure provided `function` returns a variable of type `dict` to update the dataset or `None` if you are only interested in side effects.

How can I solve it ?

datasets library : 1.2.1

flaubert_tokenizer = FlaubertTokenizer.from_pretrained('flaubert/flaubert_small_cased')
model = FlaubertModel.from_pretrained('flaubert/flaubert_small_cased')


train_file = './corpusi1.xlsx'
	
Xtrain, ytrain, filename, len_labels = read_file_2(fic)
# Xtrain, lge_size = get_flaubert_layer(Xtrain, path_to_model_lge)

data_preprocessed = make_new_traindata(Xtrain)
	
my_dict = {"verbatim": data_preprocessed[1], "label": ytrain} 
dataset = Dataset.from_dict(my_dict)

print(type(dataset))
#<class 'datasets.arrow_dataset.Dataset'>

#print(dataset)

Dataset({
    features: ['verbatim', 'label'],
    num_rows: 346
})
`
tokenized_dataset = dataset.map(lambda x: flaubert_tokenizer.encode(x['verbatim'], padding=max_length, truncation=True, max_length=512), batched=True) # this line raise the error 

How can I convert dataset.arrow to dataset dict ?

1 Like

Hi !
The output of the function passed to a batched map should be a dict with the structure {column_name → [list of values per element]}

Can you try this instead ?

tokenized_dataset = dataset.map(
    lambda x: {"input_ids": flaubert_tokenizer.encode(
        x['verbatim'], padding=max_length, truncation=True, max_length=512
    )},
    batched=True
)
1 Like

What would be the best/fastest way to process a dataset if I want to have the outputs of two tokenizers as below?

{"input_a": tokenizer_a(examples), "input_b": tokenizer_b(examples)}