TypeError: Provided `function` which is applied to all elements of table returns a variable of type <class 'list'>

emmakelo · September 24, 2021, 3:53pm

I am trying to transform my data to dataset format to use it with a bert tokenizer but I get this error :

raise TypeError(
TypeError: Provided `function` which is applied to all elements of table returns a variable of type <class 'list'>. Make sure provided `function` returns a variable of type `dict` to update the dataset or `None` if you are only interested in side effects.

How can I solve it ?

datasets library : 1.2.1

flaubert_tokenizer = FlaubertTokenizer.from_pretrained('flaubert/flaubert_small_cased')
model = FlaubertModel.from_pretrained('flaubert/flaubert_small_cased')


train_file = './corpusi1.xlsx'
	
Xtrain, ytrain, filename, len_labels = read_file_2(fic)
# Xtrain, lge_size = get_flaubert_layer(Xtrain, path_to_model_lge)

data_preprocessed = make_new_traindata(Xtrain)
	
my_dict = {"verbatim": data_preprocessed[1], "label": ytrain} 
dataset = Dataset.from_dict(my_dict)

print(type(dataset))
#<class 'datasets.arrow_dataset.Dataset'>

#print(dataset)

Dataset({
    features: ['verbatim', 'label'],
    num_rows: 346
})
`

tokenized_dataset = dataset.map(lambda x: flaubert_tokenizer.encode(x['verbatim'], padding=max_length, truncation=True, max_length=512), batched=True) # this line raise the error

How can I convert dataset.arrow to dataset dict ?

lhoestq · October 4, 2021, 9:24am

Hi !
The output of the function passed to a batched map should be a dict with the structure {column_name → [list of values per element]}

Can you try this instead ?

tokenized_dataset = dataset.map(
    lambda x: {"input_ids": flaubert_tokenizer.encode(
        x['verbatim'], padding=max_length, truncation=True, max_length=512
    )},
    batched=True
)

israfelsr · February 28, 2024, 7:32pm

What would be the best/fastest way to process a dataset if I want to have the outputs of two tokenizers as below?

{"input_a": tokenizer_a(examples), "input_b": tokenizer_b(examples)}

Topic		Replies	Views
TypeError: forward() got an unexpected keyword argument 'token_type_ids' Beginners	3	3264	June 10, 2022
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1076	August 19, 2021
Set dataset to pytorch tensors produce class list making the model unable to process the data 🤗Datasets	3	2454	July 20, 2021
ArrowTypeError: Expected bytes, got a 'float' object, when trying to make a dataset from a list of dicts 🤗Datasets	10	10939	May 13, 2024
Having an issue with 'NoneType' after using to_df_dataset() function Beginners	3	3072	January 13, 2024

TypeError: Provided `function` which is applied to all elements of table returns a variable of type <class 'list'>

Related topics