I am trying to transform my data to dataset format to use it with a bert tokenizer but I get this error :
raise TypeError(
TypeError: Provided `function` which is applied to all elements of table returns a variable of type <class 'list'>. Make sure provided `function` returns a variable of type `dict` to update the dataset or `None` if you are only interested in side effects.
How can I solve it ?
datasets library : 1.2.1
flaubert_tokenizer = FlaubertTokenizer.from_pretrained('flaubert/flaubert_small_cased')
model = FlaubertModel.from_pretrained('flaubert/flaubert_small_cased')
train_file = './corpusi1.xlsx'
Xtrain, ytrain, filename, len_labels = read_file_2(fic)
# Xtrain, lge_size = get_flaubert_layer(Xtrain, path_to_model_lge)
data_preprocessed = make_new_traindata(Xtrain)
my_dict = {"verbatim": data_preprocessed[1], "label": ytrain}
dataset = Dataset.from_dict(my_dict)
print(type(dataset))
#<class 'datasets.arrow_dataset.Dataset'>
#print(dataset)
Dataset({
features: ['verbatim', 'label'],
num_rows: 346
})
`
tokenized_dataset = dataset.map(lambda x: flaubert_tokenizer.encode(x['verbatim'], padding=max_length, truncation=True, max_length=512), batched=True) # this line raise the error
How can I convert dataset.arrow to dataset dict ?