About dataset map

xiaoshengzhuzhu · August 12, 2023, 3:00pm

I don’t know how to solve this problem

Luan77777 · August 12, 2023, 5:42pm

I think that you are currently trying to tokenize a whole list (example[‘translation’][‘en’]). You should iterate over your examples in your dataset and only tokenize every single element in your list. You could modify this code for your application:

def preprocess_function(examples):
audio_arrays = [x["array"] for x in examples["audio"]]

inputs_audio = processor(

    audio_arrays,
    sampling_rate=16000,
    padding=True,
    max_length=100000000,
    truncation=True,

)

# print(inputs_audio)

return inputs_audio

xiaoshengzhuzhu · August 13, 2023, 3:30am

I am completing an NLP task and hoping to use dataset.map to tokenize both the English and Chinese in every row of the dataset. Additionally, I’m interested in learning how to use the map function as well.

xiaoshengzhuzhu · August 13, 2023, 4:21am

I’m a beginner, and my English is not very good. I just want to achieve the effect shown in the diagram using the map function.

xiaoshengzhuzhu · August 15, 2023, 1:26am

You are correct, now I understand what you mean. Thank you very much.

Luan77777 · August 20, 2023, 6:36pm

you’re welcome!

Topic		Replies	Views
How to tokenize using map 🤗Datasets	4	6213	April 14, 2021
Trouble batch mapping dataset to tokenizer 🤗Datasets	1	827	June 12, 2023
Map with tokenize function stuck in the beginning 🤗Datasets	4	57	December 27, 2024
How to load this simple audio data set and use dataset.map without memory issues? 🤗Datasets	12	4248	December 10, 2024
Map function skipping rows (only 8k out of 1.6M rows) 🤗Datasets	1	195	December 25, 2023

About dataset map

Related topics