I don’t know how to solve this problem
I think that you are currently trying to tokenize a whole list (example[‘translation’][‘en’]). You should iterate over your examples in your dataset and only tokenize every single element in your list. You could modify this code for your application:
def preprocess_function(examples): audio_arrays = [x["array"] for x in examples["audio"]] inputs_audio = processor( audio_arrays, sampling_rate=16000, padding=True, max_length=100000000, truncation=True, ) # print(inputs_audio) return inputs_audio
I am completing an NLP task and hoping to use
dataset.map to tokenize both the English and Chinese in every row of the dataset. Additionally, I’m interested in learning how to use the
map function as well.
I’m a beginner, and my English is not very good. I just want to achieve the effect shown in the diagram using the
You are correct, now I understand what you mean. Thank you very much.