Hi! I am currently using the datasets library for the Trainer function to fine-tune a pre-trained model. I have a large dataset. I want to know if is it possible to execute the dataset.map()
function during runtime. Instead of transforming all the data at once.
def prepare_dataset(batch):
audio = batch["audio"]
wav, sr = librosa.load(audio, sr=16000)
batch["input_values"] = feature_extractor(wav, sampling_rate=sr).input_values[0]
batch["labels"] = tokenizer(batch["sentence"]).input_ids
return batch
common_voice = dataset.map(prepare_dataset, remove_columns=['audio', 'sentence'])
If I run this all the data are transformed all at once and it causes memory issues. Is it possible to run during runtime, like a pipeline, it will take only some rows map it, train it and so on.
Any insights is helpful. Thanks