Hi! I am currently using the datasets library for the Trainer function to fine-tune a pre-trained model. I have a large dataset. I want to know if is it possible to execute the
dataset.map() function during runtime. Instead of transforming all the data at once.
def prepare_dataset(batch): audio = batch["audio"] wav, sr = librosa.load(audio, sr=16000) batch["input_values"] = feature_extractor(wav, sampling_rate=sr).input_values batch["labels"] = tokenizer(batch["sentence"]).input_ids return batch common_voice = dataset.map(prepare_dataset, remove_columns=['audio', 'sentence'])
If I run this all the data are transformed all at once and it causes memory issues. Is it possible to run during runtime, like a pipeline, it will take only some rows map it, train it and so on.
Any insights is helpful. Thanks