Thanks for question and answer.
For those having issue, you can try following function to remove data that is longer than by default 6 seconds on common_voice_test and common_voice_train .
As I already processed and saved I remove long data just before train (it is pretty fast for i7 16Gb)
def remove_long_common_voicedata(dataset, max_seconds=6):
#convert pyarrow table to pandas
dftest= dataset.to_pandas()
#find out length of input_values
dftest['len']= dftest['input_values'].apply(len)
#for wav2vec training we already resampled to 16khz
#remove data that is longer than max_seconds (6 seconds ideal)
maxLength = max_seconds*16000
dftest= dftest[dftest['len']<maxLength]
dftest = dftest.drop('len', 1)
#convert back to pyarrow table to use in trainer
dataset= dataset.from_pandas(dftest)
#directly remove do not wait for gc
del dftest
return dataset
-
Also if you trained and it failed if you change something and restart training Cuda may give out of memory so before defining model and trainer, you can make sure you have more memory.
import gc
gc.collect()
#do below before defining model and trainer if you change batch size etc
#del trainer
#del model
torch.cuda.empty_cache() -
I also needed to set group_by_length to False as it hogged up memory initially, group_by_length=False , reduced batch size to 4 in TrainingArguments (RTX2070 8GB)