Hi @omar47. I’m not sure we have the same original issue.
I see two alternative issues that may cause this:
- Passing the class
Wav2Vec2FeatureExtractor
toDataCollatorForWav2Vec2Pretraining
. Solution: instantiate the feature extractor before passing it to the data collator instance:
model = Wav2Vec2ForPreTraining(args.model_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(args.model_path)
data_collator = DataCollatorForWav2Vec2Pretraining(
model=model,
feature_extractor=feature_extractor
)
- The dataset is still a dictionary:
This problem appeared again for me because I did not remove unused columns in the preprocessing step (usingprepare_dataset
). Because of this the data is still in dictionary form, which I think is not expected by the padding function in the Data Collator. Make sure that you keep the lineremove_columns=raw_datasets["train"].column_names
when mapping theprepare_dataset
function to your dataset:
vectorized_datasets = raw_datasets.map(
prepare_dataset,
num_proc=args.preprocessing_num_workers,
remove_columns=raw_datasets["train"].column_names,
cache_file_names=cache_file_names,
)
These are the two things that I could identify that alleviated the issue in my case. Good luck!