I’m looking at the wav2vec2 pretraining example given in the transformer repository. This is a preprocessing step :
with accelerator.main_process_first(): vectorized_datasets = raw_datasets.map( prepare_dataset, num_proc=args.preprocessing_num_workers, remove_columns=raw_datasets["train"].column_names, cache_file_names=cache_file_names, )
The prepare_dataset function is this :
def prepare_dataset(batch): sample = batch[args.audio_column_name] inputs = feature_extractor( sample["array"], sampling_rate=sample["sampling_rate"], max_length=max_length, truncation=True ) batch["input_values"] = inputs.input_values batch["input_length"] = len(inputs.input_values) return batch
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(args.model_name_or_path)
Wav2Vec2FeatureExtractor uses 1-D conv to extract features as is done in wav2vec2. But why is this being done as a preprocessing step? Furthermore during training the Wav2Vec2Model has yet another feature extractor which seemingly does the same thing but on the already extracted features?
class Wav2Vec2Model(Wav2Vec2PreTrainedModel): def __init__(self, config: Wav2Vec2Config): super().__init__(config) self.config = config self.feature_extractor = Wav2Vec2FeatureEncoder(config) ... ... ... extract_features = self.feature_extractor(input_values)
Should’nt there be a single feature extraction step with 1-D conv done only during training? What am I missing?
Edit : Wav2Vec2FeatureExtractor simply uses Wav2Vec2FeatureEncoder. So they are identical.