I’m looking at the wav2vec2 pretraining example given in the transformer repository. This is a preprocessing step :
with accelerator.main_process_first():
vectorized_datasets = raw_datasets.map(
prepare_dataset,
num_proc=args.preprocessing_num_workers,
remove_columns=raw_datasets["train"].column_names,
cache_file_names=cache_file_names,
)
The prepare_dataset function is this :
def prepare_dataset(batch):
sample = batch[args.audio_column_name]
inputs = feature_extractor(
sample["array"], sampling_rate=sample["sampling_rate"], max_length=max_length, truncation=True
)
batch["input_values"] = inputs.input_values[0]
batch["input_length"] = len(inputs.input_values[0])
return batch
Where feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(args.model_name_or_path)
Wav2Vec2FeatureExtractor uses 1-D conv to extract features as is done in wav2vec2. But why is this being done as a preprocessing step? Furthermore during training the Wav2Vec2Model has yet another feature extractor which seemingly does the same thing but on the already extracted features?
class Wav2Vec2Model(Wav2Vec2PreTrainedModel):
def __init__(self, config: Wav2Vec2Config):
super().__init__(config)
self.config = config
self.feature_extractor = Wav2Vec2FeatureEncoder(config)
...
...
...
extract_features = self.feature_extractor(input_values)
Should’nt there be a single feature extraction step with 1-D conv done only during training? What am I missing?
Edit : Wav2Vec2FeatureExtractor simply uses Wav2Vec2FeatureEncoder. So they are identical.