I know my question seems basic but I am talking specifically in audio data. I know what tokenization is in text data which dividing the text into tokens(word, characters or subwords).
I was checking the hugging face documentation wav2vec but I did not understand the tokenization in the context of audio.
I also used Wav2Vec2FeatureExtractor
which normalizes the data and I found out that its output is the same as Wav2Vec2Tokenizer
’s output
For example:
from transformers import Wav2Vec2Tokenizer, Wav2Vec2Model
from datasets import load_dataset
import soundfile as sf
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")
def map_to_array(batch):
speech, _ = sf.read(batch["file"])
batch["speech"] = speech
return batch
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)
input_values = tokenizer(ds["speech"][0], return_tensors="pt").input_values # Batch size 1
hidden_states = model(input_values).last_hidden_state
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")
i= feature_extractor(ds["speech"][0], return_tensors="pt", sampling_rate=16000)
i
equals input_values
in this example.
What is the difference between Wav2Vec2FeatureExtractor
and Wav2Vec2Tokenizer
?