Hi I’ve been following this guide in order to fine tune a whisper model for my language and domain: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers
Now that Ive trained it to a degree that Im happy with I want to take the model and export it to Torchscript in order to deploy it for inference,
I’ve been trying to follow this guide: Export to TorchScript
As I understand it I then need a dummy data and create the tensors for it, sending it through my model tracing it to Torchscript.
This is what I have conjured up using what I understand from the first guide and the whisper feature extractors to create tensors from an example sound loading my model from a checkpoint and tracing it:
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-medium", language="$LANGUAGE", task="transcribe")
file_paths = []
file_paths.append($PATH_TO_MP3_FILE)
trace_data = Dataset.from_dict({"audio": trace_dict}).cast_column("audio", Audio())
audio_array = trace_data["audio"][0]["array"]
input_features = feature_extractor(audio_array, sampling_rate=16000).input_features[0]
input_features_dict = {"input_features": input_features}
tensors = feature_extractor.pad(input_features_dict, return_tensors="pt")
#Print data shown later
print (tensors)
model = WhisperForConditionalGeneration.from_pretrained($PATH_TO_CHECKPOINT, torchscript=True)
model.generation_config.language = "$LANGUAGE"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None
model.eval()
traced_model = torch.jit.trace(model, tensors["input_features"])
torch.jit.save(traced_model, $SAVE_FILE)
My output from my print seems to show me that I have gotten my tensors correctly
{'input_features': tensor([[-0.7001, -0.7001, -0.7001, ..., -0.7001, -0.7001, -0.7001],
[-0.7001, -0.7001, -0.7001, ..., -0.7001, -0.7001, -0.7001],
[-0.7001, -0.7001, -0.7001, ..., -0.7001, -0.7001, -0.7001],
...,
[-0.7001, -0.7001, -0.7001, ..., -0.7001, -0.7001, -0.7001],
[-0.7001, -0.7001, -0.7001, ..., -0.7001, -0.7001, -0.7001],
[-0.7001, -0.7001, -0.7001, ..., -0.7001, -0.7001, -0.7001]])}
(I have printed unabbreviated versions too, its just same as my sound has silence in the beginning and end)
I am running into the following problem:
RuntimeError: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 2 is not equal to len(dims) = 3
So my question is what am I missing?
It seems to me I am processing the sound just the same way as in training?
Do I need tensors for the labels too as in training?
I assumed for the trace it would just be the sound as that is whats used in inference?
If I need the labels too is it important that it is correct labels? I.e could I just have an empty string and pad the tensors?