Problems tracing fine tuned whisper model to torchscript

IntraphoneMarlar · June 19, 2024, 2:27pm

Hi I’ve been following this guide in order to fine tune a whisper model for my language and domain: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

Now that Ive trained it to a degree that Im happy with I want to take the model and export it to Torchscript in order to deploy it for inference,
I’ve been trying to follow this guide: Export to TorchScript

As I understand it I then need a dummy data and create the tensors for it, sending it through my model tracing it to Torchscript.

This is what I have conjured up using what I understand from the first guide and the whisper feature extractors to create tensors from an example sound loading my model from a checkpoint and tracing it:

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-medium", language="$LANGUAGE", task="transcribe")

file_paths = []
file_paths.append($PATH_TO_MP3_FILE)
trace_data = Dataset.from_dict({"audio": trace_dict}).cast_column("audio", Audio())
audio_array = trace_data["audio"][0]["array"]
input_features = feature_extractor(audio_array, sampling_rate=16000).input_features[0]
input_features_dict = {"input_features": input_features}
tensors = feature_extractor.pad(input_features_dict, return_tensors="pt")

#Print data shown later
print (tensors)

model = WhisperForConditionalGeneration.from_pretrained($PATH_TO_CHECKPOINT, torchscript=True)
model.generation_config.language = "$LANGUAGE"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None

model.eval()

traced_model = torch.jit.trace(model, tensors["input_features"])
torch.jit.save(traced_model, $SAVE_FILE)

My output from my print seems to show me that I have gotten my tensors correctly

{'input_features': tensor([[-0.7001, -0.7001, -0.7001,  ..., -0.7001, -0.7001, -0.7001],
        [-0.7001, -0.7001, -0.7001,  ..., -0.7001, -0.7001, -0.7001],
        [-0.7001, -0.7001, -0.7001,  ..., -0.7001, -0.7001, -0.7001],
        ...,
        [-0.7001, -0.7001, -0.7001,  ..., -0.7001, -0.7001, -0.7001],
        [-0.7001, -0.7001, -0.7001,  ..., -0.7001, -0.7001, -0.7001],
        [-0.7001, -0.7001, -0.7001,  ..., -0.7001, -0.7001, -0.7001]])}

(I have printed unabbreviated versions too, its just same as my sound has silence in the beginning and end)

I am running into the following problem:

RuntimeError: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 2 is not equal to len(dims) = 3

So my question is what am I missing?
It seems to me I am processing the sound just the same way as in training?

Do I need tensors for the labels too as in training?
I assumed for the trace it would just be the sound as that is whats used in inference?
If I need the labels too is it important that it is correct labels? I.e could I just have an empty string and pad the tensors?

IntraphoneMarlar · June 27, 2024, 8:18am

I managed to get one step further by changing

input_features_dict = {"input_features": [input_features]}

But am now running into:

ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds

Topic		Replies	Views
Fine tuned whisper model export to .h5 Models	0	140	July 1, 2024
Performing Whisper's "transcribe" with Transformer pipelines Beginners	2	2684	December 19, 2023
Converting finetuned Pytorch Whisper model to HF Beginners	0	691	January 15, 2023
Finetune whisper-tiny in german for tflite runtime Models	2	207	October 16, 2024
0 Loss. HF trainer integration Models	0	40	November 20, 2024

Problems tracing fine tuned whisper model to torchscript

Related topics