I’m trying to send a sound to a model for Speech To Text.
I’m using the Ilyes/wav2vec2-large-xlsr-53-french model and when i send a wav file it works well.
When printing the data, i see the wav file is converted to a bytestring before being sent to the model, so i transformed my own mic recording to a bytestring too.
But when i send a sound recorded from my mic i get a “Malformed soundfile” error.
Is there a way to know exactly what a model is waiting as an input format (bytestring, headers, etc) ?
Is there a standard that i should know of for sound data ?
Or is there a doc for each model ?
Or is there no doc and i should study the traning data to get the info ?
What’s the good practice to get some detailed info on the input data format for a given model ?
As we can see from the transformers-quickstart-doc and transformers-finetune-doc, the model_obj we get from hugging face automodel-api should just be a torch.nn.module.
We can print the signature and docstring of model_obj.forward method, which actually defines the input of the model.
from inspect import signature
...
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
print(signature(model.forward))
print(model.forward.__doc__)
output:
(input_ids: Union[torch.Tensor, NoneType] = None, attention_mask: Union[torch.Tensor, NoneType] = None, token_type_ids: Union[torch.Tensor, NoneType] = None, position_ids: Union[torch.Tensor, NoneType] = None, head_mask: Union[torch.Tensor, NoneType] = None, inputs_embeds: Union[torch.Tensor, NoneType] = None, labels: Union[torch.Tensor, NoneType] = None, output_attentions: Union[bool, NoneType] = None, output_hidden_states: Union[bool, NoneType] = None, return_dict: Union[bool, NoneType] = None) -> Union[Tuple[torch.Tensor], transformers.modeling_outputs.SequenceClassifierOutput]
The [`BertForSequenceClassification`] forward method, overrides the `__call__` special method.
<Tip>
Although the recipe for forward pass needs to be defined within this function, one should call the [`Module`]
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
</Tip>
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
token_type_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1]`:
- 0 corresponds to a *sentence A* token,
- 1 corresponds to a *sentence B* token.
[What are token type IDs?](../glossary#token-type-ids)
...