How to know the data format required by a model?

Hello,

I’m trying to send a sound to a model for Speech To Text.
I’m using the Ilyes/wav2vec2-large-xlsr-53-french model and when i send a wav file it works well.

When printing the data, i see the wav file is converted to a bytestring before being sent to the model, so i transformed my own mic recording to a bytestring too.
But when i send a sound recorded from my mic i get a “Malformed soundfile” error.

Is there a way to know exactly what a model is waiting as an input format (bytestring, headers, etc) ?
Is there a standard that i should know of for sound data ?
Or is there a doc for each model ?
Or is there no doc and i should study the traning data to get the info ?

What’s the good practice to get some detailed info on the input data format for a given model ?

Thanks
Cedric

1 Like

Well, i started the datasets tutorial, which answers my question^^
Cedric

As we can see from the transformers-quickstart-doc and transformers-finetune-doc, the model_obj we get from hugging face automodel-api should just be a torch.nn.module.
We can print the signature and docstring of model_obj.forward method, which actually defines the input of the model.

from inspect import signature
...
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

print(signature(model.forward))
print(model.forward.__doc__)

output:

(input_ids: Union[torch.Tensor, NoneType] = None, attention_mask: Union[torch.Tensor, NoneType] = None, token_type_ids: Union[torch.Tensor, NoneType] = None, position_ids: Union[torch.Tensor, NoneType] = None, head_mask: Union[torch.Tensor, NoneType] = None, inputs_embeds: Union[torch.Tensor, NoneType] = None, labels: Union[torch.Tensor, NoneType] = None, output_attentions: Union[bool, NoneType] = None, output_hidden_states: Union[bool, NoneType] = None, return_dict: Union[bool, NoneType] = None) -> Union[Tuple[torch.Tensor], transformers.modeling_outputs.SequenceClassifierOutput]

The [`BertForSequenceClassification`] forward method, overrides the `__call__` special method.

    <Tip>

    Although the recipe for forward pass needs to be defined within this function, one should call the [`Module`]
    instance afterwards instead of this since the former takes care of running the pre and post processing steps while
    the latter silently ignores them.

    </Tip>

    Args:
        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary.

            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
            [`PreTrainedTokenizer.__call__`] for details.

            [What are input IDs?](../glossary#input-ids)
        attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.

            [What are attention masks?](../glossary#attention-mask)
        token_type_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
            1]`:

            - 0 corresponds to a *sentence A* token,
            - 1 corresponds to a *sentence B* token.

            [What are token type IDs?](../glossary#token-type-ids)
...