Longformer for sequence classification throwing error regarding data format and shape

I’m trying to ues the LongFormerForSequenceClassificattion but on simple toy problem it throws errors.

First complaining about the type of the data format for the labels, stating it should be float instead of long. After tokenization all the data formats are of type long, that is “input_ids”, “attention_mask”, “labels”.

Second, there’s a size/shape mismatch between “input_ids” and “labels” if I use a batch size larger than 1.

For instance if the batch size is 14 then the labels are of shape [14, 512] but the input_ids are of shape [512].

I don’t know why all these errors, when directly using models from HF.
Shouldn’t in each model card be clearly specified what is the expected input and type tha teach model should expect, so that we don’t have these headaches all the time.

First complaining about the type of the data format for the labels, stating it should be float instead of long. After tokenization all the data formats are of type long, that is “input_ids”, “attention_mask”, “labels”.

This error is typically thrown when you fine-tune a model for multi-label classification. In that case, the labels should indeed be of type float for the PyTorch’s BCELossWithLogits to work properly.

Regarding the shapes, the input_ids should always be of shape (batch_size, sequence_length), e.g. (4, 512) in case you pad/truncate all inputs to a length of 512 tokens, and the labels should be of shape (batfch_size, num_labels) in case of multi-label classification.

See my demo notebook, you can normally just replace BertForSequenceClassification by LongFormerForSequenceClassification as well as the tokenizer and it should work: Transformers-Tutorials/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub

Thank you @nielsr for the answer, it really helps a lot.
I just wanted to clarify that I’m not trying to do fine-tuning but training from scratch.
I usually load the tokenizer as
tokenizer.from_pretrained(...) and
config = AutoConfig.from_pretrained(...), but the actual model is loaded from the config

AutoModelForSequenceClassification.from_config(config).

I thought this way would ensure to load the model in order to train from scratch and not just pretrain since I’m not loading the model as

checkpoint = "allenai/longformer-base-4096",
AutoModelForSequenceClassification.from_pretrained(checkpoint)

Am I wrong in thinking that this way ensures training from scratch?

My sequences are different lengths but I think in this case I should pad to 1024 given the current checkpoint?

My targets are usually comprised of 1 or 2 tokens after tokenization but I pad them to length=10. Is that wrong?

On another note, what are some other hugging face models with good performance on sequence classification that can be used to train from scratch on my toy synthetic datasets?