Longformer for sequence classification throwing error regarding data format and shape

rotentomato · June 10, 2024, 11:32am

I’m trying to ues the LongFormerForSequenceClassificattion but on simple toy problem it throws errors.

First complaining about the type of the data format for the labels, stating it should be float instead of long. After tokenization all the data formats are of type long, that is “input_ids”, “attention_mask”, “labels”.

Second, there’s a size/shape mismatch between “input_ids” and “labels” if I use a batch size larger than 1.

For instance if the batch size is 14 then the labels are of shape [14, 512] but the input_ids are of shape [512].

I don’t know why all these errors, when directly using models from HF.
Shouldn’t in each model card be clearly specified what is the expected input and type tha teach model should expect, so that we don’t have these headaches all the time.

nielsr · June 10, 2024, 1:36pm

First complaining about the type of the data format for the labels, stating it should be float instead of long. After tokenization all the data formats are of type long, that is “input_ids”, “attention_mask”, “labels”.

This error is typically thrown when you fine-tune a model for multi-label classification. In that case, the labels should indeed be of type float for the PyTorch’s BCELossWithLogits to work properly.

Regarding the shapes, the input_ids should always be of shape (batch_size, sequence_length), e.g. (4, 512) in case you pad/truncate all inputs to a length of 512 tokens, and the labels should be of shape (batfch_size, num_labels) in case of multi-label classification.

See my demo notebook, you can normally just replace BertForSequenceClassification by LongFormerForSequenceClassification as well as the tokenizer and it should work: Transformers-Tutorials/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub

rotentomato · June 11, 2024, 10:41am

Thank you @nielsr for the answer, it really helps a lot.
I just wanted to clarify that I’m not trying to do fine-tuning but training from scratch.
I usually load the tokenizer as
tokenizer.from_pretrained(...) and
config = AutoConfig.from_pretrained(...), but the actual model is loaded from the config

AutoModelForSequenceClassification.from_config(config).

I thought this way would ensure to load the model in order to train from scratch and not just pretrain since I’m not loading the model as

checkpoint = "allenai/longformer-base-4096",
AutoModelForSequenceClassification.from_pretrained(checkpoint)

Am I wrong in thinking that this way ensures training from scratch?

My sequences are different lengths but I think in this case I should pad to 1024 given the current checkpoint?

My targets are usually comprised of 1 or 2 tokens after tokenization but I pad them to length=10. Is that wrong?

On another note, what are some other hugging face models with good performance on sequence classification that can be used to train from scratch on my toy synthetic datasets?

Topic		Replies	Views
How to do text classification on long sequence? Beginners	3	3255	May 14, 2023
Longformer for sequenceclassification 🤗Transformers	5	474	October 13, 2020
Huggingface sequence classification unfreezing layers 🤗Transformers	2	1312	March 24, 2022
Huggingface longformer memory issues 🤗Transformers	0	539	March 31, 2022
Character level attention with Longformer for sequence classification Intermediate	0	293	February 25, 2021

Longformer for sequence classification throwing error regarding data format and shape

Related topics