Expected scalar type Long but found Float using Trainer for BertForTokenClassification

yifan · April 19, 2021, 1:18pm

Hello,

I am using Trainer with BertForTokenClassification:

training_args = TrainingArguments(
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    #warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=encoded_dataset['train'],         # training dataset
    eval_dataset=encoded_dataset['validation'],             # evaluation dataset
)

and I am getting errors as following:

   ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: expected scalar type Long but found Float

Is there any option for trainer I need to set to fix this error?

Thank you.

sgugger · April 19, 2021, 3:18pm

From the very little you shared about the error message, it seems that your labels don’t have the right type. So the problem lies in the dataset you fed to the Trainer.

yifan · April 21, 2021, 8:12am

Thank you @sgugger Apparently the ‘input’ is in floats. The task I am doing is token classification, so I have input_ids, which are all ints, and corresponding labels, which are also ints. Maybe model by default would initialize the input as floats, or the trainer? I am not sure. I have found some other issue with the data itself, so I am fixing it now. I may share more information later. Thanks again.

sgugger · April 21, 2021, 12:03pm

The input to your loss functions are floats and that’s perfectly normal since those are the predictions of your model. Again, that error indicates it’s the target are floats instead of ints.

If you want additional help, please post the code you are training to run.

yifan · April 21, 2021, 4:26pm

Thank you @sgugger You are right, I checked the target, they are floats. Here is my code:

dataset = load_dataset("ptc.py")

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

def tokenize_and_align_labels(example):
    tokenized_inputs = tokenizer(
        example["tokens"],
        is_split_into_words=True, padding=True, truncation=True, 
        return_offsets_mapping=True
    )

    labels = []
    for doc, doc_labels in zip(tokenized_inputs.encodings, example["label"]):
        doc_encoded_labels = []
        i = 0
        last_word_id = None
        for word_id in doc.word_ids:
            if word_id == None:
                doc_encoded_labels.append(-100)
            elif word_id == last_word_id:
                doc_encoded_labels.append(-100)
            else:
                last_word_id = word_id
                doc_encoded_labels.append(doc_labels[i])
                i += 1
        labels.append(doc_encoded_labels)

    tokenized_inputs["label"] = labels
    return tokenized_inputs


encoded_dataset = dataset.map(
    tokenize_and_align_labels,
    batched=True,
)

model = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=15)
model.train()

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=encoded_dataset['train'],         # training dataset
    eval_dataset=encoded_dataset['validation'],             # evaluation dataset
)

trainer.train()

Here is the dataset:

"""PTC: The Propaganda Technique Classification Dataset."""

BUILDER_CONFIGS = [
    PtcConfig(
        name="jsonl",
        version=datasets.Version("2.0.0", ""),
        description="jsonl",
    ),
]

def _info(self):
    return datasets.DatasetInfo(
        description=_DESCRIPTION,
        features=datasets.Features({
            "tokens": datasets.Sequence(datasets.Value("string")),
            "label": datasets.Sequence(
                datasets.features.ClassLabel(
                    names = [
                        "O",
                        "Appeal_to_Authority",
                        "Appeal_to_fear-prejudice",
                        "Bandwagon,Reductio_ad_hitlerum",
                        "Black-and-White_Fallacy",
                        "Causal_Oversimplification",
                        "Doubt",
                        "Exaggeration,Minimisation",
                        "Flag-Waving",
                        "Loaded_Language",
                        "Name_Calling,Labeling",
                        "Repetition",
                        "Slogans",
                        "Thought-terminating_Cliches",
                        "Whataboutism,Straw_Men,Red_Herring",
                    ]
                )
            ),
        }),
        supervised_keys=None,   # TODO find out what is this
        homepage=_HOMEPAGE,
        citation=_CITATION,
    )

def _split_generators(self, dl_manager):
    #data_dir = dl_manager.download_and_extract(_URLS)

    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={
                "filepath": "train.jsonl",
                "split": "train",
            }
        ),
        datasets.SplitGenerator(
            name=datasets.Split.VALIDATION,
            gen_kwargs={
                "filepath": "dev.jsonl",
                "split": "dev",
            }
        ),
    ]

def _generate_examples(self, filepath, split):
    with open(filepath, encoding='utf-8') as f:
        for id_, row in enumerate(f):
            data = json.loads(row)
            yield id_, {
                "tokens": data["tokens"],
                "label": [] if split == "test" else data["label"] ,
            }

sgugger · April 21, 2021, 6:42pm

Just make sure to convert your labels to integers then, and all will be well. For instance, you can do
doc_encoded_labels.append(int(doc_labels[i])).

yifan · April 22, 2021, 9:29am

Okay, I fixed. My labels are int, not float. My problem is that I used ‘label’ rather than ‘labels’, I think Transformers assumed it is going to be a single value rather than a vector, when testing it against int returns False, the default data collator assumed it is float. I changed the name for the feature from ‘label’ to ‘labels’, Trainer does not complain any more.

Topic		Replies	Views
RuntimeError: result type Float can't be cast to the desired output type Long 🤗Transformers	1	120	January 24, 2025
Reshaping logits when using Trainer Beginners	1	5332	May 23, 2022
Expected input batch_size (2048) to match target batch_size (4) Beginners	3	1602	May 23, 2022
Loss error for bert token classifier Beginners	11	500	December 4, 2021
RuntimeError when Training starts: expected scalar type Long but found Int Beginners	2	4775	July 5, 2023

Expected scalar type Long but found Float using Trainer for BertForTokenClassification

Related topics