Expected scalar type Long but found Float using Trainer for BertForTokenClassification

Hello,

I am using Trainer with BertForTokenClassification:

training_args = TrainingArguments(
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    #warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=encoded_dataset['train'],         # training dataset
    eval_dataset=encoded_dataset['validation'],             # evaluation dataset
)

and I am getting errors as following:

   ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: expected scalar type Long but found Float

Is there any option for trainer I need to set to fix this error?

Thank you.

From the very little you shared about the error message, it seems that your labels don’t have the right type. So the problem lies in the dataset you fed to the Trainer.

Thank you @sgugger Apparently the ‘input’ is in floats. The task I am doing is token classification, so I have input_ids, which are all ints, and corresponding labels, which are also ints. Maybe model by default would initialize the input as floats, or the trainer? I am not sure. I have found some other issue with the data itself, so I am fixing it now. I may share more information later. Thanks again.

The input to your loss functions are floats and that’s perfectly normal since those are the predictions of your model. Again, that error indicates it’s the target are floats instead of ints.

If you want additional help, please post the code you are training to run.

Thank you @sgugger You are right, I checked the target, they are floats. Here is my code:

dataset = load_dataset("ptc.py")

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

def tokenize_and_align_labels(example):
    tokenized_inputs = tokenizer(
        example["tokens"],
        is_split_into_words=True, padding=True, truncation=True, 
        return_offsets_mapping=True
    )

    labels = []
    for doc, doc_labels in zip(tokenized_inputs.encodings, example["label"]):
        doc_encoded_labels = []
        i = 0
        last_word_id = None
        for word_id in doc.word_ids:
            if word_id == None:
                doc_encoded_labels.append(-100)
            elif word_id == last_word_id:
                doc_encoded_labels.append(-100)
            else:
                last_word_id = word_id
                doc_encoded_labels.append(doc_labels[i])
                i += 1
        labels.append(doc_encoded_labels)

    tokenized_inputs["label"] = labels
    return tokenized_inputs


encoded_dataset = dataset.map(
    tokenize_and_align_labels,
    batched=True,
)

model = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=15)
model.train()

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=encoded_dataset['train'],         # training dataset
    eval_dataset=encoded_dataset['validation'],             # evaluation dataset
)

trainer.train()

Here is the dataset:

"""PTC: The Propaganda Technique Classification Dataset."""

BUILDER_CONFIGS = [
    PtcConfig(
        name="jsonl",
        version=datasets.Version("2.0.0", ""),
        description="jsonl",
    ),
]

def _info(self):
    return datasets.DatasetInfo(
        description=_DESCRIPTION,
        features=datasets.Features({
            "tokens": datasets.Sequence(datasets.Value("string")),
            "label": datasets.Sequence(
                datasets.features.ClassLabel(
                    names = [
                        "O",
                        "Appeal_to_Authority",
                        "Appeal_to_fear-prejudice",
                        "Bandwagon,Reductio_ad_hitlerum",
                        "Black-and-White_Fallacy",
                        "Causal_Oversimplification",
                        "Doubt",
                        "Exaggeration,Minimisation",
                        "Flag-Waving",
                        "Loaded_Language",
                        "Name_Calling,Labeling",
                        "Repetition",
                        "Slogans",
                        "Thought-terminating_Cliches",
                        "Whataboutism,Straw_Men,Red_Herring",
                    ]
                )
            ),
        }),
        supervised_keys=None,   # TODO find out what is this
        homepage=_HOMEPAGE,
        citation=_CITATION,
    )

def _split_generators(self, dl_manager):
    #data_dir = dl_manager.download_and_extract(_URLS)

    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={
                "filepath": "train.jsonl",
                "split": "train",
            }
        ),
        datasets.SplitGenerator(
            name=datasets.Split.VALIDATION,
            gen_kwargs={
                "filepath": "dev.jsonl",
                "split": "dev",
            }
        ),
    ]

def _generate_examples(self, filepath, split):
    with open(filepath, encoding='utf-8') as f:
        for id_, row in enumerate(f):
            data = json.loads(row)
            yield id_, {
                "tokens": data["tokens"],
                "label": [] if split == "test" else data["label"] ,
            }

Just make sure to convert your labels to integers then, and all will be well. For instance, you can do
doc_encoded_labels.append(int(doc_labels[i])).

Okay, I fixed. My labels are int, not float. My problem is that I used ‘label’ rather than ‘labels’, I think Transformers assumed it is going to be a single value rather than a vector, when testing it against int returns False, the default data collator assumed it is float. I changed the name for the feature from ‘label’ to ‘labels’, Trainer does not complain any more.