ValueError: Expected input batch_size to match target batch_size in Token Classification

ignacioct · March 15, 2024, 7:03pm

Hey there! I’m trying to solve a NER-like task, in which I have text tokens and labels for each token, in this form:

['i', 'want', 'to', 'fly', 'from', 'boston', 'at', '838', 'am', 'and', 'arrive', 'in', 'denver', 'at', '1110', 'in', 'the', 'morning']
['O', 'O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'O', 'B-depart_time.time', 'I-depart_time.time', 'O', 'O', 'O', 'B-toloc.city_name', 'O', 'B-arrive_time.time', 'O', 'O', 'B-arrive_time.period_of_day']

I’ve been following the TokenClassification guide from Huggingface almost exactly, as it fits perfectly into this problem, however, I’m obtaining this error when using trainer.train()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-e3bc539dfde3> in <cell line: 1>()
----> 1 trainer_slots.train()

10 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
   3057     if size_average is not None or reduce is not None:
   3058         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3059     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
   3060 
   3061 

ValueError: Expected input batch_size (16) to match target batch_size (2048).

I’ve noticed that the target batch size is always some kind of multiple of the input batch size, and if I make the batch size bigger, the target batch size also gets bigger.

Here’s my code in some snippets:

# Loading the model and the tokenizer
model_slots_name = 'andgonzalez/bert-uncased-slot-filling'
tokenizer_slots = AutoTokenizer.from_pretrained(
    model_slots_name
    )

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer_slots(examples["text"], truncation=True, padding="max_length", max_length=128, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"labels"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Loading the Dataset into HF Dataset and preprocessing it
train_df_slots = train_df[["text", "slots"]]
dev_df_slots = dev_df[["text", "slots"]]
test_df_slots = test_df[["text", "slots"]]

# Rename the column 'intent' to 'label'
train_df_slots = train_df_slots.rename(columns={"slots": "labels"})
test_df_slots = test_df_slots.rename(columns={"slots": "labels"})
dev_df_slots = dev_df_slots.rename(columns={"slots": "labels"})

# Create a dataset dictionary
dataset_dict = {
    "train": Dataset.from_pandas(train_df_slots),
    "test": Dataset.from_pandas(test_df_slots),
    "dev": Dataset.from_pandas(dev_df_slots),
}

# Turn it into HuggingFace dataset
dataset_slots = DatasetDict(dataset_dict)
print(dataset_slots)

# Transforming the labels into numerical values with label2id
def label2id_mapping_function(examples):
  examples["labels"] = [label2id_slots[i] for i in examples["labels"]]
  return examples

dataset_slots = dataset_slots.map(label2id_mapping_function)

print(f"Example of instance: {dataset_slots['train'][0]}")

# Tokenize the dataset
tokenized_dataset_slots = dataset_slots.map(tokenize_and_align_labels, batched=True, remove_columns=["text"])
print(f"Example of instance after tokenization: {tokenized_dataset_slots['train'][0]}")
print(f"Example of instance after tokenization: {tokenized_dataset_slots['dev'][0]}")

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 4978
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 893
    })
    dev: Dataset({
        features: ['text', 'labels'],
        num_rows: 500
    })
})
Map: 100%
 4978/4978 [00:00<00:00, 7726.43 examples/s]
Map: 100%
 893/893 [00:00<00:00, 6582.68 examples/s]
Map: 100%
 500/500 [00:00<00:00, 5171.60 examples/s]
Example of instance: {'text': ['i', 'want', 'to', 'fly', 'from', 'boston', 'at', '838', 'am', 'and', 'arrive', 'in', 'denver', 'at', '1110', 'in', 'the', 'morning'], 'labels': [0, 0, 0, 0, 0, 0, 1, 0, 2, 3, 0, 0, 0, 4, 0, 5, 0, 0, 6]}
Map: 100%
 4978/4978 [00:03<00:00, 1679.58 examples/s]
Map: 100%
 893/893 [00:00<00:00, 2210.44 examples/s]
Map: 100%
 500/500 [00:00<00:00, 1676.88 examples/s]
Example of instance after tokenization: {'labels': [-100, 0, 0, 0, 0, 0, 0, 1, 0, -100, 2, 3, 0, 0, 0, 4, 0, -100, 5, 0, 0, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100], 'input_ids': [101, 1045, 2215, 2000, 4875, 2013, 3731, 2012, 6640, 2620, 2572, 1998, 7180, 1999, 7573, 2012, 11118, 2692, 1999, 1996, 2851, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
Example of instance after tokenization: {'labels': [-100, 0, 0, 0, 0, 0, 1, 0, 2, -100, 3, 0, 0, 0, 4, 0, 5, -100, 0, 0, 6, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100], 'input_ids': [101, 1045, 2215, 2000, 4875, 2013, 3731, 2012, 6640, 2620, 2572, 1998, 7180, 1999, 7573, 2012, 11118, 2692, 1999, 1996, 2851, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

model_slots = AutoModelForSequenceClassification.from_pretrained(
    model_slots_name,
    ignore_mismatched_sizes=True,
    label2id=label2id_slots,
    id2label=id2label_slots
)

# Defining the Data Collator
from transformers import DataCollatorForTokenClassification
data_collator_slots = DataCollatorForTokenClassification(tokenizer=tokenizer_slots)

# Training
training_args_slots = TrainingArguments(
    output_dir="output_slots",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    metric_for_best_model="eval_f1",
    load_best_model_at_end=True,
)

trainer_slots = Trainer(
    model=model_slots,
    args=training_args_slots,
    train_dataset=tokenized_dataset_slots["train"],
    eval_dataset=tokenized_dataset_slots["dev"],
    tokenizer=tokenizer_slots,
    data_collator=data_collator_slots,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=3, early_stopping_threshold=0.00
        )
    ],
)

I’ve seen a few posts similar, but none of the proposed answers were useful for this case. Has this ever happened to you? Would love some insight. Thanks in advance.

Sandy1857 · March 15, 2024, 9:46pm

Never did an NER task, but does the target batch size reduce to 1024 when you set max_length to 256 in the error message?

ignacioct · March 15, 2024, 10:18pm

@Sandy1857 I get this:

ValueError: Expected input batch_size (16) to match target batch_size (4096).

Sandy1857 · March 15, 2024, 10:26pm

So basically it’s your input batch size multiplied by your max_length:

256 * 16 = 4096, 128 * 16 = 2048

Are you sure that for a sequence, there is only one set of label after tokenization and not max_length number of labels? I’m just guessing.

Could you just check len(tokenized_inputs[“labels”]) ?

ignacioct · March 15, 2024, 10:34pm

@Sandy1857 what do you mean set of labels? There should be one label per token, right?

I obtain a few things when I put a print in the last line of the tokenizing function:

1000
1000
1000
1000
978
893
500

Sandy1857 · March 16, 2024, 12:00am

What I mean is that for a given input (text or image or tokens etc) there should be same number of labels. But somehow you have got more than that, but it’s just my guess. I haven’t gone through your code thoroughly.

ignacioct · March 16, 2024, 9:29am

I’ve printed the length of the input_ids list alongside the length of the labels, and it’s always the same shape.

Sandy1857 · March 17, 2024, 7:39pm

Hi. Does AutomodelForTokenClassification instead of AutomodelForSequenceClassification solve the issue since you want to do Token Classification?

system · March 18, 2024, 7:40am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Expected input batch_size (2048) to match target batch_size (4) Beginners	3	1621	May 23, 2022
Token classification on custom BERT and data Intermediate	2	1504	December 28, 2020
ValueError: Expected input batch_size (16) to match target batch_size (64) Beginners	7	5119	November 7, 2023
ValueError: Expected input batch_size (8) to match target batch_size (280) Beginners	1	2002	November 18, 2024
Cannot Start the training loop because of bad size tokenization and/or for (presumably) custom dataset settings Beginners	2	321	June 11, 2022

ValueError: Expected input batch_size to match target batch_size in Token Classification

Related topics