ArrowInvalid: Column 3 named attention_mask expected length 1000 but got length 1076

I’m trying to evaluate a QA model on a custom dataset. This is how I prepared the velidation features:

def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

But as I try to apply the function to my dataset:

test_features = test_dataset.map(
    prepare_validation_features,
    batched=True,
)

at a certain moment (more or less 23% of the process) it return me this error:

23%
5/22 [00:19<00:51, 3.04s/ba]
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-156-6658d0dc57be> in <module>()
      1 test_features = test_dataset.map(
      2     prepare_validation_features,
----> 3     batched=True,
      4 )

8 frames
/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Column 3 named attention_mask expected length 1000 but got length 1076

How can I fix it?

Hello! I have the same issue. Did you fix it?

Hi @Peppe95 @alexandra,

are you sure that each key/column in the returned batch has the same number of elements?

You can check this by inserting the line:

assert set(len(column_values) for column_values in returned_batch.values()) == 1, "Mismatch in the number of elements"

before the return statement.

Let me know if this helps.

From Python docs I see that set() makes a “sequence of iterable elements”.
Why are you iterating through returned_batch.values(), creating a set of the lengths of the columns, and then checking whether the set == 1?
How does this solve the shape error?