ArrowInvalid: Column 1 named id expected length 512 but got length 1000

I am training ncbi_disease dataset using transformers trainer.

Here are the features of the dataset as follows

DatasetDict({
train: Dataset({
features: [‘id’, ‘tokens’, ‘ner_tags’],
num_rows: 5433
})
validation: Dataset({
features: [‘id’, ‘tokens’, ‘ner_tags’],
num_rows: 924
})
test: Dataset({
features: [‘id’, ‘tokens’, ‘ner_tags’],
num_rows: 941
})
})

This is an output to a sample of training data tuple as follows

{'id': '20',
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0, 0],
 'tokens': ['For',
  'both',
  'sexes',
  'combined',
  ',',
  'the',
  'penetrances',
  'at',
  'age',
  '60',
  'years',
  'for',
  'all',
  'cancers',
  'and',
  'for',
  'colorectal',
  'cancer',
  'were',
  '0',
  '.']}

Here is the function for tokenization and I get this error

ArrowInvalid: Column 1 named id expected length 512 but got length 1000


def tokenize_text(examples):

    return tokenizer(str(examples["tokens"]),truncation=True,max_length=512)


dataset=dataset.map(tokenize_text,batched=True)

Any clue how to solve this problem?

Hey @ghadeermobasher , This has been explained in chapter 5 of the course (The 🤗 Datasets library - Hugging Face Course). Scroll down a bit and you will find a similar error with the explanation.

You need to modify your “tokenize_text” function as such:

def tokenize_text(examples):
    result = tokenizer(str(examples["tokens"]),truncation=True,   
                       max_length=512, return_overflowing_tokens=True)

    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result
1 Like