ArrowInvalid: Column 1 named id expected length 512 but got length 1000

I am training ncbi_disease dataset using transformers trainer.

Here are the features of the dataset as follows

DatasetDict({
train: Dataset({
features: [ā€˜idā€™, ā€˜tokensā€™, ā€˜ner_tagsā€™],
num_rows: 5433
})
validation: Dataset({
features: [ā€˜idā€™, ā€˜tokensā€™, ā€˜ner_tagsā€™],
num_rows: 924
})
test: Dataset({
features: [ā€˜idā€™, ā€˜tokensā€™, ā€˜ner_tagsā€™],
num_rows: 941
})
})

This is an output to a sample of training data tuple as follows

{'id': '20',
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0, 0],
 'tokens': ['For',
  'both',
  'sexes',
  'combined',
  ',',
  'the',
  'penetrances',
  'at',
  'age',
  '60',
  'years',
  'for',
  'all',
  'cancers',
  'and',
  'for',
  'colorectal',
  'cancer',
  'were',
  '0',
  '.']}

Here is the function for tokenization and I get this error

ArrowInvalid: Column 1 named id expected length 512 but got length 1000


def tokenize_text(examples):

    return tokenizer(str(examples["tokens"]),truncation=True,max_length=512)


dataset=dataset.map(tokenize_text,batched=True)

Any clue how to solve this problem?

Hey @ghadeermobasher , This has been explained in chapter 5 of the course (The šŸ¤— Datasets library - Hugging Face Course). Scroll down a bit and you will find a similar error with the explanation.

You need to modify your ā€œtokenize_textā€ function as such:

def tokenize_text(examples):
    result = tokenizer(str(examples["tokens"]),truncation=True,   
                       max_length=512, return_overflowing_tokens=True)

    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result
2 Likes

I had faced the similar problem and came up with this solution, hope that simplifies the task for arbitrary custom datasets as well:

max_input_length = 1024
max_target_length = 50

inputs=

def preprocess_function(examples):

  d={'headlines': examples["headlines"], 'description': examples["description"]}
  text_df = pd.DataFrame([d], index=[0])
  text_df = text_df.dropna()

  for doc in text_df["description"].values:
    inputs.append(prefix + str(doc[0]))
  
  model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_overflowing_tokens=True )

  # Setup the tokenizer for targets
  with tokenizer.as_target_tokenizer():
      labels = tokenizer(
          str(text_df["headlines"].values), max_length=max_target_length, truncation=True, return_overflowing_tokens=True
      )

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

at this point you call map() method acting over the final_dataset you have created as the DatasetDict object.

tokenized_datasets = final_dataset.map(preprocess_function)

Still, if your problem isnā€™t solved by the methods discussed above, then you can check this out: pyarrow.lib.ArrowInvalid: Column 1 named input_ids expected length 599 but got length 1500 Ā· Issue #1817 Ā· huggingface/datasets Ā· GitHub

This seems to be the approach that worked for me.