ArrowInvalid: Column 1 named id expected length 512 but got length 1000

ghadeermobasher · December 16, 2021, 3:06pm

I am training ncbi_disease dataset using transformers trainer.

Here are the features of the dataset as follows

DatasetDict({
train: Dataset({
features: [‘id’, ‘tokens’, ‘ner_tags’],
num_rows: 5433
})
validation: Dataset({
features: [‘id’, ‘tokens’, ‘ner_tags’],
num_rows: 924
})
test: Dataset({
features: [‘id’, ‘tokens’, ‘ner_tags’],
num_rows: 941
})
})

This is an output to a sample of training data tuple as follows

{'id': '20',
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0, 0],
 'tokens': ['For',
  'both',
  'sexes',
  'combined',
  ',',
  'the',
  'penetrances',
  'at',
  'age',
  '60',
  'years',
  'for',
  'all',
  'cancers',
  'and',
  'for',
  'colorectal',
  'cancer',
  'were',
  '0',
  '.']}

Here is the function for tokenization and I get this error

ArrowInvalid: Column 1 named id expected length 512 but got length 1000


def tokenize_text(examples):

    return tokenizer(str(examples["tokens"]),truncation=True,max_length=512)



dataset=dataset.map(tokenize_text,batched=True)

Any clue how to solve this problem?

satpalsr · December 16, 2021, 4:05pm

Hey @ghadeermobasher , This has been explained in chapter 5 of the course (The 🤗 Datasets library - Hugging Face Course). Scroll down a bit and you will find a similar error with the explanation.

You need to modify your “tokenize_text” function as such:

def tokenize_text(examples):
    result = tokenizer(str(examples["tokens"]),truncation=True,   
                       max_length=512, return_overflowing_tokens=True)

    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

prikarsartam · September 4, 2022, 7:01pm

I had faced the similar problem and came up with this solution, hope that simplifies the task for arbitrary custom datasets as well:

max_input_length = 1024
max_target_length = 50

inputs=

def preprocess_function(examples):

  d={'headlines': examples["headlines"], 'description': examples["description"]}
  text_df = pd.DataFrame([d], index=[0])
  text_df = text_df.dropna()

  for doc in text_df["description"].values:
    inputs.append(prefix + str(doc[0]))
  
  model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_overflowing_tokens=True )

  # Setup the tokenizer for targets
  with tokenizer.as_target_tokenizer():
      labels = tokenizer(
          str(text_df["headlines"].values), max_length=max_target_length, truncation=True, return_overflowing_tokens=True
      )

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

at this point you call map() method acting over the final_dataset you have created as the DatasetDict object.

tokenized_datasets = final_dataset.map(preprocess_function)

AayushShah · April 5, 2023, 8:37am

Still, if your problem isn’t solved by the methods discussed above, then you can check this out: pyarrow.lib.ArrowInvalid: Column 1 named input_ids expected length 599 but got length 1500 · Issue #1817 · huggingface/datasets · GitHub

This seems to be the approach that worked for me.

isYufeng · June 6, 2024, 8:30am

My case is I send two parameters into the tokenizer and the error is reported, I fixed the issue just re-assign the tokenized results to a new variable:

def tokenize_func(examples):
    return tokenizer(examples['text_1'], examples["text_2"], truncation=True)

Before fix:

dataset = dataset.map(tokenize_func, batched=True)

After fixed:

tokenized_data = dataset.map(tokenize_func, batched=True)

Topic		Replies	Views
Map with batch=True gives ArrowInvalid error for mismatch in a column's expected length 🤗Datasets	1	905	December 12, 2023
The 🤗 Datasets library - Hugging Face Course 🤗Datasets	1	568	November 25, 2021
Getting pyarrow.lib.ArrowInvalid: Column 2 named start_positions expected length 1000 but got length 1 🤗Datasets	1	2089	July 27, 2023
NER Label tokenization with overflowing tokens 🤗Tokenizers	4	1438	November 6, 2023
Inputs.word_ids() length not matching word label length 🤗Tokenizers	3	530	March 22, 2024

ArrowInvalid: Column 1 named id expected length 512 but got length 1000

Related topics