Run_seq2seq_qa.py: Column 3 named labels expected length 1007 but got length 1000

I tried to run t5 training with example code snippet provided on:

Code-Snippet:

python run_seq2seq_qa.py \
  --model_name_or_path t5-small \
  --dataset_name squad_v2 \
  --context_column context \
  --question_column question \
  --answer_column answer \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_seq2seq_squad/

When I run the example code I got this error:

ValueError: --answer_column’ value ‘answer’ needs to be one of: id, title, context, question, answers

I changed “–answer_column answer” into “–answer_column answers” and run it again and got this error when running tokenizer on validation dataset:

Running tokenizer on validation dataset:   0% 0/12 [00:03<?, ?ba/s]
Traceback (most recent call last):
  File "run_seq2seq_qa.py", line 678, in <module>
    main()
  File "run_seq2seq_qa.py", line 522, in main
    desc="Running tokenizer on validation dataset",
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2110, in map
    desc=desc,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 518, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 485, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 411, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2486, in _map_single
    writer.write_batch(batch)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 458, in write_batch
    pa_table = pa.Table.from_pydict(typed_sequence_examples)
  File "pyarrow/table.pxi", line 1560, in pyarrow.lib.Table.from_pydict
  File "pyarrow/table.pxi", line 1532, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 1181, in pyarrow.lib.Table.validate
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 3 named labels expected length 1007 but got length 1000

How can I fix it?

Hi, I was having the exact same issue and looks like there was an issue posted on HF github few days ago. Check this out: Error with run_seq2seq_qa.py official script (pyarrow.lib.ArrowInvalid: Column 4 named labels expected length 1007 but got length 1000) ¡ Issue #15398 ¡ huggingface/transformers ¡ GitHub