I tried to run t5 training with example code snippet provided on:
Code-Snippet:
python run_seq2seq_qa.py \
--model_name_or_path t5-small \
--dataset_name squad_v2 \
--context_column context \
--question_column question \
--answer_column answer \
--do_train \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/debug_seq2seq_squad/
When I run the example code I got this error:
ValueError: --answer_columnâ value âanswerâ needs to be one of: id, title, context, question, answers
I changed ââanswer_column answerâ into ââanswer_column answersâ and run it again and got this error when running tokenizer on validation dataset:
Running tokenizer on validation dataset: 0% 0/12 [00:03<?, ?ba/s]
Traceback (most recent call last):
File "run_seq2seq_qa.py", line 678, in <module>
main()
File "run_seq2seq_qa.py", line 522, in main
desc="Running tokenizer on validation dataset",
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2110, in map
desc=desc,
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 518, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 485, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 411, in wrapper
out = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2486, in _map_single
writer.write_batch(batch)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 458, in write_batch
pa_table = pa.Table.from_pydict(typed_sequence_examples)
File "pyarrow/table.pxi", line 1560, in pyarrow.lib.Table.from_pydict
File "pyarrow/table.pxi", line 1532, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 1181, in pyarrow.lib.Table.validate
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 3 named labels expected length 1007 but got length 1000
How can I fix it?