Custom SQuAD2.0 dataset gives an error when using run_qa.py script

julifelipe · July 29, 2021, 7:02pm

Hello,

I am trying to follow the PyTorch Question Answering example. However, when running the run_qa.py script using my own (Dutch machine-translated) SQuAD train and test files (JSON), I get the following error: pyarrow.lib.ArrowInvalid: cannot mix list and non-list, non-null values.

I use the following hyperparameters:

python run_qa.py \
--model_name_or_path GroNLP/bert-base-dutch-cased \
--version_2_with_negative \
--do_train \
--do_eval \
--train_file "C:\Users\myname\data\squad\nl_squad_train_clean.json" \
--test_file "C:\Users\myname\data\squad\nl_squad_dev_clean.json" \
--per_device_train_batch_size 12 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps=800 \
--output_dir ../output

When replacing the train and test by --dataset_name squad it works fine. What could be the problem with my own SQuAD files?

Thanks in advance! Cheers!

Topic		Replies	Views
Question answering bot: fine-tuning with custom dataset Beginners	6	6062	June 23, 2022
Run_seq2seq_qa.py: Column 3 named labels expected length 1007 but got length 1000 🤗Tokenizers	1	2527	February 17, 2022
JSON parse error when trying to load my own SQuAD dataset Beginners	0	971	July 21, 2021
Error in Question Answering on SQUAD Beginners	0	141	July 1, 2023
What's the data format of the QA json file in official scripts 🤗Datasets	5	823	February 24, 2023

Custom SQuAD2.0 dataset gives an error when using run_qa.py script

Related topics