How to find the wrong data from debugging mode in of

Hello.I got error in executing

the debug log is bellow:

Running tokenizer on train dataset: 79% 15/19 [00:03<00:00, 4.72ba/s]
05/30/2022 06:02:02 - DEBUG - datasets.arrow_writer - Done writing 15000 examples in 4874189 bytes /root/.cache/huggingface/datasets/aaraki___parquet/aaraki–github-issues7-4ed446480480c542/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901/tmpimfoz_hr.
Traceback (most recent call last):
File “”, line 627, in
File “”, line 445, in main
desc=“Running tokenizer on train dataset”,
File “/usr/local/lib/python3.7/dist-packages/datasets/”, line 2364, in map
File “/usr/local/lib/python3.7/dist-packages/datasets/”, line 532, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/datasets/”, line 499, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/datasets/”, line 458, in wrapper
out = func(self, *args, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/datasets/”, line 2751, in _map_single
File “/usr/local/lib/python3.7/dist-packages/datasets/”, line 506, in write_batch
pa_table = pa.Table.from_arrays(arrays, schema=schema)
File “pyarrow/table.pxi”, line 1702, in pyarrow.lib.Table.from_arrays
File “pyarrow/table.pxi”, line 1314, in pyarrow.lib.Table.validate
File “pyarrow/error.pxi”, line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 2 named labels expected length 995 but got length 1000

the code is bellow:

with training_args.main_process_first(desc=“train dataset map pre-processing”):
train_dataset =
load_from_cache_file=not data_args.overwrite_cache,
desc=“Running tokenizer on train dataset”, #error

I tryed to open the /root/ directory in colab,but it didn’t work.
Colud you teach me how to fix the wrong data in dataset?