How to find the wrong data from debugging mode in train_dataset.map of run_translation.py

aaraki · May 30, 2022, 6:42am

Hello.I got error in executing run_translation.py.

the debug log is bellow:

Running tokenizer on train dataset: 79% 15/19 [00:03<00:00, 4.72ba/s]
05/30/2022 06:02:02 - DEBUG - datasets.arrow_writer - Done writing 15000 examples in 4874189 bytes /root/.cache/huggingface/datasets/aaraki___parquet/aaraki–github-issues7-4ed446480480c542/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901/tmpimfoz_hr.
Traceback (most recent call last):
File “run_translation.py”, line 627, in
main()
File “run_translation.py”, line 445, in main
desc=“Running tokenizer on train dataset”,
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 2364, in map
desc=desc,
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 532, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 499, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py”, line 458, in wrapper
out = func(self, *args, **kwargs)
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 2751, in _map_single
writer.write_batch(batch)
File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py”, line 506, in write_batch
pa_table = pa.Table.from_arrays(arrays, schema=schema)
File “pyarrow/table.pxi”, line 1702, in pyarrow.lib.Table.from_arrays
File “pyarrow/table.pxi”, line 1314, in pyarrow.lib.Table.validate
File “pyarrow/error.pxi”, line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 2 named labels expected length 995 but got length 1000

the code is bellow:

with training_args.main_process_first(desc=“train dataset map pre-processing”):
train_dataset = train_dataset.map(
preprocess_function,
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=column_names,
load_from_cache_file=not data_args.overwrite_cache,
desc=“Running tokenizer on train dataset”, #error
)

I tryed to open the /root/ directory in colab,but it didn’t work.
Colud you teach me how to fix the wrong data in dataset?

Topic		Replies	Views
Dataset.map returns error: pyarrow.lib.ArrowInvalid: cannot mix list and non-list, non-null values 🤗Datasets	1	1411	January 17, 2025
KeyError: '__index_level_0__' error with datasets arrow_writer.py 🤗Datasets	3	8524	August 29, 2024
Map method to tokenize raises index error 🤗Datasets	9	4273	June 9, 2021
What is the data file format of `run_ner.py`? 🤗Transformers	2	319	April 4, 2024
Random utf-8 errors from dataset Intermediate	10	3294	December 8, 2023

How to find the wrong data from debugging mode in train_dataset.map of run_translation.py

Related topics