'i' format requires -2147483648 <= number <= 2147483647 error

sunwooooong · January 6, 2023, 8:41am

Hello. I received following error while processing(tokenizing) my custom dataset.
'i' format requires -2147483648 <= number <= 2147483647 error.

I used following codes.

        self.tokenized_datasets_training = dataset_training.map(
            tokenize_function,
            batched=True,
            batch_size=6000,
            remove_columns=["codes"],
            load_from_cache_file=self.config.cache,
            num_proc=4,
            fn_kwargs={"tokenizer": self.tokenizer},
        )

And dataset_training has 2301617 rows. When I run it with the num_proc=1, it works quite well. But it returns error when num_proc >= 2. I run it on the server so it has 23GB. My data is about 7GB.

I can process it with num_proc=1 but what should I do if I want more than 2 for num_proc.

Topic		Replies	Views
Map method to tokenize raises index error 🤗Datasets	9	4273	June 9, 2021
Dataset.map hangs on tokenization (relatively small dataset) 🤗Datasets	2	1975	April 22, 2022
ArrowInvalid: Column 1 named id expected length 512 but got length 1000 🤗Datasets	4	15230	June 6, 2024
Num_proc is not working with map Beginners	5	2181	April 15, 2024
Map with batch=True gives ArrowInvalid error for mismatch in a column's expected length 🤗Datasets	1	902	December 12, 2023

'i' format requires -2147483648 <= number <= 2147483647 error

Related topics