Load Dataset issue for custom graph dataset

seyonec · July 19, 2023, 6:04pm

I am hoping to fine-tune the graphormer model on odor prediction (see my dataset here: seyonec/goodscents_leffingwell · Datasets at Hugging Face)using a dataset of compounds and their corresponding labels (which can be 0, 1 or nan). After generating the jsonl format with the proper attributes (edge indices, attributes, num_nodes, y labels, etc) - I’m running into an issue when calling load_dataset. I was hoping to use this dataset to replicate the graphormer tutorial created by @clefourrier (Graph Classification with Transformers). Would greatly appreciate any advice, thanks!
Error details:

Downloading and preparing dataset json/seyonec--goodscents_leffingwell to /home/t-seyonec/.cache/huggingface/datasets/seyonec___json/seyonec--goodscents_leffingwell-07a9fbb3964fb885/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...
Downloading data: 100%|██████████| 6.38M/6.38M [00:00<00:00, 32.2MB/s]
Downloading data: 100%|██████████| 784k/784k [00:00<00:00, 12.0MB/s]]
Downloading data: 100%|██████████| 795k/795k [00:00<00:00, 12.4MB/s]]
Downloading data files: 100%|██████████| 3/3 [00:01<00:00,  2.72it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 2715.35it/s]
                                                        
---------------------------------------------------------------------------
ArrowIndexError                           Traceback (most recent call last)
File /anaconda/envs/dgllife/lib/python3.8/site-packages/datasets/builder.py:1894, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1887     writer = writer_class(
   1888         features=writer._features,
   1889         path=fpath.replace("SSSSS", f"{shard_id:05d}").replace("JJJJJ", f"{job_id:05d}"),
   (...)
   1892         embed_local_files=embed_local_files,
   1893     )
-> 1894 writer.write_table(table)
   1895 num_examples_progress_update += len(table)

File /anaconda/envs/dgllife/lib/python3.8/site-packages/datasets/arrow_writer.py:569, in ArrowWriter.write_table(self, pa_table, writer_batch_size)
    568     self._build_writer(inferred_schema=pa_table.schema)
--> 569 pa_table = pa_table.combine_chunks()
    570 pa_table = table_cast(pa_table, self._schema)

File /anaconda/envs/dgllife/lib/python3.8/site-packages/pyarrow/table.pxi:3439, in pyarrow.lib.Table.combine_chunks()

File /anaconda/envs/dgllife/lib/python3.8/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /anaconda/envs/dgllife/lib/python3.8/site-packages/pyarrow/error.pxi:127, in pyarrow.lib.check_status()

ArrowIndexError: array slice would exceed array length
...
   1911         e = e.__context__
-> 1912     raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1914 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

cc @albertvillanova @lhoestq @severo @clefourrier

lhoestq · July 24, 2023, 2:02pm

It works fine on my side, can you try updating datasets and pyarrow ?

seyonec · July 24, 2023, 2:45pm

Thanks! moving back to 2.5.2 fixed that issue when I tried it out but I’ve run into issues with the trainer class using this dataset (see disucssion in HF hub here: [seyonec/goodscents_leffingwell · Load Dataset issue for custom graph dataset (huggingface.co)])(seyonec/goodscents_leffingwell · Load Dataset issue for custom graph dataset)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[10], line 1
----> 1 train_results = trainer.train()

File /anaconda/envs/dgllife/lib/python3.8/site-packages/transformers/trainer.py:1539, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1534     self.model_wrapped = self.model
   1536 inner_training_loop = find_executable_batch_size(
   1537     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1538 )
-> 1539 return inner_training_loop(
   1540     args=args,
   1541     resume_from_checkpoint=resume_from_checkpoint,
   1542     trial=trial,
   1543     ignore_keys_for_eval=ignore_keys_for_eval,
   1544 )

File /anaconda/envs/dgllife/lib/python3.8/site-packages/accelerate/utils/memory.py:136, in find_executable_batch_size..decorator(*args, **kwargs)
    134     raise RuntimeError("No executable batch size found, reached zero.")
    135 try:
--> 136     return function(batch_size, *args, **kwargs)
    137 except Exception as e:
    138     if should_reduce_batch_size(e):

File /anaconda/envs/dgllife/lib/python3.8/site-packages/transformers/trainer.py:1809, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
...
   3162 if not (target.size() == input.size()):
   3163     raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
-> 3165 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)

RuntimeError: result type Float can't be cast to the desired output type Long

Topic		Replies	Views
ArrowNotImplementedError when loading json dataset 🤗Datasets	3	1765	December 17, 2021
Loading Custom Datasets 🤗Datasets	7	10758	May 25, 2021
Cannot load dataset on Kaggle 🤗Datasets	4	3176	August 16, 2023
ArrowTypeError in load_dataset 🤗Datasets	1	631	June 12, 2023
Proprietary database load error: TypeError: Argument 'storage' has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray) 🤗Datasets	2	1145	January 25, 2022

Load Dataset issue for custom graph dataset

Related topics