I am hoping to fine-tune the graphormer model on odor prediction (see my dataset here: seyonec/goodscents_leffingwell 路 Datasets at Hugging Face)using a dataset of compounds and their corresponding labels (which can be 0, 1 or nan). After generating the jsonl format with the proper attributes (edge indices, attributes, num_nodes, y labels, etc) - I鈥檓 running into an issue when calling load_dataset. I was hoping to use this dataset to replicate the graphormer tutorial created by @clefourrier (Graph Classification with Transformers). Would greatly appreciate any advice, thanks!
Error details:
Downloading and preparing dataset json/seyonec--goodscents_leffingwell to /home/t-seyonec/.cache/huggingface/datasets/seyonec___json/seyonec--goodscents_leffingwell-07a9fbb3964fb885/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...
Downloading data: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 6.38M/6.38M [00:00<00:00, 32.2MB/s]
Downloading data: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 784k/784k [00:00<00:00, 12.0MB/s]]
Downloading data: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 795k/795k [00:00<00:00, 12.4MB/s]]
Downloading data files: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 3/3 [00:01<00:00, 2.72it/s]
Extracting data files: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 3/3 [00:00<00:00, 2715.35it/s]
---------------------------------------------------------------------------
ArrowIndexError Traceback (most recent call last)
File /anaconda/envs/dgllife/lib/python3.8/site-packages/datasets/builder.py:1894, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
1887 writer = writer_class(
1888 features=writer._features,
1889 path=fpath.replace("SSSSS", f"{shard_id:05d}").replace("JJJJJ", f"{job_id:05d}"),
(...)
1892 embed_local_files=embed_local_files,
1893 )
-> 1894 writer.write_table(table)
1895 num_examples_progress_update += len(table)
File /anaconda/envs/dgllife/lib/python3.8/site-packages/datasets/arrow_writer.py:569, in ArrowWriter.write_table(self, pa_table, writer_batch_size)
568 self._build_writer(inferred_schema=pa_table.schema)
--> 569 pa_table = pa_table.combine_chunks()
570 pa_table = table_cast(pa_table, self._schema)
File /anaconda/envs/dgllife/lib/python3.8/site-packages/pyarrow/table.pxi:3439, in pyarrow.lib.Table.combine_chunks()
File /anaconda/envs/dgllife/lib/python3.8/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File /anaconda/envs/dgllife/lib/python3.8/site-packages/pyarrow/error.pxi:127, in pyarrow.lib.check_status()
ArrowIndexError: array slice would exceed array length
...
1911 e = e.__context__
-> 1912 raise DatasetGenerationError("An error occurred while generating the dataset") from e
1914 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)
DatasetGenerationError: An error occurred while generating the dataset