I have been trying to create my custom dataset repository to work with SDXL. I have a tar file of size 123GB. I used loading script as attached. However this process gets killed while generating train splits.
How can I make it more efficient?
Traceback (most recent call last):
File "/usr/local/bin/datasets-cli", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/dist-packages/datasets/commands/datasets_cli.py", line 39, in main
service.run()
File "/usr/local/lib/python3.9/dist-packages/datasets/commands/test.py", line 137, in run
builder.download_and_prepare(
File "/usr/local/lib/python3.9/dist-packages/datasets/builder.py", line 704, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.9/dist-packages/datasets/builder.py", line 1227, in _download_and_prepare
super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
File "/usr/local/lib/python3.9/dist-packages/datasets/builder.py", line 793, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/usr/local/lib/python3.9/dist-packages/datasets/builder.py", line 1210, in _prepare_split
for key, record in logging.tqdm(
File "/usr/local/lib/python3.9/dist-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/root/.cache/huggingface/modules/datasets_modules/datasets/make_dataset/4bbb8fc5763c15089fd4ed23fac05a06ad51d0c3652e1303dfed3c7edf82ede8/make_dataset.py", line 126, in _generate_examples
"original_image": {"path": file_path, "bytes": file_obj.read()},
File "/usr/lib/python3.9/tarfile.py", line 683, in read
self.fileobj.seek(offset + (self.position - start))
File "/usr/lib/python3.9/tarfile.py", line 515, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed