Hi all,
What I am trying to do is to push the dataset I created locally (which is around 1.2TB) to huggingface datasets.It contains large images and some other textual data paired with them.
What I follow:
Pass a generator to Dataset.from_generator(), which reads image files (bytes with the help of datasets.Image.encode_example(value=some_pil_image) ) and textual info from local files:
dataset = Dataset.from_generator(dataset_gen, features=features)
After that I call:
dataset.push_to_hub(dataset_id)
But It limits size of each shard to around 500mb (like, 15 image file or so). Which results in ~2400 parquet files to upload. This exceeds the rate limit for github commits.
If I push it like:
dataset.push_to_hub(dataset_id, num_shards=5)
It throws an error:
"
Map: 0%| | 0/1334 [00:06<?, ? examples/s]
Traceback (most recent call last):
File β/home/ubuntu/hf_push.pyβ, line 74, in
dataset.push_to_hub(dataset_id,
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.pyβ, line 5422, in push_to_hub
repo_id, split, uploaded_size, dataset_nbytes, repo_files, deleted_size = self._push_parquet_shards_to_hub(
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.pyβ, line 5289, in _push_parquet_shards_to_hub
first_shard = next(shards_iter)
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.pyβ, line 5271, in shards_with_embedded_external_files
shard = shard.map(
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.pyβ, line 592, in wrapper
out: Union[βDatasetβ, βDatasetDictβ] = func(self, *args, **kwargs)
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.pyβ, line 557, in wrapper
out: Union[βDatasetβ, βDatasetDictβ] = func(self, *args, **kwargs)
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.pyβ, line 3097, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.pyβ, line 3474, in _map_single
batch = apply_function_on_filtered_inputs(
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.pyβ, line 3353, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.pyβ, line 2306, in embed_table_storage
arrays = [
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.pyβ, line 2307, in
embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.pyβ, line 1831, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.pyβ, line 1831, in
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.pyβ, line 2176, in embed_array_storage
return feature.embed_storage(array)
File β/home/ubuntu/venv/lib/python3.10/site-packages/datasets/features/image.pyβ, line 276, in embed_storage
storage = pa.StructArray.from_arrays([bytes_array, path_array], [βbytesβ, βpathβ], mask=bytes_array.is_null())
File βpyarrow/array.pxiβ, line 2850, in pyarrow.lib.StructArray.from_arrays
File βpyarrow/array.pxiβ, line 3290, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean
"
Any idea how I can oversome this?