Generating Croissant Metadata for Custom Image Dataset

Dear Hugging Face Team,

I am currently hosting a 3D reconstruction image dataset on the Hugging Face Hub: yinyue27/RefRef Ā· Datasets at Hugging Face. The dataset is in Blender format and contains multiple subsets, where each subset corresponds to a scene. Each scene is further divided into three splits: train, validation, and test. Each scene includes three JSON files that store the image paths for each split.

Since Hugging Face could not automatically recognise the dataset’s structure, I implemented a custom data loader script. However, I noticed that the Croissant metadata is not being generated automatically.

Would it be possible for someone to help review my loader script and provide guidance on properly structuring the dataset so that the Croissant metadata can be generated?

I appreciate your time and any advice you can offer.

1 Like

I wonder if Croissant is not automatically generated for data set repositories that use loading scripts…? @lhoestq

from datasets import load_dataset
ds = load_dataset("yinyue27/RefRef", "single-convex", trust_remote_code=True) # error
print(ds)

The dataset viewer automatically generates the metadata in Croissant format (JSON-LD) for every dataset on the Hugging Face Hub. It lists the dataset’s name, description, URL, and the distribution of the dataset as Parquet files, including the columns’ metadata. The Croissant metadata is available for all the datasets that can be converted to Parquet format.

1 Like

Hi John666,

Thanks for looking into my problem! I think if you modify the script a little bit by adding this ā€˜scene’ keyword I customised, you will load the dataset correctly:

from datasets import load_dataset
ds = load_dataset("yinyue27/RefRef",  split="textured_sphere_scene", name="multiple-non-convex", scene="beaker", trust_remote_code=True) 
print(ds)
1 Like

It worked!

from datasets import load_dataset
ds = load_dataset("yinyue27/RefRef", split="textured_sphere_scene", name="multiple-non-convex", scene="beaker", trust_remote_code=True)
print(ds)
#Generating textured_sphere_scene split: 300 examples [00:06, 47.17 examples/s]
#Generating textured_cube_scene split: 300 examples [00:05, 55.87 examples/s]
#Dataset({
#    features: ['image', 'depth', 'mask', 'transform_matrix', 'rotation'],
#    num_rows: 300
#})

#ds.push_to_hub("yinyue27/RefRef_parquet") # it will work with DatasetViewer but just a part of dataset...
1 Like

Yes, so I assume the data loader is good to use for the dataset. Any clues on how I can generate the croissant metadata? :thinking:

1 Like

The croissant metadata on HF are generated automatically for datasets in supported formats like Parquet or ImageFolder (folder of images and a metadata file ). If you convert your dataset to Parquet, or if you structure your dataset as a ImageFolder, the croissant metadata will be available.

There is no way to automatically get a croissant metadata file for a dataset based on a python script.

2 Likes

Hi John, I’m now trying to load my dataset and use push_to_hub to push it to a new dataset. This is the script I’m using:

from datasets import load_dataset

dataset = load_dataset(
    # path="eztao/RefRef_test",
    path="yinyue27/RefRef",
    name="single-convex",
    scene="ball",
    split="textured_sphere_scene",
    trust_remote_code=True
)

print(dataset)  # Should show the dataset structure

dataset.push_to_hub("eztao/RefRef_parquet")

But I’m getting this error:

Dataset({
   features: ['image', 'depth', 'mask', 'transform_matrix', 'rotation'],
   num_rows: 300
})
Traceback (most recent call last):
 File "/home/u7543832/PhD/DataBuilder.py", line 14, in <module>
   dataset.push_to_hub("eztao/RefRef_parquet")
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5549, in push_to_hub
   additions, uploaded_size, dataset_nbytes = self._push_parquet_shards_to_hub(
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5349, in _push_parquet_shards_to_hub
   dataset_nbytes = self._estimate_nbytes()
                    ^^^^^^^^^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5177, in _estimate_nbytes
   table_visitor(table, extra_nbytes_visitor)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2378, in table_visitor
   _visit(table[name], feature)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2358, in _visit
   _visit(chunk, feature)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2362, in _visit
   function(array, feature)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5172, in extra_nbytes_visitor
   size = xgetsize(x["path"])
          ^^^^^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 769, in xgetsize
   size = fs.size(main_hop)
          ^^^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/fsspec/spec.py", line 696, in size
   return self.info(path).get("size", None)
          ^^^^^^^^^^^^^^^
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/huggingface_hub/hf_file_system.py", line 727, in info
   _raise_file_not_found(path, None)
 File "/home/u7543832/anaconda3/lib/python3.12/site-packages/huggingface_hub/hf_file_system.py", line 1136, in _raise_file_not_found
   raise FileNotFoundError(msg) from err
FileNotFoundError: datasets/yinyue27/RefRef@main/image_data/textured_sphere_scene/single-convex/ball_sphere/./train/r_0.png

Seems that I can load the dataset (I also plotted out the image to make sure of it), and the file path is correct, but I’m constantly getting this error. Could you help me with it? Thanks!

1 Like

(I’m assuming you’re passing the token using login() or something similar)
The version of the library in charge of serialization and uploading may be out of date.

pip install -U huggingface_hub
1 Like

I used huggingface-cli login and generated a token to login, and I’m still getting the error after using pip to update hf_hub (I’m already at the latest version actually) :smiling_face_with_tear:

1 Like

self.info(path).get(ā€œsizeā€, None)

It’s strange that it returns None… in other words, it means that this path cannot be found.
I think it’s a bug, but I wonder what kind of bug it is…
It’s different from the case below, and it’s probably looking for a local path, so it’s not related to networking, is it…?

If it happens with .save_to_disk(), it’s definitely a bug.

1 Like

Well, I do get the same error when trying to run dataset.save_to_disk("./data/RefRef_test_ball") :anguished_face:, and I printed out the main_hop to confirm that it’s a remote path instead of a local one: hf://datasets/yinyue27/RefRef@main/image_data/textured_sphere_scene/single-convex/ball_sphere/./train/r_0.png

 File "/home/u7543832/PhD/DataBuilder.py", line 14, in <module>
    dataset.save_to_disk("./data/RefRef_test_ball")
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 1476, in save_to_disk
    dataset_nbytes = self._estimate_nbytes()
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5177, in _estimate_nbytes
    table_visitor(table, extra_nbytes_visitor)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2378, in table_visitor
    _visit(table[name], feature)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2358, in _visit
    _visit(chunk, feature)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/table.py", line 2362, in _visit
    function(array, feature)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 5172, in extra_nbytes_visitor
    size = xgetsize(x["path"])
           ^^^^^^^^^^^^^^^^^^^
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 769, in xgetsize
    size = fs.size(main_hop)
           ^^^^^^^^^^^^^^^^^
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/fsspec/spec.py", line 696, in size
    return self.info(path).get("size", None)
           ^^^^^^^^^^^^^^^
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/huggingface_hub/hf_file_system.py", line 727, in info
    _raise_file_not_found(path, None)
  File "/home/u7543832/anaconda3/lib/python3.12/site-packages/huggingface_hub/hf_file_system.py", line 1136, in _raise_file_not_found
    raise FileNotFoundError(msg) from err
FileNotFoundError: datasets/yinyue27/RefRef@main/image_data/textured_sphere_scene/single-convex/ball_sphere/./train/r_0.png

If this is a bug, I’ll have to generate the croissant metadata manually then. :face_with_monocle: Unfortunately I couldn’t find any guide on doing this, any advice?

1 Like

I couldn’t find documents for writing Croissant metadata manually…
There is the source code…

For now, I think I’ve figured out the order in which the bugs are occurring. The relative path is causing problems.

# FileNotFoundError: datasets/yinyue27/RefRef@main/image_data/textured_sphere_scene/single-convex/ball_sphere/./train/r_0.png

# /ball_sphere/./train/r_0.png <=There is a failure in the concatenation of the path.
1 Like