ArrowBasedBuilder versus GeneratorDBasedBuilder

Could you please enumerate pros and cons for both these dataset builder classes. I couldn’t find anything in the documentation. When would I prefer one over the other. Is ArrowBasedBuilder more performant for large datasets?

Thank you!

Yes ArrowBasedBuilder is more performant in general because datasets are saved in Arrow format.

Generating a dataset file from a GeneratorBasedBuilder requires to convert the data to Arrow when writing on disk.

Therefore it’s a good idea to use the ArrowBasedBuilder whenever you have a big dataset and you’re able to load your data in Arrow tables using pyarrow

1 Like

@lhoestq
I want to create a large dataset (>4M) that includes large-size images such as infographics. Could you advise on the best use case? It seems py-arrow is struggling to handle large amounts of PIL images, and uploading to hub is also challenging. Currently, I’m thinking of just inserting image paths as strings and loading them in the collate function - is this the best approach? Alternatively, I’m also considering using GeneratorBasedBuilder.

1 Like

Hi ! is it struggling because it takes a lot of memory ?

You can already store image paths and use the Image type (it decodes images on the fly):

ds = Dataset.from_dict({"image": ["path/to/img0.png", ...]})
ds = ds.cast_column("image", Image())

and if you push_to_hub it will only load maximum 500MB in memory at a time to upload parquet files one by one.

2 Likes

@lhoestq

First of all, I want to express my sincere gratitude for your dedication to the community. I have three questions I’d like to ask:

  1. I’ve created an Arrow-format dataset with a list of images as features at here, using the DocStruct4M dataset which is originally about 400GB. However, despite only changing the annotation format, the size has increased to over 800GB. Since the uncompressed images themselves take up less than 400GB, I don’t believe this is a compression issue. I’m curious about why this size increase occurs. I noticed that a similar implementation at here only takes up around 300GB.

  2. I’m trying to process “a list of images” in an Arrow file. The ds.cast_column("image", Image()) method doesn’t seem to work in this case. Could you advise on how to handle this?
    I’m wondering if my current approach is correct: my_dataset = dataset_dict['validation'].cast_column("images",[datasets.Image()])

  3. Looking at datasets.Image, I notice that when encode=False, it still reads the file path using PIL. I’m wondering what advantages datasets.Image offers compared to simply storing file paths as strings.

Thank you again for your time and assistance.

1 Like