ArrowBasedBuilder versus GeneratorDBasedBuilder

Could you please enumerate pros and cons for both these dataset builder classes. I couldn’t find anything in the documentation. When would I prefer one over the other. Is ArrowBasedBuilder more performant for large datasets?

Thank you!

Yes ArrowBasedBuilder is more performant in general because datasets are saved in Arrow format.

Generating a dataset file from a GeneratorBasedBuilder requires to convert the data to Arrow when writing on disk.

Therefore it’s a good idea to use the ArrowBasedBuilder whenever you have a big dataset and you’re able to load your data in Arrow tables using pyarrow

1 Like