Could you please enumerate pros and cons for both these dataset builder classes. I couldn’t find anything in the documentation. When would I prefer one over the other. Is ArrowBasedBuilder more performant for large datasets?
Thank you!
Could you please enumerate pros and cons for both these dataset builder classes. I couldn’t find anything in the documentation. When would I prefer one over the other. Is ArrowBasedBuilder more performant for large datasets?
Thank you!
Yes ArrowBasedBuilder
is more performant in general because datasets are saved in Arrow format.
Generating a dataset file from a GeneratorBasedBuilder
requires to convert the data to Arrow when writing on disk.
Therefore it’s a good idea to use the ArrowBasedBuilder
whenever you have a big dataset and you’re able to load your data in Arrow tables using pyarrow