RecordBatch size when creating an arrow dataset

Some questions about arrow datasets.

From Try to read arrow files get: Invalid: Not an Arrow file I learned that the arrow datasets are created as streaming datasets. After opening one .arrow file, I learned that the batch size is 1000. During training, I’m using DDP (via torchrun) and a single batch is 32 items.

I’m wondering if the reader is always ending up reading 1000 elements for every batch processed or if there’s some sort of optimization that is caching the batch and only slicing out 32 elements at a time from the RecordBatch in the files.

Additionally, I’m assuming that the rows are sharded across the 7 workers I have right now, so that they all read independent rows. However, I want to know if there are any performance implications of using a streaming format as opposed to a non streaming format in terms of random access to batches.

Thanks!

1 Like