RecordBatch size when creating an arrow dataset

dhruvgrammarly · November 24, 2024, 8:14pm

Some questions about arrow datasets.

From Try to read arrow files get: Invalid: Not an Arrow file I learned that the arrow datasets are created as streaming datasets. After opening one .arrow file, I learned that the batch size is 1000. During training, I’m using DDP (via torchrun) and a single batch is 32 items.

I’m wondering if the reader is always ending up reading 1000 elements for every batch processed or if there’s some sort of optimization that is caching the batch and only slicing out 32 elements at a time from the RecordBatch in the files.

Additionally, I’m assuming that the rows are sharded across the 7 workers I have right now, so that they all read independent rows. However, I want to know if there are any performance implications of using a streaming format as opposed to a non streaming format in terms of random access to batches.

Thanks!

Topic		Replies	Views
"too many open files" despite streaming with IterableDataset 🤗Datasets	2	27	January 27, 2025
Increased arrow table size by factor of ~2 🤗Datasets	5	999	November 28, 2022
Local dataset loading performance: HF's arrow vs torch.load 🤗Datasets	5	1165	November 24, 2024
Streaming .arrow IterableDataset with irregular first dimension 🤗Datasets	2	15	February 14, 2025
“too many open files” despite streaming with IterableDataset 🤗Datasets	2	43	January 30, 2025

RecordBatch size when creating an arrow dataset

Related topics