Hey ,
I have a scenario where I’ll need to run distributed training on SageMaker. Couple questions on integration with Fast File mode, IterableDatasets, memory mapping and performance:
- With
streaming=True
, is the dataset memory-mapped, since it’s not actually on disk to map to/from? If not, is streaming less performant than loading from memory-mapped files, as indicated here? FastFileMode
on SageMaker exposes S3 objects as if they were in local disk, but they are actually streamed on demand as they are accessed. If using a standardDataset
, I imagine each file needs to be streamed from S3 viaFastFile
in its entirety before it can be memory mapped, is that correct? In this case, if using standardDataset
should I avoidFastFile
to avoid this 2 step process, and just download all the data upfront?- Is native HF dataset streaming performant compared to multi-process, byte-range fetches directly from a large file on S3 or other object storage? I see
IterableDataset
does not support multiple workers. - On this same line, if
Dataset
is based on the Arrow format, why doesn’t IterableDataset allow streaming Arrow files (according to docs) from remote storage, or loading Arrow files progressively from a local file? Is there a fundamental limitation on this, or would it just not provide better performance in theory over progressively loading a JSON or CSV file?
Thanks in advance. @lhoestq