I have a scenario where I’ll need to run distributed training on SageMaker. Couple questions on integration with Fast File mode, IterableDatasets, memory mapping and performance:
streaming=True, is the dataset memory-mapped, since it’s not actually on disk to map to/from? If not, is streaming less performant than loading from memory-mapped files, as indicated here?
FastFileModeon SageMaker exposes S3 objects as if they were in local disk, but they are actually streamed on demand as they are accessed. If using a standard
Dataset, I imagine each file needs to be streamed from S3 via
FastFilein its entirety before it can be memory mapped, is that correct? In this case, if using standard
Datasetshould I avoid
FastFileto avoid this 2 step process, and just download all the data upfront?
- Is native HF dataset streaming performant compared to multi-process, byte-range fetches directly from a large file on S3 or other object storage? I see
IterableDatasetdoes not support multiple workers.
- On this same line, if
Datasetis based on the Arrow format, why doesn’t IterableDataset allow streaming Arrow files (according to docs) from remote storage, or loading Arrow files progressively from a local file? Is there a fundamental limitation on this, or would it just not provide better performance in theory over progressively loading a JSON or CSV file?
Thanks in advance. @lhoestq