SageMaker FastFileMode, dataset streaming and memory mapping

Hey ,

I have a scenario where I’ll need to run distributed training on SageMaker. Couple questions on integration with Fast File mode, IterableDatasets, memory mapping and performance:

  1. With streaming=True, is the dataset memory-mapped, since it’s not actually on disk to map to/from? If not, is streaming less performant than loading from memory-mapped files, as indicated here?
  2. FastFileMode on SageMaker exposes S3 objects as if they were in local disk, but they are actually streamed on demand as they are accessed. If using a standard Dataset, I imagine each file needs to be streamed from S3 via FastFile in its entirety before it can be memory mapped, is that correct? In this case, if using standard Dataset should I avoid FastFile to avoid this 2 step process, and just download all the data upfront?
  3. Is native HF dataset streaming performant compared to multi-process, byte-range fetches directly from a large file on S3 or other object storage? I see IterableDataset does not support multiple workers.
  4. On this same line, if Dataset is based on the Arrow format, why doesn’t IterableDataset allow streaming Arrow files (according to docs) from remote storage, or loading Arrow files progressively from a local file? Is there a fundamental limitation on this, or would it just not provide better performance in theory over progressively loading a JSON or CSV file?

Thanks in advance. @lhoestq

Couldn’t add final 2 links: streaming docs and progressively loading a local file.

Hi !

  1. With streaming=True, the data is streamed directly from the source. So it doesn’t use memory mapping, which is the mechanism we use to load cached datasets in arrow format
  2. I don’t know but I would bet that memory mapping would be quite slow with this (similarly to FUSE for example)
  3. It’s super fast, especially from HF or HTTP urls. It also has experimental features for S3 and it will use the fsspec s3fs implementation which is ok but not the fastest afaik
  4. IterableDataset does allow streaming from local, from HF or from HTTP. It’s faster than reading JSON or CSV since there is no parsing/deserialization needed