SageMaker FastFileMode, dataset streaming and memory mapping

joaopcm · January 24, 2024, 7:14pm

Hey ,

I have a scenario where I’ll need to run distributed training on SageMaker. Couple questions on integration with Fast File mode, IterableDatasets, memory mapping and performance:

With streaming=True, is the dataset memory-mapped, since it’s not actually on disk to map to/from? If not, is streaming less performant than loading from memory-mapped files, as indicated here?
FastFileMode on SageMaker exposes S3 objects as if they were in local disk, but they are actually streamed on demand as they are accessed. If using a standard Dataset, I imagine each file needs to be streamed from S3 via FastFile in its entirety before it can be memory mapped, is that correct? In this case, if using standard Dataset should I avoid FastFile to avoid this 2 step process, and just download all the data upfront?
Is native HF dataset streaming performant compared to multi-process, byte-range fetches directly from a large file on S3 or other object storage? I see IterableDataset does not support multiple workers.
On this same line, if Dataset is based on the Arrow format, why doesn’t IterableDataset allow streaming Arrow files (according to docs) from remote storage, or loading Arrow files progressively from a local file? Is there a fundamental limitation on this, or would it just not provide better performance in theory over progressively loading a JSON or CSV file?

Thanks in advance. @lhoestq

joaopcm · January 24, 2024, 7:15pm

Couldn’t add final 2 links: streaming docs and progressively loading a local file.

lhoestq · January 29, 2024, 9:36pm

Hi !

With streaming=True, the data is streamed directly from the source. So it doesn’t use memory mapping, which is the mechanism we use to load cached datasets in arrow format
I don’t know but I would bet that memory mapping would be quite slow with this (similarly to FUSE for example)
It’s super fast, especially from HF or HTTP urls. It also has experimental features for S3 and it will use the fsspec s3fs implementation which is ok but not the fastest afaik
IterableDataset does allow streaming from local, from HF or from HTTP. It’s faster than reading JSON or CSV since there is no parsing/deserialization needed

Topic		Replies	Views
Creating Vision dataset with images on s3 Amazon SageMaker	9	2544	September 15, 2022
Streaming dataset and cache 🤗Datasets	5	3545	August 4, 2023
Iterating on dataset extremely slow 🤗Datasets	8	1922	November 6, 2024
Roadmap/timeline for dataset streaming 🤗Datasets	9	2271	July 5, 2021
Datasets mapper hanging issue 🤗Datasets	2	1237	March 8, 2023

SageMaker FastFileMode, dataset streaming and memory mapping

Related topics