I processed the datasets into several shards. If I want to load them as one piece I can do concatenation but it will take some time to index all of the files. Is there a quicker way to load the dataset like a memory mapping from several dataset shards?
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
How to concatenate 100s of small datasets into a very large dataset? *Without* loading into memory? | 1 | 432 | May 18, 2023 | |
[urgent]Can you reconstruct datasets using the cache file (.arrow file)? | 5 | 1074 | August 27, 2021 | |
How to save datasets as distributed with save_to_disk? | 1 | 2470 | November 15, 2022 | |
`load_dataset` results in OOM | 0 | 179 | June 25, 2024 | |
[Bug?] Datasets map and concatenation after sharding OOM | 1 | 31 | September 4, 2024 |