What is the diffrence between copy.deepcopy and flatten_indices?

lhoestq · July 20, 2021, 9:50am

Hi !
Python deepcopy does what it says: it creates a copy of the dataset object.

However flatten_indices has another goal.

First you must know that dataset shuffling/sharding doesn’t actually shuffle or shard the Arrow data on your disk. Instead, it creates an indices mapping that maps the queries of the user (e.g. dataset[0]) to the actual position of the examples in the Arrow data on your disk. This allows to make shuffling and sharding really fast and doesn’t require to write new Arrow data to save disk space.

However a shuffled/sharded dataset is a little bit slower to query because of this mapping. In case you want to get the optimal speed back, you have to write your dataset in a new Arrow file with flatten_indices. The resulting dataset will be loaded from this new Arrow file, and won’t have an indices mapping anymore since all the examples in the Arrow file will already be in the right order.

Hope that clarifies things, let me know if you have other questions !

Topic		Replies	Views
How to duplicate a dataset? 🤗Datasets	1	6053	July 21, 2021
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1508	May 17, 2021
Performance tips for shuffle and flatten_indices 🤗Datasets	5	2142	December 11, 2024
Is `flatten_indices` needed after a `filter`? 🤗Datasets	1	274	July 14, 2023
Saving a dataset to disk after select copies the data 🤗Datasets	8	2341	April 7, 2022

What is the diffrence between copy.deepcopy and flatten_indices?

Related topics