What is the diffrence between copy.deepcopy and flatten_indices?

Hi, to be honest, I’m not sure what exactly flatten_indices do, I was doing something with a dataset and I got an error saying that I’m working on a shallow copy and I should use flatten_indices, now my question is what does exactly flatten_indices do and what is the difference between using flatten_indices and python copy.deepcopy?

import copy
dataset2 = copy.deepcopy(dataset1)

# or

dataset2 = dataset1.flatten_indices()

Hi !
Python deepcopy does what it says: it creates a copy of the dataset object.

However flatten_indices has another goal.

First you must know that dataset shuffling/sharding doesn’t actually shuffle or shard the Arrow data on your disk. Instead, it creates an indices mapping that maps the queries of the user (e.g. dataset[0]) to the actual position of the examples in the Arrow data on your disk. This allows to make shuffling and sharding really fast and doesn’t require to write new Arrow data to save disk space.

However a shuffled/sharded dataset is a little bit slower to query because of this mapping. In case you want to get the optimal speed back, you have to write your dataset in a new Arrow file with flatten_indices. The resulting dataset will be loaded from this new Arrow file, and won’t have an indices mapping anymore since all the examples in the Arrow file will already be in the right order.

Hope that clarifies things, let me know if you have other questions !

3 Likes