What is the diffrence between copy.deepcopy and flatten_indices?

SMMousavi · July 19, 2021, 9:05am

Hi, to be honest, I’m not sure what exactly flatten_indices do, I was doing something with a dataset and I got an error saying that I’m working on a shallow copy and I should use flatten_indices, now my question is what does exactly flatten_indices do and what is the difference between using flatten_indices and python copy.deepcopy?

import copy
dataset2 = copy.deepcopy(dataset1)

# or

dataset2 = dataset1.flatten_indices()

lhoestq · July 20, 2021, 9:50am

Hi !
Python deepcopy does what it says: it creates a copy of the dataset object.

However flatten_indices has another goal.

First you must know that dataset shuffling/sharding doesn’t actually shuffle or shard the Arrow data on your disk. Instead, it creates an indices mapping that maps the queries of the user (e.g. dataset[0]) to the actual position of the examples in the Arrow data on your disk. This allows to make shuffling and sharding really fast and doesn’t require to write new Arrow data to save disk space.

However a shuffled/sharded dataset is a little bit slower to query because of this mapping. In case you want to get the optimal speed back, you have to write your dataset in a new Arrow file with flatten_indices. The resulting dataset will be loaded from this new Arrow file, and won’t have an indices mapping anymore since all the examples in the Arrow file will already be in the right order.

Hope that clarifies things, let me know if you have other questions !

Topic		Replies	Views
Performance tips for shuffle and flatten_indices 🤗Datasets	5	2077	December 11, 2024
Does saving a shuffled dataset to arrow format eliminate the indirection? 🤗Datasets	3	97	December 4, 2024
How to duplicate a dataset? 🤗Datasets	1	5856	July 21, 2021
Saving a dataset to disk after select copies the data 🤗Datasets	8	2298	April 7, 2022
Working with large datasets 🤗Datasets	5	4149	November 10, 2020

What is the diffrence between copy.deepcopy and flatten_indices?

Related topics