I loaded a dataset using load_from_disk method using memory mapping (keep_in_memory = False ) and done some operations like oversampling on it and then taking random 25% from the over-sampled data. I want to move those 25% to memory.
I worked around it by Dataset.from_pandas(over-sampled_shard.to_pandas()), but as you know oversampling repeats samples and those will have repeated samples, i want to efficiently move this shard to memory without paying the cost of multiple copies. Any ideas ?
You can use .select(indices)
to select samples multiple times without copying them.