Collapse duplicates in dataset and treat it as usual

ljw20180420 · July 3, 2024, 6:16am

I have a dataset with many duplicates. Here is an simple example.

from datasets import Dataset
ds = Dataset.from_dict({"val": [1, 1, 1, 2, 2, 2, 2]})

I want to collapse duplicates to a form of three 1 and four 2 to save memory. When load batch from the dataset, the library still treat the dataset as usually. That is, 1 will be loaded three times, and 2 four times. Is this possible by using Datasets library? Or is there some tricks to achieve this?

lhoestq · July 3, 2024, 12:46pm

datasets used the Arrow format under the hood, and it doesn’t allow something like this afaik

nimatov · July 3, 2024, 11:23pm

In this case I would use GPT . Here the solution that can help (not tested !!!) :

from datasets import Dataset
import numpy as np

# Original dataset
ds = Dataset.from_dict({"val": [1, 1, 1, 2, 2, 2, 2]})

# Get the unique values and their counts
unique_vals, counts = np.unique(ds['val'], return_counts=True)
unique_dataset = Dataset.from_dict({"val": unique_vals, "count": counts})

# Custom Dataset class to simulate duplicates
class DuplicatesDataset:
    def __init__(self, unique_dataset):
        self.unique_dataset = unique_dataset
        self.vals = []
        for val, count in zip(unique_dataset['val'], unique_dataset['count']):
            self.vals.extend([val] * count)

    def __len__(self):
        return len(self.vals)

    def __getitem__(self, idx):
        return {"val": self.vals[idx]}

# Create the custom dataset
collapsed_ds = DuplicatesDataset(unique_dataset)

# Example usage
print(f"Dataset length: {len(collapsed_ds)}")
print(f"First item: {collapsed_ds[0]}")
print(f"Last item: {collapsed_ds[-1]}")

# Load a batch from the dataset
batch_size = 3
for i in range(0, len(collapsed_ds), batch_size):
    batch = [collapsed_ds[j] for j in range(i, min(i + batch_size, len(collapsed_ds)))]
    print(f"Batch {i // batch_size + 1}: {batch}")

ljw20180420 · July 4, 2024, 1:39am

Thank you. What do you mean by

nimatov · July 4, 2024, 9:09am

I mean to use ChatGPT as code generator . Sometimes the code generated need some light adjustments and the code works. I took your problem definition and feed to ChatGPT . The result above

ljw20180420 · July 5, 2024, 1:33am

Thank you.

Topic		Replies	Views
Mapping 1 multi-element column of a dataset to multi row dataset with 1 element per row, duplicating other features 🤗Datasets	6	2510	November 4, 2022
How can I drop duplicates on datasets module? Beginners	3	7676	July 5, 2022
Datasets: Limit the number of rows? Beginners	4	8153	December 17, 2023
Initializing splits from existing Dataset objects 🤗Datasets	1	1214	April 7, 2022
Efficient way to concatenate DatasetDict objects 🤗Datasets	1	2280	June 12, 2023

Collapse duplicates in dataset and treat it as usual

Related topics