Collapse duplicates in dataset and treat it as usual

I have a dataset with many duplicates. Here is an simple example.

from datasets import Dataset
ds = Dataset.from_dict({"val": [1, 1, 1, 2, 2, 2, 2]})

I want to collapse duplicates to a form of three 1 and four 2 to save memory. When load batch from the dataset, the library still treat the dataset as usually. That is, 1 will be loaded three times, and 2 four times. Is this possible by using Datasets library? Or is there some tricks to achieve this?

datasets used the Arrow format under the hood, and it doesn’t allow something like this afaik

In this case I would use GPT . Here the solution that can help (not tested !!!) :

from datasets import Dataset
import numpy as np

# Original dataset
ds = Dataset.from_dict({"val": [1, 1, 1, 2, 2, 2, 2]})

# Get the unique values and their counts
unique_vals, counts = np.unique(ds['val'], return_counts=True)
unique_dataset = Dataset.from_dict({"val": unique_vals, "count": counts})

# Custom Dataset class to simulate duplicates
class DuplicatesDataset:
    def __init__(self, unique_dataset):
        self.unique_dataset = unique_dataset
        self.vals = []
        for val, count in zip(unique_dataset['val'], unique_dataset['count']):
            self.vals.extend([val] * count)

    def __len__(self):
        return len(self.vals)

    def __getitem__(self, idx):
        return {"val": self.vals[idx]}

# Create the custom dataset
collapsed_ds = DuplicatesDataset(unique_dataset)

# Example usage
print(f"Dataset length: {len(collapsed_ds)}")
print(f"First item: {collapsed_ds[0]}")
print(f"Last item: {collapsed_ds[-1]}")

# Load a batch from the dataset
batch_size = 3
for i in range(0, len(collapsed_ds), batch_size):
    batch = [collapsed_ds[j] for j in range(i, min(i + batch_size, len(collapsed_ds)))]
    print(f"Batch {i // batch_size + 1}: {batch}")

Thank you. What do you mean by

I mean to use ChatGPT as code generator . Sometimes the code generated need some light adjustments and the code works. I took your problem definition and feed to ChatGPT . The result above

Thank you.