I have a dataset with many duplicates. Here is an simple example.
from datasets import Dataset
ds = Dataset.from_dict({"val": [1, 1, 1, 2, 2, 2, 2]})
I want to collapse duplicates to a form of three 1 and four 2 to save memory. When load batch from the dataset, the library still treat the dataset as usually. That is, 1 will be loaded three times, and 2 four times. Is this possible by using Datasets library? Or is there some tricks to achieve this?
datasets
used the Arrow format under the hood, and it doesn’t allow something like this afaik
In this case I would use GPT . Here the solution that can help (not tested !!!) :
from datasets import Dataset
import numpy as np
# Original dataset
ds = Dataset.from_dict({"val": [1, 1, 1, 2, 2, 2, 2]})
# Get the unique values and their counts
unique_vals, counts = np.unique(ds['val'], return_counts=True)
unique_dataset = Dataset.from_dict({"val": unique_vals, "count": counts})
# Custom Dataset class to simulate duplicates
class DuplicatesDataset:
def __init__(self, unique_dataset):
self.unique_dataset = unique_dataset
self.vals = []
for val, count in zip(unique_dataset['val'], unique_dataset['count']):
self.vals.extend([val] * count)
def __len__(self):
return len(self.vals)
def __getitem__(self, idx):
return {"val": self.vals[idx]}
# Create the custom dataset
collapsed_ds = DuplicatesDataset(unique_dataset)
# Example usage
print(f"Dataset length: {len(collapsed_ds)}")
print(f"First item: {collapsed_ds[0]}")
print(f"Last item: {collapsed_ds[-1]}")
# Load a batch from the dataset
batch_size = 3
for i in range(0, len(collapsed_ds), batch_size):
batch = [collapsed_ds[j] for j in range(i, min(i + batch_size, len(collapsed_ds)))]
print(f"Batch {i // batch_size + 1}: {batch}")
Thank you. What do you mean by
I mean to use ChatGPT as code generator . Sometimes the code generated need some light adjustments and the code works. I took your problem definition and feed to ChatGPT . The result above