Prevent iterable dataset from consuming all the rams

i have taken care of the disk space issue i was having, for now, but using Iterable dataset is completely eating my ram (24 gb). code i am using for iterable dataset:

dataset = load_dataset("/kaggle/working/dataset/", streaming=True, keep_in_memory="map")
dataset = dataset.remove_columns(['__key__', '__url__', 'json'])
dataset = dataset.map(process_img, batched=True, remove_columns=["jpg", "txt"], batch_size=100)

train_hf_dataset = dataset["train"]
train_hf_dataset = train_hf_dataset.shuffle(42, buffer_size="1_000")
train_hf_dataset = train_hf_dataset.with_format("torch")
train_dataloader = DataLoader(train_hf_dataset, batch_size=batch_size)
1 Like

That’s strange… Maybe it’s an unknown memory leak or keep_in_memory is causing it…?