Saving checkpoints

I am currently using the following in order to save checkpoints (every 1000)

continue_from = 49000
save_for_every = 1000
encoded_dataset = datasets.load_from_disk("encoded_rvl_cdip")
for i in range(continue_from, data.shape[0], save_for_every):
    dataset = Dataset.from_pandas(data[i:i+save_for_every])
    encoded_subset = dataset.map(preprocess_data, remove_columns=dataset.column_names, features=features, 
                                 batched=True, batch_size=2)
    encoded_dataset = datasets.concatenate_datasets([encoded_dataset, encoded_subset])
    print(f"Saving to encoded_rvl_cdip {i}...")
    encoded_dataset.save_to_disk("encoded_rvl_cdip_1")

print("Done")

However the problem is the file that I am saving is getting incrementally larger each iteration. Is there a way I can add to a saved file instead of having to overwrite the whole thing?

Hi ! When you save a dataset, it saves the whole thing. Therefore at each iteration it writes both the data from encoded_rvl_cdip but also all the chunks of data you loaded from data until that point.

So this is not exactly what you want since at each iteration it writes a really big dataset, that is getting bigger and bigger.

What you can do instead is simply save the chunks that you want to add in separate datasets, and then load first your original dataset, then load each chunk, and then concatenate them together