I am currently using the following in order to save checkpoints (every 1000)
continue_from = 49000
save_for_every = 1000
encoded_dataset = datasets.load_from_disk("encoded_rvl_cdip")
for i in range(continue_from, data.shape[0], save_for_every):
dataset = Dataset.from_pandas(data[i:i+save_for_every])
encoded_subset = dataset.map(preprocess_data, remove_columns=dataset.column_names, features=features,
batched=True, batch_size=2)
encoded_dataset = datasets.concatenate_datasets([encoded_dataset, encoded_subset])
print(f"Saving to encoded_rvl_cdip {i}...")
encoded_dataset.save_to_disk("encoded_rvl_cdip_1")
print("Done")
However the problem is the file that I am saving is getting incrementally larger each iteration. Is there a way I can add to a saved file instead of having to overwrite the whole thing?