DatasetDict.map function generated very big cache files from a relatively small data

lonewar · May 23, 2023, 4:03pm

I’ve created a DatasetDict object out of three Pandas DataFrames, which are for "train ", “validation” and “test”. The three dataframes (or the original files) takes 1.3GB, 700MB and 800MB.

data_dict = {ds: pd.read_json(os.path.join(data_path, f'{ds}_data.json')) for ds in ['train', 'valid', 'test']}
dataset_dict = DatasetDict({k: Dataset.from_pandas(v, split=k, preserve_index=False) for k, v in data_dict.items()})
dataset_dict = dataset_dict.remove_columns(['event_datetime'])

After creating the DatasetDict object, I called two map functions on it. The first one is to tokenize two columns:

def tokenize(batch):
    return tokenizer(batch["title"], batch["abstract"], return_tensors='pt', truncation=True, padding='max_length', max_length=MAX_LEN)

The other is to do padding and masking on another column:

def pad_job_id_sequence(batch):
    var_job_id_sequences = [torch.tensor(ast.literal_eval(x)) for x in batch['job_id_sequence']]

    # Pad the 1st sequence
    padding_len = MAX_JOB_ID_SEQ - var_job_id_sequences[0].numel()
    var_job_id_sequences[0] = F.pad(var_job_id_sequences[0], (0, padding_len), mode='constant', value=0)
    # Pad the sequences in the batch to the same length
    padded_job_id_sequences = torch.nn.utils.rnn.pad_sequence(var_job_id_sequences, batch_first=True, padding_value=0)
    # Create a boolean mask indicating the padded positions
    padded_job_id_sequences_padding_mask = padded_job_id_sequences.eq(0)

    return {'padded_job_id_sequences': padded_job_id_sequences, 'padded_job_id_sequences_padding_mask': padded_job_id_sequences_padding_mask}

The two map functions are called on different columns sequentially.

map_batch_size = 14
num_proc = 14
writer_batch_size = 500

cache_file_path = f'{proj_root_path}/tmp_data'
cache_paths = [f'{proj_root_path}/tmp_data', f'{proj_root_path}/tmp_data/tokenize', f'{proj_root_path}/tmp_data/pad_job_id_sequence']
for path in cache_paths:
    try:
        os.mkdir(path)
    except OSError as error:
        print(error)


dataset_dict = dataset_dict.map(
    tokenize,
    batched=True,
    batch_size=map_batch_size,
    num_proc=num_proc,
    cache_file_names={k: f'{cache_paths[1]}/{k}_cache.arrow' for k in dataset_dict.keys()},
    writer_batch_size=writer_batch_size,
    desc='tokenize'
)
dataset_dict = dataset_dict.map(
    pad_job_id_sequence,
    batched=True,
    batch_size=map_batch_size,
    num_proc=num_proc,
    cache_file_names={k: f'{cache_paths[2]}/{k}_cache.arrow' for k in dataset_dict.keys()},
    writer_batch_size=writer_batch_size,
    desc='pad_job_id_sequence'
)

print('Map execution completed.')

After these two map calls finished, even though I removed the few original columns relevant to the maps, the cache files are still very big. Maybe it was around 40~50G!
After that I executed dataset_dict.save_to_disk(dataset_dict_path=dataset_dict_s3_path) to save the DatasetDict data, the final saved files are still very big: 15G (train), 5G (validation) and 5G (test).

From very small pandas dataframes to very large DatasetDict object, is this normal? Did I do anything wrong?

I did padding and masking to the max length for all rows. Is this the cause of the large cache and final saved files?

lhoestq · May 25, 2023, 3:21pm

Storing multiple lists of integers that are padded to the maximum length will make the dataset much bigger indeed. Maybe consider to not pad data when preprocessing the dataset, and only apply padding on-the-fly during training using a data collator.

lonewar · May 25, 2023, 3:41pm

Are you referring to “Dynamic Padding”? Is dynamic padding the preferred or standard way to do padding for data input of transformer models?

lhoestq · May 25, 2023, 4:28pm

It’s pretty standard yes, and is helpful when you want to save disk space

Topic		Replies	Views
Working with large datasets - cache issues 🤗Datasets	1	1027	June 1, 2022
Dataset map() creates lot of cache files 🤗Datasets	6	6462	March 26, 2024
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2724	March 22, 2023
Saving dataset in the current state without cache 🤗Datasets	9	5891	March 17, 2022
How does `datasets.Dataset.map` parallelize data? Beginners	3	3086	August 5, 2024

DatasetDict.map function generated very big cache files from a relatively small data

Related topics