DatasetDict.map function generated very big cache files from a relatively small data

I’ve created a DatasetDict object out of three Pandas DataFrames, which are for "train ", “validation” and “test”. The three dataframes (or the original files) takes 1.3GB, 700MB and 800MB.

data_dict = {ds: pd.read_json(os.path.join(data_path, f'{ds}_data.json')) for ds in ['train', 'valid', 'test']}
dataset_dict = DatasetDict({k: Dataset.from_pandas(v, split=k, preserve_index=False) for k, v in data_dict.items()})
dataset_dict = dataset_dict.remove_columns(['event_datetime'])

After creating the DatasetDict object, I called two map functions on it. The first one is to tokenize two columns:

def tokenize(batch):
    return tokenizer(batch["title"], batch["abstract"], return_tensors='pt', truncation=True, padding='max_length', max_length=MAX_LEN)

The other is to do padding and masking on another column:

def pad_job_id_sequence(batch):
    var_job_id_sequences = [torch.tensor(ast.literal_eval(x)) for x in batch['job_id_sequence']]

    # Pad the 1st sequence
    padding_len = MAX_JOB_ID_SEQ - var_job_id_sequences[0].numel()
    var_job_id_sequences[0] = F.pad(var_job_id_sequences[0], (0, padding_len), mode='constant', value=0)
    # Pad the sequences in the batch to the same length
    padded_job_id_sequences = torch.nn.utils.rnn.pad_sequence(var_job_id_sequences, batch_first=True, padding_value=0)
    # Create a boolean mask indicating the padded positions
    padded_job_id_sequences_padding_mask = padded_job_id_sequences.eq(0)

    return {'padded_job_id_sequences': padded_job_id_sequences, 'padded_job_id_sequences_padding_mask': padded_job_id_sequences_padding_mask}

The two map functions are called on different columns sequentially.

map_batch_size = 14
num_proc = 14
writer_batch_size = 500

cache_file_path = f'{proj_root_path}/tmp_data'
cache_paths = [f'{proj_root_path}/tmp_data', f'{proj_root_path}/tmp_data/tokenize', f'{proj_root_path}/tmp_data/pad_job_id_sequence']
for path in cache_paths:
    try:
        os.mkdir(path)
    except OSError as error:
        print(error)


dataset_dict = dataset_dict.map(
    tokenize,
    batched=True,
    batch_size=map_batch_size,
    num_proc=num_proc,
    cache_file_names={k: f'{cache_paths[1]}/{k}_cache.arrow' for k in dataset_dict.keys()},
    writer_batch_size=writer_batch_size,
    desc='tokenize'
)
dataset_dict = dataset_dict.map(
    pad_job_id_sequence,
    batched=True,
    batch_size=map_batch_size,
    num_proc=num_proc,
    cache_file_names={k: f'{cache_paths[2]}/{k}_cache.arrow' for k in dataset_dict.keys()},
    writer_batch_size=writer_batch_size,
    desc='pad_job_id_sequence'
)

print('Map execution completed.')

After these two map calls finished, even though I removed the few original columns relevant to the maps, the cache files are still very big. Maybe it was around 40~50G!
After that I executed dataset_dict.save_to_disk(dataset_dict_path=dataset_dict_s3_path) to save the DatasetDict data, the final saved files are still very big: 15G (train), 5G (validation) and 5G (test).

From very small pandas dataframes to very large DatasetDict object, is this normal? Did I do anything wrong?

I did padding and masking to the max length for all rows. Is this the cause of the large cache and final saved files?

Storing multiple lists of integers that are padded to the maximum length will make the dataset much bigger indeed. Maybe consider to not pad data when preprocessing the dataset, and only apply padding on-the-fly during training using a data collator.

Are you referring to “Dynamic Padding”? Is dynamic padding the preferred or standard way to do padding for data input of transformer models?

It’s pretty standard yes, and is helpful when you want to save disk space :slight_smile: