I’ve created a DatasetDict object out of three Pandas DataFrames, which are for "train ", “validation” and “test”. The three dataframes (or the original files) takes 1.3GB, 700MB and 800MB.
data_dict = {ds: pd.read_json(os.path.join(data_path, f'{ds}_data.json')) for ds in ['train', 'valid', 'test']}
dataset_dict = DatasetDict({k: Dataset.from_pandas(v, split=k, preserve_index=False) for k, v in data_dict.items()})
dataset_dict = dataset_dict.remove_columns(['event_datetime'])
After creating the DatasetDict object, I called two map functions on it. The first one is to tokenize two columns:
def tokenize(batch):
return tokenizer(batch["title"], batch["abstract"], return_tensors='pt', truncation=True, padding='max_length', max_length=MAX_LEN)
The other is to do padding and masking on another column:
def pad_job_id_sequence(batch):
var_job_id_sequences = [torch.tensor(ast.literal_eval(x)) for x in batch['job_id_sequence']]
# Pad the 1st sequence
padding_len = MAX_JOB_ID_SEQ - var_job_id_sequences[0].numel()
var_job_id_sequences[0] = F.pad(var_job_id_sequences[0], (0, padding_len), mode='constant', value=0)
# Pad the sequences in the batch to the same length
padded_job_id_sequences = torch.nn.utils.rnn.pad_sequence(var_job_id_sequences, batch_first=True, padding_value=0)
# Create a boolean mask indicating the padded positions
padded_job_id_sequences_padding_mask = padded_job_id_sequences.eq(0)
return {'padded_job_id_sequences': padded_job_id_sequences, 'padded_job_id_sequences_padding_mask': padded_job_id_sequences_padding_mask}
The two map functions are called on different columns sequentially.
map_batch_size = 14
num_proc = 14
writer_batch_size = 500
cache_file_path = f'{proj_root_path}/tmp_data'
cache_paths = [f'{proj_root_path}/tmp_data', f'{proj_root_path}/tmp_data/tokenize', f'{proj_root_path}/tmp_data/pad_job_id_sequence']
for path in cache_paths:
try:
os.mkdir(path)
except OSError as error:
print(error)
dataset_dict = dataset_dict.map(
tokenize,
batched=True,
batch_size=map_batch_size,
num_proc=num_proc,
cache_file_names={k: f'{cache_paths[1]}/{k}_cache.arrow' for k in dataset_dict.keys()},
writer_batch_size=writer_batch_size,
desc='tokenize'
)
dataset_dict = dataset_dict.map(
pad_job_id_sequence,
batched=True,
batch_size=map_batch_size,
num_proc=num_proc,
cache_file_names={k: f'{cache_paths[2]}/{k}_cache.arrow' for k in dataset_dict.keys()},
writer_batch_size=writer_batch_size,
desc='pad_job_id_sequence'
)
print('Map execution completed.')
After these two map calls finished, even though I removed the few original columns relevant to the map
s, the cache files are still very big. Maybe it was around 40~50G!
After that I executed dataset_dict.save_to_disk(dataset_dict_path=dataset_dict_s3_path)
to save the DatasetDict data, the final saved files are still very big: 15G (train), 5G (validation) and 5G (test).
From very small pandas dataframes to very large DatasetDict object, is this normal? Did I do anything wrong?
I did padding and masking to the max length for all rows. Is this the cause of the large cache and final saved files?