Hi,
I’ve been using /datasets for a while and couldn’t find an elegant solution to transform list-like elements into separate rows.
I have a large source code corpus (>500 Gb) that needs to be preprocessed for further language modeling. The corpus consists of source code files that are too long to be fed into non-sparse self-attention models. So I tokenize files into chunks, resulting in each row containing a list of chunks.
Tokenization looks like this:
tokenizer = MBartTokenizerFast.from_pretrained("./code_mbart_spm_60K")
def map_fn(example: Dict[str, Any]) -> Dict[str, Any]:
encoding = tokenizer(example["text"], return_overflowing_tokens=True)
del encoding["overflow_to_sample_mapping"]
return {**encoding}
dataset = datasets.load_from_disk("./code_dataset")
dataset = dataset.map(map_fn, batched=False, remove_columns=dataset.column_names, num_proc=24)
example = dataset[0] # type: {"input_ids": List[List[int]], "attention_mask": List[List[int]]}
After that, I need my dataset to be flat – i.e. each chunk to be a separate row in the dataset. So far I iterate over each example and each chunk and write tokenized chunks into jsonlines files. Then I load json files with /datasets and save results.
I can’t convert my tokenized dataset into a pandas dataframe and apply explode
method due to RAM limitations.
Does anybody know if there is an elegant workaround/solution to perform such transformation?