Transform list-like elements to rows

Hi,

I’ve been using :hugs:/datasets for a while and couldn’t find an elegant solution to transform list-like elements into separate rows.

I have a large source code corpus (>500 Gb) that needs to be preprocessed for further language modeling. The corpus consists of source code files that are too long to be fed into non-sparse self-attention models. So I tokenize files into chunks, resulting in each row containing a list of chunks.

Tokenization looks like this:

tokenizer = MBartTokenizerFast.from_pretrained("./code_mbart_spm_60K")

def map_fn(example: Dict[str, Any]) -> Dict[str, Any]:
    encoding = tokenizer(example["text"], return_overflowing_tokens=True)
    del encoding["overflow_to_sample_mapping"]
    return {**encoding}

dataset = datasets.load_from_disk("./code_dataset")
dataset = dataset.map(map_fn, batched=False, remove_columns=dataset.column_names, num_proc=24)
example = dataset[0] # type: {"input_ids": List[List[int]], "attention_mask": List[List[int]]}

After that, I need my dataset to be flat – i.e. each chunk to be a separate row in the dataset. So far I iterate over each example and each chunk and write tokenized chunks into jsonlines files. Then I load json files with :hugs:/datasets and save results.

I can’t convert my tokenized dataset into a pandas dataframe and apply explode method due to RAM limitations.

Does anybody know if there is an elegant workaround/solution to perform such transformation?

1 Like

What if you try with batched=True ?

1 Like

Thanks a lot, it works! :hugs: