Transform list-like elements to rows

mozharovsky · May 5, 2021, 12:11pm

Hi,

I’ve been using /datasets for a while and couldn’t find an elegant solution to transform list-like elements into separate rows.

I have a large source code corpus (>500 Gb) that needs to be preprocessed for further language modeling. The corpus consists of source code files that are too long to be fed into non-sparse self-attention models. So I tokenize files into chunks, resulting in each row containing a list of chunks.

Tokenization looks like this:

tokenizer = MBartTokenizerFast.from_pretrained("./code_mbart_spm_60K")

def map_fn(example: Dict[str, Any]) -> Dict[str, Any]:
    encoding = tokenizer(example["text"], return_overflowing_tokens=True)
    del encoding["overflow_to_sample_mapping"]
    return {**encoding}

dataset = datasets.load_from_disk("./code_dataset")
dataset = dataset.map(map_fn, batched=False, remove_columns=dataset.column_names, num_proc=24)
example = dataset[0] # type: {"input_ids": List[List[int]], "attention_mask": List[List[int]]}

After that, I need my dataset to be flat – i.e. each chunk to be a separate row in the dataset. So far I iterate over each example and each chunk and write tokenized chunks into jsonlines files. Then I load json files with /datasets and save results.

I can’t convert my tokenized dataset into a pandas dataframe and apply explode method due to RAM limitations.

Does anybody know if there is an elegant workaround/solution to perform such transformation?

thomwolf · May 5, 2021, 11:38pm

What if you try with batched=True ?

mozharovsky · May 14, 2021, 10:26am

Thanks a lot, it works!

Topic		Replies	Views
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12734	October 6, 2021
I have a dataset of texts that I want to split into shorter texts 🤗Datasets	1	1065	October 16, 2023
About dataset map 🤗Datasets	5	402	August 20, 2023
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1081	August 19, 2021
Mapping 1 multi-element column of a dataset to multi row dataset with 1 element per row, duplicating other features 🤗Datasets	6	2548	November 4, 2022

Transform list-like elements to rows

Related topics