Map tokenization and posterior to smaller substrings

mdelas · September 29, 2022, 5:38pm

Dear community,

My objective is to have sequences of 1024 tokens ready to GPT-2. As input data, I have DNA sequences of 2E5 length, and I have built a BPE tokenizer using this data.

I am trying to build a function which concatenates:

Tokenize the full sequences (2E5 max_length)
Chunk data into 1024 sequence length and enlarge the number of rows of my Huggingface Dataset

For now I am trying to use this function, and map it posterior to the Dataset:

def tokenize_and_chunk(examples):
    chunks = []
    tokenized_inputs = fast_tokenizer(
        examples['text'],
        max_length=200000,
        truncation=True
    )

    for sentence in examples['text']:
        chunks += [sentence[i:i + 1024] for i in range(0, len(sentence), 1024)]

    return {"chunks": chunks} 

chunked_data = dna_data.map(tokenize_and_chunk, remove_columns=dna_data.column_names)

The problem is that I receive a list of lists just like this:
[[‘C’,
‘C’,
‘A’,
‘G’,
…,
],…,]

where this list is a list where for each example I have a inner-list with length of 91170 chars.

I can’t figure out how should I correctly map this function. Does someone have an idea of which should be the best practice for this?

Topic		Replies	Views
Building a GPT2 dataset from long sequences 🤗Datasets	1	518	September 19, 2022
How to tokenize using map 🤗Datasets	4	6216	April 14, 2021
Make text data continuous from DatasetDict 🤗Datasets	1	1180	May 11, 2022
Map with tokenize function stuck in the beginning 🤗Datasets	4	57	December 27, 2024
Transform list-like elements to rows 🤗Datasets	2	1157	May 14, 2021

Map tokenization and posterior to smaller substrings

Related topics