Map tokenization and posterior to smaller substrings

Dear community,

My objective is to have sequences of 1024 tokens ready to GPT-2. As input data, I have DNA sequences of 2E5 length, and I have built a BPE tokenizer using this data.

I am trying to build a function which concatenates:

  1. Tokenize the full sequences (2E5 max_length)
  2. Chunk data into 1024 sequence length and enlarge the number of rows of my Huggingface Dataset

For now I am trying to use this function, and map it posterior to the Dataset:

def tokenize_and_chunk(examples):
    chunks = []
    tokenized_inputs = fast_tokenizer(
        examples['text'],
        max_length=200000,
        truncation=True
    )

    for sentence in examples['text']:
        chunks += [sentence[i:i + 1024] for i in range(0, len(sentence), 1024)]

    return {"chunks": chunks} 

chunked_data = dna_data.map(tokenize_and_chunk, remove_columns=dna_data.column_names)

The problem is that I receive a list of lists just like this:
[[‘C’,
‘C’,
‘A’,
‘G’,
…,
],…,]

where this list is a list where for each example I have a inner-list with length of 91170 chars.

I can’t figure out how should I correctly map this function. Does someone have an idea of which should be the best practice for this?