Dear community,
My objective is to have sequences of 1024 tokens ready to GPT-2. As input data, I have DNA sequences of 2E5 length, and I have built a BPE tokenizer using this data.
I am trying to build a function which concatenates:
- Tokenize the full sequences (2E5 max_length)
- Chunk data into 1024 sequence length and enlarge the number of rows of my Huggingface Dataset
For now I am trying to use this function, and map it posterior to the Dataset:
def tokenize_and_chunk(examples):
chunks = []
tokenized_inputs = fast_tokenizer(
examples['text'],
max_length=200000,
truncation=True
)
for sentence in examples['text']:
chunks += [sentence[i:i + 1024] for i in range(0, len(sentence), 1024)]
return {"chunks": chunks}
chunked_data = dna_data.map(tokenize_and_chunk, remove_columns=dna_data.column_names)
The problem is that I receive a list of lists just like this:
[[‘C’,
‘C’,
‘A’,
‘G’,
…,
],…,]
where this list is a list where for each example I have a inner-list with length of 91170 chars.
I can’t figure out how should I correctly map this function. Does someone have an idea of which should be the best practice for this?