hey guys,
I want to tokenize a DNA seq. it’s like [[AAA GGG TTT],[AGT ATT CGC CCC AAA GTT],…]
as you see it encompasses different sequences with different lengths, generally we have 64 unique codons (each of the elements in our list(for example AAA) is a codon)
how can I tokenize in word level that separated only in spaces.
also I want to padding and truncating my tokenized list (with input_ids, mask columns that every tokenized file in hugging face pretrained library has
somehow this code works but it was not whta i want
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace
from transformers import PreTrainedTokenizerFast
Create the tokenizer
tokenizer = Tokenizer(WordLevel(unk_token=“[UNK]”))
tokenizer.pre_tokenizer = Whitespace()
Define special tokens
special_tokens_dict = {‘cls_token’: ‘[CLS]’, ‘sep_token’: ‘[SEP]’, ‘pad_token’: ‘[PAD]’, ‘mask_token’: ‘[MASK]’}
tokenizer.add_special_tokens(special_tokens_dict)
Define the trainer with special tokens
trainer = WordLevelTrainer(special_tokens=list(special_tokens_dict.values()))
Example training data
training_data = s.DNA_Codons.keys()
Save the training data to a temporary file
with open(“training_data.txt”, “w”) as f:
f.write(“\n”.join(training_data))
Train the tokenizer
tokenizer.train([“training_data.txt”], trainer)
Save the tokenizer to a file
tokenizer.save(“C:\Users\farsh\Downloads\tokenizer_DNA.json”)