I have already tokenized my dataset in the desired format for the problem, so I don’t want to tokenize it again. I’m working with nucleotides so I want single, paired and triple sequences. However, I do want to pass this into a BERT model and would like to preprocess the data. I’ve looked at this link here: Preprocess (particularly the video). I currently have my tokens stored as a list. Is there anyway I can use that for the remaining steps mentioned in the video (i.e. converting to ids and preparing for model)?
Thanks!