Custom Tokenizing?

thedarkknight7 · March 19, 2024, 4:15am

I have already tokenized my dataset in the desired format for the problem, so I don’t want to tokenize it again. I’m working with nucleotides so I want single, paired and triple sequences. However, I do want to pass this into a BERT model and would like to preprocess the data. I’ve looked at this link here: Preprocess (particularly the video). I currently have my tokens stored as a list. Is there anyway I can use that for the remaining steps mentioned in the video (i.e. converting to ids and preparing for model)?

Thanks!

Topic		Replies	Views
Training BERT model from scratch with custom sequence Beginners	0	392	September 21, 2022
Train model from scratch on own dataset Beginners	0	572	February 26, 2024
Pre-training a BERT model from scratch with custom tokenizer Intermediate	5	3091	January 11, 2022
All my sequences get tokenized the same 🤗Tokenizers	2	609	February 12, 2022
Custom tokenizer: finetune model or retrain model? 🤗Transformers	1	909	March 8, 2024

Custom Tokenizing?

Related topics