Training a tokenizer

Hi everyone,
I have a dataset in which sentences have been segmented into words. How do I use it to train a BPE or SentencePiece tokenizer?
Thank you

Check out this notebook from huggingface’s github, or the second step of this other notebook about how to pretrain a LM from scratch :slight_smile:

1 Like