Hi everyone, I have a dataset in which sentences have been segmented into words. How do I use it to train a BPE or SentencePiece tokenizer? Thank you
Check out this notebook from huggingfaceâs github, or the second step of this other notebook about how to pretrain a LM from scratch