Chapter 6 questions

After I built a custom Tokenizer, if I have more data to train, should I call the train() or train_from_iterator() again on the same saved tokenizer? will the original trained tokenizer be overwritten?

Training is based on the algorithm of the Trainer model (e.g. BPE, WordPiece), which is based on the usage of tokens in your training dataset.

If you use the add_tokens(), it ignores the algorithm and adds your specific token directly to your Tokenizer.

if I have trained a tokenizer, whether trained from an old tokenizer or from scratch, does that mean I have to use it with a trained from scratch model, in other words, the new trained tokenizer cannot tokenize for any existing models

Hello.

I’m watching the Training a new tokenizer from an old one chapter.

There is a problem using tokenizer.train_new_from_iterator according to its code.

The problem is that the memory keeps piling up, causing OOM, but I don’t think it’s a placement problem.
How do you solve it?

def get_training_corpus(dataset, batch_size):
    for start_idx in tqdm(range(0, len(dataset), batch_size)):
        yield dataset[start_idx:start_idx+batch_size]["text"]

training_corpus = get_training_corpus(dataset, batch_size=5000)
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, 30000)

Hi there,

def get_training_corpus():
return (
raw_datasets[“train”][i : i + 1000][“whole_func_string”]
for i in range(0, len(raw_datasets[“train”]), 1000)
)

training_corpus = get_training_corpus()

I think there’s a mistake here. A function has been written to make a generator reusable. However, as far as I know, we need to wrap the generator with a list for reuse. I think the correct way should be like this:

def get_training_corpus():
return (
raw_datasets[“train”][i : i + 1000][“whole_func_string”]
for i in range(0, len(raw_datasets[“train”]), 1000)
)

training_corpus = list(get_training_corpus())

In the implementation of def encode_word(word, model) in the section Unigram tokenization - Hugging Face NLP Course, why do do we initialise the first index with score 1? {"start": 0, "score": 1}