Chapter 6 questions

raptorkwok · October 10, 2023, 11:32am

After I built a custom Tokenizer, if I have more data to train, should I call the train() or train_from_iterator() again on the same saved tokenizer? will the original trained tokenizer be overwritten?

raptorkwok · October 10, 2023, 11:34am

Training is based on the algorithm of the Trainer model (e.g. BPE, WordPiece), which is based on the usage of tokens in your training dataset.

If you use the add_tokens(), it ignores the algorithm and adds your specific token directly to your Tokenizer.

Nivina · October 16, 2023, 6:48am

if I have trained a tokenizer, whether trained from an old tokenizer or from scratch, does that mean I have to use it with a trained from scratch model, in other words, the new trained tokenizer cannot tokenize for any existing models

saeu5407 · October 29, 2023, 8:53am

Hello.

I’m watching the Training a new tokenizer from an old one chapter.

There is a problem using tokenizer.train_new_from_iterator according to its code.

The problem is that the memory keeps piling up, causing OOM, but I don’t think it’s a placement problem.
How do you solve it?

def get_training_corpus(dataset, batch_size):
    for start_idx in tqdm(range(0, len(dataset), batch_size)):
        yield dataset[start_idx:start_idx+batch_size]["text"]

training_corpus = get_training_corpus(dataset, batch_size=5000)
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, 30000)

furkansepetci · February 8, 2024, 12:15pm

Hi there,

def get_training_corpus():
return (
raw_datasets[“train”][i : i + 1000][“whole_func_string”]
for i in range(0, len(raw_datasets[“train”]), 1000)
)

training_corpus = get_training_corpus()

I think there’s a mistake here. A function has been written to make a generator reusable. However, as far as I know, we need to wrap the generator with a list for reuse. I think the correct way should be like this:

def get_training_corpus():
return (
raw_datasets[“train”][i : i + 1000][“whole_func_string”]
for i in range(0, len(raw_datasets[“train”]), 1000)
)

training_corpus = list(get_training_corpus())

Blannikus · April 2, 2024, 12:42pm

In the implementation of def encode_word(word, model) in the section Unigram tokenization - Hugging Face NLP Course, why do do we initialise the first index with score 1? {"start": 0, "score": 1}

chrischang80 · July 10, 2024, 3:25am

The QuestionAnswering model tried to predict the position of start and end for answer. Let’s say, we got a paragraph with 50 words. The correct answer is between position 30 to 40. If the predicted start & end position is 25 & 10. It doesn’t make sense, because start should always less than end.

Topic		Replies	Views
Chapter 2: Different logits for otherwise identical tokenization "pipelines" Course	1	241	April 29, 2024
Fast tokenizers in the QA pipeline (TensorFlow) numpy error Course	2	824	May 25, 2022
Tokenizer unigram tutorial encode_word function question Beginners	0	83	May 11, 2024
Self-pretrained model predicts token with -1 index gap 🤗Transformers	0	661	February 22, 2022
Different sentiments when texts processed in batches vs singles Intermediate	1	437	July 3, 2022

Chapter 6 questions

Related Topics