I have 2 questions regarding fine-tuning t5:-
Is there anyway to change the lm_head on T5ForConditionalGeneration to intiliaze it from scratch to support new vocabulary size ?
I did it by changing the T5ForConditionalGeneration code and add a new layer called final_layer, but I was wondering if there is an easier way.
Is T5 generate method use teacher forcing or not ?
When you modify the vocab
, you also need to resize the the token embeddings. The right way to do this is
Add the new tokens to the tokenizer
tokenizer.add_tokens(list of new toknes)
Resize token embeddings
model.resize_token_embeddings(len(tokenizer))
teacher forcing is used while training. generate
does not use teacher forcing since it’s not used in training and meant for generating after training.
Thanks @valhalla for your explanation.
To confirm my understanding.
Resizing the embedding will add extra rows/columns for the new tokens, which is initialised with random weights, correct ?
Seq2Seq example:
https://github.com/huggingface/transformers/blob/master/examples/seq2seq/seq2seq_trainer.py#L119
Will use teacher forcing during training, is there anyway to disable teacher forcing in the library, or I have to implement it my self by feeding the model one output at a time sequentially ?
Here’s what I used to add some tokens:
from transformers import T5Tokenizer
from transformers import T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
local_dir = "./cryptic_special"
model_name = "t5-small"
special_tokens = ["<DEFN>",
"<ANAG>",
"<ANS>",
"<INDIC>"]
tokenizer_special = T5Tokenizer.from_pretrained(model_name, additional_special_tokens=special_tokens)
model.resize_token_embeddings(len(tokenizer_special))
tokenizer_special.save_pretrained(local_dir)
model.save_pretrained(local_dir)
Then you just adapt the fine_tune script to point to the local_dir (for model and tokenizer)
Thanks a lot for the example.
Perfect, thanks for the explanation.
This didn’t work for me, how can you reload the model once you’ve resized the embedding?
The rest of the model resizes, but it seems the LM_HEAD will not, eg:
size mismatch for lm_head.weight: copying a param with shape
torch.Size([32128, 768]) from checkpoint, the shape in current model is
torch.Size([32102, 768])`
Disregard this, it was a bug that was fixed in:
huggingface:master
← patrickvonplaten:fix_t5_resize_tokens
opened 04:59PM - 01 Dec 20 UTC
# What does this PR do?
<!--
Congratulations! You've made it this far! You'r… e not quite done yet though.
Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.
Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.
Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost.
-->
This PR extends the `resize_embeddings` function in PyTorch to models that have input/output embeddings that are **not** tied.
In PyTorch all models that have tied input/output embeddings by default can also untie those embeddings by setting `config.tie_word_embeddings=False`. This however requires the `_resize_token_embeddings` to be extended to also resize the `lm_head`. This PR does this extension by adding a `_get_resized_lm_head` method. Also, all models that have a `get_output_embedding()` function, now need a `set_output_embedding()` function. A test is added to make sure the new functionality works as expected. The Bart-like models currently skip this test because there is a rather weird `lm_head` behavior that I want to refactor in another PR.
In addition this PR:
- Fixes #8706: With MT5 and T5v1_1, T5 now has a configuration where input and output embeddings are not tied anymore. This PR fixes this.
- Refactors MobileBert
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/master/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the
[documentation guidelines](https://github.com/huggingface/transformers/tree/master/docs), and
[here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/master/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
## Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @
If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
Please tag fewer than 3 people.
albert, bert, XLM: @LysandreJik
GPT2: @LysandreJik, @patrickvonplaten
tokenizers: @mfuntowicz
Trainer: @sgugger
Benchmarks: @patrickvonplaten
Model Cards: @julien-c
examples/distillation: @VictorSanh
nlp datasets: [different repo](https://github.com/huggingface/nlp)
rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
Text Generation: @patrickvonplaten, @TevenLeScao
Blenderbot, Bart, Marian, Pegasus: @patrickvonplaten
T5: @patrickvonplaten
Rag: @patrickvonplaten, @lhoestq
EncoderDecoder: @patrickvonplaten
Longformer, Reformer: @patrickvonplaten
TransfoXL, XLNet: @TevenLeScao, @patrickvonplaten
examples/seq2seq: @patil-suraj
examples/bert-loses-patience: @JetRunner
tensorflow: @jplu
examples/token-classification: @stefan-it
documentation: @sgugger
FSMT: @stas00
-->