I was wondering can we finetune a model with Jax, just like we fine tuned with pytorch? here
I couldn’t find any guide for that, how to approach this?
any suggestions would be great!
I was trying to fine tune this VQ GAN model, which has been pretrained using Jax.
gently pinging @lewtun and @sgugger for suggestion on this. Thanks
Is there any notebook available for finetuning a GPT2 model on a text-generation (poem/song/etc.) based task?
We were hoping to finetune our pretrained GPT2 Bengali model, any pointer would help! Thanks
This will be covered in chapter 7. In the meantime, you can look at the language modeling scripts and notebooks to see how to fine-tune a language model on a new corpus.
Hi there! I have to say, the course is amazing. It explains everything on a high-level enough basis so you can understand all the steps perfectly without having to dive deeper into what’s under the hood. I do have a question: how would you use a model like the one you have fine-tuned here in a pipeline? Seems like the text classification pipeline only accepts one sentence (or a list of single sentences).
in general i think you can use the [SEP] token in your inputs to tell the pipeline which part belongs to sentence 1 and sentence 2. this token will differ from tokenizer to tokenizer, but usually [SEP] works for BERT-based models while other models like RoBERTa use </s>
@lewtun Well I’m trying to do that with a model I have finetuned. It has somewhere around 95% accuracy. I’m taking 20 test samples and feeding them to the pipeline with a ‘[SEP]’ in between the 2 sentences and it seems to always predict label 0.
I also tried doing that in in the same colab notebook from the course and it does the same thing (except it always predicts 1 but that’s not the point).
You can see it here:
thanks for sharing your notebook @andy13771 - that really helps!
i now think i was incorrect about simply using [SEP] in the pipeline for BERT-based models with sentence-pair tasks like mrpc
the problem is that BERT’s tokenizer relies on token_type_ids to keep track of which tokens belong to the first / second sentence, and with just a single string input like
"sentence 1 [SEP] sentence 2"
it assigns a 0 ID to each token. (you can verify this for yourself by passing two sentences to a BERT tokenizer and comparing the token_type_ids vs those with a single string)
so it seems that for BERT models, we can’t hack the pipeline for sentence-pair tasks, however there are other models like RoBERTa which don’t rely on token_type_ids at all! for these models, the separation token is <\s><\s> so the following example shows we get the correct prediction for the first training example of the MRPC dataset: textattack/roberta-base-MRPC · Hugging Face.
You can use a text classification pipeline for pairs of sentences, though it’s a bit obscure
The key is to pass a list of pairs of sentences to the pipeline object, taking your example in Colab:
cls = pipeline('text-classification', model='testing-pipeline')
for i in range(20):
print(cls([[raw_datasets['test'][i]['sentence1'], raw_datasets['test'][i]['sentence2']]]))
(here double brackets to have a list with one pair of sentences).
In the example in chapter 3 we use trainer.predict(tokenized_datasets["validation"])
I can’t figure out how to get s1 and s2 into a format (i.e. the preprocessing) that would allow me to get what you have suggested.
I think it means that I need to have input_ids and attention_masks etc, but I can’t figure out how to get there as you did with the map(tokenize_function).
In the “A full training” section, we talk about using the Data Loader to break the data into batches, which then gets sent to the model like so:
outputs = model(**batch)
I’m wanting to change the default loss function that the model uses when comparing predictions to ground truth labels. How can I specify a different loss function from the default loss function in the below training loop code?
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
Similarly to others, I had the same issue due to the missing line in the update_state of the F1_metric class. The updated code fixed it, but I’ve tried to add tf.keras.metrics.Precision() to metrics in the compile method and got the same error. I’ve basically adapted your code and it works, but I wonder why tf.keras.metrics.Precision() as is doesn’t work. I can see how the first F1_metric didn’t work (I suppose we were comparing the true class (shape None,1) vs class probabilities (shape None,2)) but I would have expected tf.keras.metrics.Precision() to handle that automatically. Does it not?
Sorry for the delay in replying! We’re actually pulling that section from the updated course - it was quite confusing, and it wasn’t really much help, since you could only use that approach to compute the F1 metric, and not more complex NLP metrics like BLEU and ROUGE. Instead, we’re working on a new Keras callback for automatically computing more arbitrary metrics, which should hopefully be both simpler and much more useful than hacking in the F1 as a Keras metric!