I meet the zero gradient descent

Sniper · November 12, 2020, 12:43pm

I want use transformers to do text classification, I want code myself rather than use TFBertForSequenceClassification ,so I write the model with TFBertModel and tf.keras.layers.Dense ,but this is no gradient descent in my code, I try to find what wrong with my code but I can’t. So I submit this issues to ask for some help.
my code is here:

Model:

and I know train data is test data,just for quick debug.
and when I train this model ,

thomwolf · November 12, 2020, 1:08pm

maybe @jplu has some idea here.

If you could share a google colab exhibiting the behavior it would be a lot better than screen caps.

Sniper · November 12, 2020, 2:47pm

thanks for your reply. I upload my code on the colab,this is the link :https://colab.research.google.com/drive/1S4Y01Pr64R8uQi6cdNx53BzVx9iROZEH?usp=sharing and I have gave the edit permissions

It might be a little chaos cause I wrote it just for a test.Thank you very much for answering my question.

jplu · November 12, 2020, 3:05pm

At a first glance I would say it comes from two reasons:

You should not flatten the output of the BERT model
You take the sequence output while you have to take only the pooled output.

Look at how it is done in TFBertForSequenceClassification.

Sniper · November 12, 2020, 3:14pm

I have seen the document of them for a hold afternoon.You mean the souce code of TFBertForSequenceClassification? But I’m too weak to find the souce code because there are a lot of call…How can I find the souce code.

jplu · November 12, 2020, 3:31pm

Here: transformers/modeling_tf_bert.py

Sniper · November 12, 2020, 3:37pm

Thank you very much!!! I will read the source code tommorrow because is too late.Thanks again for your patient !！

But I still can’t understand why the gradient descent was zero.
I see the document
last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.
It is the parameters of bert after all,why Dense(FC) can’t match them and do the gradient descent?

Sniper · November 13, 2020, 4:40am

I solve this problem,thanks a lot for patient teaching.

Topic		Replies	Views
Training loss is not decreasing using TFBertModel 🤗Transformers	4	5801	October 24, 2023
Having issues finetuning a Bert model pretrained from scratch on downstream task (GLUE Dataset)! Intermediate	0	717	March 26, 2022
Gradients of BERT layer outputs to inputs 🤗Transformers	0	1588	December 7, 2020
Finding gradients in zero-shot learning Intermediate	4	2835	November 17, 2020
Why my simple Bert model for text classification could not learn anything? Beginners	2	2071	October 23, 2023

I meet the zero gradient descent

Related topics