I meet the zero gradient descent

I want use transformers to do text classification, I want code myself rather than use TFBertForSequenceClassification ,so I write the model with TFBertModel and tf.keras.layers.Dense ,but this is no gradient descent in my code, I try to find what wrong with my code but I can’t. So I submit this issues to ask for some help.
my code is here:



and I know train data is test data,just for quick debug.
and when I train this model ,

maybe @jplu has some idea here.

If you could share a google colab exhibiting the behavior it would be a lot better than screen caps.

thanks for your reply. I upload my code on the colab,this is the link :https://colab.research.google.com/drive/1S4Y01Pr64R8uQi6cdNx53BzVx9iROZEH?usp=sharing and I have gave the edit permissions

It might be a little chaos cause I wrote it just for a test.Thank you very much for answering my question.

At a first glance I would say it comes from two reasons:

  1. You should not flatten the output of the BERT model
  2. You take the sequence output while you have to take only the pooled output.

Look at how it is done in TFBertForSequenceClassification.

I have seen the document of them for a hold afternoon.You mean the souce code of TFBertForSequenceClassification? But I’m too weak to find the souce code because there are a lot of call…How can I find the souce code.

Here: transformers/modeling_tf_bert.py

Thank you very much!!! I will read the source code tommorrow because is too late.Thanks again for your patient !!

But I still can’t understand why the gradient descent was zero.
I see the document
last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.
It is the parameters of bert after all,why Dense(FC) can’t match them and do the gradient descent?

I solve this problem,thanks a lot for patient teaching.