Unsupported value type BatchEncoding

Hi

I’m a HuggingFace Newbie and I’m trying to fine tune DistilBERT for a three label sentiment classification task.

To do so I am using as a guide the HuggingFace Course. Hence I am using the following code to train my model:-

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

lr_scheduler = PolynomialDecay(

    initial_learning_rate=5e-5,

    end_learning_rate=0.,

    decay_steps=num_train_steps

    )

opt = Adam(learning_rate=lr_scheduler)

model.compile(optimizer=opt, loss=loss, metrics=['accuracy', F1_metric()])

model.fit(

    encoded_train,

    np.array(y_train),

    validation_data=(encoded_val, np.array(y_val)),

    batch_size=8,

    epochs=3

)

The loss function is:-

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

The number of training steps is calculated like so:-

batch_size = 8

num_epochs = 3

num_train_steps = (len(encoded_train['input_ids']) // batch_size) * num_epochs

So far then, very much like the boiler-plate code in the course.

My encoded training data looks like this:-

{'input_ids': <tf.Tensor: shape=(1040, 512), dtype=int32, numpy=
array([[  101,   155,  1942, ...,     0,     0,     0],
       [  101, 27900,  7641, ...,     0,     0,     0],
       [  101,   155,  1942, ...,     0,     0,     0],
       ...,
       [  101,   109,  7414, ...,     0,     0,     0],
       [  101,  2809,  1141, ...,     0,     0,     0],
       [  101,  1448,  1111, ...,     0,     0,     0]], 
dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1040, 512), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>}

Printing with y_train.head() my labels look like this (though my code turns this into a numpy array):-

10     2
147    1
342    1
999    3
811    3
Name: sentiment, dtype: int64

I am receiving the following error message:-

Epoch 1/3
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-56-2902befb3adf> in <module>()
     16     validation_data=(encoded_val, np.array(y_val)),
     17     batch_size=8,
---> 18     epochs=3
     19 )

14 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/type_spec.py in __make_cmp_key(self, value)
    381     raise ValueError("Unsupported value type %s returned by "
    382                      "%s._serialize" %
--> 383                      (type(value).__name__, type(self).__name__))
    384 
    385   @staticmethod

ValueError: Unsupported value type BatchEncoding returned by IteratorSpec._serialize

My code is being run in Google Collaboratory using GPUs.

The problem lies in the training data, but you did not share how you built it, so we can’t help you see what’s wrong.

Thanks for getting back to me Sylvain.

My source data looks a bit like this:-

id created_at text sentiment
0 77522 2020-04-15 01:03:46+00:00 RT @RobertBeadles: Yo💥\nEnter to WIN 1,000 … positive
1 661634 2020-06-25 06:20:06+00:00 #SriLanka surcharge on fuel removed!\n⛽📉… negative
2 413231 2020-06-04 15:41:45+00:00 Net issuance increases to fund fiscal programs… positive
3 760262 2020-07-03 19:39:35+00:00 RT @bentboolean: How much of Amazon’s traffic … positive
4 830153 2020-07-09 14:39:14+00:00 $AMD Ryzen 4000 desktop CPUs looking ‘greatâ… positive

From this I extract the training data:-

id created_at text
10 223041 2020-04-27 00:41:06+00:00 RT @PipsToDollars: Earnings $AMZN $TSLA $MSFT …
147 808963 2020-07-08 19:41:22+00:00 Baidu $BIDU Has A Weak #Technical Analysis Sco…

Capture

I turned the sentiment values into numbers:-

y_train.head()
10     2
147    1
342    1
999    3
811    3
Name: sentiment, dtype: int64

I encoded the data using the following code:-

checkpoint = "distilbert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequences = X_train.loc[:,"text"].tolist()

encoded_train = tokenizer(sequences, padding="max_length", truncation=True, return_tensors="tf")

I had a look at some of the data types:-

encoded_train is of the following type:  <class 'transformers.tokenization_utils_base.BatchEncoding'>
np.array(y_train) is of the following type:  <class 'numpy.ndarray'>
y_train is of the following type:  <class 'pandas.core.series.Series'>

Hopefully, some of the above make the problem a little clearer

Keras will not accept a BatchEncoding, which is the type of your encoded_train. You need to convert it into a dictionary by adding a .data at the end. I think that should solve the problem.

3 Likes

Thanks for the swift reply - I appreciate it.

It looks like your suggestion fixed the immediate source of the problem but I am still struggling to train the model:-

Some layers from the model checkpoint at distilbert-base-cased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'activation_13', 'vocab_projector', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1/3
WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:AutoGraph could not transform <bound method Socket.send of <zmq.sugar.socket.Socket object at 0x7f20a1201d70>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <bound method Socket.send of <zmq.sugar.socket.Socket object at 0x7f20a1201d70>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function wrap at 0x7f20bc3b4170> and will run it as-is.
Cause: while/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function wrap at 0x7f20bc3b4170> and will run it as-is.
Cause: while/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-8197c954d134> in <module>()
     16     validation_data=(encoded_val, np.array(y_val)),
     17     batch_size=8,
---> 18     epochs=3
     19 )

9 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
    984           except Exception as e:  # pylint:disable=broad-except
    985             if hasattr(e, "ag_error_metadata"):
--> 986               raise e.ag_error_metadata.to_exception(e)
    987             else:
    988               raise

ValueError: in user code:

    /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py:855 train_function  *
        return step_function(self, iterator)
    <ipython-input-43-112004023d94>:11 update_state  *
        self.precision.update_state(y_true, y_pred, sample_weight)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/utils/metrics_utils.py:86 decorated  **
        update_op = update_state_fn(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/metrics.py:177 update_state_fn
        return ag_update_state(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/metrics.py:1337 update_state  **
        sample_weight=sample_weight)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/utils/metrics_utils.py:366 update_confusion_matrix_variables
        y_pred.shape.assert_is_compatible_with(y_true.shape)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/tensor_shape.py:1161 assert_is_compatible_with
        raise ValueError("Shapes %s and %s are incompatible" % (self, other))

    ValueError: Shapes (8, 3) and (8, 1) are incompatible

Looks like a loss problem, make sure you use the proper one. You did not specify the loss function in the code you pasted, so again it’s hard to help you with what went wrong.

Thanks Sylvain.

The loss function is:-

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

There are three different labels in my training data, each represented by integer values.

That is the right one, and looking at the stack trace, the problem actually comes from one of the metrics, sorry.

1 Like

Don’t apologise it is very good of you to help me out!!

I have followed your lead and tried removing one of my metrics so that the compilation step now reads like so model.compile(optimizer=opt, loss=loss, metrics=['accuracy']). I will have a look at what was causing the problem with the F1_metric() later (again it was using the boiler-plate from the HuggingFace Course so hopefully it should be easily resolvable). In the meantime, I have run my code again but am seeing model.compile(optimizer=opt, loss=loss, metrics=['accuracy']) in each epoch:-

 Downloading: 100%

354M/354M [00:11<00:00, 31.4MB/s]

Some layers from the model checkpoint at distilbert-base-cased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'vocab_transform', 'activation_13'] - This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['dropout_19', 'classifier', 'pre_classifier'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Epoch 1/3 WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`). WARNING:tensorflow:AutoGraph could not transform <bound method Socket.send of <zmq.sugar.socket.Socket object at 0x7f03baabcd70>> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert WARNING: AutoGraph could not transform <bound method Socket.send of <zmq.sugar.socket.Socket object at 0x7f03baabcd70>> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert WARNING:tensorflow:AutoGraph could not transform <function wrap at 0x7f03d5c70170> and will run it as-is. Cause: while/else statement not yet supported To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert WARNING: AutoGraph could not transform <function wrap at 0x7f03d5c70170> and will run it as-is. Cause: while/else statement not yet supported To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`. WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version. Instructions for updating: The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU. WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`). WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`. 130/130 [==============================] - ETA: 0s - loss: nan - accuracy: 0.0000e+00WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`). WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`. 130/130 [==============================] - 157s 881ms/step - loss: nan - accuracy: 0.0000e+00 - val_loss: nan - val_accuracy: 0.0000e+00 Epoch 2/3 130/130 [==============================] - 113s 866ms/step - loss: nan - accuracy: 0.0000e+00 - val_loss: nan - val_accuracy: 0.0000e+00 Epoch 3/3 130/130 [==============================] - 113s 868ms/step - loss: nan - accuracy: 0.0000e+00 - val_loss: nan - val_accuracy: 0.0000e+00

<tensorflow.python.keras.callbacks.History at 0x7f0300473f90>

Apologies, by the way, for the rather wide output. I am struggling to get the backslashes to justify the text.

As an update. I decided to abandon this approach all together and am now implementing using native PyTorch, in which approach I am making extensive use of the HuggingFace library. The approach is a little more long winded but gives greater opportunities for interrogating the source of errors.