Using XLA with TFTrainer to speed-up training

Hello everyone,
I am searching for methods to speed-up my training execution time for NER task. Right now I modify script examples/token-classification/ to do the training and I read from this source and also this PR that using XLA may achieve what I am looking for.
I tried to use it in several ways but I am not sure if I am doing it right:

  1. Using what done in the mentioned PR, I add tf.config.optimizer.set_jit(True) in the code: I tried this on WNUT17 following examples given for NER(using DistilBERT as the base model) and the training time increase instead of decreasing (13min 21s after using the method, 8min 34s if not). I got Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. in the log so I think it is using XLA. But code in this PR is modified and this method is no longer used so I am confused if it still the proper way to use XLA.
  2. tf.function` mentioned in this source, I add this code on top of the script:
import tensorflow as tf
def main():

it is not working and show this error:

Traceback (most recent call last):
  File "transformers/examples/token-classification/", line 257, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/", line 775, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/", line 823, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/", line 697, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/", line 2855, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/", line 3213, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/", line 3075, in _create_graph_function
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/", line 986, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/", line 600, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/", line 973, in wrapper
    raise e.ag_error_metadata.to_exception(e)
AttributeError: in user code:

    /usr/local/lib/python3.6/dist-packages/transformers/benchmark/ run_in_graph_mode  *
        return func(*args, **kwargs)
    transformers/examples/token-classification/ main  *
    /usr/local/lib/python3.6/dist-packages/transformers/ train  *
        train_ds = self.get_train_tfdataset()
    /usr/local/lib/python3.6/dist-packages/transformers/ get_train_tfdataset  *
        self.num_train_examples = self.train_dataset.reduce(tf.constant(0), lambda x, _: x + 1).numpy()

    AttributeError: 'Tensor' object has no attribute 'numpy'
  1. I add this code it top of the script
from transformers.benchmark.benchmark_tf import run_with_tf_optimizations
@run_with_tf_optimizations(False, True)
def main():

It is also an error with the same error with the previous one.

Is it possible to use TFTrainer with XLA? If yes, how to enable it? Am I doing it right?
I also notice that this log come out when running script:
XLA service 0x4c88680 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
is it mean that by default XLA is used?

Also if there are any tips and methods that you can share to speed-up training time besides using XLA I will be really happy to know.
Any help will be really appreciated. Thank you in advance!