Transformers v3.0.0 is out!

New tokenizer API, TensorFlow improvements, enhanced documentation & tutorials

Breaking changes since v2

  • In #4874 the language modeling BERT has been split in two: BertForMaskedLM and BertLMHeadModel. BertForMaskedLM therefore cannot do causal language modeling anymore, and cannot accept the lm_labels argument.
  • The Trainer data collator is now a method instead of a class
  • Directly setting a tokenizer special token attributes (e.g. tokenizer.mask_token = '<mask>' now only associate the token to the attribute of the tokenizer but doesn’t add the token to the vocabulary if it is not in the vocabulary. Tokens are only added by using the tokenizer.add_special_tokens() and tokenizer.add_tokens() methods
  • The prepare_for_model method was removed as part of the new tokenizer API.
  • The truncation method is now only_first by default.

New Tokenizer API (@anthony, thomwolf, mfuntowicz)

The tokenizers has evolved quickly in version 2, with the addition of rust tokenizers. It now has a simpler and more flexible API aligned between Python (slow) and Rust (fast) tokenizers. This new API let you control truncation and padding deeper allowing things like dynamic padding or padding to a multiple of 8.

The redesigned API is explained in detail in #4510 and here

Notable changes:

  • it’s now possible to truncate to the max input length of a model while padding the longest sequence in a batch
  • padding and truncation are decoupled and easier to control
  • it’s possible to pad to a multiple of a predefined length, e.g. 8 which can give significant speeds up on recent NVIDIA GPU (V100)
  • a generic wrapper using tokenizer.__call__ can be used for all case (single sequence, pair of sequences to groups, batches, etc…)
  • tokenizers now accept pre-tokenized inputs (when the input is already split in word strings e.g. for NER)
  • All the Rust tokenizers are now fully tested like slow tokenizers
  • A new class AddedToken can be used to have a more fine-grained control on how added tokens behave during tokenization. In particular the user can control (1) whether left and right spaces are removed around the token during tokenization (2) whether the token will be identified inside another word and (3) whether the token will be recognized in normalized forms (e.g. in lower case if the tokenizer uses lower-casing)
  • Serialization issues where fixed
  • Possiblity to create NumPy tensors when using return_tensors parameter on tokenizers.
  • Introduced a new enum TensorType to map all the possible tensor backends we support: TensorType.TENSORFLOW, TensorType.PYTORCH, TensorType.NUMPY
  • Tokenizers now accept TensorType enum on encode(...), encode_plus(...), batch_encode_plus(...) tokenizer method for return_tensors parameters.
  • BatchEncoding new property is_fast indicates if the BatchEncoding comes from a Python (slow) tokenizer or a Rust (fast) tokenizer.
  • Slow and Fast Tokenizers are now picklable. So is their output, the dict sub-class BatchEncoding.

Several PRs to make the API more stable have been made:

  • [tokenizers] Fix #5081 and improve backward compatibility #5125 (thomwolf)
  • Tokenizers API developments #5103 (thomwolf)
  • Clearer error message in the use-case of #5169 (thomwolf)
  • Add more tests on tokenizers serialization - fix bugs #5056 (thomwolf)
  • [Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING #5252 (thomwolf)
  • [tokenizers] Several small improvements and bug fixes #5287
  • Add pad_to_multiple_of on tokenizers (reimport) #5054 (mfuntowicz)
  • [tokenizers] Updates data processors, docstring, examples and model cards to the new API #5308

TensorFlow improvements (jplu, dzorlu, @lysandre)

Very big release for TensorFlow!

  • TensorFlow models can now compute the loss themselves, using the TFPretrainedModel.compute_loss method. #4530
  • Can now resize token embeddings in TensorFlow #4351
  • Cleaning TensorFlow models #5229

Enhanced documentation (@sgugger)

We welcome @sgugger as a team member in New York. He already introduced a lot of very cool documentation changes:

  • Added a model summary #4789
  • Expose classes used in documentation #4808
  • Explain how to preview the docs in a PR #4795
  • Clean documentation #4849
  • Remove old doc page and add note about cache in installation #5027
  • Fix all sphynx warnings #5068 (@sgugger)
  • Update pipeline examples to doctest syntax #5030
  • Reorganize documentation #5064
  • Update installation page and add contributing to the doc #5084
  • Update glossary #5148
  • Quick tour #5145
  • Switch master/stable doc and add older releases #5193
  • Add version control menu #5222
  • Don’t recreate old docs #5243
  • Tokenization tutorial #5257
  • Remove links for all docs #5280
  • New model sharing tutorial #5323

Training & fine-tuning quickstart

  • Our own joeddav added a training & fine-tuning quickstart to the documentation #5034!


The MobileBERT from MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, was added to the library for both PyTorch and TensorFlow.

A single checkpoint is added: mobilebert-uncased which is the uncased_L-24_H-128_B-512_A-4_F-4_OPT checkpoint converted to our API.

This model was first implemented in PyTorch by lonePatient, ported to the library by vshampor, then finalized and implemented in Tensorflow by @lysandre.

Eli5 examples (yjernite) #4968

  • The examples/eli5 folder contains training code for the dense retriever and to fine-tune a BART model, the jupyter notebook for the blog post, and the code for the live demo.

  • The RetriBert model implements the dense passage retriever. It’s basically a wrapper for two Bert models and projection matrices, but it does gradient checkpointing in a way that is very different from a concurrent PR and Yacine thought it would be easier to write its own class for now and see if we can merge into the BART code later.

Enhanced examples/seq2seq (@sshleifer)

  • the examples/seq2seq folder is a combination of the old examples/summarization and examples/translation folders.
  • Finetuning works well for summarization, more experiments needed for translation. Finetuning works on multi-gpu, saves rouge scores during validation, and provides --freeze_encoder and --freeze_embeds options. These options make finetuning BART 5x faster on the cnn/dailymail dataset.
  • Distillbart code is added in It only supports summarization, for now.
  • Evaluation works well for both summarization and translation.
  • New weights and biases shared task for collaboration on the XSUM summarization task

Distilbart (@sshleifer)

  • Distilbart models are smaller versions of bart-large-cnn and bart-large-xsum. They can be loaded using BartForConditionalGeneration.from_pretrained('@sshleifer/distilbart-xsum-12-6'), for example See this tweet for more info on available models and their speed/performance.
  • Commands to reproduce are available in the examples/seq2seq folder

BERT Loses Patience (JetRunner)

Add BERT Loses Patience (Patience-based Early Exit) based on the paper and the official implementation

Unifying label arguments (@sgugger) #4722

  • Deprecate any argument that’s not labels (like masked_lm_labels, lm_labels, etc.) to labels.

NumPy type in tokenizers (mfuntowicz) #4585

Introduce a new tensor type for return_tensors on tokenizer for NumPy.

  • As we’re introducing more than two tensor backend alternatives I created an enum TensorType listing all the possible tensor we can create TensorType.TENSORFLOW, TensorType.PYTORCH, TensorType.NUMPY. This might help newcomers who don’t know about “tf”, “pt”.
    Note: TensorType are compatible with previous “tf”, “pt” and now “np” str to allow backward compatibility (+unittest)

  • Numpy is now a possible target when creating tensors. This is useful for JAX.

Community notebooks

Benchmarks (patrickvonplaten)

The benchmark script was consolidated and some features were added:

Adds the functionality to measure the following functionalities for TF and PT (#4912):

  • Tensorflow:

    • Inference: CPU, GPU, GPU + XLA, GPU + eager mode, CPU + eager mode, TPU
  • PyTorch:

    • Inference: CPU, CPU + torchscript, GPU, GPU + torchscript, GPU + mixed precision, Torch/XLA TPU
    • Training: CPU, GPU, GPU + mixed precision, Torch/XLA TPU
  • [Benchmark] Add encoder decoder to benchmark and clean labels #4810

  • [Benchmark] add tpu and torchscipt for benchmark #4850

  • [Benchmark] Extend Benchmark to all model type extensions #5241

  • [Benchmarks] improve Example Plotter #5245

Hidden states, attentions and cache

Before v3.0.0, the way to handle attentions, model hidden states, and whether to use the cache in models that have it for sequential decoding was to specify an argument in the configuration. In version v3.0.0, while we do maintain that argument for backwards compatibility, we introduce a new way of handling these through the forward and call methods.

  • Output attentions #4538 (Bharat123rox)
  • Output hidden states #4978 (drjosephliu)
  • Use cache #5194 (patrickvonplaten)

Revamped AutoModels (patrickvonplaten)

The AutoModelWithLMHead encompasses all models with a language modeling head, not making the distinction between causal, masked and seq2seq models. Three new auto models are added:

  • AutoModelForCausalLM for Autoregressive models
  • AutoModelForMaskedLM for Autoencoding models
  • AutoModelForSeq2SeqCausalLM for Sequence-to-sequence models with causal LM for the decoder

New model & tokenizer architectures


  • Fixed a bug causing invalid ordering of the inputs in the underlying ONNX IR.
  • Increased logging to giv ethe user more information about the exported variables.

Bug fixes and improvements

  • TFRobertaModelIntegrationTest requires tf #4726 (@sshleifer)
  • Cleanup glue for TPU #4621 (jysohn23)
  • [Reformer] Improved memory if input is shorter than chunk length #4720 (patrickvonplaten)
  • Pipelines: miscellanea of QoL improvements and small features #4632 (@julien-c)
  • Fix bug when changing the token for generate #4745 (patrickvonplaten)
  • never_split on slow tokenizers should not split #4723 (mfuntowicz)
  • PretrainedModel.generate: remove unused kwargs #4761 (@sshleifer)
  • Codecov is now setup differently to have better insights into code coverage #4768 (@lysandre)
  • Don’t access pad_token_id if there is no pad_token #4773 (@sgugger)
  • Removed deprecated use of Variable API from pplm example #4619 (prajjwal1)
  • Add drop_last arg for data loader #4757 #4925 (setu4993)
  • No silent error when XLNet’s d_head is already in the configuration #4747 (@lysandre)
  • MarianTokenizer: delete unused constants #4802 (@sshleifer)
  • NER: Add new WNUT’17 example #4681 (stefan-it)
  • [EncoderDecoderConfig] automatically set decoder config to decoder #4809 (patrickvonplaten)
  • Add matplotlib to known 3rd party dependencies #4800 (@sshleifer)
  • Pipelines test and new kwarg #4812 (@sshleifer)
  • Updated path “cd examples/text-generation/pplm” #4778 (Mr-Ruben)
  • [marian tests] pass device to pipeline #4815 (@sshleifer)
  • Export PretrainedBartModel from init #4819 (BramVanroy)
  • Updates args in tf squad example. #4820 (daniel-shan)
  • [Generate] beam search should generate without replacement (patrickvonplaten)
  • TFTrainer: Align how the checkpoints are managed the same way than in the PyTorch trainer. #4831 (jplu)
  • [Longformer] Remove redundant code #4839 (ZhuBaohe)
  • [cleanup] consolidate some prune_heads logic #4799 (@sshleifer)
  • Fix the getattr method in BatchEncoding #4772 (jplu)
  • Consolidate summarization examples #4837 (aretius)
  • Fix a bug in the initialization and serialization of TFRobertaClassificationHead #4884 (harkous)
  • [examples] Cleanup summarization docs #4876 (@sshleifer)
  • bug fix #4867 (@songyouwei)
  • Remove unused arguments in Multiple Choice example #4853 (@sgugger)
  • Deal with multiple choice in common tests #4886 (@sgugger)
  • Fix the CI #4903 (@sgugger)
  • [All models] fix docs after adding output attentions to all forward functions #4909 (patrickvonplaten)
  • Add more models to common tests #4910 (@sgugger)
  • [ctrl] fix pruning of MultiHeadAttention #4904 (aretius)
  • Don’t init TPU device twice #4916 (patrickvonplaten)
  • Run a single wandb instance per TPU run #4851 (@lysandre)
  • check type before logging in trainer to ensure values are scalars #4883 (m)oldey)
  • Split LMBert model in two #4874 (@sgugger)
  • Make multiple choice models work with input_embeds #4921 (@sgugger)
  • Fix resize_token_embeddings for Transformer-XL #4759 (RafaelWO)
  • [mbart] Fix fp16 testing logic #4949 (@sshleifer)
  • Hans data with newer tokenizer API #4854 (@sgugger)
  • Fix parameter ‘output_attentions’ docstring #4976 (ZhuBaohe)
  • Improve ONNX logging #4999 (mfuntowicz)
  • NER: fix construction of input examples for RoBERTa #4943 (stefan-it)
  • Possible fix to make AMP work with DDP in the trainer #4728 (BramVanroy)
  • Make DataCollator a callable #5015 (@sgugger)
  • Increase pipeline support for ONNX export. #5005 (mfuntowicz)
  • Fix importing transformers on Windows - SIGKILL not defined #4997 (mfuntowicz)
  • TFTrainer: improve logging #4946 (borisdayma)
  • Add position_ids in TFElectra models docstring #5021 (@sgugger)
  • [Bart] Question Answering Model is added to tests #5024 (patrickvonplaten)
  • Ability to pickle/unpickle BatchEncoding pickle (reimport) #5039 (mfuntowicz)
  • refactor(wandb): consolidate import #5044 (borisdayma)
  • [cleanup] Hoist ModelTester objects to top level #4939 (aretius)
  • Convert hans to Trainer #5025 (@sgugger)
  • Fix marian tokenizer save pretrained #5043 (@sshleifer)
  • [cleanup] examples test_run_squad uses tiny model #5059 (@sshleifer)
  • Add header and fix command for HANS #5082 (@sgugger)
  • [examples] SummarizationModule improvements #4951 (@sshleifer)
  • Some changes to simplify the generation function #5031 (yjernite)
  • Make default_data_collator more flexible and deprecate old behavior #5060 (@sgugger)
  • [MarianTokenizer] Switch to sacremoses for punc normalization #5092 (@sshleifer)
  • [style] add pandas to setup.cfg #5093 (@sshleifer)
  • [ElectraForQuestionAnswering] fix qa example in doc #4929 (patil-suraj)
  • Fixing TPU training by disabling gradients logging #4926 (patil-suraj)
  • [docs] fix T5 training doc #5080 (patil-suraj)
  • support local_files_only option for tf models #5116(ogarin)
  • [cleanup] generate_beam_search comments #5115 (@sshleifer)
  • [fix] Move _adjust_logits above postprocess to fix Marian.generate #5126(@sshleifer)
  • Pin sphinx-rtd-theme #5128 (@lysandre)
  • Add missing arg in 02-transformers notebook #5085 (pri-ax)
  • [cleanup] remove redundant code in SummarizationDataset #5119 (@sshleifer)
  • AutoTokenizer supports mbart-large-en-ro #5121 (@sshleifer)
  • Fix in Reformer Config documentation #5138 (erickrf)
  • [bart-mnli] Fix class flipping bug #5141 (@sshleifer)
  • [MobileBert] fix dropout #5150 (ZhuBaohe)
  • SummarizationPipeline: init required task name #5086 (@julien-c)
  • [examples] fixes arguments for summarization finetune scripts #5157 (ieBoytsov)
  • Fixing docs for Encoder Decoder Config #5171 (mikaelsouza)
  • fix bart doc #5132 (fuzihaofzh)
  • Added feature to move added tokens in vocabulary for Transformer-XL #4953 (RafaelWO)
  • Add support for gradient checkpointing in BERT #4659 (ibeltagy)
  • Fix for IndexError when Roberta Tokenizer is called on empty text #4209 (malteos)
  • Add TF auto model to the docs + fix sphinx warnings (again) #5187 (@sgugger)
  • Have documentation fail on warning #5189 (@lysandre)
  • Cleaner warning when loading pretrained models #4557 (thomwolf)
  • Upgrade examples to pl=0.8.1 #5146 (@sshleifer)
  • [fix] mobilebert had wrong path, causing slow test failure #5205 (@sshleifer)
  • [fix] remove unused import #5206 (@sshleifer)
  • [Reformer] Axial Pos Emb Improve mem usage reformer #5209 (patrickvonplaten)
  • [pl_examples] revert deletion of optimizer_step #5227 (@sshleifer)
  • [bart] add config.extra_pos_embeddings to facilitate reuse #5190 (@sshleifer)
  • Only put tensors on a device #5223 (@sgugger)
  • Fix PABEE division by zero error #5233 (JetRunner)
  • Use the script in utils #5224 (@sgugger)
  • Delay decay schedule until the end of warmup #4940 (amodaresi)
  • Replace pad_token with -100 for LM loss calculation #4718 (setu4993)
  • examples/seq2seq supports translation #5202 (@sshleifer)
  • Fix convert_graph_to_onnx script #5230 (n1t0)
  • Refactor Code samples; Test code samples #5036 (@lysandre)
  • [Generation] fix docs for decoder_input_ids #5306 (patrickvonplaten)
  • [pipelines] Change summarization default to distilbart-cnn-12-6 #5289 (@sshleifer)
  • Add BART-base modeling and configuration #5315 (JetRunner)
  • CircleCI stores cleaner output at test_outputs.txt #5291 (@sshleifer)
  • [pl_examples] default warmup steps=0 #5316 (@sshleifer)