We are working on a new major release that should come out at the end of next week, with cool new features that will unfortunately result in some breaking changes. There will be one last release for v3 before we start introducing those breaking changes on master, so if you’re using a source installation, be prepared or revert to v3.5.0 for a bit
-
AutoTokenizers
andpipeline
will switch to Fast tokenizers by default
=> Resulting breaking change: the slow and fast tokenizers have roughly the same API but they have a different handling of the overflowing tokens.
=> Why are we doing this: This will greatly increase the performance of the tokenization aspect in pipelines, and enable clearer, simpler example scripts leveraging the fast tokenizers. The overflowing of Fast tokenizers is also a lot more powerful than it’s counterpart in slow tokenizers. -
sentencepiece
will be removed as a required dependency. (It will still be required to be installed for slow SP based tokenizers)
=> Resulting breaking change: some people will have to install sentencepiece explicitly while they didn’t have to before with the commandpip install transformers[sentencepiece]
.
=> Why are we doing this? This, in turn, will allow us to create and maintain a conda channel offering the full Hugging Face suite on conda. -
Reorganizing the internal organization of the library with subfolders (either one per model or one for all
models
, alltokenizers
, one subfolder forpipelines
,trainer
etc). With the number of models growing, the source folder is a bit too hard to navigate right now.
=> Resulting breaking change: some people directly accessing the internals will have to update the path they use. If you only use imports fromtransformers
directly, nothing will break.
=> Why are we doing this? The library will be more robust to scaling for more models. -
Switching the
return_dict
argument toTrue
. This argument that makes the ouputs of the models self-documented was introduced a few months ago with a default toFalse
for backward compatibility.
=> Resulting breaking change: unpacking the output of a model with commands likeloss, logits = model(**inputs)
won’t work anymore. The commandto_tuple
can convert a model output to a tuple.
=> Why are we doing this? Outputs of the model are easier to understand when they areModelOutput
. You can index as a dict or use auto-complete in an IDE to find all fields. This will also allow us to optimize the TensorFlow models more (the tuples of various size being incompatible with graph mode). -
Deprecated arguments or functions will be removed on a case-by-case basis.