[RFC] Transformers Pipeline v2

Dear :hugs: community,

Late in 2019, we introduced the concept of Pipeline in transformers, providing single-line of code inference for downstream NLP tasks. At that time we only supported a few tasks such as:

  • Token Classification (ex: NER)
  • Sentence Classification (ex: Sentiment Analysis)
  • Question Answering
  • Feature Extraction (i.e. computing embeddings from pretrained model)

The initial design was very simple and gave us satisfaction for quite a while. It allowed many new tasks to be integrated into this concept, among them:

  • Fill mask
  • Generation
  • Summarization
  • Translation
  • Zero-Shot

Yet, over the last couple of months, we started to spot a few areas where the current pipeline’s design would not allow us to keep providing all the features we would like to. In this context, with the :hugs: team, we started discussing a major refactoring of the pipelines in order to make it easier to address those points with potential:

  • Increased Flexibility: We want you to be able to customize as much as possible pipelines.
  • Performances: Ensure we have improved performances for the pipelines to go in production more easily.
  • Training: Pipelines encapsulate some logic to make it easier to do stuff with SOTA NLP models. Training is one step of the overall process and pipeline should definitively support this.

Today, we would like to share with you what we have been thinking about the future for the Pipelines:

Main Architectural Changes:

  • Framework Specific Pipeline

Pipelines are currently sharing the same code for both PyTorch and TensorFlow backend. We would like to split the implementation of pipelines in a framework-specific way.

Stepping in this direction will make pipelines more comparable to the rest of the transformers repository, where model implementations are done in a framework-specific fashion.

Also, by using a framework-specific approach, it will be possible to express the entire computation graph with operators provided by the framework. This is particularly appealing as it would allow us to provide training capabilities along with easier export of pipelines for inference.

  • Tokenization and model configuration / argument passing

Currently, pipelines allow very few/no flexibility to change the behavior of the tokenizer/model and post-processing steps. For instance, it’s nearly impossible to change the maximum answer length when using the Question Answering pipeline…
Also, it is not possible to provide any parameters when creating the tokenizer for the pipeline…

As such, we would like to introduce the notion of Configuration. Configuration allows specifying all the elements you care about and saving them as part of the pipeline object. Configurations are essentially based on key/value mappings and can tune the overall behavior of every step involved in a pipeline. Finally, many different configurations can be attached to a pipeline, allowing to iterate and version all the elements at the same place.

Configurations are serialized and saved along with the tokenizer and model weights in order to make as easy as possible to save training configuration and deploying such pipelines.

Here is an example of the proposed design:

# Define some configs
grouped_entities_config = TokenClassificationConfig(
         "max_length": 512,
         "padding": True,
         "truncation": True

ungrouped_entities_config = TokenClassificationConfig(
         "max_length": 512,
         "padding": True,
         "truncation": True

# Create the pipeline
nlp = TokenClassificationPipeline.from_pretrained(

# Register configurations
nlp.register_config("grouped", grouped_entities_config)
nlp.register_config("ungrouped", ungrouped_entities_config)

# Forward (output holds the input(s), the configuration which was used and the output(s))
default_outputs = nlp("My name is Morgan and I live in Paris")  # Default config
grouped_outputs = nlp("My name is Morgan and I live in Paris", "grouped")  # Config ref
ungrouped_outputs = nlp("My name is Morgan and I live in Paris", ungrouped_entities_config)  # Config object
  • Model export

One another point we would like to give more flexibility is the export and serialization of pipeline. As you may have seen, PyTorch and TensorFlow are very versatile frameworks but in the context of production and inference workloads, one can leverage more dedicated tools such as ONNX Runtime we have been collaborating a lot with over the past months.

By using the framework-specific approach described above, we would like to rely on the tracing mechanism provided by both PyTorch and TensorFlow to export most of the pipeline parts as a single ONNX graph/TorchScript/TF Graph to make it very easier to run an inference with it.

Still, not all pipelines would benefit from this feature at first as some of them require lot of computations which cannot be expressed with such operators, but we would like to tend to bring more and more release after release.

  • Training capabilities

Last but not least, we would like to provide a unified experience while using pipelines by allowing the use of Pipeline object when training a model. This is made possible by the new, framework-specific, approach we discussed above as it allows us to express the overall computation graph.

As an example, the PyTorch implementation of such pipelines would be based on torch.nn.Module hence providing all the tooling and integration required to train a model with the framework.

Along with providing better integration, we would like to propose framework-specific methods/syntaxes in order to better fit the usage of this framework. For instance, TensorFlow pipelines might benefit from a compile() method or PyTorch ones to have a state_dict() generator.

    # Create the pipeline
    nlp = TokenClassificationPipeline.from_pretrained(

    # Possibility to train ? PytorchPipeline inherits from torch.nn.Module
    optim = Adam(nlp.parameters(), lr=0.01)
    for _ in some_data_loader:
        # forward doesn't post-process the model's output(s)
        logits, some_other_tensor = nlp.forward("My name is Morgan and I live in Paris")
        loss = cross_entropy(logits, some_labels)

All the points detailed here would apply for both PyTorch, TensorFlow, any other backend? (JAX? :face_with_hand_over_mouth:)

Also, do not hesitate to let us know if you see some points that would be useful to support in the new pipelines.

Morgan & :hugs: team,


I like this. Especially the focus on customizability on the one hand and having a training pipeline on the other.

Even though I agree with the direction you want to take considering separating the frameworks, I wonder: a few months back it seemed that there was a lot of effort going into unifying everything and being framework-agnostic. What changed? Does that make things too complex? Does it limit your options too much?

One thing that I am not sure about is the integration of the tokenizer in the pipeline, though. Usually we would either preprocess the data (e.g. using datasets) or tokenize and pad in a collate_fn in the dataloader, which can be parallellized. As your example is currently written, it seems that you just get the raw text input from the dataloader (i.e. batch of sentences = list of strings) and pass it to nlp.forward, which is then responsible for tokenization and forwarding through the model at every step. I think this would make everything slow and inefficient, and also un-PyTorch-y if that’s a word. The preprocessing of the data should happen a lot earlier imo.

Just my two cents!

That’s true we recently tried to uniformize many aspect of transformers. Yet, for pipeline it raises three concerns:

1. Performances

If you look at the current implementation of pipelines, we rely on a framework-specific part for inference through the model and then convert everything needed to Numpy array in order to post-process and extract the higher-level information from the model (QA span extraction for instance).

This works well, but it implies some internal memory-copy and layout change to go from one framework to another. This is even more true if a tensor lives on the GPU.

Also, some NumPy operators doesn’t support batching whereas PyTorch & TensorFlow alternatives do support it. It would be much more efficent to leverage those operators for pipelines.

2. Tracing

Another point we would like to give acces to the user is ensuring an easier way to trace/export a model.

Currently, with the current implementation of pipelines mixin (PyTorch/Tensorflow) & Numpy operators we cannot really export an entire model from the actual Neural Network to the post processing steps.

TorchScript, TensorFlow Graph or ONNX all rely on the operations supported by the exporting framework to generate a static graph representation. Having NumPy operators in the middle breaks this compatibility and prevent us to export monolothic block which, hopefully, should work out of the box for a wide variety of tasks.

3. Training

Last but not least, as we would like to introduce training capability to the pipelines, we need to provide some closer integration with frameworks.

Again, PyTorch & TensorFlow creates a graph representation of the network during the forward pass. This graph is created by using some internal mechanisms specific to each framework (i.e. grad_fn for PyTorch for instance) tightly coupled with the operators from that specific framework.

All the operations within the network requiring gradients need to be included in the graph for the chain-rule to apply and autograd to compute the actual gradients for each of them. This doesn’t work if you put NumPy specific operators in the middle.

1 Like

Thanks for the detailed explanation. I agree that at some point you just have to go back to framework-specific implementations, but I was curious how the distinction will be made for the rest of the library as well as for datasets. In case of the latter, I think that datasets also casts to numpy arrays as intermediate states and optionally returns tensors in the format that is requested by the user. (This has some underlying issues, too, because there is no one-on-one dtype conversion between numpy and torch.)

That begs the question: from a user-perspective, will this get too complicated when putting all the parts together? Will it be clear for users which part of the library are framework specific (pipeline, models) and which ones aren’t (datasets, tokenizers, trainer (?)).


Just an aparté on ONNX inference (cross-post from Supporting ONNX optimized models)

I’d be interested in an nlp = pipeline("sentiment-analysis", onnx=True) pipeline like @valhalla created, where the ONNX files are hosted on the model hub and stored in the transformers cache.

My use case is fast inference of pre-trained models on embedded applications (no network connection).