Tabular classification/regression pipeline

Hi!

I have a fine-tuned pytorch model which builds upon a pretrained roberta model from huggingface. The pytorch model simply has two fully connected layers after the roberta model where I add some additional parameters to predict a single value (i.e. a regression task).

I would very much like this model to be available for inference on the hub and I believe the Tabular Classification or Tabular Regression tasks are suitable widgets since they allow me to both pass text to the transformer in one column and the other additional parameters in the other columns.

However, I’ve been trying to follow this tutorial which links to the wine-quality example by @osanseviero, but there I can’t find any pipeline.py file and don’t really understand how to implement the functions specified in the tutorial.

There are some users who have successfully implemented the Tabular Regression widget on sklearn models, has anyone figured out how to do so with transformers models?

Thank you in advance! :blush:

2 Likes

Hey there! This is an excellent question!

The Inference API (widgets) don’t work out of the box at the moment for tabular tasks with transformers. The reason is that there is no pipeline in transformers for these two tasks, which leads to this issue.

The generic solution is a bit hacky and more of a proof of concept. It requires changing the model library to generic in the README.md by setting library_name: generic. In pipeline.py you can then load the model in __init__ and run inference in __call__. Note that this will ideally be fixed by adding the pipeline to transformers and then just using the inference API, as the mentioned before is a proof of concept. Note that you will also need to update the model card with an example table so it knows which columns are expected. Here is an example

Thank you for your answer! I’m however still a bit unsure on how to write the pipeline.py file. Currently my file simply looks like this:

from typing import Dict, List, Union
from transformers import AutoModel, AutoTokenizer
import os

class PreTrainedPipeline():
    def __init__(self, path=""):
        # IMPLEMENT_THIS
        # Preload all the elements you are going to need at inference.
        # For instance your model, processors, tokenizer that might be needed.
        # This function is only called once, so do all the heavy processing I/O here"""

        self.tokenizer = AutoTokenizer.from_pretrained("StyrbjornKall/new_dummy")
        self.model = AutoModel.from_pretrained("StyrbjornKall/new_dummy")

    def __call__(
        self, inputs: Dict[str, Dict[str, List[Union[str, float]]]]
    ) -> List[Union[str, float]]:
        """
        Args:
            inputs (:obj:`dict`):
                a dictionary containing a key 'data' mapping to a dict in which
                the values represent each column.
        Return:
            A :obj:`list` of floats or strings: The classification output for each row.
        """

        encodings = self.tokenizer.batch_encode_plus(
            inputs['text'],
            padding = 'max_length',
            max_length = 100,
            pad_to_max_length=True,
            truncation=True)

        tok = encodings['input_ids']

        return tok

I thought an easy starting point would just be to return the tokens of the text held in the “text” column in the table but this does not seem to work… Now I get a time-out error as well when trying to load the model which I have not seen before.

Here’s my repo for reference: new_dummy

THe library_name you’re using right now is sklearn. This PR should fix that

1 Like

Thank you so much, that seem to have resolved some things. However, now I get the following error from the API:

The task `tabular-classification` is not recognized by api-inference-community

On another note: how are the inputs accessed? Is it table = inputs['data'] to access the data stored in the API-table and then further table['text'] to access the contents of column ‘text’?

Sorry about that! This PR will fix things once deployed. In the meantime you should be able to play with structured-data-classification instead, which was the old name.

1 Like

Thank you! I also suggest the same be done to the tabular-regression task which seems to generate the same error :slight_smile:

Is the structured-data-classification not supported as a widget in the API? The table from the README is no longer visible for example

Yes, the PR will update it for both tabular- tasks.
You’re right, the widget is now disabled for that. Sorry for misleading you! We need to wait until the PR is merged+deployed and this will work afterward.

Thank you!

I am working on a regression problem and I am looking forward to using Transformers for it but before jumping into the implementation and all stuff, I am curious that can you use transformers for a regression problem? I have around 90 features (floating points) and one target. I couldn’t find any paper on transformers for regression problems so please let me know if any of you used transformers for this purpose.

I am working on a problem where I am having tabular data having more than 90 features and one target and all the features are in integers (continuous). I want to use pre-trained BERT, GPT2 but when it comes to the tokenizer the tokenizer is expecting the input in the text format. I can change the integer data in the text format like this:

original_data = [1,2,3,4,5,…,94]
transformed_data = [“1,2,3,4,5,…,94”]

Now if I pass the transformed_data to the tokenizer then surely it will work but I wanna know if someone tried to use transformers for this purpose and if yes, then what was the outcome, and how did the results look like?

How can I use the transformers library for this purpose all the tokenizers are trained for the text data so I am kinda lost. Any help will be appreciate.