Adding a new model to Transformers with additional dependencies


currently I am implementing a PyTorch version of the Tapas algorithm proposed by Google AI. Since the algorithm is based on BERT, I am using parts of to implement it. However, the original implementation of the Tapas algorithm (which is written in Tensorflow 1) contains operations on segmented tensors, such as:

  • tf.math.unsorted_segment_mean
  • tf.math.unsorted_segment_sum
  • tf.math.unsorted_segment_max
  • tf.math.unsorted_segment_min

These operations do not exist in PyTorch. However, there’s a small extension library called pytorch_scatter which implements these operations. So I would use this library in my implementation.

Is there any chance my implementation will be added to the Transformers library if it relies on libraries other than Pytorch? I read in the templates README of the Transformers repository that “the package is also designed to be as self-consistent and with a small and reliable set of packages dependencies. In consequence, additional dependencies are usually not allowed when adding a model but can be allowed for the inclusion of a new tokenizer (recent examples of dependencies added for tokenizer specificities include sentencepiece and sacremoses).”


1 Like

You should copy and paste the code you need in an utils file associated to your model. There is little chance of having your model merged if it introduces a new dependency.

Thank you for your answer. Where can I find utils files of other models in the Transformers repository?

There `modeling_trasnfo_xl_utilities for instance.

1 Like

Thank you for your reply. I have another question. I have implemented the model, but now I want to implement TapasTokenizer, which prepares the data for the model. The API would look like this:

from transformers import TapasTokenizer

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio"], 'Number of movies': ["87", "53"]}
table = pd.DataFrame.from_dict(data)
queries = ["What is the name of the first actor?", "How many movies has he played in?"]
answer_coordinates = [(0, 0), (0, 1)]
answer_text = ["Brad Pitt", "87"]

tokenizer = TapasTokenizer.from_pretrained("tapas-base-finetuned-sqa")
inputs = tokenizer(table, queries, answer_coordinates, answer_text)

The “inputs” dictionary should contain the following:

{'input_ids': [[  101,  2054,  2003,  1996,  2171,  1997,  1996,  2034,  3364,  1029, 102,  3364,  2193,  1997,  5691,  8226, 15091,  5187, 14720,  4487, 6178,  9488,  6584], 
                [101,  2129,  2116,  5691,  2038, 14720,  4487,  6178,  9488,  2209, 1999,  1029,   102,  3364,  2193,  1997,  5691,  8226, 15091,  5187, 14720,  4487,  6178,  9488,  6584],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
                            [0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
                             [1, 1, 1, 1, 1, 0, 0, 0, 0]],
'label_ids':  [[...],[...]],
'answer': [[nan], ["87"]],
'aggregation_function_id': [[0],[0]],
'numeric_values': [...],
'numeric_values_scale': [...]

TapasTokenizer uses the same vocabulary as BERT, but it encodes tables and associated queries. The inputs ids are [CLS] query [SEP] flattened table, and it creates additional token type ids to encode tabular structure. In case one also wants to compute the loss, it creates answer, aggregation function id, numeric values and numeric values scale lists/tensors. What is the best way to implement this? Which function of BertTokenizer should I overwrite to use TapasTokenizer as shown above? Or should I let TapasTokenizer just inherit from PretrainedTokenizer and define the encode function?

Normally you can use the prepare_for_model to add any new keys. The token_type_ids themselves are defined in the create_token_type_ids_from_sequences function, so you could probably use the BertTokenizer and subclass it while implementing those two.

1 Like

Hey @sgugger, could you please consider running my notebook?

It showcases my current implementation of Tapas. It is still WIP, but I am interested in whether you are interested in adding it to the Transformers library :slight_smile: There is a requirement for the torch-scatter library as PyTorch currently does not include operations on segmented tensors. Could it be possible to add something like this when initializing TapasForQuestionAnswering?

Looks like you made a lot of progress, great work! I think having torch-scatter as a soft dependency required for this model is OK. Look at what was done in RAG recently, but basically, you should implement functions like is_scatter_available and requires_scatter in and have a requires_scatter in the init of your models (so it fails with a nice error message if a user doesn’t have the dep).

It looks like you only have to do the final clean-up before opening a PR. Make sure you copy what is done in other modeling files for documentation to have clean docstrings for your models.
One bit of work you have left is to remove imports form the modeling_bert file: for easy customization by researchers, we want all model files to be independent (at the price of dupe code). We have a script that checks copies form other modeling files stay consistent (and can be automatically updated). Check what was done in for an example.

That’s great to hear :smiley: However, the main thing which should be improved is I have 2 questions about it:

  • Currently, I’ve only implemented _batch_prepare_for_model, but it is not compatible with the different padding/truncation strategies of Transformers. Currently, the way it works is as follows: there’s a function _get_token_budget in that calculates the number of tokens left for the flattened table after tokenizing a question. This is currently set to self.model_max_length - (len(question_tokens) + 2) (+ 2 for the CLS and SEP tokens), as was done in the original implementation. There is a hyperparameter when initializing TapasTokenizer called drop_rows_to_fit which drops rows of the table to fit into the token budget if set to True. If it’s set to False and a table is too big, it throws a ValueError indicating 'too many rows'. Should I make TapasTokenizer fully compatible with the padding/truncation strategies of Transformers? Or can I leave it like that with the hyperparameter drop_rows_to_fit? Also, padding is currently simply done up to self.model_max_length. It’s currently not possible to pad up to the longest sequence in a batch. Again, this is because of the complex way examples are created, I cannot simply overwrite .pad() (maybe I can, but then I definitely need some help)
  • The way data is prepared is a bit different compared to BERT and other models. If you provide more than 1 question related to a table as a batch to the tokenizer, then the prev_label_ids (which is one of the 7 token type ids the authors of Tapas introduce) are set to the label_ids of the previous table + question pair. These indicate which tokens of the current table where an answer to the previous question (which is useful in a conversational setup - SQA is a dataset of conversational questions to a table). So I cannot simply overwrite the create_token_type_ids_from_sequences function, because that function already expects ids. The token type ids are made based on the table and several already created token type ids.

TLDR: the tokenizer implementation should be improved a lot and I would definitely need some assistance in doing so.

Also, adding the docstrings and removing the BERT imports at the top of is maybe better to do at the very end, for readability/debugging/code reviewing purposes? Because is currently already quite extensive. Can I maybe open up a pull request and assist the developers of HuggingFace? I’ve been working on the model since the end of August so I can definitely help in explaining how everything works. The original authors from Google also helped me while implementing the model (see this Github thread).

Thank you for your help!

I’m not expert enough in the tokenizers to best advise you for this model. Maybe @thomwolf or @anthony could have more insight?

Documentation is a must before opening the PR, and no it does not hinder readability, it’s actually supposed to explain what the code does :wink:
For the decoupling we can help if you need it, but it will need to be done before the PR is merged anyway.

Haha okay, I understand :slight_smile: in the meantime I’m gonna add a lot of documentation such that people can understand what the code does.

@sgugger all docstrings are added (except for the model_doc, i.e. the .rst file).

I tried replacing the bert imports with original code blocks, but I encountered an error which probably has to do with the fact that my branch is not properly synced with the master branch (it’s a lot of commits behind :sob:). Do I simply have to

git checkout master
git add upstream
git fetch upstream
git rebase upstream/master

? Should this be done from my local master branch as shown above (which I never worked on) or my local modeling_tapas branch? Currently when I type git remote -v I only get

origin (fetch)
origin (push)

I’m not a git pro, but I just want to make sure I don’t mess things up. Or is it maybe better to save the Python files I added (modeling_tapas, tokenization_tapas, etc.), create a new branch, properly sync and then add those files?

Just save the files you added in case something goes wrong and you need to recreate a branch. Then try the rebase with the instructions you set and fingers crossed, it should work.

1 Like

First of all, thank you for taking the time responding to my questions!

Secondly, I created a new branch tapas_up_to_date that is more up-to-date with upstream/master in which I copied all the files I added and made more improvements.

Currently, all things of the checklist are done except for 2 things:

  • should be improved (help/opinions needed)
  • tests should be written appropriately (also help needed). Currently I’ve written of which 20 tests pass and 6 fail, but testing out TapasForQuestionAnswering would require Pandas, which obviously cannot be added to Transformers as an extra dependency. I’ve currently mainly tested out this model in Google Colab (such as this one)by pip installing my branch. Note that pandas is not a dependency of the model itself, it’s a dependency to test out the model.

All the rest is done (docstrings are added including the model doc which I checked using make html, a special TableQuestionAnsweringOutput is defined, the BERT imports are replaced by original code blocks and the model still works, tapas has been added to the files, the soft dependency on torch-scatter has been added in the __init__ of TapasModel,…)

You can find the branch here:

Could you maybe check out my work (or point me to someone) to get some feedback? I don’t really know if my code is meeting the quality standards of Transformers to make a PR.

Thank you!

I don’t see where you need pandas in your branch? For the tests, make sure that you pass all common tests as this is what will make sure your models work well with the library.

On the tokenizer side, pinging @thomwolf again as he may have some more insight. Or @lysandre maybe?

Hi @nielsr! Thanks for your work on the implementation of Tapas, and thanks for your help on github issues, it helps a lot!

Regarding this current work, I think it would be simpler for everybody if you opened a PR, so that we may iterate with comments directly on the code. Opening PRs even if they’re in an unfinished state is completely okay, and we’ll be more than happy to guide you. We don’t have a quality standard for open PRs, only quality standards for merged PRs :)!

Feel free to ping @sgugger and I (@LysandreJik) on the PR directly so that we may follow!