Creating a tokenizer with both custom tokens and positions

cowszero · January 29, 2022, 10:16pm

There are a lot of good articles and posts in the forums about custom tokenizers that allow you to train a different vocabulary/language. I cannot find information, though, on customizing the position part as well. Is there documentation on this? If not, can someone point me in the right direction. My goal is to train a transformer on a dataset where the 2-D position of the glyph in the document is just as important as the glyph itself.

Thanks!

Note: A similar question asked before by @bengul but there has been no response since July 2021.

bengul · January 29, 2022, 10:47pm

Hi,
I have figured out a way to edit the position embedding. The position embedding in huggingface bert model is defined in the BertEmbedding class. Other models also have an Embedding class. If your task is aligned with what they have, you might be able to get away with changing the Embedding class only. Look Here. Alternatively if you want to provide your preferred position embedding with the input, then you have to change all the classes that use the position embedding. I am not an expert by any means, but feel free to ask anything if you need more help on this.

cowszero · January 29, 2022, 11:09pm

The position embedding in huggingface bert model is defined in the BertEmbedding class. Other models also have an Embedding class. If your task is aligned with what they have, you might be able to get away with changing the Embedding class only.

Thanks, @bengul, that looks like a very promising route. Will give it a try tomorrow!

jcoke · April 21, 2022, 7:38am

Hey!
I’m trying to train a BART model with customized positional embeddings, similarly to what you have been doing and I have a few questions that you perhaps can help me with. First of all, say that I want to change the positional embeddings of BART to sinusoidal embeddings, just like you did @bengul, is that even possible? - My intuition is that we have to re-learn so many parts of the Transformer architecture that it might not be worth doing, or am I wrong here? My second thought is based on the assumption that it actually works, i.e. it is possible to change the positional embeddings, what kind of computing resource is necessary to be able to do this change in how the model treats the positions?

Hope you can help me with some insight in this @bengul @cowszero

Thanks!

bengul · April 22, 2022, 2:31am

is that even possible?

Sure, it is possible

Is it worth it?

Depends on your objective

what kind of computing resource is necessary to be able to do this change in how the model treats the positions?

My understanding is that you have to pretrain the model from scratch. That is definitely time and resource consuming (depends on the amount of data and model complexity). I don’t think using the new positional embedding during just fine-tuning would be useful, however I have not tested that.

jcoke · April 22, 2022, 6:59am

My idea has been that I want to fine-tune a pre-trained BART model with a length penalty in the positional embeddings of the model. However, my results so far kind of confirms the hypothesis that is is not possible to only fine-tune with new embeddings. I guess I would have to pre-train it from scratch, but I want to avoid that due to the resources needed for such a task.

Thank you for your input!

Topic		Replies	Views
How to use custom positional embedding while fine tuning Bert Beginners	2	2786	September 14, 2022
`BertEmbeddings` contains positional embedding? 🤗Transformers	2	3143	December 27, 2022
Trying to process longer documents with BERT-based models Intermediate	0	623	March 8, 2021
Positional Embeddings in Transformer Implementations 🤗Transformers	1	1785	September 3, 2024
BartForConditionalGeneration: Adding additional layers of embedding Models	2	189	July 11, 2024

Creating a tokenizer with both custom tokens and positions

Related topics