Why positional embeddings are implemented as just simple embeddings?

miguelvictor · August 4, 2020, 1:32pm

Hello!

I can’t figure out why the positional embeddings are implemented as just the vanilla Embedding layer in both PyTorch and Tensorflow. Based on my current understanding, positional embeddings should be implemented as non-trainable sin/cos or axial positional encodings (from reformer).

Can anyone please enlighten me with this? Thank you so much!

yjernite · August 4, 2020, 4:40pm

Hi @miguelvictor ! Both are valid strategies: iirc the original Transformers paper had sinusoidal embeddings with a fixed rate, but BERT learned a full vector for each of the 512 expected positions.

Currently, the Transformers library has sinusoidal embeddings in the TransfoXL model, check it out!

vgoklani · August 4, 2020, 5:00pm

Thanks @yjernite - curious, what are the pros and cons of both approaches? When should one choose to use trigonometric functions vs applying an nn.Embedding on an ordinal index?

miguelvictor · August 5, 2020, 1:25am

Thank you for sharing this information! Is it always guaranteed that the model will learn the expected positions given enough training data?

yjernite · August 14, 2020, 4:09pm

It’s not always clear what the “expected positions” for a specific task, so no strong guarantees In general, if your model relies on absolute positions with a fixed range, you should be fine with learned positional embeddings.

yjernite · August 14, 2020, 4:12pm

In theory, the trigonometric functions have the ability to generalize beyond positions that are seen at training time. They also allow the model to rely on relative rather than absolute positions, and as such their dot product can be computed more efficiently as shown in the TransformerXL paper.

On the other hand, the learned index embeddings offer more parameters, which might enable the model to learn faster in some situations.

As for many other things, it really depends on your use case

gmihaila · September 20, 2020, 9:15pm

Just curious: Did the previous implementations of Bert from HuggingFace used any sin/cos in the positional embedding? It sounds like the vanilla Embedding layer was added later since it looks like it has more pros in replace of sin/cos positional embeddings.

Milana · October 27, 2023, 7:33am

In the original code https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/modeling.py#L190 it still uses learned positional embedding layer

Topic		Replies	Views
`BertEmbeddings` contains positional embedding? 🤗Transformers	2	3130	December 27, 2022
Positional Embeddings in Transformer Implementations 🤗Transformers	1	1780	September 3, 2024
How to use custom positional embedding while fine tuning Bert Beginners	2	2785	September 14, 2022
What are the goals in Positional Embedding methods? 🤗Transformers	2	502	March 3, 2022
Creating a tokenizer with both custom tokens and positions Beginners	5	1230	April 22, 2022

Why positional embeddings are implemented as just simple embeddings?

Related topics