What's the input of BERT?

frap · February 19, 2022, 9:30pm

I really don’t get what’s the input of BERT. I read a lot of thing about BERT and most of it is a very confusing.

I know there are three embedding layers as well as I know the intuition behind each of them. But, what’s exactly a token embedding, a segment embedding, and a positional embedding? What is a learned rapresentation? Is it a representation learned during training or a representation learned before the pre-training?

Given a word, a token embedding is a token ID, or an embedding of size 768? Is this embedding random initialize?

segment embeddings are vectors of 0 (for the first sentence) or 1 (for the second sentence)? A segment embedding for a sentence is a binary value for each token in the sentence or binary vectors for each of them?

beneyal · February 19, 2022, 10:10pm

Hi!

BERT’s input is essentially subwords. For example, if I want to feed BERT the sentence “Welcome to HuggingFace Forums!”, what I actually gets fed in is:
['[CLS]', 'welcome', 'to', 'hugging', '##face', 'forums', '!', '[SEP]'].

Each of these tokens is mapped to an integer:
[101, 6160, 2000, 17662, 12172, 21415, 999, 102].

These integers map into vectors using BERT’s embedding matrix (which is initialized randomly and trained during pre-training) and is 768-dimensional.

The learned representation is this 768-dimensional vector for each subword in BERT’s vocabulary.

Segment embeddings are zeros for the tokens in 1st sentence, and ones for the tokens in the 2nd sentence.

frap · February 19, 2022, 10:12pm

Hi! Thank you very much for your time. Your answer was of great help

moma1820 · February 27, 2022, 1:19pm

Thats a fantastic question.

Just to add to that.
When doing lets say sequence classification, you typically pass in input id’s, attention mask and label to the BERT model.

As you probably know computer don’t know what languages are they can just crunch numbers hence each token(tht includes cls/sep tokens) and converts them into numerical represntation which the bert then understands.

Hope it helps!

dreidizzle · December 23, 2022, 4:50pm

Hi! I actually am not sure this is so clear, so I want to make sure. Let’s just focus on MLM. Let’s say I have a sentence that’s tokenized and is truly [a, b, c, d, e]. What I believe happens with BERT for such a sentence is some tokens are masked to [MASK] so for example it becomes [a, b, c, d, e] → [a, [MASK], c, [MASK], e]. Then, the input is x = [a, [MASK], c, [MASK], e] while the output is the original y = [a, b, c, d, e] and the goal is to predict y from just seeing x. The objective then involves a soft max due to two [MASK] tokens. You have many more pairs (x, y) but this i just 1 training pair …

… Also, you might not [MASK] a token, you might corrupt it in the sense that it becomes x = [a, b’, c, d’, e] and b’ is either b or another (random) token. Still, the goal is to predict y from this corrupted x so there are still two terms in the loss for this sentence.

Is this not the right way to view this? Let me know if something is missing please. Thank you!

Topic		Replies	Views
The inputs into BERT are token IDs. How do we get the corresponding input token VECTORS? Beginners	10	17785	September 15, 2022
Bert embedding layer Beginners	1	2748	January 4, 2022
How to get a model's initial input representation? 🤗Transformers	2	831	June 21, 2022
How to access tokens embedding layer in BERT? 🤗Transformers	0	817	April 13, 2021
What should be used as sentence embedding for BertModel? Beginners	0	1909	May 24, 2021

What's the input of BERT?

Related topics