What's the input of BERT?

I really don’t get what’s the input of BERT. I read a lot of thing about BERT and most of it is a very confusing.

I know there are three embedding layers as well as I know the intuition behind each of them. But, what’s exactly a token embedding, a segment embedding, and a positional embedding? What is a learned rapresentation? Is it a representation learned during training or a representation learned before the pre-training?

Given a word, a token embedding is a token ID, or an embedding of size 768? Is this embedding random initialize?

segment embeddings are vectors of 0 (for the first sentence) or 1 (for the second sentence)? A segment embedding for a sentence is a binary value for each token in the sentence or binary vectors for each of them?

1 Like

Hi! :slight_smile:

BERT’s input is essentially subwords. For example, if I want to feed BERT the sentence “Welcome to HuggingFace Forums!”, what I actually gets fed in is:
['[CLS]', 'welcome', 'to', 'hugging', '##face', 'forums', '!', '[SEP]'].

Each of these tokens is mapped to an integer:
[101, 6160, 2000, 17662, 12172, 21415, 999, 102].

These integers map into vectors using BERT’s embedding matrix (which is initialized randomly and trained during pre-training) and is 768-dimensional.

The learned representation is this 768-dimensional vector for each subword in BERT’s vocabulary.

Segment embeddings are zeros for the tokens in 1st sentence, and ones for the tokens in the 2nd sentence.

4 Likes

Hi! Thank you very much for your time. Your answer was of great help :slight_smile:

1 Like

Thats a fantastic question.

Just to add to that.
When doing lets say sequence classification, you typically pass in input id’s, attention mask and label to the BERT model.

As you probably know computer don’t know what languages are they can just crunch numbers hence each token(tht includes cls/sep tokens) and converts them into numerical represntation which the bert then understands.

Hope it helps!

Hi! I actually am not sure this is so clear, so I want to make sure. Let’s just focus on MLM. Let’s say I have a sentence that’s tokenized and is truly [a, b, c, d, e]. What I believe happens with BERT for such a sentence is some tokens are masked to [MASK] so for example it becomes [a, b, c, d, e] → [a, [MASK], c, [MASK], e]. Then, the input is x = [a, [MASK], c, [MASK], e] while the output is the original y = [a, b, c, d, e] and the goal is to predict y from just seeing x. The objective then involves a soft max due to two [MASK] tokens. You have many more pairs (x, y) but this i just 1 training pair …

… Also, you might not [MASK] a token, you might corrupt it in the sense that it becomes x = [a, b’, c, d’, e] and b’ is either b or another (random) token. Still, the goal is to predict y from this corrupted x so there are still two terms in the loss for this sentence.

Is this not the right way to view this? Let me know if something is missing please. Thank you!

1 Like