What's the input of BERT?

I really don’t get what’s the input of BERT. I read a lot of thing about BERT and most of it is a very confusing.

I know there are three embedding layers as well as I know the intuition behind each of them. But, what’s exactly a token embedding, a segment embedding, and a positional embedding? What is a learned rapresentation? Is it a representation learned during training or a representation learned before the pre-training?

Given a word, a token embedding is a token ID, or an embedding of size 768? Is this embedding random initialize?

segment embeddings are vectors of 0 (for the first sentence) or 1 (for the second sentence)? A segment embedding for a sentence is a binary value for each token in the sentence or binary vectors for each of them?

1 Like

Hi! :slight_smile:

BERT’s input is essentially subwords. For example, if I want to feed BERT the sentence “Welcome to HuggingFace Forums!”, what I actually gets fed in is:
['[CLS]', 'welcome', 'to', 'hugging', '##face', 'forums', '!', '[SEP]'].

Each of these tokens is mapped to an integer:
[101, 6160, 2000, 17662, 12172, 21415, 999, 102].

These integers map into vectors using BERT’s embedding matrix (which is initialized randomly and trained during pre-training) and is 768-dimensional.

The learned representation is this 768-dimensional vector for each subword in BERT’s vocabulary.

Segment embeddings are zeros for the tokens in 1st sentence, and ones for the tokens in the 2nd sentence.


Hi! Thank you very much for your time. Your answer was of great help :slight_smile:

1 Like

Thats a fantastic question.

Just to add to that.
When doing lets say sequence classification, you typically pass in input id’s, attention mask and label to the BERT model.

As you probably know computer don’t know what languages are they can just crunch numbers hence each token(tht includes cls/sep tokens) and converts them into numerical represntation which the bert then understands.

Hope it helps!