Thats a fantastic question.
Just to add to that.
When doing lets say sequence classification, you typically pass in input id’s, attention mask and label to the BERT model.
As you probably know computer don’t know what languages are they can just crunch numbers hence each token(tht includes cls/sep tokens) and converts them into numerical represntation which the bert then understands.
Hope it helps!