Here (attached in the picture), and in the original paper it is stated that last 100 bytes are “reused” to be sentinel tokens.
However, upon inspecting the embeddings and the tokenizer I see that there are in fact 384 tokens, around 126 of which are sentinels. Perhaps I misunderstood something but why the discrepancy?