Chunks of word for input token

Jhp · April 16, 2021, 1:07am

Hi I have some questions about using pretrained bert.

Can I put a chunk of words into one input token? For example, split “hi my name is Linda and today i will~” as “hi my name is Linda” and “and today i will” and make each split as one embedding vector (i.e using average word2vec) and treat each split vector as one input token. Is it okay to apply it to the existing pre-trained models?

Actually i’m forced to use phrase wise token in my task so the models for long sequences are not the option.

Thanks

Topic		Replies	Views
The inputs into BERT are token IDs. How do we get the corresponding input token VECTORS? Beginners	10	17728	September 15, 2022
Working with named entities with bert Beginners	2	316	August 30, 2020
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6640	February 9, 2024
Train wordpiece from scratch 🤗Tokenizers	2	1436	September 9, 2021
Bert pretrained tokenizer: how to preserve hyphened words? Beginners	0	311	April 6, 2022

Chunks of word for input token

Related topics