Tokenizers: How to get representation for a single word form subwords

adamits · July 20, 2021, 9:59pm

Hi!

I am using a Roberta model to encode words and then finetune on a downstream task for which I need to find relations between words, not subwords. I would like to represent words as either the mean subword embedding for subwords comprising that word, or as just the first subword.

How can I efficiently get a handle on which subwords comprise a single token, when my input is an entire sentence, from the output of RobertaModel, so I can then compose them in some way? Thanks!

Topic		Replies	Views
Obtaining word-embeddings from Roberta Beginners	13	13256	January 18, 2022
How to make RoBERTa question answer models take tokens from the question instead of the context? Beginners	0	212	September 29, 2022
How can i get the word representation using BERT? Beginners	2	2306	January 16, 2022
Xlm-Roberta Tokenizing 🤗Transformers	3	470	January 19, 2021
Sequence masking 🤗Transformers	0	379	April 25, 2022

Tokenizers: How to get representation for a single word form subwords

Related topics