Likelyhood input sequence came from training set

david-waterworth · February 17, 2021, 10:06am

I’m wondering if there’s a way of using a transformer to generate some sort of metric which scores an input sequence based on how similar it is to the training data. My motivation is I’ve created my own tokeniser and trained a RoBERTa model using a moderately large corpus of IoT device descriptions. The descriptions contain lots of abbreviations, unusual ways of delimiting the text etc.

When I pre-train, then fine tune a classifier the performance is good on some datasets and poor on others. I assume the variation is because some datasets aren’t similar enough to the training data.

So ideally I’d like to compete P(x1,…xn) where x1…xn is the input sequence, i.e. assuming this sequence is similar to data seen in training P(x1,…xn) should be higher than if not.

Given that the encoder produces a contextual embedding rather than probabilities I’m not sure if this is possible though?

Topic		Replies	Views
Using transformers (BERT, RoBERTa) without embedding layer Research	8	4182	December 16, 2020
Two sentences classification detail questions 🤗Transformers	0	397	June 2, 2022
Computing similarity between sentences Intermediate	4	3307	July 31, 2021
Feed output from one transformer model as input to another 🤗Transformers	1	1112	July 30, 2021
Should I need to use pre_train-tokenizer? 🤗Transformers	0	258	June 8, 2022

Likelyhood input sequence came from training set

Related topics