Probability of a word within a given context / Reasonability of a sequence of words

I am looking for a NLP model that tells me how probable/reasonable a given word is within the context of some other words or word sequences.

For example:
Consider the sentence “I build a house.” which is a reasonable sentence.

Now, what I want to know is P(“build” | “I”, “a house”) or to put it differently model.score(“I build a house”).

This means, I want to know how reasonable this sentence is and I would expect a score or probability that is significantly different from zero.

Please note: I do not want to predict a word but rather I want to know if an already existing sequence of words is reasonable ie. makes sense.

A negative example for a sentence/word sequence that does not make any sense would be the sentence “I build a soup”.

In this case model.score(“I build a soup”) or P(“build” | “I”, “a soup”) should be close to zero or a least extremely low.

Do you know of any model that can accomplish this task?

Because you want to provide probabilities for words in the middle of a string, a bidirectional encoder model would do you best. I recommend bert-base-uncased. One of the self-supervised tasks it was trained with is token masking whereby it attempts to predict missing tokens in a given sentence. Details are available in its paper.

A thorough how-to is here but the steps you’d want to take are:

  1. Load the model with BertTokenizer, BertForMaskedLM, &c.
  2. Take the sentence for which you want P(x|S) for S = {X_1, …, X_n} and mask x with “[MASK]”
    e.g., “I build a house” → “I [MASK] a house”
  3. Encode the new string and pass it to the model, collect predicted IDs
  4. Decode them and determine the loss for the X you had in mind

Hope that helps.