We have built a causation-focused question answering capability, based on SQuAD2 on ALBERT v2 xxlarge, using transformers run_squad script. It performs very well on most corpus files when asked
"What causes X?"
"What does X cause?"
We have noticed on a few occasions that when a question is asked of a very short paragraph (really, a snippet) we get an answer, with a high score.
For example, to the question of
What does increasing demand cause?
We get the following answer (post-processed):
"cause": "Increasing demand", "effect": "Changes in consumption patterns", "score": 0.9729956935388504, "context": "Impacts of increasing demand:",
Key here is that the “paragraph” is simply “Impacts of increasing demand:”
We could (and should) be filtering out these short phrases, but had the full expectation that the question answering would do that for us.
More surprisingly, of the ones we see (since we’re looking at the top k) the answers seem high quality and they seem to be sensible response.
So the question is, where are these coming from? Is there something about the language model and its exposure to a huge corpus that leads it to fill in the blank, without justification in the text (a trick question on a reading comprehension test? ) but at the same time, to give a sensible answer?
We’re flummoxed, and we’d like to know whether/how we might be able to control it.
BTW - in this case there is somewhat of a “causal signal” - the work “Impacts.” In a longer paragraph I would expect “the Impacts of X are Y and Z” to find an answer, e.g. X causes Y, or X causes Y and Z, or some such.