Purely extractive Language Models?


Given an plain text email thread, I am trying to extract the body of the most recent email.

I used to do that with rules. Now I am testing Large Language Models (LLM) to see if I they provide a less ad hoc solution.

Mistral-7B-Instruct, for instance, seems to understand the task and provides acceptable outputs most of the time.

However, in some cases, it explains the email rather than just copy/paste the relevant chunk.

I have tried dozens of prompts, for instance:

instruction = 'Given the email thread bellow the dotted line, extract verbatim the body of the most recent (top) message. Remove all headers, footers and disclaimers. In your response, do not add any text that was not present in the original message'

And tried to prevent hallucinations by setting the following:

    generation_output = model.generate(

However, in a few cases, the model still adds explanations and/or hallucinates a bit.

My questions are the following:

  1. Are you aware of any models that could do a better job without fine-tuning? For instance, purely extractive models (as opposed to generative ones).

  2. If generative models are the way to go, is there a way to force the model to just copy/paste?



The following approach works:

At least, there is no hallucination.

If I add line numbers, the model provides numbers that are fully consistent with the identified email body (which is not 100% correct, but that is a different question).

Interestingly, not all Mistral-based models are able to provide line numbers. Mistral Instruct does but Mistral OpenInstruct seems to output always text, entirely disregarding the prompts.