Purely extractive Language Models?

mirix · November 24, 2023, 11:03am

Hello,

Given an plain text email thread, I am trying to extract the body of the most recent email.

I used to do that with rules. Now I am testing Large Language Models (LLM) to see if I they provide a less ad hoc solution.

Mistral-7B-Instruct, for instance, seems to understand the task and provides acceptable outputs most of the time.

However, in some cases, it explains the email rather than just copy/paste the relevant chunk.

I have tried dozens of prompts, for instance:

instruction = 'Given the email thread bellow the dotted line, extract verbatim the body of the most recent (top) message. Remove all headers, footers and disclaimers. In your response, do not add any text that was not present in the original message'

And tried to prevent hallucinations by setting the following:

    generation_output = model.generate(
        model_inputs,
        do_sample=True,
        temperature=0.0000001,
        top_p=0.0000001,
        top_k=1,
        max_new_tokens=words
        )

However, in a few cases, the model still adds explanations and/or hallucinates a bit.

My questions are the following:

Are you aware of any models that could do a better job without fine-tuning? For instance, purely extractive models (as opposed to generative ones).
If generative models are the way to go, is there a way to force the model to just copy/paste?

Best,

Ed

mirix · November 24, 2023, 3:55pm

The following approach works:

At least, there is no hallucination.

If I add line numbers, the model provides numbers that are fully consistent with the identified email body (which is not 100% correct, but that is a different question).

mirix · November 28, 2023, 8:33am

Interestingly, not all Mistral-based models are able to provide line numbers. Mistral Instruct does but Mistral OpenInstruct seems to output always text, entirely disregarding the prompts.

Topic		Replies	Views
What model will fit better for Email Parsing and Data Extraction Beginners	1	654	May 16, 2024
Struggling to get good text from Mistral-7b-Instruct in my fantasy story program Models	0	940	March 5, 2024
Mistral-7B-Instruct-v0.3 vs Mistral-NEMO-12B Models	4	1015	July 10, 2025
Keep getting the same output from Mistral-7b-Instruct Beginners	4	1346	December 24, 2024
LLMs Return Prompt as Well as Generated Text Beginners	2	1473	June 20, 2024

Purely extractive Language Models?

Related topics