What model will fit better for Email Parsing and Data Extraction


Firstly, Apologies if that is not the right section in the forum… (quite new here)

Now, I am working in a new project, well in a new idea to automate a current very manual process:

Lots of emails come in which are manually processed by “a human” to extract data from them which are then copied in a tabulate software (excel) or similar.

I can clearly identify the attributes/fields I want to extract from the text/emails.

I have too the types of “text/emails”

The “problem” is to be able to extract that information accurately.

Email Example:

"Hello my friends.

we have two guys arriving tomorrow 23/01/2024 around 1pm from Madrid, flight ABC123
Another one people leaving on sunday same week around 2am to Instanbul, flight CBA321

Name of the arrving ones: Jose Mateo Feliz y Ana Triste del Carmen.
Name of the leaving Matia Nodoyuna


And basically the output required should be a JSON with a fix and depict attribute list filled (or left empty if not found).

I have been playing with Mistral and Llama, they are ok, a little slow (different Q tested and B), but I believe my scope is quite small so with the right “small” model and some training (I have thousand of emails examples), I could get a much faster and accurate model…

Any thought?

Thanks in advance!

Hi I’m working with a similar use case. Curious to know if you have found a successful solution. I’ve tried Bert Squad but it was sub optimal. Gpt3.5 is the best but I’m looking for an open source option.