Is it possible to create a Résumé parser using a Huggingface model?

xjdeng · December 15, 2020, 10:48pm

In other words, is it possible to train a supervised transformer model to pull out specific from unstructured or semi-structured text and if so, which pretrained model would be best for this?

In the resume example, I’d want to input the text version of a person’s resume and get a json like the following as output: {‘Education’: [‘BS Harvard University 2010’, ‘MS Stanford University 2012’], ‘Experience’: [‘Microsoft, 2012-2016’, ‘Google, 2016 - Present’]}

Obviously, I’ll need to label hundreds or thousands of resumes with their relevant Education and Experience fields before I’ll have a model that is capable of the above.

Here’s another example of the solution that I’m talking about although this person seems to be using GPT-3 and didn’t have any code provided. Is this something that any of the huggingface pipelines is capable of and if so, which pipeline would be most appropriate?

FL33TW00D · December 15, 2020, 10:54pm

Is there any reason you’re looking to do this with a transformer?
This is a common vision problem, and transformers aren’t usually the first port of call for a problem like this.

xjdeng · December 15, 2020, 10:56pm

I’m looking to do this with a transformer because I’ll be receiving the raw text, not images, as input.

And yes, I’m using the Resume example as a proxy for a confidential use case at my company.

clem · December 16, 2020, 1:00am

Yes, we’ve seen companies using transformers for similar use cases. If you don’t have a lot of labelled data, it usually involves a mix of zero-shot classification to understand the sections (ex: https://huggingface.co/facebook/bart-large-mnli?text=2001+-+2003+harvard%2C+master+in+management&labels=school%2C+job%2C+hobbies&multiclass=false) and then NER to extract the right information for the right classes (https://huggingface.co/dslim/bert-base-NER?text=I+worked+at+Facebook). Sometimes you want to add entity linking to the mix depending on how elaborate a system you need.

Would you be open to jump on a call to see if some of our commercial offering could be useful or are you looking at doing that all by yourself?

facehugger2020 · December 16, 2020, 1:02am

This is a common NLP problem, and transformers are a good first port of call for a problem like this.

You should google these terms:

Named entity recognition
Chunking

You’ll need to have labeled data, usually marking every token in your document with perhaps IOB tags that can demarcate the start and end of a coherent chunk of text.

xjdeng · December 16, 2020, 2:48am

The problem I’m trying to solve, in the most general sense, is that you’re given a set of documents and each document in your set has specific information that you’re trying to pull out. Examples of specific information to pull out include:

Author of the document
When was the document written
Who is the recipient of the document
etc.

And keep in mind information in the document may not always be stated in an obvious way. In one document, the author may be given as “Author: Joe Shmo”. In another one it might say: “From Jane Doe”. But let’s assume that every document has an author and any normal adult human with an average comprehension of English can pull out the author even though the author may not be stated in the exact same way in each document (and let’s assume there are countless ways of saying who the author is and no reasonable Regex pattern can be used to pull it out.) Ditto to the other fields like the date when it’s written and the intended recipient of the document.

I originally thought of using a Question Answering model as a basis for this task but it might be overkill. Regarding the Resume example, I might end up training the model on just two questions and their respective answers for each resume in the dataset:

What is the education?
What is the experience?

I suppose training a custom NER model might be another route to take.

facehugger2020 · December 16, 2020, 10:18pm

Is there any reason you’re looking to do this with a transformer?
This is a common vision problem, and transformers aren’t usually the first port of call for a problem like this.

LOL. What?

FL33TW00D · December 16, 2020, 11:28pm

I misread I thought he had the physical resumes and was jumping straight to: https://arxiv.org/abs/2010.11929

meditations · July 15, 2021, 1:18am

Hi @clem would be very interested in checking out HF’s commercial offering for this. Can we chat somehow?

tnavin · August 3, 2022, 12:05am

Hey clem. Can we jump on a call to look into your offerings regarding the same? Thanks.

nielsr · August 3, 2022, 9:44am

Hi,

We do have several models available for that. These include (at the time of writing):

LayoutLM is a BERT-like model by Microsoft that adds layout information of the text tokens to improve performance on document processing tasks, like information extraction from documents, document image classification and document visual question answering. Its successors are LayoutLMv2 and v3 which improve its performance. Notebooks for all of those can be found in my Github repo.

Then there are also the Document Image Transformer, which is mostly suited for document image classification and layout analysis (like table detection in PDFs), and TrOCR, which is a Transformer-based encoder-decoder model for optical character recognition on single text-line images. Notebooks for those can also be found in my Github repo.

Sybghat · April 14, 2023, 9:28am

Yes it is possible and infect i have a demo available on my space here. It was initial version and i uploaded that almost an year ago.

rakin061 · October 12, 2023, 5:32am

@Sybghat Can you share your source code if uploaded in github or so ??

radames · October 12, 2023, 8:12pm

you can see the source here Sybghat/resume-parser at main

umesh19 · October 18, 2023, 5:55am

i have found this deepset/tinyroberta-squad2 but this will work only when resume contains label i have used this using haystack
impira/layoutlm-document-qa which use LayoutLM behind which work fine but its again i have passed resume without name label only values it cannot detected

ddeisadze · March 22, 2024, 12:36am

How accurate is your model?

ddeisadze · March 22, 2024, 12:36am

@xjdeng where you able to find something accurate to extract that info?

Topic		Replies	Views
Seeking Advice on Named Entity Recognition with AI Beginners	6	653	February 5, 2025
How can I use the models provided in huggingface.co/models? Beginners	3	1562	April 9, 2021
Further pre-train language model in transformers like BERT Models	3	1108	March 27, 2022
How to build a Resume matcher to increase the probability of passing an ATS system with huggingface pipelines 🤗Transformers	0	619	October 15, 2023
Application of a transformer model without fine tuning for NER task Beginners	2	1328	May 31, 2021

Is it possible to create a Résumé parser using a Huggingface model?

Related topics