Is it possible to create a Résumé parser using a Huggingface model?

In other words, is it possible to train a supervised transformer model to pull out specific from unstructured or semi-structured text and if so, which pretrained model would be best for this?

In the resume example, I’d want to input the text version of a person’s resume and get a json like the following as output: {‘Education’: [‘BS Harvard University 2010’, ‘MS Stanford University 2012’], ‘Experience’: [‘Microsoft, 2012-2016’, ‘Google, 2016 - Present’]}

Obviously, I’ll need to label hundreds or thousands of resumes with their relevant Education and Experience fields before I’ll have a model that is capable of the above.

Here’s another example of the solution that I’m talking about although this person seems to be using GPT-3 and didn’t have any code provided. Is this something that any of the huggingface pipelines is capable of and if so, which pipeline would be most appropriate?

Is there any reason you’re looking to do this with a transformer?
This is a common vision problem, and transformers aren’t usually the first port of call for a problem like this.

I’m looking to do this with a transformer because I’ll be receiving the raw text, not images, as input.

And yes, I’m using the Resume example as a proxy for a confidential use case at my company.

Yes, we’ve seen companies using transformers for similar use cases. If you don’t have a lot of labelled data, it usually involves a mix of zero-shot classification to understand the sections (ex: and then NER to extract the right information for the right classes ( Sometimes you want to add entity linking to the mix depending on how elaborate a system you need.

Would you be open to jump on a call to see if some of our commercial offering could be useful or are you looking at doing that all by yourself?


This is a common NLP problem, and transformers are a good first port of call for a problem like this.

You should google these terms:

  • Named entity recognition
  • Chunking

You’ll need to have labeled data, usually marking every token in your document with perhaps IOB tags that can demarcate the start and end of a coherent chunk of text.

The problem I’m trying to solve, in the most general sense, is that you’re given a set of documents and each document in your set has specific information that you’re trying to pull out. Examples of specific information to pull out include:

  • Author of the document
  • When was the document written
  • Who is the recipient of the document
  • etc.

And keep in mind information in the document may not always be stated in an obvious way. In one document, the author may be given as “Author: Joe Shmo”. In another one it might say: “From Jane Doe”. But let’s assume that every document has an author and any normal adult human with an average comprehension of English can pull out the author even though the author may not be stated in the exact same way in each document (and let’s assume there are countless ways of saying who the author is and no reasonable Regex pattern can be used to pull it out.) Ditto to the other fields like the date when it’s written and the intended recipient of the document.

I originally thought of using a Question Answering model as a basis for this task but it might be overkill. Regarding the Resume example, I might end up training the model on just two questions and their respective answers for each resume in the dataset:

  • What is the education?
  • What is the experience?

I suppose training a custom NER model might be another route to take.

Is there any reason you’re looking to do this with a transformer?
This is a common vision problem, and transformers aren’t usually the first port of call for a problem like this.

LOL. What?


I misread I thought he had the physical resumes and was jumping straight to:

Hi @clem would be very interested in checking out HF’s commercial offering for this. Can we chat somehow?

Hey clem. Can we jump on a call to look into your offerings regarding the same? Thanks.


We do have several models available for that. These include (at the time of writing):

LayoutLM is a BERT-like model by Microsoft that adds layout information of the text tokens to improve performance on document processing tasks, like information extraction from documents, document image classification and document visual question answering. Its successors are LayoutLMv2 and v3 which improve its performance. Notebooks for all of those can be found in my Github repo.

Then there are also the Document Image Transformer, which is mostly suited for document image classification and layout analysis (like table detection in PDFs), and TrOCR, which is a Transformer-based encoder-decoder model for optical character recognition on single text-line images. Notebooks for those can also be found in my Github repo.