In other words, is it possible to train a supervised transformer model to pull out specific from unstructured or semi-structured text and if so, which pretrained model would be best for this?
In the resume example, I’d want to input the text version of a person’s resume and get a json like the following as output: {‘Education’: [‘BS Harvard University 2010’, ‘MS Stanford University 2012’], ‘Experience’: [‘Microsoft, 2012-2016’, ‘Google, 2016 - Present’]}
Obviously, I’ll need to label hundreds or thousands of resumes with their relevant Education and Experience fields before I’ll have a model that is capable of the above.
Here’s another example of the solution that I’m talking about although this person seems to be using GPT-3 and didn’t have any code provided. Is this something that any of the huggingface pipelines is capable of and if so, which pipeline would be most appropriate?
Is there any reason you’re looking to do this with a transformer?
This is a common vision problem, and transformers aren’t usually the first port of call for a problem like this.
This is a common NLP problem, and transformers are a good first port of call for a problem like this.
You should google these terms:
Named entity recognition
Chunking
You’ll need to have labeled data, usually marking every token in your document with perhaps IOB tags that can demarcate the start and end of a coherent chunk of text.
The problem I’m trying to solve, in the most general sense, is that you’re given a set of documents and each document in your set has specific information that you’re trying to pull out. Examples of specific information to pull out include:
Author of the document
When was the document written
Who is the recipient of the document
etc.
And keep in mind information in the document may not always be stated in an obvious way. In one document, the author may be given as “Author: Joe Shmo”. In another one it might say: “From Jane Doe”. But let’s assume that every document has an author and any normal adult human with an average comprehension of English can pull out the author even though the author may not be stated in the exact same way in each document (and let’s assume there are countless ways of saying who the author is and no reasonable Regex pattern can be used to pull it out.) Ditto to the other fields like the date when it’s written and the intended recipient of the document.
I originally thought of using a Question Answering model as a basis for this task but it might be overkill. Regarding the Resume example, I might end up training the model on just two questions and their respective answers for each resume in the dataset:
What is the education?
What is the experience?
I suppose training a custom NER model might be another route to take.
Is there any reason you’re looking to do this with a transformer?
This is a common vision problem, and transformers aren’t usually the first port of call for a problem like this.
LayoutLM is a BERT-like model by Microsoft that adds layout information of the text tokens to improve performance on document processing tasks, like information extraction from documents, document image classification and document visual question answering. Its successors are LayoutLMv2 and v3 which improve its performance. Notebooks for all of those can be found in my Github repo.
Then there are also the Document Image Transformer, which is mostly suited for document image classification and layout analysis (like table detection in PDFs), and TrOCR, which is a Transformer-based encoder-decoder model for optical character recognition on single text-line images. Notebooks for those can also be found in my Github repo.
i have found this deepset/tinyroberta-squad2 but this will work only when resume contains label i have used this using haystack
impira/layoutlm-document-qa which use LayoutLM behind which work fine but its again i have passed resume without name label only values it cannot detected