Custom Dataset Creation Guidance For Resume Parsing

tnavin · October 30, 2023, 7:13am

Hello, I am thinking to create a custom dataset for Resume Parsing from Raw Text extracted from PDFs and DOCs. The reason for doing this is to train a small model that understands context of resume entities directly from the raw text extracted from pdfs, instead of using document AI models like UniLM to extract contextual details like tables, lists and text blocks etc.

The strategy I am thinking to use is:

Extract text from a resume document as a text file (just simple copy and paste) and store it as a file such as “resume1.txt”, “resume2.txt” etc.
Annotate the entities from each resume as a row in a CSV file.

Is it possible to create a training dataset, which has one file for each resume text version, for example:

resume1.txt
resume2.txt

And each of these resume txt is annotated in tabular format inside a csv row?
For example, this is a CSV FORMAT of the annotated data for each resume txt file:

Resume File Name, Name, Experience 1 Detail, Experience 2 Detail, Experience 3 Detail,Date of Birth,Address,Phone,Email,Top 10 Skills,Educational 1,Date 1,Educational 2,Date 2,Educational 3,Date 3,Educational 4,Date 4,Educational 5,Date 5,Educational 6,Date 6,Certifications,References,Company Name 1,Experience Title 1,Experience Period 1,Company Name 2,Experience Title 2,Experience Period 2,Company Name 3,Experience Title 3,Experience Period 3,Company Name 4,Experience Title 4,Experience Period 4,Company Name 5,Experience Title 5,Experience Period 5,Company Name 6,Experience Title 6,Experience Period 6,Company Name 7,Experience Title 7,Experience Period 7,Company Name 8,Experience Title 8,Experience Period 8,

Would this strategy work? How would I arrange the dataset, and make it trainable using Transformers? Any help and insight would be much appreciated. Thanks and regards.

Topic		Replies	Views
Is it possible to create a Résumé parser using a Huggingface model? Beginners	16	12257	March 22, 2024
How to extract a specific paragraph from a text file 🤗Transformers	2	740	May 29, 2024
JSON response for pdf text data Beginners	1	545	June 10, 2024
Generate dataset for fine tuning on PDF(s) 🤗Transformers	6	3370	September 3, 2024
Preparing datasets for NLP tasks 🤗Datasets	1	543	July 28, 2021

Custom Dataset Creation Guidance For Resume Parsing

Related topics