Hello, I am thinking to create a custom dataset for Resume Parsing from Raw Text extracted from PDFs and DOCs. The reason for doing this is to train a small model that understands context of resume entities directly from the raw text extracted from pdfs, instead of using document AI models like UniLM to extract contextual details like tables, lists and text blocks etc.
The strategy I am thinking to use is:
- Extract text from a resume document as a text file (just simple copy and paste) and store it as a file such as “resume1.txt”, “resume2.txt” etc.
- Annotate the entities from each resume as a row in a CSV file.
Is it possible to create a training dataset, which has one file for each resume text version, for example:
- resume1.txt
- resume2.txt
And each of these resume txt is annotated in tabular format inside a csv row?
For example, this is a CSV FORMAT of the annotated data for each resume txt file:
Resume File Name, Name, Experience 1 Detail, Experience 2 Detail, Experience 3 Detail,Date of Birth,Address,Phone,Email,Top 10 Skills,Educational 1,Date 1,Educational 2,Date 2,Educational 3,Date 3,Educational 4,Date 4,Educational 5,Date 5,Educational 6,Date 6,Certifications,References,Company Name 1,Experience Title 1,Experience Period 1,Company Name 2,Experience Title 2,Experience Period 2,Company Name 3,Experience Title 3,Experience Period 3,Company Name 4,Experience Title 4,Experience Period 4,Company Name 5,Experience Title 5,Experience Period 5,Company Name 6,Experience Title 6,Experience Period 6,Company Name 7,Experience Title 7,Experience Period 7,Company Name 8,Experience Title 8,Experience Period 8,
Would this strategy work? How would I arrange the dataset, and make it trainable using Transformers? Any help and insight would be much appreciated. Thanks and regards.