BERT for token & sentence classification

Hi everyone,
I’m trying to realize a Resume Parser through a NER task using BERT, so it would be a token level classification task.
Now, I have a problem with the Work Experience section of the resume.
I would like to extract (date, job title, company name, job description).
The problem is that while the first three are entities with few words, the last one is made up of many words, so I don’t think a token level classification is ok, because I should associate a description entity for each word of the description (I think it would be not efficient).
The ideal solution should be a sentence classification for the description, so for the entire description I would associate only a label.
But in this way, BERT should perform a token & sentence classification at the same time, and I don’t know if this feasible.
I don’t want use two BERT, one for token and the other one for sentence.
Is it possible to perform the two tasks at the same time with just one network?
Many thanks in advance

1 Like

BERT produces a 768-dimension vector for each token, processed to take into account a small amount of information about each of the other tokens in the input text. Then it can be made to combine all of those vectors (for a single input text) into a single 768-dimension vector that can be considered as a representation of the whole input text.

I believe it should be possible for you to access both the token-vectors and the whole-text-vector, without having to delve into the model code.

With a bit more trouble, you could create your own vector that is a combination of only the tokens from the job description. The tricky bit is deciding how to combine the vectors. Some researchers suggest using the vectors from each of the last four layers.

What are you planning to do with the embeddings once BERT has created them? Do you want your “job description” embedding to include a small amount of context information from your date/title/name? If not, you would need to put them through a separate BERT.

Are you planning to use BERT just to produce embeddings, or are you planning to fine-tune BERT to your task?

Note that BERT will only accept a maximum of 512 tokens per text.

BERT will be fine-tuned to perform the classification of my work experience section.
About the maximum number of token per text, I am planning to divide the work experience text in sentences of lenght < 512.
This “raw” split does not allow a perfect division of my job description respect to my date/title/name, so in the description embeddings it could happen that a small amount of context information of date/title/name will be there.

I can easily access to both the “pooled output” (or the hidden layers output and get an average of the last fours layer) and the “token output” of my BERT:

class EntityModel(nn.Module):
   def __init__(self, num_tag):
        super(EntityModel, self).__init__()
        self.num_tag = num_tag
        self.bert = transformers.BertModel.from_pretrained(BASE_MODEL_PATH)
        self.bert_drop_1 = nn.Dropout(0.3)
        self.out_tag = nn.Linear(768, self.num_tag)
    
    def forward(self, ids, mask, token_type_ids, target_pos, target_tag):

       #Forward function of BERT model which return a token output, pooled output and optionally the hidden states' output
        token_output, pooled_output = self.bert(ids, attention_mask=mask, token_type_ids=token_type_ids)

        bo_tag = self.bert_drop_1(token_output)

        tag = self.out_tag(bo_tag)

        loss_tag = loss_fn(tag, target_tag, mask, self.num_tag)

        return tag, loss_tag

My original idea was to use the pooled_output if the sentence is a “job description” sentence, so the model will predict only one label for the single output, while if the sentence is not a “job description” sentence, then it is used the token_output, so the model will predict one label for each token.
But this seems to be not possible because the model can’t know the type of the sentence before the prediction.

What sort of classification labels are you using? Do you have classification labels for each 512-token text in the training data?

I’m still confused. If you just want to detect whether a text is a date [or contains a date] or not, then I think BERT is not the best method.

I’ve been working with BERT and my data texts were often longer than 512. As a compromise, I took just the first 512 tokens in each document, and assumed that the label that related to the full text would also relate to the first 512 tokens. This did allow the model to learn, but I’m not at all sure it was a valid assumption.

At some point I need to train a model using only data texts that are already ideal lengths, and compare that with the truncated-texts version.

I would be even more worried about a model that used the same label for several sub-sections of text.

The labels I want to predict are the followings:
date, job title, company name, job description and O label for the tokens which are not associated with these labels.

In few words, i would like to find a way to label a job description text with just one label (and not one label for each token of job description text) while at same time to label each token of the text that is not part of the job description text.
The problem is that the job description text and the job title/date/company name text are in the same block of text and I have to use just one model.

Thinking about it, I do not think this is possible, so I think I will perform a NER task for the job description text too (one label for each token of job description text).

By the way, I have a question.
Did you executed some test or do you have some reference which shows if BERT model can capture well the semantic information of a text block of 512 words? I’m asking this because I know that the longer the text, the more difficult it is to capture its semantic content and I think that 512 tokens are quite a lot.

In addition, in my case I can’t do a truncation because I need all the information of the Work experience text, so I think that your doubt depends on the context in which you are working on

In few words, i would like to find a way to label a job description text with just one label (and not one label for each token of job description text) while at same time to label each token of the text that is not part of the job description text.
The problem is that the job description text and the job title/date/company name text are in the same block of text and I have to use just one model.

Can you clarify what you’re trying to do? Here’s what I understand from what you wrote above:

  • You have a chunk of text from a resume
  • You want to identify the span of the text of a job description.
  • You want to also identify the spans of text for date, job title, and company name.

What do you mean by:

i would like to find a way to label a job description text with just one label

Do you mean that after you identify the job description span of text, you want to classify that text?

I have been looking for a way to do the same task, but not successfully at the moment…I hit the same problem about how to recognise the job description. As entity recognition recognise only a couple of words, it seems difficult to recognise a job description.