SpanBert TACRED tokens

For the SpanBert model fine-tuned on TACRED dataset (ie. mrm8488/spanbert-base-finetuned-tacred · Hugging Face) the subject and object entities (ie. PERSON, ORGANIZATION, etc) are substituted by an unused token at the original code :

replace the subject and object entities by their NER
tags such as “[CLS][SUBJ-PER] was born in
[OBJ-LOC] , Michigan, . . . ”

as described in their paper: https://arxiv.org/pdf/2004.14855.pdf
And can also be found in their code: https://github.com/facebookresearch/SpanBERT/blob/master/code/run_tacred.py
in lines 134 to 139:

    def get_special_token(w):
        if w not in special_tokens:
            special_tokens[w] = "[unused%d]" % (len(special_tokens) + 1)
        return special_tokens[w]

    ...

    SUBJECT_START = get_special_token("SUBJ_START")
    SUBJECT_END = get_special_token("SUBJ_END")
    OBJECT_START = get_special_token("OBJ_START")
    OBJECT_END = get_special_token("OBJ_END")
    SUBJECT_NER = get_special_token("SUBJ=%s" % example.ner1)
    OBJECT_NER = get_special_token("OBJ=%s" % example.ner2)

The issue is that to use the pre-trained models one has to substitute those tokens before tokenizing, but there is no way to obtain the originally used ones without the original data (which is not freely available). Does anyone have access to the TACRED dataset and is able to obtain these tokens (or the special_tokens dict) using the original code and share it? And maybe add it somewhere in the repo so it can be easily accessed.
Thanks!

3 Likes

Tagging model author @mrm8488 :slight_smile:

1 Like

Hi, I uploaded the model from FB research repo to HF model hub in order to experiment with it but I had kind of same problem with the tokenizer. I have to check out if I can solve it or fine-tuning the model by myself and re-upload it.

2 Likes

Hi @mrm8488, thank you for your time, your support would really beneficial for me as well! A simple dictionary that matches the unused tokens to the entities would do the job.
Gracias! :slight_smile:

1 Like