SpanBert TACRED tokens

PereLluis13 · October 5, 2020, 10:57pm

For the SpanBert model fine-tuned on TACRED dataset (ie. mrm8488/spanbert-base-finetuned-tacred · Hugging Face) the subject and object entities (ie. PERSON, ORGANIZATION, etc) are substituted by an unused token at the original code :

replace the subject and object entities by their NER
tags such as “[CLS][SUBJ-PER] was born in
[OBJ-LOC] , Michigan, . . . ”

as described in their paper: https://arxiv.org/pdf/2004.14855.pdf
And can also be found in their code: https://github.com/facebookresearch/SpanBERT/blob/master/code/run_tacred.py
in lines 134 to 139:

    def get_special_token(w):
        if w not in special_tokens:
            special_tokens[w] = "[unused%d]" % (len(special_tokens) + 1)
        return special_tokens[w]

    ...

    SUBJECT_START = get_special_token("SUBJ_START")
    SUBJECT_END = get_special_token("SUBJ_END")
    OBJECT_START = get_special_token("OBJ_START")
    OBJECT_END = get_special_token("OBJ_END")
    SUBJECT_NER = get_special_token("SUBJ=%s" % example.ner1)
    OBJECT_NER = get_special_token("OBJ=%s" % example.ner2)

The issue is that to use the pre-trained models one has to substitute those tokens before tokenizing, but there is no way to obtain the originally used ones without the original data (which is not freely available). Does anyone have access to the TACRED dataset and is able to obtain these tokens (or the special_tokens dict) using the original code and share it? And maybe add it somewhere in the repo so it can be easily accessed.
Thanks!

julien-c · October 6, 2020, 3:59pm

Tagging model author @mrm8488

mrm8488 · October 6, 2020, 4:55pm

Hi, I uploaded the model from FB research repo to HF model hub in order to experiment with it but I had kind of same problem with the tokenizer. I have to check out if I can solve it or fine-tuning the model by myself and re-upload it.

AlviseSembenico · October 10, 2020, 3:39pm

Hi @mrm8488, thank you for your time, your support would really beneficial for me as well! A simple dictionary that matches the unused tokens to the entities would do the job.
Gracias!

Topic		Replies	Views
How to handle "entities" during tokenization? Beginners	1	245	March 10, 2021
SpanBERT, ELECTRA, MARGE from scratch? Beginners	5	1379	July 22, 2023
TFBertModel for classification task with no CLS token Beginners	0	344	March 11, 2023
NER model fine tuning with labeled spans Beginners	5	3905	May 7, 2023
Any Model for NER on French Models	7	1001	September 18, 2020

SpanBert TACRED tokens

Related topics