For the SpanBert model fine-tuned on TACRED dataset (ie. mrm8488/spanbert-base-finetuned-tacred · Hugging Face) the subject and object entities (ie. PERSON, ORGANIZATION, etc) are substituted by an unused token at the original code :
replace the subject and object entities by their NER
tags such as “[CLS][SUBJ-PER] was born in
[OBJ-LOC] , Michigan, . . . ”
as described in their paper: https://arxiv.org/pdf/2004.14855.pdf
And can also be found in their code: https://github.com/facebookresearch/SpanBERT/blob/master/code/run_tacred.py
in lines 134 to 139:
def get_special_token(w):
if w not in special_tokens:
special_tokens[w] = "[unused%d]" % (len(special_tokens) + 1)
return special_tokens[w]
...
SUBJECT_START = get_special_token("SUBJ_START")
SUBJECT_END = get_special_token("SUBJ_END")
OBJECT_START = get_special_token("OBJ_START")
OBJECT_END = get_special_token("OBJ_END")
SUBJECT_NER = get_special_token("SUBJ=%s" % example.ner1)
OBJECT_NER = get_special_token("OBJ=%s" % example.ner2)
The issue is that to use the pre-trained models one has to substitute those tokens before tokenizing, but there is no way to obtain the originally used ones without the original data (which is not freely available). Does anyone have access to the TACRED dataset and is able to obtain these tokens (or the special_tokens dict) using the original code and share it? And maybe add it somewhere in the repo so it can be easily accessed.
Thanks!