Hi all,
I have been looking for a model to run a NER task in French. I see there are Camembert and RobertA models for token classification but these models are not fine-tuned for any NER tasks. Any suggestions on this? If there is not any model, is there any French dataset tagged for NER?
Thank you,
Sergul
I just asked Pedro (https://github.com/pjox ), maybe he knows some good NER datasets for French that are publicly available for fine-tuning.
If someone could give me access to FTB dataset (see CamemBERT paper), I could fine-tune a model + upload it to the model hub
Alternatives would be to use “silver standard” datasets like WikiANN/Panx or WikiNER (that include French)
Thank you @stefan-it . If I use WikiNER, do you know if there is a good way to convert it to ConLL format?
For the last time I worked with WikiNER I wrote an own script that converts the dataset into a CoNLL-like format - you can find it here:
import sys
filename = sys.argv[1]
def parse_file(filename: str):
with open(filename, "rt") as f_p:
for line in f_p:
line = line.rstrip()
if not line:
continue
# Convert
# L'|DET:ART|O Afghanistan|NAM|I-LOC a|VER:pres|O pour|PRP|O codes|NOM|O :|PUN|O
# into CoNLL-like format:
# token ner
print(
"\n".join(f"{' '.join(word.split('|')[::2])}" for word in line.split())
)
This file has been truncated. show original
Excellent! Thank you so much and please let me know if you upload a NER model in French
Oh one more thing, @stefan-it do you have an evaluation script (F1, accuracy, etc scores) for ConLL type?