Common evaluation datasets for NER

In my PhD research I developed a multi-layer perceptron (MLP) for recognizing medical terms in NIH library publications. I used the standard precision, recall, and F1 metrics for evaluation, along with NIH-provided, hand-annotated ground truth.

I’m interested in finding other measures and standard test data that are recognized in the NLP community.

I’ve done a search of the forums and can’t find any specific topic on the subject, so I wanted to post this as a new topic.

Suggestions would be most welcome.

