My test set and validation set have 3 reference created by human, how can I eval my model during training?