How to interpret metrics for a Seq2Seq task?

I’m fine-tuning distilgpt2 to translate English sentences into regex (a specific type I implemented).

I am unsure how to interpret accuracy in this scenario and how exactly to evaluate model performance. The accuracy usually goes from around 60% at step 50 to around 70% at step 700. Then, it slowly plateaus.

But how is it computed exactly, and what does it mean when doing Seq2Seq tasks such as this? Is it simply the proportion of correctly predicted labels for a given sequence? If so, is that even informative for this task?

If not, what options do I have to track performance?

When I run inference on examples I come up with myself, the model doesn’t seem very good. I’ve seen other regex generating models do far better than my own. Ultimately it’s most likely an issue with the small dataset.