I’m trying to convert MathML representations of equations into Python symbolic equations using the SymPy library, e.g.
<mml:mi>h</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>c</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>g</mml:mi>
</mml:msub>
</mml:mrow>
would translate to
from sympy import *
h, h_g, h_c = symbols('h h_g h_c')
e = Eq(h, h_g + h_c)
I thought using an encoder-decoder transformer like T5 would be good for this task and was planning on fine tuning the model with a synthetic dataset of MathML to SymPy equations following the Hugging Face Learn Translation walkthrough. I was wondering if it would be beneficial to train a new tokenizer that recognised the nested structure of MathML, or whether to just use the T5 AutoTokenizer?
I also considered using CodeT5, but it seems that was trained on code data only and therefore wouldn’t have any knowledge of MathML like T5 hopefully would.
Any input would be much appreciated!