MathML to Python translation - what model/tokenizer

I’m trying to convert MathML representations of equations into Python symbolic equations using the SymPy library, e.g.

<mml:mi>h</mml:mi>
  <mml:mo>=</mml:mo>
  <mml:mrow>
    <mml:msub>
      <mml:mi>h</mml:mi>
      <mml:mi>c</mml:mi>
    </mml:msub>
    <mml:mo>+</mml:mo>
    <mml:msub>
      <mml:mi>h</mml:mi>
      <mml:mi>g</mml:mi>
    </mml:msub>
  </mml:mrow>

would translate to

from sympy import *
h, h_g, h_c = symbols('h h_g h_c')
e = Eq(h, h_g + h_c)

I thought using an encoder-decoder transformer like T5 would be good for this task and was planning on fine tuning the model with a synthetic dataset of MathML to SymPy equations following the Hugging Face Learn Translation walkthrough. I was wondering if it would be beneficial to train a new tokenizer that recognised the nested structure of MathML, or whether to just use the T5 AutoTokenizer?

I also considered using CodeT5, but it seems that was trained on code data only and therefore wouldn’t have any knowledge of MathML like T5 hopefully would.

Any input would be much appreciated!

1 Like

hi @kj821
According to this, it would be better if you train a new one I guess.

1 Like

Thanks for the link. That page refers to also retraining my model from scratch, which I don’t believe I have the necessary data for. Is it feasible to train a new tokenizer and use it to fine-tune a pretrained model?

1 Like