MathML to Python translation - what model/tokenizer

kj821 · January 2, 2025, 2:26pm

I’m trying to convert MathML representations of equations into Python symbolic equations using the SymPy library, e.g.

<mml:mi>h</mml:mi>
  <mml:mo>=</mml:mo>
  <mml:mrow>
    <mml:msub>
      <mml:mi>h</mml:mi>
      <mml:mi>c</mml:mi>
    </mml:msub>
    <mml:mo>+</mml:mo>
    <mml:msub>
      <mml:mi>h</mml:mi>
      <mml:mi>g</mml:mi>
    </mml:msub>
  </mml:mrow>

would translate to

from sympy import *
h, h_g, h_c = symbols('h h_g h_c')
e = Eq(h, h_g + h_c)

I thought using an encoder-decoder transformer like T5 would be good for this task and was planning on fine tuning the model with a synthetic dataset of MathML to SymPy equations following the Hugging Face Learn Translation walkthrough. I was wondering if it would be beneficial to train a new tokenizer that recognised the nested structure of MathML, or whether to just use the T5 AutoTokenizer?

I also considered using CodeT5, but it seems that was trained on code data only and therefore wouldn’t have any knowledge of MathML like T5 hopefully would.

Any input would be much appreciated!

mahmutc · January 2, 2025, 7:13pm

hi @kj821
According to this, it would be better if you train a new one I guess.

kj821 · January 5, 2025, 4:11pm

Thanks for the link. That page refers to also retraining my model from scratch, which I don’t believe I have the necessary data for. Is it feasible to train a new tokenizer and use it to fine-tune a pretrained model?

Topic		Replies	Views
Tuto on how to train a translation from scratch in a pythonic way? Beginners	2	619	October 23, 2023
Transformer for Translation from Scratch with Hugging Face/PyTorch Intermediate	5	3795	December 1, 2022
HTML Embedding processing Intermediate	8	3869	February 13, 2022
Help defining tokenizer 🤗Tokenizers	0	282	April 28, 2023
How to train a translation model from scratch Beginners	9	12581	March 1, 2022

MathML to Python translation - what model/tokenizer

Related topics