HTML Embedding processing

I am interested in creating embedding to HTML tags in a web. Does Bert have such model?
Other works that can be relevant

Hi,

MarkupLM is such a model. I’ll add it soon to HuggingFace Transformers.

Can I already use it?

Technically yes, as follows:

!pip install -q https://github.com/NielsRogge/transformers.git@modeling_markuplm

from transformers import MarkupLMTokenizer, MarkupLMModel

tokenizer = MarkupLMTokenizer.from_pretrained("microsoft/markuplm-base")
model = MarkupLMModel.from_pretrained("microsoft/markuplm-base")

with open("path_to_html_file") as f:
       html_string = f.read()

encoding = tokenizer(html_string, return_tensors="pt")

# forward pass
outputs = model(**encoding)

Note that there are still some bugs and rough edges, which will be fixed soon.

1 Like

Hi. Thanks so much for working this. I ran the code above in colab and am hitting the following error on the encoding line:

_init_() missing 2 required positional arguments: 'merges_file' and 'tags_dict'

Do I need to clone the unilm markup repo too somehow?

I think what needed to be done is change:

encoding = MarkupLMTokenizer(html_string, return_tensors="pt")

to

encoding = tokenizer(html_string, return_tensors="pt")

Yes my mistake, thanks for fixing.

@nielsr thanks again for working the markuplm, it was a good small learning experience trying to use your branch to get the thing running for fine tuning on the WEBsrc dataset per the creators’ git instructions. Draft space for qa on html strings is here:

https://huggingface.co/spaces/FuriouslyAsleep/markupQAdemo

I know it’s not your job but any idea why the fine tuning was gobbling up 70+ gb of ram before I trimmed the training set down to just one small single domain subset? Colab just kept giving out on me.

Fine tuning colab is at: