HTML Embedding processing

I am interested in creating embedding to HTML tags in a web. Does Bert have such model?
Other works that can be relevant


MarkupLM is such a model. I’ll add it soon to HuggingFace Transformers.

Can I already use it?

Technically yes, as follows:

!pip install -q

from transformers import MarkupLMTokenizer, MarkupLMModel

tokenizer = MarkupLMTokenizer.from_pretrained("microsoft/markuplm-base")
model = MarkupLMModel.from_pretrained("microsoft/markuplm-base")

with open("path_to_html_file") as f:
       html_string =

encoding = MarkupLMTokenizer(html_string, return_tensors="pt")

# forward pass
outputs = model(**encoding)

Note that there are still some bugs and rough edges, which will be fixed soon.