HTML Embedding processing

natank · January 6, 2022, 10:08am

I am interested in creating embedding to HTML tags in a web. Does Bert have such model?
Other works that can be relevant

nielsr · January 6, 2022, 1:40pm

Hi,

MarkupLM is such a model. I’ll add it soon to HuggingFace Transformers.

natank · January 7, 2022, 5:49pm

Can I already use it?

nielsr · January 8, 2022, 9:09am

Technically yes, as follows:

!pip install -q https://github.com/NielsRogge/transformers.git@modeling_markuplm

from transformers import MarkupLMTokenizer, MarkupLMModel

tokenizer = MarkupLMTokenizer.from_pretrained("microsoft/markuplm-base")
model = MarkupLMModel.from_pretrained("microsoft/markuplm-base")

with open("path_to_html_file") as f:
       html_string = f.read()

encoding = tokenizer(html_string, return_tensors="pt")

# forward pass
outputs = model(**encoding)

Note that there are still some bugs and rough edges, which will be fixed soon.

FuriouslyAsleep · January 22, 2022, 9:52pm

Hi. Thanks so much for working this. I ran the code above in colab and am hitting the following error on the encoding line:

_init_() missing 2 required positional arguments: 'merges_file' and 'tags_dict'

Do I need to clone the unilm markup repo too somehow?

FuriouslyAsleep · January 23, 2022, 8:22pm

I think what needed to be done is change:

encoding = MarkupLMTokenizer(html_string, return_tensors="pt")

to

encoding = tokenizer(html_string, return_tensors="pt")

nielsr · January 24, 2022, 8:41am

Yes my mistake, thanks for fixing.

FuriouslyAsleep · February 12, 2022, 6:20pm

@nielsr thanks again for working the markuplm, it was a good small learning experience trying to use your branch to get the thing running for fine tuning on the WEBsrc dataset per the creators’ git instructions. Draft space for qa on html strings is here:

https://huggingface.co/spaces/FuriouslyAsleep/markupQAdemo

I know it’s not your job but any idea why the fine tuning was gobbling up 70+ gb of ram before I trimmed the training set down to just one small single domain subset? Colab just kept giving out on me.

FuriouslyAsleep · February 13, 2022, 7:13pm

Fine tuning colab is at:

Topic		Replies	Views
MarkupLM model applied to html longer than 512 🤗Transformers	0	230	February 11, 2023
How to use markupLM for QA on HTML text longer than 512 tokens? Models	0	384	February 26, 2023
Error trying to load MarkupLMForPretraining 🤗Transformers	2	552	June 17, 2022
BERT MLM finetuning with custom embeddings Models	0	264	August 7, 2022
How to make MarkupLM accept full HTML code of a webpage? Beginners	0	168	June 13, 2023

HTML Embedding processing

Related topics