I would like to know is there a model for generating the embedding of Document Object Model (DOM). DOM is a tree and therefore I suppose a model that handle tree would be a good choice.
The downstream task is to learn DOM similarity.
Given two DOM input, I am thinking to generate the two Dom embeddings emb_dom1, emb_dom2 and then I can take the cosine similarity for the similarity matching.
Hey @neo-benjamin - Which approach did you take to generate the DOM embeddings? I’m working on a problem where I need to find similar pages based on the structure. Looking for a way to pass the structural info while creating the embedding.
@pallavJha, I’ve been trying to figure out DOM embeddings this past couple of weeks. I looked to see if anyone’s done anything similar but the only thing I found was a project called webui, which I think is more aimed at understanding mobile user interfaces (but it’s been useful to explore). Anyway, I didn’t find anything so I’ve been working on a custom tokenizer to include DOM info along with the text content, and probably I’ll try and generate embeddings from that once I get it working how I’d like. What I’ve done so far is on my github, currently it takes a Chrome Debug Protocol structure as input but i purposely wrote it so it could also accept HTML via beautifulsoup or similar (just all the data I have is the CDP structure so that’s what I wrote first!)