Web parsing in HuggingChat

alkibijad · October 10, 2023, 2:58pm

HuggingChat has access to web search, which is great.

However, (the way it parses webpages isn’t great). AFAIU, it looks only at <p> tags.

This causes the model to fail at some tasks, because of the parser’s output. Does anyone have an idea of a better way to parse webpages? Any ideas how OpenAI is doing it?
If we just include additional tags, e.g. <span>, <div>, <li>… we can end up with a loooot of text.

Here’s an example:

webpage: https://www.zara.com/us/en/viscose-blend-knit-polo-p06674305.html#:~:text=,90%20USD

The price of the item is $59.90.

ChatGPT’s response - it gets the price correctly:

The price of the VISCOSE BLEND KNIT POLO SHIRT on the provided link is $59.90 USD1.
Is there anything else you would like to know?

HuggingChat’s response - it fails because it can’t extract the price as it’s inside <span>

image1622×660 69.8 KB

Note that the model ends up hallucinating, none of the links have $29.90 price.

Topic		Replies	Views
Any web parser models? Beginners	0	177	April 19, 2024
Extract data from html page and extract pre-structured JSON 🤗Transformers	1	606	September 23, 2024
Cost-Effective LLM for Extracting Web Selectors from E-Commerce HTML Models	0	109	February 17, 2025
Image to text models tailored for web scraping? Models	1	831	June 9, 2024
How to use website search functionality with my LLM Beginners	1	4517	January 11, 2025

Web parsing in HuggingChat

Related topics