Hello,
I’ve checked 18 gb of corporas (scraped from news, blogs, forums, random sites), all of them contain things like:
links, ads, the html tags for chars (> … ), mysql connection errors, etc…
Also some gbs of plain text from ebooks, all of them contain things like:
index, author and publisher stuff, page numbers, empty lines, weird/wrong formatting, etc…
Is normal to use these as pre-training datasets?
It is just the fine-tune data that have to be precise?