Question about data in datasets

Hello,
I’ve checked 18 gb of corporas (scraped from news, blogs, forums, random sites), all of them contain things like:
links, ads, the html tags for chars (> … ), mysql connection errors, etc…

Also some gbs of plain text from ebooks, all of them contain things like:
index, author and publisher stuff, page numbers, empty lines, weird/wrong formatting, etc…

Is normal to use these as pre-training datasets?

It is just the fine-tune data that have to be precise?