Question about data in datasets

TikoTako · June 16, 2024, 7:15pm

Hello,
I’ve checked 18 gb of corporas (scraped from news, blogs, forums, random sites), all of them contain things like:
links, ads, the html tags for chars (> … ), mysql connection errors, etc…

Also some gbs of plain text from ebooks, all of them contain things like:
index, author and publisher stuff, page numbers, empty lines, weird/wrong formatting, etc…

Is normal to use these as pre-training datasets?

It is just the fine-tune data that have to be precise?

Topic		Replies	Views
Format of data during pre-training 🤗Datasets	1	353	October 7, 2020
Preparing datasets for NLP tasks 🤗Datasets	1	547	July 28, 2021
Pre-training datasets for base and roberta 🤗Datasets	0	378	May 12, 2022
Format requirements of dataset when fine tuning another model 🤗Datasets	1	896	April 7, 2022
Bert Data Preparation Beginners	1	452	November 8, 2021

Question about data in datasets

Related topics