Format of data during pre-training

What should be the format of the data for pre-training? could it be any raw data (e.g., news articles) in my case and then after I fine-tune, then I need to define it for a specific task e.g., classification?