How to Train on Corpus of Text w/o splitting into Q&A JSON

If I have a huge corpus of text that I can’t split into Q&A JSON, how would I train a (Llama/Mistral preferably if that matters) model on it?

Another question, how did OpenAI train their GPT models on all of the internets text? (may be same answer as previous question, I assume they couldn’t split all that into Q&A JSON)