Creating dataset for costum pretraining speech recognition

Hello here, I am creating a dataset to train a basic model for speech recognition.
I’m on the phase of collecting textual data that I will then enrigister in studio to have labeled data for the learning transfer after the training of the basic model.
my question is the following:
does the data have to make sense from a linguistic point of view. i mean each sample.
alson if some can help on training base model for speech recognition basing on fairseq using wav2vec U.