Indeed if you use set_transform then the resulting phonemized data are created on-the-fly and not stored/cached. Only the original OSCAR data are stored on your disk as an arrow file.
And you’re right your second point about BART-style pretraining: you can pass a function to set_transform that returns two fields, one that is the original text and one that is randomly masked, even if you have only one column in your dataset.