Pipeline with custom dataset tokenizer: when to save/load manually

The idea is that you can write a simple and readable code once and not care that it is redoing the downloading/pre-processing operations when you run it several times because all these are automatically cached.