So far my tests indicate that when I do that, my training uses much more RAM, and has much worse accuracy… making me think that somehow using those tokens totally separates the examples (so they are separate instances in each batch)… which isn’t what I want if I want the model to look at both texts “together” with reference to the classification.