XLNet model applied to text classification

I’m a data science student, recently I reviewed the XLNet paper and I have a doubt about it:

Imagine that we have a dataset with categories, let’s say 200, and we have 20.000 instances to train/validate the model, for example:

text: about an specific objectA
category: objectA

I thought that having so many categories can be a problem when we categorize, so I thought, ok, let’s make these categories have relation parent-child like:

Jeans - jeansA, jeansB, jeansC, …
Shirts - shirtA, shirtB, shirtC, …

instead of: jeansA, jeansB, jeansC, shirtA, shirtB, shirtC, …

My intention here is to take profit of the hierarchical classification together with the XLNet model in order to improve the accuracy. But here is when my doubt appeared:

In many examples I saw in some websites (for example Kaggle) people use XLNet directly (after a pre-processing), so I’m not sure about what I am thinking, maybe with the XLNet model alone it’s enough powerful to achieve a good classification. The question is: Has some sense what am I saying or I didn’t properly understand what XLNet does since I didn’t see anyone applying this proposal for many categories?

Is a pretrained model so there’s no way to do this proposal