Identifying solutions to a complex labelling problem

I have a classification task that is fairly complex. It involves long texts as inputs. As it stands, to perform this task, humans extract answers from this long text, and then classify the text with multiple labels given the extractions. However, sometimes it is not purely extractive - as the exact verbiage doesn’t always match the labels (it does sometimes). How would you frame this problem?

  1. Multilabel Extractive or Abstractive Q and A, where the question is “What labels would you assign this text?”
  2. A summarization transformer that is trained on its effectiveness to extract features for a multi label classification model, evaluated on how successful the classification head is.

I have a feeling this isn’t as simple as a multilabel classification problem because there is an aspect of summarization going on, and that solution is might be inefficient given there are thousands of possible labels. Any opinion/experience helps!