Mutli-label classification for large free text input

I have a dataset with 2 columns,

  1. around 100k lines
  2. large amount of free text (around 10k words per cell).
  3. a list of labels that this free text belong to. usually between 5 to 30 list items.
  4. the full unique list items (labels) are around 12k.

I was wondering what would be the best way to train a model on this dataset, as majority of transformers has a maximum of 512 or 1024 only tokens as input.

1 Like

Sounds like Annif could be usable for you. See also Annif-wiki and Annif-tutorial for instructions.

(FYI, I’m one of Annif developers.)

1 Like

I was looking for ready model but for best practice concept that need to be used in this case.

Thank you for your input.

1 Like