Zero shot classification for long form text

I’m looking to do topic prediction/classification on long form text (podcasts/transcripts) and I’m curious if anyone knows of a model for this? I’ve looked through the existing zero shot classification models but they all appear to be optimized for short form text like questions.

If anyone knows of such a model I would appreciate it

1 Like

cc @joeddav who is the zero-shot expert here

1 Like

Tbh your best approach is probably to just to do use one of the existing models and either (1) truncate the longer documents or (2) split them into smaller segments and ensemble the model’s predictions to get an overall label. There might be something more amenable to long sequences but I doubt it would do much better than that if there is.

3 Likes

Just to clarify, current default behavior of the library when running ZeroShotClassificationPipeline on very long text will be (1) , i.e. truncation.

(@joeddav correct me otherwise, that’s what I infer from transformers.pipelines.zero_shot_classification — transformers 4.13.0.dev0 documentation )