Speed up Longformer Tokenizer

Hello,

My current Longformer model takes 2.5 hours to classify 80k documents. The majority of time is consumed by the tokenizer. Is there any way to speed up the tokenizer?

Which environment are you using, GPU or?
I tried transformer, may be you need to use some data structure to compress the input data, I am looking for such data structure e.g., dictionary, any update from you will be appreciated.

I am currently using a p3.2x large instance on aws. It has Tesla V100 GPU. How does compressing the data improves the speed of tokenization? could you please explain.

1 Like