Got an issue where quite often, somewhere in the chain my disk goes 100% read (500mb/s for 10-20m) then crash. I’ve put loggers everywhere to see what’s causing it, and the last logger is usually after loading the summarization model (code here, model used). It could be anything; that’s the most common model used on the service. All I know is it’s transformers, as it’s always that file/module that triggers it.
My models are all persisted, so it’s not re-downloading. Dev’ing Docker on Windows (WSL2 with nvidia-docker / dev-channel). I know that’s the smoking gun, but it happens on my Ubuntu server-server too. pytorch=1.6.0 cuda=10.1 cudnn=7 transformers=3.3.1 python=3.8 (github/lefnire/dockerfiles, forum not letting me post >2 links). I saw github/huggingface/transformers/issues/5001 which had me wondering if it’s a pytorch<->cuda<->transformers version bad-match (the ticket’s very old; cuda 9.2, etc). But is there a recommended/common version-combo of Pytorch, CUDA, Python for transformers?
Could it be something with the .lock files?