With a single GPU, I get reasonable outputs. Output single GPU: âhi there, Iâm a newbie to this forum and Iâm looking for some helpâ
As soon as I use multiple GPUs, I get: Output multi GPU: âhi there header driv EUannotuta voor measurements shooting variableslowea grayĹbestÄŻbindingâ
There seems to be some deeper problem (it appears as if it has to do with some interaction of the hardware and the drivers and the latest version of transformers/tokenizers). We got in contact with NVIDIA about this.
Since it has only indirectly to do with transformers, this can be closed.
For me, it was an issue of NCCL in the end. We had to deactivate ACS on the HPC on which I was working and the problem was resolved (see: Troubleshooting â NCCL 2.19.3 documentation).
It interfered with the communication between the GPUs.
@Dragon777 : Is the general setup somehow different in both cases? If the eight GPUs are on different nodes of your HPC and the 4 GPUs in the first case are not, I could imagine that something is going wrong with the inter-node communication. I think the NCCL performance test is a good tool for diagnosing the problem: GitHub - NVIDIA/nccl-tests: NCCL Tests
later on we switched to the instruct version of mistral (mistral-7b-instruct-v0.2), and then these settings had to be removed, but perhaps playing with these options will help!
I am facing this on the a single node with two Nvidia A40âs. It works on a single A100, and using two A100âs produces an out of memory error, strangely. On the two A40âs, I get output like this: âmissionaries ŕšŕ¸ŕ¸Łŕ¸˛ŕ¸° вŃŃĐ°ŃиconditionallyŕĽŕ¤ trĂŹnhéĄă stĹĂฤศภstâ.
The pattern of huge models producing unintelligible output in a multi-GPU environment seems similar to the symptoms of this post.
If it seems to appear in the HF official service, it may or may not be a bug in the HF library itself, but it may be an unresolved issue.