I Don't Understand the Model Parallelism Approach in LLama Code

I am a developer just starting out with AI. Therefore, I would really appreciate your understanding if my question seems a bit naive.

When I look at the code at ‘meta-llama from Github’
, I can see that multiple processes are reading divided .pth files. However, in the chat-completion() function and the generate() function, it seems that multiple processes are dividing and reading the model, and each process is independently outputting the result from the split models.

From the above code, it feels like each process is simply dividing up hundreds of millions of parameters and independently performing generate(). It doesn’t appear that there is any comparison or combining of final outputs among the processes in the code. It doesn’t seem like it utilizes pipeline or tensor parallelism either, so I’m curious why it was written this way.

Additionally, is there a way in the llama code to execute multi-process operations and then have only the process with rank 0 hold the result?