I Don't Understand the Model Parallelism Approach in LLama Code

huggingMuscle · April 23, 2024, 10:59am

I am a developer just starting out with AI. Therefore, I would really appreciate your understanding if my question seems a bit naive.

When I look at the code at ‘meta-llama from Github’
, I can see that multiple processes are reading divided .pth files. However, in the chat-completion() function and the generate() function, it seems that multiple processes are dividing and reading the model, and each process is independently outputting the result from the split models.

From the above code, it feels like each process is simply dividing up hundreds of millions of parameters and independently performing generate(). It doesn’t appear that there is any comparison or combining of final outputs among the processes in the code. It doesn’t seem like it utilizes pipeline or tensor parallelism either, so I’m curious why it was written this way.

Additionally, is there a way in the llama code to execute multi-process operations and then have only the process with rank 0 hold the result?

Topic		Replies	Views
Model Parralelism approach in Llama Code looks like very inefficient 🤗Transformers	0	95	May 13, 2024
Multi-GPU LLM inference data parallelism (llama) Beginners	1	14134	October 25, 2023
Model Parallelism and Pipelining for Model Training Beginners	3	3332	April 11, 2024
AI model (llama) is producing garbage output Beginners	2	251	January 9, 2025
Inference optimization with HPC Research	2	582	January 8, 2024

I Don't Understand the Model Parallelism Approach in LLama Code

Related topics