Distributed configuration for LLama 3.1 70b FP8 and 4 h100 GPUs server #7980
wakusoftware
started this conversation in
General
Replies: 1 comment
-
I am facing a very similar problem with an 8-GPU node and smaller 8B models. I second the OP, what is the best solution for optimizing inference throughput in a single-node multi-GPU scenario, provided that the model fits on a single GPU? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi. We are getting a server with 4 h100 and want to serve a LLama 3.1 70B FP8. That means the model fits in a single GPU (theoretically). Of course we want to optimize our resources, what would be the best configuration for vllm serve?
It would load the model 4 times with --tensor-parallel-size 4?
Beta Was this translation helpful? Give feedback.
All reactions