Blog / AMD GPU Performance for LLM Inference: A Deep Dive
AMD GPU Performance for LLM Inference: A Deep Dive

AMD GPU Performance for LLM Inference: A Deep Dive

by Eero Laaksonen | on October 31, 2024

TL;DR:

AMD's MI300X GPU outperforms Nvidia's H100 in LLM inference benchmarks due to its larger memory (192 GB vs. 80/94 GB) and higher memory bandwidth (5.3 TB/s vs. 3.3–3.9 TB/s), making it a better fit for handling large models on a single GPU. In tests, the MI300X nearly doubles request throughput and significantly reduces latency, making it a promising contender in the AI hardware space.


The introduction of AMD's MI300X GPU has stirred up significant interest in the AI community, particularly when compared to Nvidia's H100. Two key factors set the MI300X apart.

  1. Memory size (192 GB on a single MI300X vs. 80 or 94 GB on a single H100)
  2. Memory bandwidth (5.3 TB/s on a single MI300X vs 3.3 to 3.9 TB/s on a single H100)

In this post, we report on our benchmarks comparing the MI300X and H100 for large language model (LLM) inference.

Why Single-GPU Performance Matters

Models like Mistral’s Mixtral and Llama 3 are pushing the boundaries of what's possible on a single GPU with limited memory.

  • Llama3-70B-Instruct (fp16): 141 GB + change (fits in 1 MI300X, would require 2 H100)
  • Mixtral-8x7B-Instruct (fp16): 93 GB + change (fits in 1 MI300X, would require 2 H100)
  • Mixtral-8x22B-Instruct (fp16): 281 GB + change (fits in 2 MI300x, would require 4 H100)

Dealing with multiple GPUs can be painful and complex, not to mention incurring higher infrastructure costs. A single GPU that can handle these models is a game-changer.

We’re naturally aware that quantized models can help you squeeze a larger model into a tighter space, but we’ll be talking about non-quantized, 16-bit floating point (fp16) models in this post. (For local inference, e.g., on your Apple Silicon laptop, we heartily recommend quantized models.)

Let’s Talk about Batching

Batching is a crucial concept in LLM inference that can significantly impact performance. It refers to the process of grouping multiple input sequences (i.e., user input) together and processing them simultaneously instead of one input at a time. Just like with parallel processing in general, batching is important to increase the total throughput of the system and to ensure the hardware is utilized to its full potential.

While batching can greatly improve performance, it presents certain challenges. As alluded to before, with batching, GPU memory usage is increased since all of the parallel inputs need to fit into memory. (This is, as you may imagine, less of a problem with an MI300X, given its relatively larger memory size.) Latency may be increased for an individual request, as the LLM inference machinery may need to wait to be able to schedule the request into a new batch. In a multi-GPU setup, all of this is even trickier. (Remember what we said above about having a single MI300X being nice?)

LLM Inference Software Stacks

There are two commonly used distribution formats (GGUF and HF Safetensors) and a multitude of inference stacks (libraries and software) available for running LLMs. There is no golden standard for performance today and each combination of format and inference library can yield vastly different results.

  1. vLLM: Popular for production-grade LLM serving. Builds on top of Transformers with custom optimizations. AMD maintains its own fork (rocm/vllm).
  2. Hugging Face Transformers: Widely used in the AI community; a Torch-based generic framework for running and training transformer-architecture models.
  3. llama.cpp: A ground-up C++ implementation of various LLM architectures. Known for its efficiency, especially on consumer hardware. This project powers the popular Ollama product.
  4. JAX: Google's numerical computing library, gaining traction in the AI space. Comes with big promises of being completely hardware agnostic.

… and others such as ONNX and TensorRT – and given the velocity we’re seeing in the LLM field today, chances are this list is already outdated by the time we’re publishing this post. Beyond new products, all of these stacks are being constantly updated, and we expect to see non-trivial performance boosts. This is also convoluted by the fact that drivers and firmware for the accelerators themselves see frequent updates.

Each stack has its strengths and optimizations, which impact performance on different infrastructure. The choice of inference stack often depends on factors such as:

  • The specific hardware being used (e.g., NVIDIA GPUs, AMD GPUs, Intel CPUs, Apple Silicon)
  • The model architecture and size
  • Performance requirements (latency, throughput)
  • Deployment environment – serving clients in the cloud or on the edge; serving a single user on their local device, or even on a mobile device

The Challenges of Benchmarking

While we're working on benchmarks for fine-tuning and training, this post focuses primarily on inference performance. This is partly because there are some software ecosystem challenges for efficient fine-tuning on varied hardware (and benchmarking thereof), and partly because modern LLMs can be very good at zero-shot/RAG inference even without explicit fine-tuning.

When benchmarking LLM inference performance, this diversity in inference solutions contributes to the complexity of making fair comparisons between different setups.

Even with a single stack, we’ve observed that using scarcely documented environment variables had a significant impact on inference performance, and it’s difficult to know whether a particular configuration is as efficient as it can be.

The particular test scenarios also make a difference: it’s rather different to compare a single-user scenario such as a local user prompting Llama.cpp to a vLLM server processing a batch of 100 queries simultaneously.

These variables make it challenging to perform truly apples-to-apples comparisons between different setups.

Benchmark Results

All of the above said, we’ve run some numbers. The conclusion is that the MI300X holds up to or beats H100 and is a great piece of hardware for LLM inference. This time we focused only on single GPU benchmarks, i.e. only one H100 and only one MI300X. This keeps our benchmark model sizes a little smaller as they also need to fit a single H100’s measly 80 GB of memory. In our opinion, given that most real-world use cases would need a second H100 to be able to load a model, this makes the results even more interesting.

The below numbers suggest that MI300X is very efficient at serving medium-sized MoE (Mixture of Experts) LLMs such as Qwen.

On this benchmark, the MI300X consistently outperforms the H100 across all measured metrics. MI300X achieves nearly double (1.97×) the request throughput with lower latency, serving requests for 1000 simulated clients in 64 seconds while the H100 takes roughly twice as long. In this configuration, the MI300X is about 2.7 times faster on the time-to-first-token metric (i.e. how long it will take for a client to receive any output from the LLM).

Conclusion

It’s great to see AMD leveling the playing field with NVIDIA on the GPU front. The rate of progress we have seen from AMD on gaming performance is starting to translate to ML workloads and should just result in better and reasonably priced hardware for everyone in the long run. AMD has already confirmed the release of the MI325X and is working on future advancements all the way to MI400X. We will be keeping a very close eye on AMD and expanding our benchmarks to training and new hardware as they become available to us.


Appendix: Benchmark Details

Using the qwen1.5-moe-a2.7b-chat model on a MI300X machine hosted by RunPod.

Latency Test (vLLM benchmark_latency)

seconds p10 p25 p50 p75 p90 p99 average
H100 3.446 3.472 3.486 3.499 3.513 3.519 3.482
MI300X 2.110 2.131 2.145 2.161 2.166 2.178 2.413
python benchmark_latency.py --model Qwen/Qwen1.5-MoE-A2.7B-Chat

Throughput Test (vLLM benchmark_serving)

Metric MI300X H100
Successful requests 1000 1000
Benchmark duration (s) 64.07 126.71
Total input tokens 217,393 217,393
Total generated tokens 185,616 185,142
Request throughput (req/s) 15.61 7.89
Output token throughput (tok/s) 2,896.94 1,461.09
Total Token throughput (tok/s) 6,289.83 3,176.70
Time to First Token (TTFT)
Mean TTFT (ms) 8,422.88 22,586.57
Median TTFT (ms) 6,116.67 16,504.55
P99 TTFT (ms) 23,657.62 63,382.86
Time per Output Token (TPOT)
Mean TPOT (ms) 80.35 160.50
Median TPOT (ms) 72.41 146.94
P99 TPOT (ms) 232.86 496.63
Inter-token Latency (ITL)
Mean ITL (ms) 66.83 134.89
Median ITL (ms) 45.95 90.53
P99 ITL (ms) 341.85 450.19
vllm serve Qwen/Qwen1.5-MoE-A2.7B-Chat --swap-space 16 --disable-log-requests
python benchmark_serving.py  --model Qwen/Qwen1.5-MoE-A2.7B-Chat --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json  --request-rate 50

Technical details

On the MI300X machine:

  • ROCM 6.1.0.60100-82
  • torch==2.4.1+rocm6.1
  • vllm==0.6.1.post2+rocm614 (built from the main branch of rocm/vllm on 2024-10-24)
  • environment variables: HIP_FORCE_DEV_KERNARG=1 VLLM_USE_TRITON_FLASH_ATTN=0

On the H100 machine:

  • vllm==0.6.1.post2 + vllm-flash-attn==2.6.1 installed from PyPI
Free eBookPractical MLOpsHow to get started with MLOps?