AWS HPC Blog
Application deep-dive into the AWS Graviton3E-based Amazon EC2 Hpc7g instance
This post was created by Neil Ashton, Karthik Raman, Stephen Sachs, and Heidi Poxon, from HPC Engineering, Rami Malladi and Jun Tang from Annapurna Labs, and Dnyanesh Digraskar from the HPC Partner organization.
Last week we announced the Amazon EC2 Hpc7g instance type that joins the family of HPC-specific instances in the Amazon Elastic Cloud (Amazon EC2) which started with Hpc6a in January 2022. Moving to Hpc7g offers up to 70% better performance and almost 3x better price-performance compared to the previous generation AWS Graviton instances for compute-intensive workloads. They also consume up to 60% less energy for the same work than comparable Amazon EC2 instances.
In this post we’ll discuss details of this new instance type powered by the AWS Graviton3E, and we’ll show you performance results from our work running some real workloads from computational fluid dynamics (CFD), finite-element analysis (FEA), molecular dynamics, and numerical weather prediction.
We’re thankful for the work Siemens are doing with our teams, which has already resulted in the latest version of Siemens Simcenter STAR-CCM+ ported to AWS Graviton with some really pleasing results – these are also reflected on Hpc7g, too.
We’re also excited to show – for the first time – results from a beta-release of Ansys Fluent which has very recently been ported to work on Graviton. We’ve worked closely with the team there to port other Ansys applications including Ansys LS-Dyna that we also show results for in this post.
And we’re including a customer use-case here from our friends at Flying Whales to highlight why we build these things: we want the scientists and engineers who run their HPC workloads on AWS to be the most productive people in their field because they have access to amazing tools, any time they need them.
The path to Hpc7g
Since the launch of AWS Graviton2 in instances like the c6g, we’ve been hearing from customers doing serious work in CFD, molecular dynamics, and numerical weather prediction. We blogged about how they offered up to 37% better price performance than traditional alternatives at the time. But we knew we could do even better, and we worked to improve our Graviton offerings in two dimensions.
Firstly, we designed and built the third generation Graviton3 processor, with several new capabilities that made it great for HPC workloads. It was the first instance type in the fleet to offer DDR5 memory, and it packs a communication mesh connecting all 64 cores on the die.
Second, we embarked on developing an HPC-specific family of instances tailored specifically to the needs of HPC customers. That meant higher network bandwidth between instances, deep capacity pools in placement groups close to other services HPC customers use (to limit the effects of latency), and engineered for a lower price-point. This makes them a true competitive alternative to highly optimized on-premises systems.
The intersection of these two engineering programs is the Hpc7g instance, based upon the AWS Graviton3E processor. Hpc7g instances support 3 configurations (see Table 1) with up to 64 physical cores, 128 GiB of memory, 200 Gbps of network performance through the Elastic Fabric Adapter (EFA), and 20 Gbps of Amazon Elastic Block Store (EBS) performance. These instances are possible because of the AWS Nitro System, a combination of dedicated hardware and a lightweight hypervisor that offloads many of the traditional virtualization functions to dedicated hardware, result in performance that’s virtually indistinguishable from bare metal.
And like we said a minute ago, Hpc7g instances use up to 60% less energy for the same performance than other, comparable Amazon EC2 instances.
Important specs
Hpc7 instance sizes are different from normal Amazon EC2 instance sizes
All 7th generation HPC instances will be available in different sizes. Usually in Amazon EC2, a smaller size reflects a smaller slice of the underlying hardware. However, for the HPC instances starting with Hpc7g, each size in the instance family will have the same engineering specs and price, and will differ only by the number of cores offered.
You have always been able to manually disable cores or use process pinning (affinity) to carefully place threads around the CPU. But doing this optimally needs an in-depth knowledge of the chip architecture like the number of NUMA domains and the memory layout. It also means you have to know MPI well, and have a clear sight to what your job submission scripts do when the scheduler receives them.
By offering instance sizes that already have the right cores turned off, you’ll be able to maximize the performance of your code with less work. This will be a boost for customers who need to achieve the absolute best performance per core for their workloads, which often involve commercially-licensed ISV applications. In this case, customers are driven by pricing concerns to get the best possible results from each per-core license they buy. You’ll find a detailed explanation in our other post on this channel in the next few days, along with performance comparisons that will help you understand our methodology.
The cost stays the same across all Hpc7g instance sizes. That’s because you still get access to the entire node’s memory and network, with selected cores turned off to dedicate the memory bandwidth available to the remaining cores. We encourage you to benchmark your own codes and find the right balance for your specific needs.
Performance
To illustrate why Hpc7g instances are so appealing, let’s look at six common HPC codes:
- Siemens Simcenter STAR-CCM+ for CFD
- Ansys Fluent for CFD
- Ansys LS-Dyna for FEA
- OpenFOAM for CFD (DrivAer and Flying Whales customer case)
- GROMACS for molecular dynamics
- WRF for numerical weather prediction
In Figures 1-14, you’ll see comparisons using c6gn.16xlarge (based on Graviton2) and hpc7g.16xlarge (based on Graviton3E).
The highlight is that for these codes, Hpc7g delivered up to 70% better performance (including a few cases even better than that) and almost 3x better price-performance due to a combination of the DDR5 memory offering up to 50% more memory bandwidth than DDR4, the Graviton3E processor, and 200Gbps networking from EFA.
Let’s now dive a bit deeper into each of these results so you can see the specific performance and price-performance increases for a range of test-cases and codes.
Siemens Simcenter STAR-CCM+
The first code is Siemens Simcenter STAR-CCM+ who have ported their code recently to support Arm-based processors like the Hpc7g instance. To demonstrate the performance we took the AeroSUV 320M cell automotive test-case — a useful public test-case with similar characteristics to production automotive external aerodynamics models. We ran this with Siemens Simcenter STAR-CCM+ 2302, using OpenMPI 4.1.4. The graphs below show the iterations per minute and the cost in dollars for a complete simulation (1000 iterations) using Amazon EC2 On-Demand pricing (you’ll get a better deal from Reserved Instances or Savings Plans, but On Demand is a simpler benchmark to use and works well for our purposes in this post).
As shown in Figures 1 and 2 the performance improvement from c6gn.16xlarge instances with Graviton2 to hpc7g.16xlarge instances with Graviton3E is between 71% (256 cores) to 74% (1280 cores) and between 2.83x (256 cores) and 2.91x (1280 cores) better price-performance.
Ansys Fluent (new to Graviton)
One of the most widely used CFD codes in the world is Ansys Fluent. We have worked closely with Ansys to help them to port their code to work on Graviton instances. They have recently provided a public beta of this ported version, so we were keen to see the performance and price-performance improvements from Graviton2 to Graviton3E.
To do this we ran Ansys Fluent 2023 R1 on the F1 race-car 140M cell case, which is a standard steady-state case that gives indicative performance for any external aerodynamics type of case. We used OpenMPI v4.14 for all the runs, we show Solver Rating for performance and the cost per simulation, too (based on 1000 iterations).
Figures 3 and 4 show that the performance improvement is between 61% (512 cores) to 64% (256 cores) and the price-performance improvement is between 2.64x (512 cores) and 2.7x (256 cores).
Ansys LS-DYNA
Ansys LS-DYNA is an explicit simulation software used for modeling vehicle crash, occupant safety, drop tests, and impacts. We work closely with Ansys to optimize their LS-DYNA ARM binaries for Graviton.
To see the performance improvement on Hpc7g, we ran Ansys LS-DYNA R12 on the Offset Deformable Barrier (ODB) model with 10M elements. ODB is a standard LS-DYNA benchmarking case that simulates frontal impact on a sedan car. We used OpenMPI v4.0 for all the runs.
As you can see from Figure 5 and 6, the performance improvement is between 56% (128 cores) to 59% (320 cores) and the price-performance improvement is between 2.56x (128 cores) and 2.62x (320 cores).
OpenFOAM
Next, we tested OpenFOAM with the DrivAer fastback vehicle (from the AutoCFD workshop) with 128M cells (generated using the ANSA preprocessing software by BETA-CAE Systems), and ran the case in hybrid RANS-LES mode using the pimpleFoam solver. We used OpenFOAM v2112 compiled with GNU C++ Compiler v9.4 and Open MPI v4.1.4. To schedule these jobs (as well as for the rest of the codes discussed) we used AWS ParallelCluster. We followed the same procedures as the AWS ParallelCluster CFD workshops, but instead utilized C6gn and Hpc7g instances for running the tests.
We ran the case using 64 MPI ranks per instance (fully-populated) and scaled from 2 instances (128 cores) to 16 instances (1024 cores) across both c6gn and Hpc7g. The graphs below show the iterations per minute and the cost in dollars for 10,000 time-steps using Amazon EC2 On-Demand pricing.
In Figure 7 below, you can see that for this test case, Hpc7g offers between 60% (256 cores) and 72% (1024 cores) better performance as compared to the previous generation Graviton2 instance.
Figure 8 shows between 2.7x (256 cores) and 2.98x (1024 cores) better price-performance. Much of this is thanks to the increased memory bandwidth between Graviton2 and Graviton3E and the optimized HPC price.
Our customer’s use case: Flying Whales
Flying Whales is a French company developing a 60-ton payload cargo airship for the heavy lift and outsize cargo market. It’s a remarkable project, born from France’s ambition to provide efficient, environmentally friendly transportation for collecting wood in remote areas. You can read more about them and how they use AWS in this case study.
The Flying Whales team uses OpenFOAM to run complex CFD simulations of their vehicles – and they have a continual focus on driving more accuracy and efficiency from their CFD simulations. They were naturally very interested in testing the Hpc7g instance.
Their test-case is 31M cells and uses the simpleFoam solver within OpenFOAM. Like in our previous tests, we used OpenFOAM v2212, compiled with GNU C++ Compiler v9.4 and ran using Open MPI v4.1.4.
They ran the test to 8,000 steady-state iterations. Figures 9 and 10 plot the results, and show that Hpc7g provided a boost in performance over c6gn of between 62% (64 cores) and 75% (640 cores) for the same core-count and between 2.67x (64 cores) and 2.84x (640 cores) lower cost per simulation.
The scaling efficiency also improved with Hpc7g, which we attribute to the newer generation EFA and an extra 100 Gbps of network bandwidth (200 Gbps vs 100 Gbps).
GROMACS
Next, we looked at the standard benchRIB case of 2M atoms for simulating a peptide in water. We used GROMACS v2023, which we compiled using GNU C++ Compiler v9 and ran with Open MPI v4.1.4 on AWS ParallelCluster.
Like last time, we ran with 64 MPI ranks per node to fully-populate each instance, and scaled from 1 instance to 16 instances for a total of 1024 cores. For these graphs showing Ns of simulated time per day (ns/day), higher is better.
We saw between 39% (1024 cores) and 55% (128 cores) better performance and between 2.28x (1024 cores) and 2.54x (128 cores) better price-performance than the previous generation Graviton2 instances.
WRF
Finally, we looked at CONUS 2.5km benchmark performance using WRF v4.2.2. We used GCC-11 and OpenMPI 4.1.4 to compile WRF. When running, we used 8 ranks per instance and 8 threads per rank with optimal affinity to use all the 64 cores on each instance.
We ran the benchmark scaling tests up to 32 instances (2,048 cores). We used total elapsed time to calculate the simulation speedup (number of computing steps that can be simulated each day) — higher is better.
We saw between 58% (2048 cores) and 62% (256 cores) better performance and between 2.73x (2048 cores) and 2.8x (256 cores) better price-performance than the previous generation, Graviton2 instances.
Conclusion
In this post we introduced the new Hpc7g instance that joins the Amazon EC2 HPC instances family. We ran several popular HPC applications and showed that it offers up to 70% better performance and almost 3x better price-performance compared to previous generation Graviton instances.
Graviton3E instances also consume up to 60% less energy for the same work than comparable Amazon EC2 instances, which makes it a more sustainable approach to HPC.
We’ve shown how the core technology improvements delivered for some important workloads from CFD, FEA, molecular dynamics, and weather simulation. And we showed you how this translates to real benefits for customers like Flying Whales and their engineering design simulations.
You can find a list of technical resources to help you move to Graviton for HPC here and you can learn more about HPC visit the HPC page here.