Parameters to compare GPUs
Introduction
       Steps of Identify Ideal GPU for a use-case
      Categorization of GPU
   Performance Comparison of Current Inventory
   Some Popular GPU Benchmarks
     HPC/AI Performance Benchmarks
      Video Rendering Performance Benchmarks
      Analytical Processing Performance Benchmarks
Introduction
To determine best GPU for a specific use case there are certain consideration that should be made like what are the dynamics for the
problem, optimum/ideal performance for a solution, cost and power utilization budget, driver support for GPU, etc. The software ecosystem
around GPU hardware should also be considered in making such decisions.
    Steps of Identify Ideal GPU for a use-case
   Identification of Primary Use Case and its requirements in terms for compute, memory and network requirements.
   Comparison of Key specification of GPUs to determine a hierarchy.
   Addition of prior use-case specific benchmarks
   Add price-to performance ratio to the GPU hierarchy.
   Check compatibility of GPU with existing code base dependencies.
   Software and ecosystem support
   Future proof
    Categorization of GPU
The GPUs should be clearly categorized according to thier capabilities in terms of:
   Number of parallel compute components like CUDA cores (NVIDIA) / Streaming processors (AMD).
   Memory Bandwidth, higher bandwidth can make crucial difference in inferencing servers.
   Available dedicated memory and memory type (e.g., GDDR6, HBM2).
   Availability and count of specialized cores/accelerators like Ray Tracing and Tensor Cores in NVIDIA, and Ray Accelerators in AMD.
   Use-case benchmarks like MLPerf/DLPerf or determination of a custom performance benchmark.
   Form factor
   Software and features: Support for features like NVIDIA's DLSS (Deep Learning Super Sampling) or AMD's FidelityFX Super
   Resolution, as well as driver stability and software ecosystem
   Support for APIs like DirectX, Vulkan and OpenGL
     AMD Radeon RX 6800 XT Review | TechSpot This kind of a comparison when drawn for HPC/AI or rendering workloads could
     give an intuitive incite to the user.
    Performance Comparison of Current Inventory
    Featu     RTX       A2       A10         A30    A40   A100      L4      L40s      H100   H100   H100      RTX      RTX      RTX
    res |    8000                                                                     PCIe   NVL    SXM       A600     6000    4090
GPUS                                                                                                   0      Ada
Targe            Edge /            AI              AI/HP      Edge /   Gener                                          Rende
  t              Entry            Infere                C     Entry    ative                                           ring
Audie            Level             nce                        Level     AI/
 nce                               and                        for AI   LLMs/
                                  Analyt                       and     Rende
                                   ics                        Data      ring
                                                              analys
                                                                is
Price(   4000    1400             15000            50000      1500     6000                           3500    2500    1500
  $)
Memo     48 GB   16GB     24GB    24GB     48GB    40/80      24GB     48GB     80GB    94GB   80GB   48GB    48GB    24GB
 ry      GDDR    GDDR     GDDR    HBM2                 GB     GDDR     GDDR     HBM2    HBM2   HBM2   GDDR    GDDR    GDDR
           6       6        6                      HBM2         6      6 with   e       e      e       6       6        6
                                                                       ECC
Memo     384-    128-     384-    3072-    384-    5120-      192-     384-     5120-                 384-    384-     384
 ry       bit     bit      bit     bit     bit          bit    bit      bit     bit                    bit     bit     bit
 bus
Width
Memo      672     200     600      933     696     1935 |     300       864     2000    3900   3350   768.0   768.0   1008
 ry                                                2039
Band
width
(GB/s
  )
Memo             6251     6251    1215     7251        1512   6251     9001     1593
 ry
Clock
(Mz)
GPU                       GA10             GA10    GA10                         GH10
                          2-890            2-895   0-                           0-200
                                                   893FF
                                                   ,
                                                   GA10
                                                   0-
                                                   893FF
                                                   F,
                                                   GA10
                                                   0-
                                                   893H
                                                   H,
                                                   GA10
                                                   0-
                                                   893H
                                                   HH
Clock      ?     1440-    885-    930-     1305-   1065-      795-     1065-    1125-                 1410    2175    2235
Spee             1770     1695    1440     1740        1410   2040     2520     1755                  MHz     MHz     MHz
  d
(Base
   -
Boost
  ed
MHz)
Cuda      4608    1280       3rd    6912    7424   18,17                  10752   18,17   16,38
Cores                       gen                      6                             6       4
Tenso     576      40 |     224     432     240     568                   336     568     512
   r              Gen 3
Cores
 RT        72       10       72      ?      60      142                    84     142     128
Cores
Tenso     119.4     ?        ?      312      ?     1,466                  309.7   1457.    ?
   r      TFLO                      TFLO           TFLO                   TFLO     0
Perfor     PS                        PS             PS                     PS     TFLO
manc                                                                               PS
  e
Doubl                       5.2 |   9.7 |                   30 |   34 |
  e                         10.3    19.5                    60     67
Precis
 ion
(FP64
| FP64
Tenso
r core)
 Pref
FP32      14.9     4.5      10.3    19.5    30.3   91.6     60     67     38.7    91.1    82.58
 Perf     TFLO                                     TFLO                   TFLO    TFLO    TFLO
           PS                                       PS                     PS      PS      PS
TF32              9 | 18*   82 TF   156 |   120    183 I    989    835
 Perf                       | 165   312*           366*
                            TF*
FP16               18 |     165 |   312 |   242    362.0    1671   1979
(Tens              36*      330*    624*            5I
  or                                               733*
core)
 Perf
 FP8                          -             485    733 I    3341   3958
 Perf                                              1,466*
INT8               36 |     330 |   624 |   485    733 I    3341   3958
INT4              144*      661*    1248*          1,466*
                   72 |     661 |                  733 I
                  144*      1321*                  1,466*
 RT        ?        ?        ?       ?       ?      212                   75.6    210.6    ?
Core                                               TFLO                   TFLO    TFLO
Perfor                                                           PS                                  PS      PS
manc
  e
Enco        1        1                NA       NA        2        3      0        0        0          1       3        2
ders
Deco        1        2                 4        5        4        3      7        7        7          2       3        1t
ders
Speci      1st       ?               1 OFA      ?        4      Transf   7        7        7          ?       ?        ?
ality      Ray                         1               JPEG     ormer    JPEG     JPEG     JPEG
          tracin                     NVJP              DEC      Engin    DEC      DEC      DEC
            g                         EG                          e
          GPU
Archit    Turing   Amper    Amper    Amper    Amper     Ada      Ada     Hoppe    Hoppr    Hoppe    Amper    Ada      Ada
ectur                e        e        e        e      Lovela   Lovela   r        t        r          e     Lovela   Lovela
  e                                                     nce      nce                                         nce      nce
Serie     Quadr    Tesla    Tesla    Tesla    Tesla    Tesla    Tesla    Tesla    Tesla    Tesla    Quadr   Quadr    Gefor
  s         o                                                                                         o       o       ce
Cooli     Active   Passiv   Passiv   Passiv   Passiv     ?      Passiv   Passiv   Passiv   Passiv     ?       ?        ?
 ng                  e        e        e        e                 e      e        e        e
Powe      250W      40-     150W     165W     250W(     72       350     300-     350-     700W     300W    300 W    450W
  r       [260-    60W                        40GB)                      350W     400W
Cons       295                                  ,
umpti      W]
 on                                           150W-
                                              300W(
                                              80GB)
                                                .
                                              400W(
                                              SXM)
 Ray      DESE     LOW               LOW      GOO      MEDI     GOO                                 MEDI    GOO      GOO
Tracin    CNT                                   D       UM        D                                  UM       D        D
  g
  AI      LOW      LOW               MEDI     GOO      MEDI     GOO                                 MEDI    MEDI     MEDI
Capa                                  UM        D       UM        D                                  UM      UM       UM
bilitie
  s
 ML       LOW      LOW               MEDI     GOO      MEDI     GOO                                 MEDI    MEDI     MEDI
traini                                UM        D       UM        D                                  UM      UM       UM
 ng
Rend      LOW      LOW               LOW      GOO      MEDI     GOO                                 MEDI    MEDI     GOO
ering                                           D       UM        D                                  UM      UM        D
Capa
bilitie
  s
Multi     NA        NA        NA       4                Up to     NA        NA             NA        NA
Instan                               MIGs                 7                                                 NA
 ce                                   @                 GPU
Supp                                 6GB                instan
 ort                                   2                 ces
                                     MIGs
                                      @
                                     12GB
                                       1
                                     MIGs
                                      @
                                     24GB
NVLin    Conne      NA        NA     1x 3rd   1x 3rd    3x 3rd    NA        NA            Conne      NA     NA
 k/      cts 2                       Gen      Gen       Gen                               cts 2
NVSw     at 100                      NVLin    NVLin     NVLin                               at
 itch    GB/s                          k      k           k                               112.5
           bi-                       200G     112.5     600G                              GB/s
         directi                      B/s     GB/s       B/s                              (bidire
          onal                                                                            ctional
                                                                                             )
Four     10DE:     10DE:     10DE:   10DE:    10DE:     80GB     10DE:     10DE:          10DE:
 part    1E78:     25B6:     2236:   20B7:    2235:       -      27B8:     26B9           2230:
 ID      10DE:     10DE:     10DE:   10DE:    10DE:     10DE:    10DE:     :10DE          10DE:
(VID:    13D8      157E      1482    1532     145A      20B5:    16CA      :1851          1459
DEVI                                                    10DE:
D:SVI                                                   1533
D:SSI                                                   40GB
 D)                                                       -
                                                        10DE:
                                                        20F1:
                                                        10DE:
                                                        145F
Form     4.4” H    HHHL      FHFL,   FHFL,    FHFL,      4/8     1-slot    FHFL,          FHFL,     FHFL,   ?
Facto      x       , SW,      SW      DW      DW        SXM       low-      DW             DW        DW
  r      10.5”      (LP)                                GPUs     profile
           L,      PCIe                                   in     , PCIe
         FHFL,                                          NVIDI    (169m
          DW                                              A       mx
                                                        HGX      69mm
                                                         ™          )
                                                        A100
                                                        PCIe
Interf   PCIe      PCIe      PCIe    PCIe     PCIe      PCIe     PCIe      PCIe    PCIe   PCIe
 ace     3.0x1      4.0       4.0     4.0         4.0    4.0      4.0       4.0    5.0     4.0
           6       x8, x4;    x8,     x16         x16    x16      x16,     x16;    x16,    x16
                     3.0     x16;                                 x8;      64GB/   x8;
                     x8      3.0 x                                3.0       s Bi
                                                                                   4.0
                              16                                  x16
                                                                                   x16
   Powe        8-pin         -        8-pin      8-pin      8-pin       8-pin         -       16-pin     16-pin                 8-pin
      r        auxilia               auxilia    auxilia    auxilia     auxilia                auxilia    auxilia                auxilia
   Conn          ry                    ry          ry         ry          ry                    ry          ry                    ry
   ector       power                 power      power       power      power                  power      power                  power
               conne                 conne      conne       conne      conne                  conne      conne                  conne
                ctor                  ctor        ctor       ctor       ctor                   ctor        ctor                  ctor
      https://cloudspacetechnologies-my.sharepoint.com/:x:/g/personal/siddharth_mishra_myrealdata_in/Eff_
      b2M2R1pJuy42n1WO2uMB-0rqyo2iBQXOdv55_CxWEg?e=NzojcI
 Connect your OneDrive account to collaborate on work across Atlassian products. Learn more about Smart Links.
    OneDrive                                                                                              Connect to OneDrive
     NVIDIA Data Center Platform | Line Card
                                                    Precision format support in NVIDIA GPU Architectures
Some Popular GPU Benchmarks
   HPC/AI Performance Benchmarks
1. MLPerf: A broad and widely recognized benchmark suite for machine learning performance. MLPerf covers a range of AI tasks
  including training and inference for different types of neural networks across various hardware platforms.
2. HPL (High Performance Linpack): Traditionally used to rank supercomputers in the TOP500 list, HPL measures a system's floating-
  point computing power by solving a dense system of linear equations, which is relevant for both HPC and certain AI workloads.
3. HPCG (High Performance Conjugate Gradients): Complements HPL by testing computational and data access patterns that are
  more characteristic of real-world HPC applications than HPL's dense linear algebra focus.
4. GuideLLM: It is a powerful tool for evaluating and optimizing the deployment of large language models (LLMs). By simulating real-
  world inference workloads, GuideLLM helps users gauge the performance, resource needs, and cost implications of deploying LLMs on
  various hardware configurations. (For vLLM inference scenarios)
   Video Rendering Performance Benchmarks
1. OctaneRender: Uses GPU rendering to measure how well a GPU can handle photorealistic rendering using the OctaneRender engine.
  This is particularly relevant for professionals in visual effects and animation.
2. Blender Benchmark: Open-source 3D rendering software that offers a benchmarking tool for measuring the performance of GPUs
  (and CPUs) in rendering tasks. It's widely used due to Blender's popularity in the 3D modeling and animation industry.
3. Redshift Benchmark: Designed for the Redshift rendering engine, this benchmark measures the performance of GPUs in rendering
  scenes that are representative of motion pictures and visual effects workloads.
4. V-Ray Benchmark: V-Ray Benchmark is a free tool that measures how fast your system renders. Rendering performance evaluation
  can be done using CPUs, NVIDIA GPUs, or a combination of both. Chaos® V-Ray® is a 3D rendering plugin available for all major 3D
  design and CAD programs. It works seamlessly with 3ds Max, Cinema 4D, Houdini, Maya, Nuke, Revit, Rhino, SketchUp, and Unreal.
   Analytical Processing Performance Benchmarks
1. SPECviewperf: Widely used to evaluate the performance of GPUs in professional visualization applications, including energy, medical,
  and financial analysis tasks. It measures the graphics performance of systems in professional applications.
2. Geekbench: Provides both compute and GPU benchmarks that measure the performance of GPUs in various computational tasks,
  including those relevant to analytical processing.
3. SiSoftware Sandra: Offers a suite of benchmarks that can test various aspects of GPU performance, including processing capability,
  memory bandwidth, and latency, relevant for analytical processing tasks.