dgx2 User Guide
dgx2 User Guide
User Guide
TABLE OF CONTENTS
The NVIDIA® DGX-2™ System is the world’s first two-petaFLOPS system that engages
16 fully interconnected GPUs for accelerated deep learning performance. The DGX-2
System is powered by NVIDIA® DGX™ software stack and an architecture designed for
Deep Learning, and High-Performance Computing and analytics.
Unless otherwise indicated, references to the DGX-2 in this User Guide also apply to the
DGX-2H.
HARDWARE OVERVIEW
Major Components
The following diagram shows the major components of the DGX-2 System.
DGX-2H:
Dual Intel Xeon Platinum 8174, 3.1 GHz, 24-cores
6 System Memory 1.5 TB
7 Storage (RAID 0) (Cache) 8 3.84 TB each (30TB total) NVMe SSDs
8 Network (storage) 2 High speed Ethernet 10/25/40/100 GbE
Can be expanded with the purchase and
installation of a second dual-port network
adapter.
Mechanical Specifications
Feature Description
Form Factor 10U Rackmount
Height 17.32” (440 mm)
Width 19" (482.6 mm)
Depth 31.3" (795 mm)
Gross Weight 360 lbs (163.29 kg)
Power Specifications
Input Specification for Comments
Each Power Supply
200-240 DGX-2: 3000 W @ 200-240 V, The DGX-2/2H System contains six load-
volts AC 10 kW max. 16 A, 50-60 Hz balancing power supplies.
DGX-2H:
12 kW max.
The DGX-2/2H also supports operating in a degraded power mode when more than one
PSU fails. If only 3 or 4 PSUs are operating, then performance is degraded slightly but
the system is still operational.
Note: The DGX-2 will not operate with less than three PSUs.
! WARNING: To avoid electric shock or fire, do not connect other power cords to the
DGX-2. For more details, see B.6. Electrical Precautions.
Environmental Specifications
Feature Specification
Operating Temperature DGX-2: 5 ◦ C to 35 ◦ C (41 ◦ F to 95 ◦ F)
DGX-2H: 5 ◦ C to 25 ◦ C (41 ◦ F to 77 ◦ F)
Relative Humidity 20% to 80% noncondensing
Airflow DGX-2: 1000 CFM @ 35 ◦ C
DGX-2H: 1200 CFM @ 25 ◦ C
Heat Output DGX-2: 34,122 BTU/hr
DGX-2H: 40,945 BTU/hr
ID Qty Description
1 4 Upper GPU tray fans
2 4 Lower GPU tray fans
3 8 Solid State Drives.
(default) Additional SSDs available for purchase to expand to 16.
4 2 Motherboard tray fans
ID Qty Description
5 1 Front console board:
USB 3.0 (2x)
VGA (1x)
6 1 Power and ID buttons:
Bottom: ID button
Press to cause an LED on the back of the unit to flash as an
identifier during servicing.
! IMPORTANT: See the section Turning the DGX-2 On and Off for instructions on how to
properly turn the system on or off.
ID Qty Description
1 1 EMI shield
2 6 Power supplies and connectors
3 1 I/O tray
ID Qty Description
4 1 Motherboard tray
5 2 Handles to pull power supply carrier
ID Qty Description
1 2 NVIDIA NVLink™ plane card
ID Qty Description
1 1 (Optional) High profile PCI card slot (for network storage)
2 2 (Default) QSFP28 network ports (for network storage)
ID Qty Description
Left side port designation: enp134s0f0
Right side port designation: enp134s0f1
3 1 RJ45 network port (for in-band management)
4 2 USB 3.0 ports
5 1 IPMI port (for out-of-band management (BMC))
6 1 VGA port
7 1 Serial port (DB-9)
8 1 System ID LED
Blinks blue when ID button is pressed from the front of the unit
as an aid in identifying the unit needing servicing
9 1 BMC reset button
10 1 Power and BMC heartbeat LED
On/Off – BMC is not ready
Blinking – BMC is ready
NETWORK PORTS
The following figure highlights the available network ports and their purpose.
INFINIBAND CABLES
The DGX-2 System is not shipped with InfiniBand cables. For a list of cables compatible
with the Mellanox ConnectX-5 VPI cards installed in the DGX-2 system, visit the
Mellanox ConnectX-5 Firmware Download page, select the appropriate FW version,
OPN (model), and PSID, and then select Release Notes from the Documentation
column.
To connect the DGX-2 system to an existing 10 or 25 GbE network, you can purchase the
following adaptors from NVIDIA.
DGX OS SOFTWARE
The DGX-2 System comes installed with a base OS incorporating
An Ubuntu server distribution with supporting packages
The NVIDIA driver
Docker CE
NVIDIA Container Runtime for Docker
The following health monitoring software
● NVIDIA System Management (NVSM)
Provides active health monitoring and system alerts for NVIDIA DGX nodes in a
data center. It also provides simple commands for checking the health of the
DGX-2 SYSTEM from the command line.
● Data Center GPU Management (DCGM)
This software enables node-wide administration of GPUs and can be used for
cluster and data-center level management.
ADDITIONAL DOCUMENTATION
Note: Some of the documentation listed below are not available at the time of
publication. See https://docs.nvidia.com/dgx/ for the latest status.
CUSTOMER SUPPORT
Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or
diagnosing problems with your DGX-2 System. Also contact NVIDIA Enterprise
Support for assistance in installing or moving the DGX-2 System. You can contact
NVIDIA Enterprise Support in the following ways.
Our support team can help collect appropriate information about your issue and involve
internal resources as needed.
Connect to the DGX-2 console using either a direct connection, a remote connection
through the BMC, or through an SSH connection.
CAUTION: Connect directly to the DGX-2 console if the DGX-2 System is connected
to a 172.17.xx.xx subnet.
DGX OS Server software installs Docker CE which uses the 172.17.xx.xx subnet by
default for Docker containers. If the DGX-2 System is on the same subnet, you will not
be able to establish a network connection to the DGX-2 System.
Refer to the section Configuring Docker IP Addresses for instructions on how to change
the default Docker network settings.
DIRECT CONNECTION
At either the front or the back of the DGX-2 System, connect a display to the VGA
connector, and a keyboard to any of the USB ports.
This method requires that you have the BMC login credentials. These credentials
depend on the following conditions:
Prior to first time boot: The default credentials are
Username: admin
Password: admin
After first boot setup: The administrative user username that was set up during the
initial boot is used for both the BMC username and BMC password.
Username: <administrator-username>
Password: <administrator-username>
After first boot setup with changed password: The BMC password can be changed
from “<system-username>”, in which case the credentials are
Username: <administrator-username>
Password: <new-bmc-password>
1. Make sure you have connected the BMC port on the DGX-2 System to your LAN.
2. Open a browser within your LAN and go to:
https://<bmc-ip-address>/
Make sure popups are allowed for the BMC address.
3. Log in.
SSH CONNECTION
You can also establish an SSH connection to the DGX-2 System through the network
port. See the section Network Ports to identify the port to use, and the section
Configuring Static IP Addresses for the Network Ports if you need to configure a static
IP address.
While NVIDIA service personnel will install the DGX-2 System at the site and perform
the first boot setup, the first boot setup instructions are provided here for reference and
to support any re-imaging of the server.
These instructions describe the setup process that occurs the first time the DGX-2 System
is powered on after delivery or after the server is re-imaged.
Be prepared to accept all End User License Agreements (EULAs) and to set up your
username and password.
1. Connect to the DGX-2 console as explained in Connecting to the DGX-2 Console.
! CAUTION: Once you create your login credentials, the default admin/admin login will
no longer work.
Note: The BMC software will not accept "sysadmin" for a user name. If you create this
user name for the system log in, "sysadmin" will not be available for logging in to the
BMC.
● Choose a primary network interface for the DGX-2 System; for example, enp6s0.
This should typically be the interface that you will use for subsequent system
configuration or in-band management.
Note: After you select the primary network interface, the system attempts to configure
the interface for DHCP and then asks you to enter a hostname for the system. If DHCP
is not available, you will have the option to configure the network manually. If you
need to configure a static IP address on a network interface connected to a DHCP
network, select Cancel at the Network configuration – Please enter the
hostname for the system screen. The system will then present a screen with the
option to configure the network manually.
Note: RAID 1 Rebuild in Progress - When the system is booted after restoring the
image, software RAID begins the process of rebuilding the RAID 1 array - creating a
mirror of (or resynchronizing) the drive containing the software. System performance
may be affected during the RAID 1 rebuild process, which can take an hour to
complete.
During this time, the command “nvsm show health” will report a warning that the RAID
volume is resyncing.
You can check the status of the RAID 1 rebuild process using “sudo mdadm -D
/dev/md0”.
This chapter provides basic requirements and instructions for using the DGX-2 System,
including how to perform a preliminary health check and how to prepare for running
containers. Be sure to visit the DGX documentation website at
https://docs.nvidia.com/dgx/ for additional product documentation.
REGISTRATION
Be sure to register your DGX-2 System with NVIDIA as soon as you receive your
purchase confirmation e-mail. Registration enables your hardware warranty and allows
you to set up an NVIDIA GPU Cloud for DGX account.
To register your DGX-2 System, you will need information provided in your purchase
confirmation e-mail. If you do not have the information, send an e-mail to NVIDIA
Enterprise Support at enterprisesupport@nvidia.com.
1. From a browser, go to the NVIDIA DGX Product Registration page
(https://www.nvidia.com/object/dgx-product-registration).
2. Enter all required information and then click SUBMIT to complete the registration
process and receive all warranty entitlements and DGX-2 support services
entitlements.
Before installation, make sure you have completed the Site Survey and have given all
relevant site information to your Installation Partner.
Work with NVIDIA Enterprise Support to set up an NGC enterprise account if you are
the organization administrator for your DGX-2 purchase. See the NGC Container
Registry for DGX User Guide (https://docs.nvidia.com/dgx/ngc-registry-for-dgx-user-
guide/) for detailed instructions on getting an NGC enterprise account.
Startup Considerations
In order to keep your DGX-2 running smoothly, allow up to a minute of idle time after
reaching the login prompt. This ensures that all components are able to complete their
initialization.
Shutdown Considerations
! WARNING: Risk of Danger - Removing power cables or using Power Distribution Units
(PDUs) to shut off the system while the Operating System is running may cause damage
to sensitive components in the DGX-2 server.
When shutting down the DGX-2, always initiate the shutdown from the operating
system, momentary press of the power button, or by using Graceful Shutdown from the
BMC, and wait until the system enters a powered-off state before performing any
maintenance.
See the NVIDIA Containers and Deep Learning Frameworks User Guide at
https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html for further
instructions, including an example of logging into the NGC container registry and
launching a deep learning container.
This chapter describes key network considerations and instructions for the DGX-2
System.
BMC SECURITY
NVIDIA recommends that customers follow best security practices for BMC
management (IPMI port). These include, but are not limited to, such measures as:
Restricting the DGX-2 IPMI port to an isolated, dedicated, management network
Using a separate, firewalled subnet
Configuring a separate VLAN for BMC traffic if a dedicated network is not available
https_proxy="https://<username>:<password>@<host>:<port>/";
no_proxy="localhost,127.0.0.1,localaddress,.localdomain.com"
HTTP_PROXY="http://<username>:<password>@<host>:<port>/"
FTP_PROXY="ftp://<username>:<password>@<host>:<port>/";
HTTPS_PROXY="https://<username>:<password>@<host>:<port>/";
NO_PROXY="localhost,127.0.0.1,localaddress,.localdomain.com"
Example:
http_proxy="http://myproxy.server.com:8080/"
ftp_proxy="ftp://myproxy.server.com:8080/";
https_proxy="https://myproxy.server.com:8080/";
For apt
Edit (or create) a proxy config file /etc/apt/apt.conf.d/myproxy and include the
following lines
Acquire::http::proxy "http://<username>:<password>@<host>:<port>/";
Acquire::ftp::proxy "ftp://<username>:<password>@<host>:<port>/";
Acquire::https::proxy "https://<username>:<password>@<host>:<port>/";
Example:
Acquire::http::proxy "http://myproxy.server.com:8080/";
Acquire::ftp::proxy "ftp://myproxy.server.com:8080>/";
Acquire::https::proxy "https://myproxy.server.com:8080/";
For Docker
To ensure that Docker can access the NGC container registry through a proxy, Docker
uses environment variables. For best practice recommendations on configuring proxy
environment variables for Docker,
see https://docs.docker.com/engine/admin/systemd/#http-proxy.
By default, Docker uses the 172.17.0.0/16 subnet. Consult your network administrator
to find out which IP addresses are used by your network. If your network does not
conflict with the default Docker IP address range, then no changes are needed and
you can skip this section.
However, if your network uses the addresses within this range for the DGX-2 System,
you should change the default Docker network addresses.
You can change the default Docker network addresses by either modifying
the /etc/docker/daemon.json file or modifying the /etc/systemd/
system/docker.service.d/docker-override.conf file. These instructions provide
an example of modifying the/etc/systemd/system/docker.service.d/docker-
override.conf to override the default Docker network addresses.
1. Open the docker-override.conf file for editing.
$ sudo vi /etc/systemd/system/docker.service.d/docker-override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// -s overlay2
LimitMEMLOCK=infinity
LimitSTACK=67108864
2. Make the changes indicated in bold below, setting the correct bridge IP address and
IP address ranges for your network. Consult your IT administrator for the correct
addresses.
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// -s overlay2 --bip=192.168.127.1/24
--fixed-cidr=192.168.127.128/25
LimitMEMLOCK=infinity
LimitSTACK=67108864
Save and close the /etc/systemd/system/docker.service.d/docker-
override.conf file when done.
3. Reload the systemctl daemon.
4. Restart Docker.
OPENING PORTS
Make sure that the ports listed in the following table are open and available on
your firewall to the DGX-2 System:
CONNECTIVITY REQUIREMENTS
To run NVIDIA NGC containers from the NGC container registry, your network must
be able to access the following URLs:
http://archive.ubuntu.com/ubuntu/
http://security.ubuntu.com/ubuntu/
http://international.download.nvidia.com/dgx/repos/
(To be accessed using apt-get, not through a browser.)
https://apt.dockerproject.org/repo/
https://download.docker.com/linux/ubuntu/
https://nvcr.io/
To verify connection to nvcr.io, run
$ wget https://nvcr.io/v2
Note: If you cannot access the DGX-2 System remotely, then connect a display
(1440x900 or lower resolution) and keyboard directly to the DGX-2 System.
● To set the subnet mask, enter the following and replace the italicized text with
your information.
● To set the default gateway IP (“Router IP address” in the BIOS settings), enter the
following and replace the italicized text with your information.
3. At the BIOS Setup Utility screen, navigate to the Server Mgmt tab on the top menu,
then scroll to BMC network configuration and press Enter.
4. Scroll to Configuration Address Source and press Enter, then at the Configuration
Address source pop-up, select Static and then press Enter.
5. Set the addresses for the Station IP address, Subnet mask, and Router IP address as
needed by performing the following for each:
6. When finished making all your changes, press F10 to save & exit
You can now access the BMC over the network.
Note: If you cannot access the DGX-2 System remotely, then connect a display
(1440x900 or lower resolution) and keyboard directly to the DGX-2 System.
1. Determine the port designation that you want to configure, based on the physical
ethernet port that you have connected to your network.
1 enp134s0f0
2 enp134s0f1
3 enp6s0
network:
version: 2
renderer: networkd
ethernets:
<port-designation>:
dhcp4: no
dhcp6: no
addresses: [10.10.10.2/24]
gateway4: [10.10.10.1]
nameservers:
search: [<mydomain>, <other-domain>]
addresses: [10.10.10.1, 1.1.1.1]
Consult your network administrator for the appropriate information for the items in
bold, such as network, gateway, and nameserver addresses, and use the port
designations that you determined in step 1.
3. When finished with your edits, press ESC to switch to command mode, then save
the file to the disk and exit the editor.
4. Apply the changes.
Note: If you are not returned to the command line prompt after a minute, then reboot
the system.
For these changes to work properly, the configured port must connect to a networking
switch that matches the port configuration. In other words, if the port configuration is
set to InfiniBand, then the external switch should be an InfiniBand switch with the
corresponding InfiniBand cables. Likewise, if the port configuration is set to Ethernet,
then the switch should also be Ethernet.
2. To verify that the Mellanox Software Tools (MST) services are running, enter the
following.
MST devices:
------------
/dev/mst/mt4119_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:35:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
/dev/mst/mt4119_pciconf1 - PCI configuration cycles access.
domain:bus:dev.fn=0000:3a:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
/dev/mst/mt4119_pciconf2 - PCI configuration cycles access.
domain:bus:dev.fn=0000:58:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
/dev/mst/mt4119_pciconf3 - PCI configuration cycles access.
domain:bus:dev.fn=0000:5d:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
/dev/mst/mt4119_pciconf4 - PCI configuration cycles access.
domain:bus:dev.fn=0000:86:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
/dev/mst/mt4119_pciconf5 - PCI configuration cycles access.
domain:bus:dev.fn=0000:b8:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
/dev/mst/mt4119_pciconf6 - PCI configuration cycles access.
domain:bus:dev.fn=0000:bd:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
/dev/mst/mt4119_pciconf7 - PCI configuration cycles access.
domain:bus:dev.fn=0000:e1:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
/dev/mst/mt4119_pciconf8 - PCI configuration cycles access.
domain:bus:dev.fn=0000:e6:00.0 addr.reg=88 data.reg=92
Chip revision is: 00
$
This output shows the first eight cards are configured for InfiniBand and correspond to
the network cluster ports. The last card has two ports which correspond to the two
network storage ports. These are configured for Ethernet should not be changed.
Map the Device bus numbers from your output to the device name from the mst
status output on your system. For example, this example output shows that the
device name for bus bd is /dev/mst/mt4119_pciconf5. You will need the device
name when changing the configuration.
Device: 0000:e6:00.0
LINK_TYPE_P1 ETH (1)
Device #8:
Device type: ConnectX5
Device: 0000:58:00.0
LINK_TYPE_P1 ETH (1)
Device #9:
Device type: ConnectX5
Device: 0000:86:00.0
LINK_TYPE_P1 ETH(2)
LINK_TYPE_P2 ETH(2)
Device #5:
Device type: ConnectX5
Device: 0000:35:00.0
LINK_TYPE_P1 IB(1)
Device #6:
Device type: ConnectX5
Device: 0000:5d:00.0
LINK_TYPE_P1 IB(1)
Device #7:
Device type: ConnectX5
Device: 0000:e6:00.0
LINK_TYPE_P1 IB(1)
Device #8:
Device type: ConnectX5
Device: 0000:58:00.0
LINK_TYPE_P1 IB(1)
Device #9:
Device type: ConnectX5
Device: 0000:86:00.0
LINK_TYPE_P1 ETH(2)
LINK_TYPE_P2 ETH(2)
By default, the DGX-2 System includes eight SSDs in a RAID 0 configuration. These
SSDs are intended for application caching, so you must set up your own NFS storage for
long term data storage. The following instructions describe how to mount the NFS onto
the DGX-2 System, and how to cache the NFS using the DGX-2 SSDs for improved
performance.
Make sure that you have an NFS server with one or more exports with data to be
accessed by the DGX-2 System, and that there is network access between the DGX-2
System and the NFS server.
1. Configure an NFS mount for the DGX-2 System.
a) Edit the filesystem tables configuration.
sudo vi /etc/fstab
b) Add a new line for the NFS mount, using the local mount point of /mnt.
<nfs_server>:<export_path> /mnt nfs
rw,noatime,rsize=32768,wsize=32768,nolock,tcp,intr,fsc,nofail 0 0
― /mnt is used here as an example mount point.
― Consult your Network Administrator for the correct values for <nfs_server> and
<export_path>.
― The nfs arguments presented here are a list of recommended values based on
typical use cases. However, "fsc" must always be included as that argument
specifies use of FS-Cache.
c) Save the changes.
2. Verify the NFS server is reachable.
ping <nfs_server>
Use the server IP address or the server name provided by your network
administrator.
3. Mount the NFS export.
sudo mount /mnt
/mnt is an example mount point.
4. Verify caching is enabled.
cat /proc/fs/nfsfs/volumes
Look for the text FSC=yes in the output.
The NFS will be mounted and cached on the DGX-2 System automatically upon
subsequent reboot cycles.
This chapter describes specific features of the DGX-2 server to consider during setup
and operation.
SETTING MAXQ/MAXP
The maximum power consumption of the DGX-2 system is 10 kW. Beginning with DGX
OS 4.0.5, you can reduce the power consumption of the GPUs in the DGX-2 system to
accommodate server racks with a power budget of 18 kW. This allows you to install two
DGX-2 systems in the rack, instead of being limited to one.
Notes:
MaxQ is supported on DGX-2 systems with BMC firmware version 1.04.03 or later.
MaxQ is not supported on DGX-2H systems.
Commands to switch to MaxP or MaxQ, or to see the current power state, are not
supported on DGX-2H systems.
Setting MaxP/MaxQ is not supported on DGX-2 systems configured to run kernel virtual
machines (KVM mode).
MaxQ
Maximum efficiency mode
Allows two DGX-2 systems to be installed in racks that have a power budget of 18
kW.
Switch to MaxQ mode as follows:
MaxP
Default mode that provides maximum performance
GPUs operate unconstrained up to the thermal design power (TDP) level.
In this setting, the maximum DGX-2 power consumption is 10 kW.
Provides reduced but better performance than MaxQ when only 3 or 4 PSUs are
working.
If you switch to MaxQ mode, you can switch back to the default power mode (MaxP)
as follows:
/user/sbin/dgs-kdump-config enable-dmesg-dump
/user/sbin/dgs-kdump-config enable-vmcore-dump
/user/sbin/dgs-kdump-config disable
This option disables the use of kdump and make sure no memory is reserved for the
crash kernel.
Since SBIOS updates do not over-write existing settings, the DGX-2 automatically
disables ACS upon rebooting the system as part of the SBIOS update.
If you are using the DGX-2 in KVM mode, ACS will be enabled automatically as part
of the conversion from bare-metal to KVM host.
When converting back to bare-metal mode from KVM mode and then rebooting, the
DGX-2 automatically disables ACS.
If the DGX-2 software image becomes corrupted (or both OS NVMe drives are replaced),
restore the DGX-2 software image to its original factory condition from a pristine copy of
the image.
Note: The DGX OS Server software is restored on one of the two NMVe M.2 drives.
When the system is booted after restoring the image, software RAID begins the process
rebuilding the RAID 1 array - creating a mirror of (or resynchronizing) the drive
containing the software. System performance may be affected during the RAID 1
rebuild process, which can take an hour to complete.
Before re-imaging the system remotely, ensure that the correct DGX-2 software image is
saved to your local disk. For more information, see Obtaining the DGX-2 Software ISO
Image and Checksum File.
1. Log in to the BMC.
2. Click Remote Control and then click Launch KVM.
3. Set up the ISO image as virtual media.
a) From the top bar, click Browse File and then locate the re-image ISO file and
click Open.
b) Click Start Media.
4. Reboot, install the image, and complete the DGX-2 System setup.
a) From the top menu, click Power and then select Hard Reset, then click Perform
Action.
b) Click Yes and then OK at the Power Control dialogs, then wait for the system to
power down and then come back online.
c) At the boot selection screen, select Install DGX Server.
If you are an advanced user who is not using the RAID disks as cache and want
to keep data on the RAID disks, then select Install DGX Server without formatting
RAID. See the section Retaining the RAID Partition While Installing the OS for
more information.
d) Press Enter.
The DGX-2 System will reboot from ISO image and proceed to install the image.
This can take approximately 15 minutes.
After the installation is completed, the system ejects the virtual CD and then reboots into
the OS.
Refer to Setting Up the DGX-2 System for the steps to take when booting up the DGX-2
System for the first time after a fresh installation.
Note: If you are restoring the software image remotely through the BMC, you do not
need a bootable installation medium and you can omit this task.
If you are creating a bootable USB flash drive, follow the instructions for the platform
that you are using:
● On a text-only Linux distribution, see Creating a Bootable USB Flash Drive by Using
the dd Command.
● On Windows, see Creating a Bootable USB Flash Drive by Using Akeo Rufus.
If you are creating a bootable DVD-ROM, you can use any of the methods described
in Burning the ISO on to a DVD on the Ubuntu Community Help Wiki.
Note: To ensure that the resulting flash drive is bootable, use the dd command
to perform a device bit copy of the image. If you use other commands to
perform a simple file copy of the image, the resulting flash drive may not be
bootable.
! CAUTION: The dd command erases all data on the device that you specify in the of
option of the command. To avoid losing data, ensure that you specify the correct path
to the USB flash drive.
3. Under Boot selection, click SELECT and then locate and select the ISO image.
4. Under Partition scheme, select GPT.
5. Under File system, select FAT32.
6. Click Start. Because the image is a hybrid ISO file, you are prompted to select
whether to write the image in ISO Image (file copy) mode or DD Image (disk image)
mode.
Before re-imaging the system from a USB flash drive, ensure that you have a bootable
USB flash drive that contains the current DGX-2 software image.
1. Plug the USB flash drive containing the OS image into the DGX-2 System.
2. Connect a monitor and keyboard directly to the DGX-2 System.
3. Boot the system and press F11 when the NVIDIA logo appears to get to the boot
menu.
4. Select the USB volume name that corresponds to the inserted USB flash drive, and
boot the system from it.
5. When the system boots up, select Install DGX Server on the startup screen.
If you are an advanced user who is not using the RAID disks as cache and want to
keep data on the RAID disks, then select Install DGX Server without formatting
RAID. See the section Retaining the RAID Partition While Installing the OS for more
information.
6. Press Enter.
The DGX-2 System will reboot and proceed to install the image. This can take more
than 15 minutes.
After the installation is completed, the system then reboots into the OS.
Refer to Setting Up the DGX-2 System for the steps to take when booting up the DGX-2
System for the first time after a fresh installation.
Since the RAID array on the DGX-2 System is intended to be used as a cache and not for
long-term data storage, this should not be disruptive. However, if you are an advanced
user and have set up the disks for a non-cache purpose and want to keep the data on
those drives, then select the Install DGX Server without formatting RAID option at the
boot menu during the boot installation. This option retains data on the RAID disks and
performs the following:
Installs the cache daemon but leaves it disabled by commenting out the RUN=yes line
in /etc/default/cachefilesd.
Creates a /raid directory, leaves it out of the file system table by commenting out the
entry containing “/raid” in /etc/fstab.
Does not format the RAID disks.
When the installation is completed, you can repeat any configurations steps that you
had performed to use the RAID disks as other than cache disks.
You can always choose to use the RAID disks as cache disks at a later time by
enabling cachefilesd and adding /raid to the file system table as follows:
1. Uncomment the #RUN=yes line in /etc/default/cachefiled.
2. Uncomment the /raid line in etc/fstab.
3. Run the following:
a) Mount /raid.
You must register your DGX-2 System in order to receive email notification whenever a
new software update is available.
These instructions explain how to update the DGX-2 software through an internet
connection to the NVIDIA public repository. The process updates a DGX-2 System
image to the latest QA’d versions of the entire DGX-2 software stack, including the
drivers, for the latest update within a specific release; for example, to update to the latest
Release 4.0 update from an earlier Release 4.0 version.
For instructions on ugrading from one Release to another (for example, from Release 3.1
to Release 4.1), consult the release notes for the target release.
$ wget -O f5-download
http://download.docker.com/linux/ubuntu/dists/bionic/Release
$ wget -O f6-international
http://international.download.nvidia.com/dgx/repos/bionic/dists/bionic/
Release
All the wget commands should be successful and there should be six files in the
directory with non-zero content.
UPDATE INSTRUCTIONS
! CAUTION: These instructions update all software for which updates are available from
your configured software sources, including applications that you installed yourself. If
you want to prevent an application from being updated, you can instruct the Ubuntu
package manager to keep the current version. For more information, see Introduction
to Holding Packages on the Ubuntu Community Help Wiki.
To prevent an application from being updated, instruct the Ubuntu package manager
to keep the current version. See Introduction to Holding Packages.
3. Upgrade to the latest version.
This section provides instructions for updating firmware for the NVIDIA® DGX server
firmware using a Docker container.
See the DGX-2 System Firmware Update Container Release Notes for information about
each release.
For reference, the following naming scheme is used for the package, container image,
and run file, depending on the FW update container version.
Starting with version 19.03.1, the container naming format has changed from
nvfw_dgx2_version to nvfw_dgx2:tag,where tag indicates the version.
Example output after loading nvfw-dgx2_19.03.1.tar.gz.
The output shows the onboard version, the version in the manifest, and whether the
firmware is up-to-date.
COMMAND SYNTAX
sudo docker run --rm [-e auto=1] --privileged -ti -v /:/hostfs <image-
name> update_fw [-f] <target>
all
to update all firmware components (SBIOS, BMC)
SBIOS
to update the SBIOS
BMC
to update the BMC firmware
Note: Other components may be supported beyond those listed here. Query the
firmware manifest to see all the components supported by the container.
The command will scan the specified firmware components and update any that are
down-level.
See the section Additional Options for an explanation of the [-e auto=1] and [-f]
options.
Note: While the progress output shows the current and manifest firmware versions, the
versions may be truncated due to space limitations. You can confirm the updated
version after the update is completed using the show_version option.
You can also update a subset of all the components. For example, to update both the
BMC firmware and the system BIOS, enter the following:
$ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgx2:19.03.1
update_fw BMC SBIOS
ADDITIONAL OPTIONS
The container will not check the onboard versions against the manifest.
To update the firmware without encountering the prompt, omit the -ti option and
instead use the -e auto=1 and -t options as follows.
$ sudo docker run -e auto=1 --rm --privileged -t -v /:/hostfs <image-
name> update_fw <target>
COMMAND SUMMARY
Show the manifest.
$ sudo docker run --rm --privileged -v /:/hostfs <image-name>
show_fw_manifest
Show version information.
$ sudo docker run --rm --privileged -v /:/hostfs <image-name>
show_version
Check the onboard firmware against the manifest and update any down-level
firmware.
$ sudo docker run --rm --privileged -ti -v /:/hostfs <image-name>
update_fw <target>
Bypass the version check and update the firmware.
In this case, specify only the container repository and not the tag.
$ chmod +x /<run-file-name>.run
$ sudo ./<run-file-name>.run
This command is the same as running the container with the update_fw all
option.
The .run file accepts the same options that are used when running the container.
Examples:
Show the manifest.
$ sudo ./<run-file-name>.run show_fw_manifest
Show version information.
$ sudo ./<run-file-name>.run show_version
Check the onboard firmware against the manifest and update any down-level
firmware.
$ sudo ./<run-file-name>.run update_fw <target>
Bypass the version check and update the firmware.
$ sudo ./<run-file-name>.run update_fw -f <target>
TROUBLESHOOTING
Make sure all PSUs are fully inserted and that power cords to all PSUs are fully inserted
and secured. If the firmware update still fails, then run nvsm dump health and send
the resulting archive containing the output to NVIDIA Enterprise Support
(https://nvid.nvidia.com/dashboard/) for failure analysis.
Do not attempt any further firmware updates until the issue is resolved or cleared by
NVIDIA Enterprise Support.
The NVIDIA DGX-2 System comes with a baseboard management controller (BMC) for
monitoring and controlling various hardware devices on the system. It monitors system
sensors and other parameters.
3. Log in.
QuickLinks …
Provides quick access to several tasks.
Note: Depending on the BMC firmware version, the following quick links may appear:
• Maintenance->Firmware Update
• Settings->NbMeManagement->NvMe P3700Vpd Info
Do not access these tasks using the Quick Links dropdown menu, as the resulting pages
are not fully functional.
Sensor
Provides status and readings for system sensors, such as SSD, PSUs, voltages, CPU
temperatures, DIMM temperatures, and fan speeds.
FRU Information
Provides, chassis, board, and product information for each FRU device.
Settings
Configure the following settings
Remote Control
Opens the KVM Launch page for accessing the DGX-2 console remotely.
Power Control
Perform various power actions
Maintenance
IMPORTANT: While you can update the BMC firmware from this page, NVIDIA
recommends using the NVIDIA Firmware Update Container instead (see section
Updating Firmware for instructions).
Do not update from versions earlier than 01.04.02 using the BMC UI, as the sensor data
record (SDR) is erroneously preserved which can result in the BMC UI reporting a critical
3V Battery sensor error. This is corrected in version 1.0.4.02 - updating from 1.04.02
does not preserve the SDR.
If you need to update from this page, click Dual Firmware Update and then select
whichever is the Current Active Image to update.
OVERVIEW
Note: NVIDIA KVM is also supported on the NVIDIA DGX-2H. References to DGX-2 in this
chapter also apply to DGX-2H.
The following diagram depicts an overview of the NVIDIA KVM architecture, showing
the hardware layer, the DGX Server KVM OS, and the virtual machines.
Using NVIDIA KVM, the DGX-2 System can be converted to include a bare metal
hypervisor to provide GPU multi-tenant virtualization. This is referred to as the DGX-2
KVM host. It allows different users to run concurrent deep learning jobs using multiple
virtual machines (guest GPU VMs) within a single DGX-2 System. Just like the bare-
metal DGX-2 System, each GPU-enabled VM contains a DGX OS software image which
includes NVIDIA drivers, CUDA, the NVIDIA Container Runtime for Docker, and other
software components for running deep learning containers.
Note: Unlike the-bare metal DGX-2 system or the KVM host OS, the guest VM OS is
configured for English-only with no option to switch to languages such as Chinese. To
set up a guest VM for a different language, install the appropriate language pack onto
the guest VM.
Example of installing a Chinese language pack:
Guest-vm-2g4-5:~$ sudo apt install language-pack-zh-hant
language-pack-zh-hans language-pack-zh-hans-base language-
pack-zh-hant-base
Running NVIDIA containers on the VM is just like running containers on a DGX-2 bare
metal system with DGX OS software installed.
While NVIDIA KVM turns your DGX system into a hypervisor supporting multiple
guest GPU VMs, it does not currently provide support for the following:
oVirt, virt-manager
The DGX-2 OS incorporates Ubuntu server, which does not include a graphics
manager required by oVirt and virt-manager.
Orchestration/resource manager
Created GPU VMs are static and cannot be altered once created.
NVMe drives as pass-through devices
To preserve the existing RAID configuration on the DGX-2 System and simplify the
process of reusing this resource if the server were ever to be reverted from KVM,
NVMe drives should not be up as pass-through devices. However, if you want to use
NVMe as pass-through devices for performance reasons, refer to the KVM
Performance Tuning section of the DGX Best Practices guide for instructions.
The DGX-2 KVM host cannot be used to run containers.
NVIDIA GPUDirectTM is not supported on multi-GPU guest VMs across InfiniBand.
There is no guest UEFI BIOS support.
About nvidia-vm
Guest GPU VMs can be managed using the virsh (see https://linux.die.net/man/1/virsh)
program or using libvirt-based XML templates. For the NVIDIA KVM, NVIDIA has
taken the most common virsh options and configuration steps and incorporated them
into the tool nvidia-vm, provided with the DGX KVM package. nvidia-vm simplifies
the process of creating guest GPU VMs and allocating resources. In addition, you can
use nvidia-vm to modify default options to suit your needs for the VM and manage
VM images installed on the system.
You can view the man pages by entering the following from the DGX-2 KVM host.
man nvidia-vm
Note: Using nvidia-vm requires root or sudo privilege. This includes deleting VMs,
running health-check, or other operations.
This step updates the GRUB menu options so the Linux kernel is made KVM-ready,
and binds the virtualization drivers to the NVIDIA devices. It also creates the GPU
health database.
Example of selecting image dgx-kvm-image-4-1-1:
sudo apt-get install dgx-kvm-sw dgx-kvm-image-4-1-1
See the section Installing Images for more information about installing KVM guest
images, including how to view the image contents.
6. Reboot the system.
Rebooting the system is needed to finalize the KVM preparation of the DGX-2
System.
sudo reboot.
Your DGX-2 System is now ready for you to create VMs.
! CAUTION: Reverting the server back to a bare metal system destroys all guest GPU VMs
that were created as well as any data. Be sure to save your data before removing the
KVM software.
The domain of each guest GPU VM is either based on the username of the VM creator
appended with a timestamp, or is specified by the VM creator. The domain is then
appended with a suffix to indicate the number of GPUs and their indices using the
format
<number-of-gpus>g<starting-index>-<ending index>.
Examples:
Inspect the list to determine the GPU indices that are available to you.
Syntax
where
--gpu-count The allowed number of GPUs to assign to the VM, depending on availability.
Acceptable values: 1, 2, 4, 8, 16
--gpu-index For the purposes of the KVM, GPUs on the DGX-2 System are
distinguished by a zero-based, sequential index. gpu_index specifies the starting
index value for the group of sequentially indexed GPUs to be assigned to the VM.
Allowed values for gpu_index depend on the number of GPUs assigned to the VM,
as shown in the following table.
1 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
2 0,2,4,6,8,10,12,14
4 0,4,8,12
8 0,8
16 0
--image (Optional) Specifies the KVM image to use as the basis for the VM. If not
specified, the latest version that is installed will be used. See the section
Managing the Images for instructions on how to install images and also how to
view which images are installed.
--user-date (Optional) Starting with dgx-kvm-sw 4.1.1, you can use cloud-init by specifying the
cloud-config file containing setup parameters for the VM. See section Using cloud-
init to Initialize the Guest VM.
--meta-data (Optional) Starting with dgx-kvm-sw 4.1.1, you can use cloud-init to specify the
meta-data file containing the meta-data that you want to include in your VM. See
section Using cloud-init to Initialize the Guest VM.
Command Help:
[sudo] nvidia-vm create --help
This command does not require “sudo”; however, using sudo affects the VM name as
described in this section.
Command Examples:
Basic command
Specifying an image
Using cloud-init involves creating two configuration files - cloud-config and instance-
data.json - and then calling them when creating the guest VM using the following
options:
--user-data <cloud-config>
--meta-data <meta-data file>
Example:
$ nvidia-vm create --gpu-count <#> --verbose --user-data
/home/lab/cloud-config --meta-data /home/lab/instance-data.json
Shutting Down a VM
You can perform a graceful shutdown of a VM, which does the following:
Releases the CPUs, memory, GPUs, and NVLink
Retains allocation of the OS and data disks
Note: Since allocation of the OS and data disks are retained, the creation of other VMs
is still impacted by the shut-down VM.
In the event that nvidia-vm shutdown fails to shut down the VM, for example, if the
VM OS is unresponsive, then you’ll need to delete the VM as explained in the section
Deleting a VM
Starting an Inactive VM
To restart a VM that has been shut down (not deleted), run the following.
sudo nvidia-vm start <vm-domain>
You can also connect to the console automatically upon restarting the VM using the
following command.
sudo nvidia-vm start --console <vm-domain>
Deleting a VM
Like the process of creating a guest GPU VM, deleting a VM involves several virsh
commands. For this reason, NVIDIA provides a simple way to delete a VM using
nvidia-vm. Deleting a VM using nvidia-vm does the following:
Stops the VM if it is running
Erases data on disks that the VM is using and releases the disks
Deletes any temporary support files
You should delete your VM instead of merely stopping it in order to release all resources
and to remove unused files.
! CAUTION: VMs that are deleted cannot be recovered. Be sure to save any data before
deleting any VMs.
Syntax
sudo nvidia-vm delete --domain <vm-domain>
Command Help
sudo nvidia-vm delete --help
Command Examples
Deleting an individual VM
Stopping a VM
You can stop (or destroy) a VM, which forcefully stops the VM but leaves its resources
intact.
sudo nvidia-vm destroy --domain <vm-domain>
Rebooting a VM
sudo nvidia-vm reboot --domain <vm-domain> --mode <shutdown-mode>
Where the shutdown mode string is one of the following: acpi, agent, initctl,
signal, paravirt.
Determining IP Addresses
You can determine the IP address of your VM by entering the following.
virsh domifaddr <vm-domain> --source agent
This command returns IP addresses for the default network configuration (macvtap) as
well as private networks. Refer to the section Network Configuration for a description of
each network type.
Example:
$ virsh domifaddr 1gpu-vm-1g1 --source agent
Login: nvidia
Password: nvidia
These can be changed. See the section Changing Login Credentials for instructions.
deluser -r nvidia
To run virsh commands, the new user must then be added to the libvirt and libvirt-
qemu groups.
Using cloud-init
You can also use cloud-init to establish a unique username and password when you
create the VM. Specify the username and password in the cloud config file. See the
section Using cloud-init to Initialize the Guest VM for more information about using
cloud-init.
Using cloud-init
You can also use cloud-init to establish the SSH keys on a per-user basis when you
create the VM. Specify the SSH authorized keys in the cloud config file. See the section
Using cloud-init to Initialize the Guest VM for more information about using cloud-init.
MANAGING IMAGES
Guest GPU VMs are based on an installed KVM image using thin provisioning for
resource efficiency.
! IMPORTANT: A KVM guest VM runs a thin-provisioned copy of the source image. If the
source image is ever uninstalled, the guest VM may not work properly. To keep guest
VMs running uninterrupted, save the KVM source image to another location before
uninstalling it.
Syntax
sudo nvidia-vm image [options]
Command Help
sudo nvidia-vm image --help
Installing Images
The KVM image is typically installed at the time the KVM package is installed. Since
updated KVM images may be available from the repository, you can install any of these
images for use in creating a guest GPU VM.
To check available DGX KVM images, enter the following.
apt-cache policy dgx-kvm-image*
Syntax
apt show <kvm-image>
Example
<snip>
To install a KVM image from the list, use the nvidia-vm image install
command.
Syntax
sudo nvidia-vm image install <kvm-image>
Example
sudo nvidia-vm image install dgx-kvm-image-4-1-1
Note: This command applies only to VMs that are not running. Currently, the command
returns “Unknown” for any guest VMs that are running.
Uninstalling Images
If you convert the DGX-2 System from a KVM OS back to the bare metal system, you
need to uninstall all the dgx-kvm images that were installed.
! IMPORTANT: If you uninstall KVM images without converting the system back to bare
metal – or example, to recover space on the Hypervisor or to upgrade to a newer image
- then you should make a copy of the image first.
A KVM guest VM runs a thin-provisioned copy of the source image. If the source image is
ever uninstalled, the guest VM may not work properly. To keep guest VMs running
uninterrupted, save the KVM source image to another location before uninstalling it.
Guest OS Drive
DGX-2 KVM Host software uses the existing RAID-1 volume as the OS drive of each
Guest (/dev/vda1) which by default is 50 GB. Since the OS drive resides on the RAID-1
array of the KVM Host, its data shall always be persistent.
Using the nvidia-vm tool, a system administrator can change the default OS drive size.
Data Drives
The DGX-2 KVM host software assigns a virtual disk to each guest GPU VM, referred to
here as the Data Drive. It is based on filesystem directory-based volumes and can be
used either as scratch space or as a cache drive.
DGX-2 software sets up a storage pool on top of the existing RAID-0 volume on the
KVM Host for Data Drives on the Guests. The Data drive is automatically carved, by
nvidia-vm tool, out of the Storage Pool and allocated to each GPU VM as a Data Drive
(/dev/vdb1) which is automatically mounted on /raid. The Data Drive size is pre-
configured according to the size of the GPU VM. For example, a 16-GPU VM gets a very
large Data Drive (See the Resource Allocation section for size details).
Since the Data Drive is created on the Host RAID-0 array, data is not intended to be
persistent. Therefore, when the GPU VM is destroyed, the Data Drive is automatically
deleted and data is not preserved.
Using the nvidia-vm tool, a system administrator can change the default Data Drive size.
$ virsh pool-list
Name State Autostart
-------------------------------------------
dgx-kvm-pool active yes
Create a VM:
To see the volumes that are created for each VM, enter the following.
-----------------------------------------------------------------------------------------------
nvidia@dgx2vm-rootTue1616-1g0:~$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 50G 0 disk
└─vda1 252:1 0 50G 0 part /
vdb 252:16 0 54.9G 0 disk
└─vdb1 252:17 0 54.9G 0 part /raid
nvidia@dgx2vm-rootTue1616-1g0:~$
NETWORK CONFIGURATION
Networks can be configured in several way. The following table shows the available
options and describes their application.
Private Yes No No
Private Network
Specify --privateIP while creating the VM so that the second virtual network
interface will be added based on private-net network for Host-to-VM connectivity.
Example:
sudo nvidia-vm create --gpu-count 4 --gpu-index 12 --privateIP
Since the reboot step will stop any running guest VMs, they should be stopped first to
avoid an uncontrolled or unexpected interruption which can lead to corruption of the
VM.
! IMPORTANT: A KVM guest VM runs a thin-provisioned copy of the source image. If the
source image is ever uninstalled, the guest VM may not work properly. To keep guest
VMs running uninterrupted, save the KVM source image to another location before
uninstalling it.
SUPPLEMENTAL INFORMATION
Resource Allocations
By default, the KVM software assigns the following resources in approximate
proportion to the number of assigned GPUs:
GPU 1 2 4 8 16
vCPU/HT 5 10 22 46 92
InfiniBand N/A 1 2 4 8
OS Drive (GB) 50 50 50 50 50
NVLink N/A 1 3 6 6
Data drive values indicate the maximum space that will be used. The actual space is
allocated as needed.
You can use command options to customize memory allocation, OS disk size, and
number of vCPUs to assign.
Resource Management
NVIDIA KVM optimizes resources to maximize the performance of the VM.
vCPU
vCPUs are pinned to each VM to be NUMA-aware and to provide better VM
performance.
InfiniBand
InfiniBand (IB) devices are set up as passthrough devices to maximize performance.
GPU
GPUs are set up as passthrough devices to maximize performance.
Data Drive
Data drives are intended to be used as scratch space cache.
NVSwitch
NVSwitch assignments are optimized for NVLink peer-to-peer performance.
NVLink
An NVLink connection is the connection between each GPU and the NVSwitch fabric.
Each NVLink connection allows up to 25 GB/s uni-directional performance.
To identify failed GPUs, the KVM host automatically polls the state of any GPUs to be
used upon launching a VM. When a failed GPU is identified by the software, the DGX-2
System is marked as ‘degraded’ and operates in degraded mode until all bad GPUs are
replaced.
Examples:
Example output
+--------------------------+--------------------------------------+
| Health Monitor Report |
+==========================+======================================+
| Overall Health | Healthy |
+--------------------------+--------------------------------------+
The following is an example of launching a VM when GPU 12 and 13 have been marked
as degraded or in a failed state.
sudo nvidia-vm create --gpu-count 8 --gpu-index 8
Note: If you attempt to launch a VM with a failed GPU before the system has
identified its failed state, the VM will fail to launch but without an error
message. If this happens, keep trying to launch the VM until the message
appears.
Your VM should restart successfully if none of the associated GPUs failed. However, if
one or more of the GPUs associated with your VM failed, then the response depends on
whether the system has had a chance to identify the GPU as unavailable.
Failed GPU identified as unavailable
The system will return an error indicating that the GPU is missing or unavailable and
that the VM is unable to start.
Failed GPU not yet identified as unavailable
The server must be powered off when performing the replacement. After GPU
replacement and upon powering on the server, the KVM software runs a health scan to
add any new GPUs to the health database.
TROUBLESHOOTING TOOLS
This section discusses tools to assist in gathering data to help NVIDIA Enterprise
Services troubleshoot GPU VM issues. Most of the tools are used at the KVM host level.
If you are using guest VMs and do not have access to the KVM host, then request the
help of your system administrator.
Run nvsysinfo from within the guest VM as well as on the KVM host, collect the
output and provide to NVIDIA Enterprise Services.
virt-install.log
$ grep -i 'error|fail' $HOME/.cache/virt-manager/virt-install.log
From the KVM host, connect to the VM console to verify guest VM operation.
$ virsh net-list
Example output
Name State Autostart Persistent
----------------------------------------------------------
macvtap-net active yes yes
private-net active yes yes
Example output:
$ virsh domifaddr 1gpu-vm-1g2 --source agent
You can configure the VM for Host-to-VM network connectivity by using privateIP. See
How to Configure the Guest VM with privateIP for instructions.
Instructions for fixing degraded RAID-0 and its impact to GPU VMs is currently beyond
the scope of this document.
Example output
Node SN Model Namespace Usage Format FW Rev
------------ -------------- -------------------------- -- -------------------- ---------- --------
/dev/nvme0n1 S2X6NX0K501953 SAMSUNG MZ1LW960HMJP-00003 1 61.79 GB / 960.20 GB 512 B + 0 B CXV8601Q
<snip> ...
3. View the health of each NVMe drive by running the following command:
:-$ sudo nvme smart-log /dev/nvme9n1
Example output
Smart Log for NVME device:nvme9n1 namespace-id:ffffffff
critical_warning : 0
<snip> ...
Example output
/dev/md0:
Version : 1.2
Creation Time : Tue Aug 13 08:23:52 2019
Raid Level : raid1
Array Size : 937034752 (893.63 GiB 959.52 GB)
Used Dev Size : 937034752 (893.63 GiB 959.52 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Name : dgx-18-04:0
UUID : 98c0057f:11b0d131:2b689147:c780f126
Events : 2014
Example Output
The DGX-2 KVM Host software uses the existing RAID-1 volume as the OS drive of each
Guest (/dev/vda1), which by default is 50 GB. Each GPU VM also gets a virtual disk
called the Data Drive. It is based on filesystem directory-based volumes and can be used
either as scratch space or as a cache drive.
From within the guest VM, run the following command.
:~# lsblk
Example output
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 50G 0 disk
└─vda1 252:1 0 50G 0 part /
vdb 252:16 0 13.9T 0 disk
└─vdb1 252:17 0 13.9T 0 part /raid
From within the Hypervisor, perform the following command and verify that the
output shows “No errors found”.
$ virsh domblkerror <vm-name>
No errors found
Examples
Check console/syslog
Example
$ sudo virt-df -d 1gpu-vm-1g0
List filesystems, partitions, block devices, LVM on the guest VM’s disk image.
Known Issues
For a list of known issues with using GPU VMs on DGX-2 systems, refer to DGX-2
Server Software Release Notes.
Reference Resources
The following are some useful resources for debugging and troubleshooting KVM
issues.
Linux KVM: Guest OS debugging
For security purposes, some installations require that systems be isolated from the
internet or outside networks. Since most DGX-2 software updates are accomplished
through an over-the-network process with NVIDIA servers, this section explains how
updates can be made when using an over-the-network method is not an option. It
includes a process for installing Docker containers as well.
Alternately, you can update the DGX-2 software by performing a network update from
a local repository. This method is available only for software versions that are available
for over-the-network updates.
! CAUTION: This process destroys all data and software customizations that you
have made on the DGX-2 System. Be sure to back up any data that you want to
preserve, and push any Docker images that you want to keep to a trusted
registry.
Using Cloud-init
Setting Up the Cloud-Config File
Set up a cloud-config file that your guest VM will use when it is created. Refer to
https://cloudinit.readthedocs.io/en/latest/topics/examples.html for a description of the
file format and options.
Following are the minimum set of options to appear in the cloud-config file.
name
The user’s login name. The default file contains a dummy value which must be
replaced with your own.
primary_group
Define the primary group. Defaults to a new group created named after the user.
The default file contains a dummy value which must be replaced with your own.
groups
shell
lock_passwd
passwd
The hash - not the password itself - of the password you want to use for this user.
ssh_authorized_keys
Setting Up Instance-Data
Set up an instance-data.json file with the metadata that you want to include in
your VM. Refer to
https://cloudinit.readthedocs.io/en/latest/topics/instancedata.html#format-of-instance-
data-json for a list of metadata attributes and file format.
Safety Information
To reduce the risk of bodily injury, electrical shock, fire, and equipment damage, read
this document and observe all warnings and precautions in this guide before installing
or maintaining your server product.
In the event of a conflict between the information in this document and information
provided with the product or on the website for a particular product, the product
documentation takes precedence.
Your server should be integrated and serviced only by technically qualified persons.
You must adhere to the guidelines in this guide and the assembly instructions in your
server manuals to ensure and maintain compliance with existing product certifications
and approvals. Use only the described, regulated components specified in this guide.
Use of other products I components will void the UL Listing and other regulatory
approvals of the product, and may result in noncompliance with product regulations in
the region(s) in which the product is sold.
Symbol Meaning
CAUTION Indicates the presence of a hazard that may cause minor personal
injury or property damage if the CAUTION is ignored.
The rail racks are designed to carry only the weight of the server
system. Do not use rail-mounted equipment as a workspace. Do not
place additional load onto any rail-mounted equipment.
This product was evaluated as Information Technology Equipment (ITE), which may be
installed in offices, schools, computer rooms, and similar commercial type locations. The
suitability of this product for other product categories and environments (such as
medical, industrial, residential, alarm systems, and test equipment), other than an ITE
application, may require further evaluation.
Site Selection
Choose a site that is:
Clean, dry, and free of airborne particles (other than normal room dust).
Well-ventilated and away from sources of heat including direct sunlight and
radiators.
Away from sources of vibration or physical shock.
In regions that are susceptible to electrical storms, we recommend you plug your
system into a surge suppressor and disconnect telecommunication lines to your
modem during an electrical storm.
Provided with a properly grounded wall outlet.
Provided with sufficient space to access the power supply cord(s), because they serve
as the product's main power disconnect.
Conform to local occupational health and safety requirements when moving and
lifting equipment.
Use mechanical assistance or other suitable assistance when moving and lifting
equipment.
Electrical Precautions
Power and Electrical Warnings
Caution: The power button, indicated by the stand-by power marking, DOES NOT
completely turn off the system AC power; standby power is active whenever the system
is plugged in. To remove power from system, you must unplug the AC power cord from
the wall outlet. Make sure all AC power cords are unplugged before you open the
chassis, or add or remove any non hot-plug components.
Do not attempt to modify or use an AC power cord if it is not the exact type required. A
separate AC cord is required for each system power supply.
Some power supplies in servers use Neutral Pole Fusing. To avoid risk of shock use
caution when working with power supplies that use Neutral Pole Fusing.
The power supply in this product contains no user-serviceable parts. Do not open the
power supply. Hazardous voltage, current and energy levels are present inside the
power supply. Return to manufacturer for servicing.
When replacing a hot-plug power supply, unplug the power cord to the power supply
being replaced before removing it from the server.
To avoid risk of electric shock, tum off the server and disconnect the power cords,
telecommunications systems, networks, and modems attached to the server before
opening it.
Caution: To avoid electrical shock or fire, check the power cord(s) that will be used with
the product as follows:
Do not attempt to modify or use the AC power cord(s) if they are not the exact type
required to fit into the grounded electrical outlets.
The power cord(s) must meet the following criteria:
● The power cord must have an electrical rating that is greater than that of the
electrical current rating marked on the product.
● The power cord must have safety ground pin or contact that is suitable for the
electrical outlet.
● The power supply cord(s) is/ are the main disconnect device to AC power. The
socket outlet(s) must be near the equipment and readily accessible for
disconnection.
● The power supply cord(s) must be plugged into socket-outlet(s) that is /are
provided with a suitable earth ground.
Caution: If the server has been running, any installed processor(s) and heat sink(s)
may be hot.
Unless you are adding or removing a hot-plug component, allow the system to cool
before opening the covers. To avoid the possibility of coming into contact with hot
component(s) during a hot-plug installation, be careful when removing or installing the
hot-plug component(s).
Caution: To avoid injury do not contact moving fan blades. Your system is supplied
with a guard over the fan, do not operate the system without the fan guard in place.
.
Note: The following installation guidelines are required by UL for maintaining safety
compliance when installing your system into a rack.
Install equipment in the rack from the bottom up with the heaviest equipment at the
bottom of the rack.
You are responsible for installing a main power disconnect for the entire rack unit. This
main disconnect must be readily accessible, and it must be labeled as controlling power
to the entire unit, not just to the server(s).
To avoid risk of potential electric shock, a proper safety ground must be implemented
for the rack and each piece of equipment installed in it.
Reduced Air Flow -Installation of the equipment in a rack should be such that the
amount of air flow required for safe operation of the equipment is not compromised.
Mechanical Loading- Mounting of the equipment in the rack should be such that a
hazardous condition is not achieved due to uneven mechanical loading.
Particular attention should be given to supply connections other than direct connections
to the branch circuit (e.g. use of power strips).
Caution: ESD can damage drives, boards, and other parts. We recommend that you
perform all procedures at an ESD workstation. If one is not available, provide some ESD
protection by wearing an antistatic wrist strap attached to chassis ground -- any
unpainted metal surface -- on your server when handling parts.
Always handle boards carefully. They can be extremely sensitive to ESO. Hold boards
only by their edges. After removing a board from its protective wrapper or from the
server, place the board component side up on a grounded, static free surface. Use a
conductive foam pad if available but not the board wrapper. Do not slide board over
any surface.
Other Hazards
NICKEL
NVIDIA Bezel. The bezel’s decorative metal foam contains some nickel. The metal
foam is not intended for direct and prolonged skin contact. Please use the handles to
remove, attach or carry the bezel. While nickel exposure is unlikely to be a problem, you
should be aware of the possibility in case you’re susceptible to nickel-related reactions.
Battery Replacement
Caution: There is the danger of explosion if the battery is incorrectly replaced. When
replacing the battery, use only the battery recommended by the equipment
manufacturer.
The NVIDIA DGX-2 is compliant with the regulations listed in this section.
United States
Product: DGX-2, DGX-2H
This device complies with part 15 of the FCC Rules. Operation is subject to the following
two conditions: (1) this device may not cause harmful interference, and (2) this device
must accept any interference received, including any interference that may cause
undesired operation of the device.
NOTE: This equipment has been tested and found to comply with the limits for a Class
A digital device, pursuant to part 15 of the FCC Rules. These limits are designed to
provide reasonable protection against harmful interference when the equipment is
operated in a commercial environment. This equipment generates, uses, and can radiate
radio frequency energy and, if not installed and used in accordance with the instruction
manual, may cause harmful interference to radio communications. Operation of this
equipment in a residential area is likely to cause harmful interference in which case the
user will be required to correct the interference at his own expense.
California Department of Toxic Substances Control: Perchlorate Material - special handling may
apply. See www.dtsc.ca.gov/hazardouswaste/perchlorate.
Canada
Product: DGX-2
CAN ICES-3(A)/NMB-3(A)
The Class A digital apparatus meets all requirements of the Canadian Interference-
Causing Equipment Regulation.
Cet appareil numerique de la class A respecte toutes les exigences du Reglement sur le
materiel brouilleur du Canada.
CE
Product: DGX-2
This is a Class A product. In a domestic environment this product may cause radio
frequency interference in which case the user may be required to take adequate
measures.
Japan
Product: DGX-2, DGX-2H
In a domestic environment this product may cause radio interference, in which case the
user may be required to take corrective actions. VCCI-A
2008年、日本における製品含有表示方法、JISC0950が公示されました。製造事業者は
、2006年7月1日
以降に販売される電気・電子機器の特定
化学物質の含有に付きまして情報提供を義務付けられました。製品の部材表示に付き
ましては、
A Japanese regulatory requirement, defined by specification JIS C 0950, 2008, mandates that
manufacturers provide Material Content Declarations for certain categories of electronic products
offered for sale after July 1, 2006.
To view the JIS C 0950 material declaration for this product, visit www.nvidia.com
日本工業規格JIS C
0950:2008により、2006年7月1日以降に販売される特定分野の電気および電子機器について、製造者による含有物質の表示が義務付けられま
す。
機器名称:DGX-2
特定化学物質記号
主な分類
Pb Hg Cd Cr(VI) PBB PBDE
筐体 除外項目 0 0 0 0 0
プリント基板 除外項目 0 0 0 0 0
プロセッサー 除外項目 0 0 0 0 0
マザーボード 除外項目 0 0 0 0 0
電源 除外項目 0 0 0 0 0
システムメモリ 除外項目 0 0 0 0 0
ハードディスクドライブ 除外項目 0 0 0 0 0
ケーブル/コネクター 除外項目 0 0 0 0 0
はんだ付け材料 0 0 0 0 0 0
フラックス、クリームはんだ、ラベル、そ
0 0 0 0 0 0
の他消耗品
注:
1.「0」は、特定化学物質の含有率が日本工業規格JIS C 0950:2008に記載されている含有率基準値より低いことを示します。
2.「除外項目」は、特定化学物質が含有マークの除外項目に該当するため、特定化学物質について、日本工業規格JIS C
0950:2008に基づく含有マークの表示が不要であることを示します。
3.「0.1wt%超」または「0.01wt%超」は、特定化学物質の含有率が日本工業規格JIS C 0950:2008 に記載されている含有率基準値を超え
ていることを示します。
A Japanese regulatory requirement, defined by specification JIS C 0950: 2008, mandates that manufacturers provide Material Content
Declarations for certain categories of electronic products offered for sale after July 1, 2006.
Chassis Exempt 0 0 0 0 0
PCA Exempt 0 0 0 0 0
Processor Exempt 0 0 0 0 0
Motherboard Exempt 0 0 0 0 0
Cables/Connectors Exempt 0 0 0 0 0
Soldering material 0 0 0 0 0 0
Flux, Solder Paste, label and other
0 0 0 0 0 0
consumable materials
Notes:
1. “0” indicates that the level of the specified chemical substance is less than the threshold level specified in the standard, JIS C 0950: 2008.
2. “Exempt” indicates that the specified chemical substance is exempt from marking and it is not required to display the marking for that
specified chemical substance per the standard, JIS C 0950: 2008.
3. “Exceeding 0.1wt%” or “Exceeding 0.01wt%” is entered in the table if the level of the specified chemical substance exceeds the threshold
level specified in the standard, JIS C 0950: 2008.
This product meets the applicable EMC requirements for Class A, I.T.E equipment
China
Product: DGX-2
产品中有害物质的名称及含量
The Table of Hazardous Substances and their Content
根据中国《电器电子产品有害物质限制使用管理办法》
as required by China’s Management Methods for Restricted of Hazardous Substances Used in Electrical and
Electronic Products
有害物质
部件名称 Hazardous Substances
Parts 铅 汞 镉 六价铬 多溴联苯 多溴联苯醚
(Pb) (Hg) (Cd) (Cr(VI)) (PBB) (PBDE)
机箱
X O O O O O
Chassis
印刷电路部件
X O O O O O
PCA
处理器
X O O O O O
Processor
主板
X O O O O O
Motherboard
电源设备
X O O O O O
Power supply
存储设备
X O O O O O
System memory
硬盘驱动器
X O O O O O
Hard drive
机械部件 (风扇、散热器、面板等)
X O O O O O
Mechanical parts (fan, heat sink, bezel…)
线材/连接器
X O O O O O
Cables/Connectors
焊接金属
O O O O O O
Soldering material
助焊剂,锡膏,标签及其他耗材
Flux, Solder Paste, label and other O O O O O O
consumable materials
注:环保使用期限的参考标识取决于产品正常工作的温度和湿度等条件
Note: The referenced Environmental Protection Use Period Marking was determined according to normal
operating use conditions of the product such as temperature and humidity.
THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED
FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE
USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE
PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE
SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF
FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART,
FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.
NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use
without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is
customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the
necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s
product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions
and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage,
costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to
this guide, or (ii) customer product designs.
Other than the right for customer to use the information in this guide with the product, no other license, either expressed or
implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if
reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated
conditions, limitations, and notices.
Trademarks
NVIDIA, the NVIDIA logo, and DGX-2 are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States
and other countries. Other company and product names may be trademarks of the respective companies with which they are
associated.
Copyright
© 2018-2019 NVIDIA Corporation. All rights reserved.
www.nvidia.com