-
Notifications
You must be signed in to change notification settings - Fork 334
Open
Description
I have two GPU on node:
- GPU-2cf4c3f6-25e7-7c24-d621-479c5d7150fd;
- GPU-4969baef-c6e6-039b-5e1d-c72f494347b6.
we see this is below on screenshot:
but we see metrics from only one GPU GPU-4969baef-c6e6-039b-5e1d-c72f494347b6:
and not see metrics from GPU-2cf4c3f6-25e7-7c24-d621-479c5d7150fd
we are sure GPU GPU-2cf4c3f6-25e7-7c24-d621-479c5d7150fd on load:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:2A:00.0 Off | 0 |
| N/A 48C P0 194W / 300W | 74367MiB / 81920MiB | 91% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe On | 00000000:3D:00.0 Off | 0 |
| N/A 49C P0 190W / 300W | 74367MiB / 81920MiB | 89% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 4184729 C ...3) --multiprocessing-fork_TP0 74358MiB |
| 1 N/A N/A 4184730 C ...7) --multiprocessing-fork_TP1 74358MiB |
+-----------------------------------------------------------------------------------------+
what could be the problem?
eabykov
Metadata
Metadata
Assignees
Labels
No labels