Skip to content

Conversation

@ewagner12
Copy link
Contributor

This adds another function to the script called "remove" which stops the display manager, unloads the driver, removes the PCIe addresses of the eGPU, reloads the driver if necessary, then restarts the display manager. This is purely an enhancement. In my experience, this allows physically unplugging the eGPU anytime after the "remove" process completes. Without this, I would experience a system hang if the eGPU were unplugged. Let me know your experience with unplugging the eGPU, if this works for you or not, or if you have another way.

Note that AMD card users will see the following error message:
[drm:amdgpu_pci_remove [amdgpu]] *ERROR* Device removal is currently not supported outside of fbcon
Of the dozens of removals I've tried, I only had one error that required a hard reboot to solve. It seems like patches are in the works for better AMD removal (that might not even need restarting X). As it stands with the current kernels, all the steps I've added are strictly necessary for unplugging and they should remain the "safe" option even if better unplugging support comes later.

Also speaking of the kernel, I tried with a couple of old mainline kernels and found 5.0+ to be the requirement. Earlier ones did not work, but later worked fine on both the Linux Mint and Ubuntu systems I tested. Again, let me know what you think of this. I'd like to see it tested with more laptop configurations especially dGPU+eGPU. As it is I would consider this to be a "beta" feature that the user can call only if they want.

Finally, this also cleans up a little bit of code from my previous PR by allowing it to read the hex ID from the is_egpu_connected function.

Adds a function to remove the eGPU's drivers and PCIe addresses. In my testing, this allows physically unplugging after switching to the internal mode and logging out. In my testing, without this unplugging would cause a system hang. Tested on Ubuntu 20.04 with kernel 5.4.
Made eGPU pcie address removal separate function and cleanup some existing code
Checks whether any devices need the GPU drivers (amdgpu, nvidia) before automatically reloading those drivers. Loads the drivers if any devices still need it, but doesn't if none do. Prevents errors and loading of unnecessary modules.
@hertg
Copy link
Owner

hertg commented Jul 10, 2020

Thanks for the PR!
I will have a look at it and test it on my machine.

Could you elaborate on the ( trap '' HUP TERM ... method?
I am not that familiar with the trap command, does the code inside those brackets execute on script exit?
So the indented code actually executes after the systemctl stop display-manager.service right?

@ewagner12
Copy link
Contributor Author

That's correct, it executes the code in the parenthesis after the display manager is stopped. Without the trap command normally user started processes stop when the display manager stops so the script stops without doing the rest of the commands. This webpage describes the HUP and TERM signals used (same as SIGHUP and SIGTERM).

Another method that I tried and got working was to make the remove function a different script and make a remove.service file that calls the separate removal script. Then in the egpu-switcher script it can just do systemctl start remove.service and that will also work to do the full removal without stopping or needing the trap command. I can make a separate branch with that method if you want to see it. In some ways, this would be the more elegant solution, but it also relies more on systemd.

@hertg
Copy link
Owner

hertg commented Jul 11, 2020

Alright, thanks for the explanation.

Although i still don't get 100% why this code works. 😄
Wouldn't the trap command prevent the display-manager from actually stopping until all of the code gets executed? Just looking at the code, i would expect it to get trapped in the while loop indefinitely (?)

while [ "$(systemctl status display-manager | awk '/Active:/{print$2}')" = "active" ]; do
	sleep 1
done

No need for the other PR, even though I'm still a bit confused, the trap command seems to be a more elegant solution to the problem than creating a separate systemd service just for the remove method. It's also easier to manage in terms of updates and stuff :)

@hertg
Copy link
Owner

hertg commented Jul 12, 2020

I tested the function on my machine, but it doesn't seem to work as expected. I really appreciate your contributions code-wise and your help in the issues, so I'm trying to be as thorough as possible.

Display-Manager doesn't start automatically
Being connected to the eGPU and running sudo egpu-switcher remove and confirming with y, does change the Xorg configuration. However, the restarting of the display-manager doesn't seem to work. The display-manager stops, but then I end up on a black screen with a blinking cursor on tty7.

Switching to tty2 and running sudo systemctl status display-manager confirms that the display-manager service stays "inactive". I still need to manually start the display-manager with sudo systemctl start display-manager in another tty to make it work.

eGPU can't be unplugged after executing the remove method

this allows physically unplugging the eGPU anytime after the "remove" process completes.

Additionally, I also wasn't able to reproduce that. Running egpu-switcher remove, then manually starting the display-manager and unplugging the eGPU did eventually result in the system hanging up after login.

All-in-all, the behavior of the following two procedures were identical on my system. Both resulting in a system freeze after login.

  1. Using the remove method
    1.1 sudo egpu-switcher remove
    1.2 sudo systemctl start display-manager
    1.3 Unplug the eGPU
    1.4 Login
    1.5 freeze

  2. Not using the remove method
    2.1 sudo egpu-switcher switch internal
    2.2 sudo restart display-manager
    2.3 Unplug the eGPU
    2.4 Login
    2.5 freeze

Unplugging the eGPU before starting display-manager does work
I did also run a third procedure to test whether there's a different behaviour, if I unplug the eGPU before (re)starting the display-manager.

  1. sudo egpu-switcher remove
  2. Unplug the eGPU before starting the display-manager
  3. sudo systemctl start display-manager

This method did indeed work better (meaning the system didn't freeze after login). But the behavior also didn't seem to be different whether I'm using the remove method or just the switch method.

Personal bad experiences with hot-plugging
In general, I've had bad experiences with trying to hot-plug my eGPU/Dock in the past. Sometimes resulting in weird keystroke glitches from the keyboard connected through Thunderbolt. I did also experience these issues with my tests today.

When I removed my eGPU with the third method mentioned above (unplug before starting display-manager), and then re-connected it with the following procedure:

  1. Plug in the eGPU
  2. sudo egpu-switcher switch egpu
  3. sudo systemctl restart display-manager.

The eGPU reconnected without a problem, but my keystrokes weren't properly recognized after that. It sometimes missed keystrokes and even missed some key-releases, so it kept entering a key that i wasn't pressing anymoreeeeeeeeeeeeeee (you get what i mean 😄).

Summary
Because of those issues, I'm a bit reluctant of hot-plugging my eGPU and I rather just reboot my machine to have a more stable experience. Given that your method still needs to restart the display-manager and therefore killing all of the users processes, it doesn't seem to be a big advantage when compared to a reboot. Because of these two points, I'm not sure if we should add this feature.

If you still want to add it as an experimental feature and troubleshoot the issues I've had, I'm happy to further test it for you. Just tell me what I should do or if there's anything more you need to know.

Would be great if you could open another PR just for the cleanup with the HEX-IDs, this seems to be unrelated to the new remove feature and a bit of code-cleanup is always welcome. :)

Background information about my setup

  1. I'm running Arch Linux with AwesomeWM and LightDM (kernel 5.7.7-arch1-1)
  2. I'm using a Thinkpad X1 Extreme with an internal dedicated NVIDIA GTX1050TI MaxQ.
  3. Intel integrated graphics are completely disabled on my machine by setting Graphics to Dedicated rather than Hybrid in BIOS. (I did that, because I've experienced some issues otherwise, when working with the (non eGPU) Thunderbolt Dock in my office).
  4. My xorg.conf.internal is completely empty, therefore not specifying the NVIDIA driver explicitly for the internal card.
  5. My eGPU is a Mantiz Venus MZ-02 with an NVIDIA GTX 1080. (I do also use the eGPU as docking station, meaning my peripherals are all connected through the eGPU)

@ewagner12
Copy link
Contributor Author

Thanks for the comments. I was concerned that the removal would work differently on Nvidia dGPU + eGPU setups since I can't test that. After you do sudo egpu-switcher remove does lspci still show the eGPU card? Before starting the display manager can you do sudo modprobe nvidia and see if there's an error there as well.

Hot plugging can be problematic, I agree. The main issue I'd like to address is the eGPU removal. I should mention that not all user processes are stopped when doing this. For example if you have something non-graphical running on a different tty that can survive. I definitely agree that this would be an experimental feature if added, but I'd like to try to get it working.

@hertg
Copy link
Owner

hertg commented Jul 13, 2020

After you do sudo egpu-switcher remove does lspci still show the eGPU card?

The eGPU still shows up in lspci after the remove command. I outputted lspci into a file before and after running sudo egpu-switcher remove. Comparing both files with diff lspci_before_remove lspci_after_remove shows that there is no difference in the output.

can you do sudo modprobe nvidia and see if there's an error there as well

Running sudo modprobe nvidia doesn't print anything (adding the verbose flag -v didn't print anything either).

I did also try to run the commands from the remove method manually and it seems like it's unable to remove the nvidia_uvm and nvidia module.

sudo egpu-switcher switch internal
sudo systemctl stop display-manager
sudo modprobe -r nvidia_uvm # didn't work
sudo modprobe -r nvidia_drm # works
sudo modprobe -r nvidia_modeset # works
sudo modprobe -r nvidia # didn't work

I get the following error on these two modules:

modprobe: FATAL: Module nvidia is in use.

This made me think, that I should probably add a logging feature with a bit more verbose logging to /var/log/egpu-switcher.log or something like that. Would definitely also help a lot in assisting users that run into trouble :)

I should mention that not all user processes are stopped when doing this.

I'm aware of that, but my guess is that 99% of people using the egpu-switcher script don't really run stuff on other ttys.

I definitely agree that this would be an experimental feature if added, but I'd like to try to get it working.

Sounds great to me, let me know if I can further assist you with testing.

Stops the nvidia persistence daemon before driver removal. This could cause issues removing the eGPU if persistence mode was on.
@ewagner12
Copy link
Contributor Author

I just added a line to stop the nvidia persistence daemon. This fixes an issue, let me know if it fixes your issues.

@hertg
Copy link
Owner

hertg commented Jul 14, 2020

Unfortunately this didn't fix any of my issues.

A quick look at sudo systemctl status nvidia-persistenced also revealed that this service doesn't seem to be active before running the remove command.

● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; disabled; vendor preset: disabled)
     Active: inactive (dead)

@ewagner12
Copy link
Contributor Author

Can you try blacklisting the nvidia_uvm module? Obviously that's not a permanent solution, but I want to make sure that module is the problem.
Also, can you post what driver version you are using? I'm testing with 440.100 on a GT 710. Here's the packages I have installed as well (though this is on Ubuntu 20.04 so they might be called different names in Arch). nvidia-compute-utils-440 nvidia-dkms-440 nvidia-driver-440 nvidia-kernel-common-440 nvidia-kernel-source-440 nvidia-prime nvidia-settings nvidia-utils-440
The nvidia-persistenced service always is active on my machine. I tried the AwesomeWM and mainline kernel 5.7.7 and I still couldn't find the problem. Maybe do the command sudo lsof | grep nvidia-uvm to try and see what is using that module?

@hertg
Copy link
Owner

hertg commented Jul 14, 2020

Oops, my bad. I've found via sudo lsof | grep nvidia-uvm that I've been running the Folding@Home service in the background. Obviously that one blocked the nvidia modules.

After stopping and disabling the foldingathome.service, all of the specified modules can be removed via modprobe -r <modulename>. Maybe adding a check whether the drivers are still in use and printing an error about that would help in the future (I'm aware that the output can't be seen by the user as of now, but this can be resolved with a log-file in the future).

The remove method still didn't work after that, but with some debugging I've found that the nvidia_drm module also needs to be loaded. And I had to add a small delay. Without these changes, the display-manager failed to start.

if [ $(lspci -k | grep -c ${vga_driver}) -gt 0 ]; then
	modprobe ${vga_driver}
	modprobe nvidia_drm # added this
	sleep 1 # added this
fi

Obviously, the nvidia_drm shouldn't be hardcoded there, that's just for testing purposes.
But with these changes, I was able to get the remove function working on my system. I'm also able to unplug my eGPU without getting a system freeze.

For completeness, here are my installed nvidia packages.

sudo pacman -Qs nvidia

local/egl-wayland 1.1.5-1
    EGLStream-based Wayland external platform
local/lib32-nvidia-utils 450.57-1
    NVIDIA drivers utilities (32-bit)
local/libvdpau 1.4-1
    Nvidia VDPAU library
local/libxnvctrl 450.57-1
    NVIDIA NV-CONTROL X extension
local/nvidia 450.57-2
    NVIDIA drivers for linux
local/nvidia-dkms 450.57-2
    NVIDIA drivers - module sources
local/nvidia-settings 450.57-1
    Tool for configuring the NVIDIA graphics driver
local/nvidia-utils 450.57-2
    NVIDIA drivers utilities
local/opencl-nvidia 450.57-2
    OpenCL implemention for NVIDIA

I'm on the kernel 5.7.8-arch1-1 right now.

@ewagner12
Copy link
Contributor Author

Nice! Thanks for doing the debugging work on this. It wasn't as easy as I hoped, but I'm glad you got it working. I'll consider how to best implement the check on if the drivers are still being used and make a commit for all those changes.
I should've thought of F@H as a potential issue. I was using it this winter, but replaced it with BOINC projects recently.

Added a check for if the drivers are in use before unloading them. Previously a program like Folding @ Home could keep the drivers active and prevent the remove function from completing, causing a black screen. Now if remove is called while a program like F@H is active, the script will print an error, stop trying the removal and restart the display manager.
@ewagner12
Copy link
Contributor Author

Hey @hertg have you had time to test my latest commits? No worries, I know we're all busy, just want to get this PR wrapped up if possible.

@hertg
Copy link
Owner

hertg commented Aug 2, 2020

Just tested it, and it worked flawlessly 😄

@hertg hertg merged commit 18952b6 into hertg:master Aug 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants