-
-
Notifications
You must be signed in to change notification settings - Fork 58
Add eGPU "remove" function #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adds a function to remove the eGPU's drivers and PCIe addresses. In my testing, this allows physically unplugging after switching to the internal mode and logging out. In my testing, without this unplugging would cause a system hang. Tested on Ubuntu 20.04 with kernel 5.4.
Made eGPU pcie address removal separate function and cleanup some existing code
Checks whether any devices need the GPU drivers (amdgpu, nvidia) before automatically reloading those drivers. Loads the drivers if any devices still need it, but doesn't if none do. Prevents errors and loading of unnecessary modules.
|
Thanks for the PR! Could you elaborate on the |
|
That's correct, it executes the code in the parenthesis after the display manager is stopped. Without the trap command normally user started processes stop when the display manager stops so the script stops without doing the rest of the commands. This webpage describes the HUP and TERM signals used (same as SIGHUP and SIGTERM). Another method that I tried and got working was to make the remove function a different script and make a remove.service file that calls the separate removal script. Then in the egpu-switcher script it can just do |
|
Alright, thanks for the explanation. Although i still don't get 100% why this code works. 😄 while [ "$(systemctl status display-manager | awk '/Active:/{print$2}')" = "active" ]; do
sleep 1
doneNo need for the other PR, even though I'm still a bit confused, the trap command seems to be a more elegant solution to the problem than creating a separate systemd service just for the remove method. It's also easier to manage in terms of updates and stuff :) |
|
I tested the function on my machine, but it doesn't seem to work as expected. I really appreciate your contributions code-wise and your help in the issues, so I'm trying to be as thorough as possible. Display-Manager doesn't start automatically Switching to tty2 and running eGPU can't be unplugged after executing the remove method
Additionally, I also wasn't able to reproduce that. Running All-in-all, the behavior of the following two procedures were identical on my system. Both resulting in a system freeze after login.
Unplugging the eGPU before starting display-manager does work
This method did indeed work better (meaning the system didn't freeze after login). But the behavior also didn't seem to be different whether I'm using the remove method or just the switch method. Personal bad experiences with hot-plugging When I removed my eGPU with the third method mentioned above (unplug before starting display-manager), and then re-connected it with the following procedure:
The eGPU reconnected without a problem, but my keystrokes weren't properly recognized after that. It sometimes missed keystrokes and even missed some key-releases, so it kept entering a key that i wasn't pressing anymoreeeeeeeeeeeeeee (you get what i mean 😄). Summary If you still want to add it as an experimental feature and troubleshoot the issues I've had, I'm happy to further test it for you. Just tell me what I should do or if there's anything more you need to know. Would be great if you could open another PR just for the cleanup with the HEX-IDs, this seems to be unrelated to the new remove feature and a bit of code-cleanup is always welcome. :) Background information about my setup
|
|
Thanks for the comments. I was concerned that the removal would work differently on Nvidia dGPU + eGPU setups since I can't test that. After you do Hot plugging can be problematic, I agree. The main issue I'd like to address is the eGPU removal. I should mention that not all user processes are stopped when doing this. For example if you have something non-graphical running on a different tty that can survive. I definitely agree that this would be an experimental feature if added, but I'd like to try to get it working. |
The eGPU still shows up in
Running I did also try to run the commands from the remove method manually and it seems like it's unable to remove the sudo egpu-switcher switch internal
sudo systemctl stop display-manager
sudo modprobe -r nvidia_uvm # didn't work
sudo modprobe -r nvidia_drm # works
sudo modprobe -r nvidia_modeset # works
sudo modprobe -r nvidia # didn't workI get the following error on these two modules: This made me think, that I should probably add a logging feature with a bit more verbose logging to
I'm aware of that, but my guess is that 99% of people using the egpu-switcher script don't really run stuff on other ttys.
Sounds great to me, let me know if I can further assist you with testing. |
Stops the nvidia persistence daemon before driver removal. This could cause issues removing the eGPU if persistence mode was on.
|
I just added a line to stop the nvidia persistence daemon. This fixes an issue, let me know if it fixes your issues. |
|
Unfortunately this didn't fix any of my issues. A quick look at |
|
Can you try blacklisting the nvidia_uvm module? Obviously that's not a permanent solution, but I want to make sure that module is the problem. |
|
Oops, my bad. I've found via After stopping and disabling the The remove method still didn't work after that, but with some debugging I've found that the if [ $(lspci -k | grep -c ${vga_driver}) -gt 0 ]; then
modprobe ${vga_driver}
modprobe nvidia_drm # added this
sleep 1 # added this
fiObviously, the For completeness, here are my installed nvidia packages. sudo pacman -Qs nvidia I'm on the kernel 5.7.8-arch1-1 right now. |
|
Nice! Thanks for doing the debugging work on this. It wasn't as easy as I hoped, but I'm glad you got it working. I'll consider how to best implement the check on if the drivers are still being used and make a commit for all those changes. |
Added a check for if the drivers are in use before unloading them. Previously a program like Folding @ Home could keep the drivers active and prevent the remove function from completing, causing a black screen. Now if remove is called while a program like F@H is active, the script will print an error, stop trying the removal and restart the display manager.
|
Hey @hertg have you had time to test my latest commits? No worries, I know we're all busy, just want to get this PR wrapped up if possible. |
|
Just tested it, and it worked flawlessly 😄 |
This adds another function to the script called "remove" which stops the display manager, unloads the driver, removes the PCIe addresses of the eGPU, reloads the driver if necessary, then restarts the display manager. This is purely an enhancement. In my experience, this allows physically unplugging the eGPU anytime after the "remove" process completes. Without this, I would experience a system hang if the eGPU were unplugged. Let me know your experience with unplugging the eGPU, if this works for you or not, or if you have another way.
Note that AMD card users will see the following error message:
[drm:amdgpu_pci_remove [amdgpu]] *ERROR* Device removal is currently not supported outside of fbconOf the dozens of removals I've tried, I only had one error that required a hard reboot to solve. It seems like patches are in the works for better AMD removal (that might not even need restarting X). As it stands with the current kernels, all the steps I've added are strictly necessary for unplugging and they should remain the "safe" option even if better unplugging support comes later.
Also speaking of the kernel, I tried with a couple of old mainline kernels and found 5.0+ to be the requirement. Earlier ones did not work, but later worked fine on both the Linux Mint and Ubuntu systems I tested. Again, let me know what you think of this. I'd like to see it tested with more laptop configurations especially dGPU+eGPU. As it is I would consider this to be a "beta" feature that the user can call only if they want.
Finally, this also cleans up a little bit of code from my previous PR by allowing it to read the hex ID from the is_egpu_connected function.