Sorry but could you please elaborate. I’ve been using nvidia forever in linux machines both at work and at home. I work in AI so using nvidia gpus is a must. Maybe there’s something that I missed but my experience has been pretty solid so far.
At home I am using openSUSE tumbleweed KDE wayland and at work ubuntu headless.
These days ROCm support is more common than a few years ago so you’re no longer entirely dependent on CUDA for machine learning. (Although I wish fewer tools required non-CUDA users to manually install Torch in their venv because the auto-installer assumes CUDA. At least take a parameter or something if you don’t want to implement autodetection.)
Nvidia’s Linux drivers generally are a bit behind AMD’s; e.g. driver versions before 555 tended not to play well with Wayland.
Also, Nvidia’s drivers tend not to give any meaningful information in case of a problem. There’s typically just an error code for “the driver has crashed”, no matter what reason it crashed for.
Personal anecdote for the last one: I had a wonky 4080 and tracing the problem to the card took months because the log (both on Linux and Windows) didn’t contain error information beyond “something bad happened” and the behavior had dozens of possible causes, ranging from “the 4080 is unstable if you use XMP on some mainboards” over “some BIOS setting might need to be changed” and “sometimes the card doesn’t like a specific CPU/PSU/RAM/mainboard” to “it’s a manufacturing defect”.
Sure, manufacturing defects can happen to anyone; I can’t fault Nvidia for that. But the combination of useless logs and 4000-series cards having so many things they can possibly (but rarely) get hung up on made error diagnosis incredibly painful. I finally just bought a 7900 XTX instead. It’s slower but I like the driver better.
Finally, thanks for the clear cut answer. I don’t have any experience with training on AMD but the errors from nvidia are usually very obscure.
As for using gpus other than nvidia, there’s a slew of problems. Mostly that on cloud where most of the projects are deployed, our options seem either limited to nvidia gpus, or cloud tpus.
Each AI experiment can cost usually in thousands of dollars and use a cluster of GPUs. We have built and modified our system for fully utilizing such an environment. I can’t even imagine shifting to Amd gpus at this point. The amount of work involved and the red tape shudder
Oh yeah, the equation completely changes for the cloud. I’m only familiar with local usage where you can’t easily scale out of your resource constraints (and into budgetary ones). It’s certainly easier to pivot to a different vendor/ecosystem locally.
By the way, AMD does have one additional edge locally: They tend to put more RAM into consumer GPUs at a comparable price point – for example, the 7900 XTX competes with the 4080 on price but has as much memory as a 4090. In systems with one or few GPUs (like a hobbyist mixed-use machine) those few extra gigabytes can make a real difference. Of course this leads to a trade-off between Nvidia’s superior speed and AMD’s superior capacity.
Yeah, Tumbleweed has a good track record with NVIDIA drivers in my experience. As with updates in general.
Although I still use X11 as Wayland still has graphical issues in some apps for me. Usually Flatpaks. That makes it unusable for me for the time being.
Edit: I have an older card (1050ti), so maybe I don’t get the latests drivers anymore?? On version 550.
Ah the problem you are describing in wayland actually usually happens only with electron apps. Most of the electron apps require forcing them to run on wayland. They are usually running on X (x-wayland) which cause all sorts of glitches. You can use xeyes to check if the app is using xwayland or not. If eyes move when you move the cursor inside the app then it’s on xwayland.
To resolve the issues for the electron apps I pass these parameters:
--enable-features=UseOzonePlatform --ozone-platform=wayland
Getting these args to flatpacks could be a bit tricky. You can usually find Appimages that can allow you to run these apps easily on wayland.
I am also on ver 550.120 so doubt that driver is the issue here.
So last when I was following the issue on github, it would need to be supported by electron first. It’s in the works but for now, and take it with a grain of salt, I think the recommendation in the issue was to add the options in the desktop file or executable of the app yourself.
If you are distributing the app with the flags then just a remainder to set the compatibility of the flags such that it also works with X.
Sorry but could you please elaborate. I’ve been using nvidia forever in linux machines both at work and at home. I work in AI so using nvidia gpus is a must. Maybe there’s something that I missed but my experience has been pretty solid so far.
At home I am using openSUSE tumbleweed KDE wayland and at work ubuntu headless.
deleted by creator
Do you mean in terms of gaming? I admit that I don’t do much gaming on linux. Usually just development and browsing.
I also use proprietary nvidia drivers if that makes a difference.
deleted by creator
These days ROCm support is more common than a few years ago so you’re no longer entirely dependent on CUDA for machine learning. (Although I wish fewer tools required non-CUDA users to manually install Torch in their venv because the auto-installer assumes CUDA. At least take a parameter or something if you don’t want to implement autodetection.)
Nvidia’s Linux drivers generally are a bit behind AMD’s; e.g. driver versions before 555 tended not to play well with Wayland.
Also, Nvidia’s drivers tend not to give any meaningful information in case of a problem. There’s typically just an error code for “the driver has crashed”, no matter what reason it crashed for.
Personal anecdote for the last one: I had a wonky 4080 and tracing the problem to the card took months because the log (both on Linux and Windows) didn’t contain error information beyond “something bad happened” and the behavior had dozens of possible causes, ranging from “the 4080 is unstable if you use XMP on some mainboards” over “some BIOS setting might need to be changed” and “sometimes the card doesn’t like a specific CPU/PSU/RAM/mainboard” to “it’s a manufacturing defect”.
Sure, manufacturing defects can happen to anyone; I can’t fault Nvidia for that. But the combination of useless logs and 4000-series cards having so many things they can possibly (but rarely) get hung up on made error diagnosis incredibly painful. I finally just bought a 7900 XTX instead. It’s slower but I like the driver better.
Finally, thanks for the clear cut answer. I don’t have any experience with training on AMD but the errors from nvidia are usually very obscure.
As for using gpus other than nvidia, there’s a slew of problems. Mostly that on cloud where most of the projects are deployed, our options seem either limited to nvidia gpus, or cloud tpus.
Each AI experiment can cost usually in thousands of dollars and use a cluster of GPUs. We have built and modified our system for fully utilizing such an environment. I can’t even imagine shifting to Amd gpus at this point. The amount of work involved and the red tape shudder
Oh yeah, the equation completely changes for the cloud. I’m only familiar with local usage where you can’t easily scale out of your resource constraints (and into budgetary ones). It’s certainly easier to pivot to a different vendor/ecosystem locally.
By the way, AMD does have one additional edge locally: They tend to put more RAM into consumer GPUs at a comparable price point – for example, the 7900 XTX competes with the 4080 on price but has as much memory as a 4090. In systems with one or few GPUs (like a hobbyist mixed-use machine) those few extra gigabytes can make a real difference. Of course this leads to a trade-off between Nvidia’s superior speed and AMD’s superior capacity.
The only two things that have ever been broken by an update for me are hyprland and Nvidia drivers, multiple times
Even then that seems to have stopped happening recently though they patched one of the reallg big issues this year
Yeah, Tumbleweed has a good track record with NVIDIA drivers in my experience. As with updates in general.
Although I still use X11 as Wayland still has graphical issues in some apps for me. Usually Flatpaks. That makes it unusable for me for the time being.
Edit: I have an older card (1050ti), so maybe I don’t get the latests drivers anymore?? On version 550.
Ah the problem you are describing in wayland actually usually happens only with electron apps. Most of the electron apps require forcing them to run on wayland. They are usually running on X (x-wayland) which cause all sorts of glitches. You can use xeyes to check if the app is using xwayland or not. If eyes move when you move the cursor inside the app then it’s on xwayland.
To resolve the issues for the electron apps I pass these parameters:
--enable-features=UseOzonePlatform --ozone-platform=wayland
Getting these args to flatpacks could be a bit tricky. You can usually find Appimages that can allow you to run these apps easily on wayland.
I am also on ver 550.120 so doubt that driver is the issue here.
😂i need to fix element (matrix client) with that 🥳 finally got the cause of the issue
Thank you very much!
How could that be fixed by the devs? Is it something electron has to update or all electron apps individually, in order for it to work out of box?
So last when I was following the issue on github, it would need to be supported by electron first. It’s in the works but for now, and take it with a grain of salt, I think the recommendation in the issue was to add the options in the desktop file or executable of the app yourself.
If you are distributing the app with the flags then just a remainder to set the compatibility of the flags such that it also works with X.