Are you really running six GPU’s in your workstation?
I got an email the other day in reference to the following picture that I posted on twitter in the past.
The direct answer is that no, I am not running 6 GPUs in my system, for several reasons. First and foremost being that those Quadro NVS 420 cards are Dual GPU cards, so technically that picture shows 12 GPUs. A more down to earth reason would be because I did that just for fun one day to see if it was possible to run that many GPUs. Those are very old cards and aren’t capable of pushing the resolution I am running these days: 11520×2160. However, before anyone started to freak out about running 12 GPUs at once… you can’t run that many Nvidia GPU’s under Linux. Not currently… to my knowledge anyway.
I’m not sure where the exact limit is implemented. I don’t know if its a limit in the Kernel or a limit in the Nvidia driver, all I know for sure is that the OS will not detect more than 8 GPUs at a time. I have tested this on multiple systems and all three of them hit the wall at 8 GPUs. There is also a post by an Nvidia employee TMurray stated here that the limit was “at least 8”. That along with my own experience of 8 being the limit leads me to believe that is the hard limit for ‘some’ reason.
To be completely academic about this, I have found the following thread here where someone claims to have built a rig with 18 GPUs, but only provides text output to validate his claim. It would have been nice to see a screenshot or a photo of this alleged system, without it I’m left wondering if those comments are factual, or someone out to inflate their e-peen. I would love for it to be legit, I really wish more information was provided.
I’ve done quite a bit of research online, and have read a whole slew of claims that it will work, as well as a slew people claiming it wont work with a myriad of reasons. The reasons people usually quote are:
1) It’s a BIOS Problem
2) Not enough PCI Lanes (usually valid but on a dual CPU system like mine that’s not an issue)
3) Kernel Limitation
4) Driver Limitation
5) Some other random motherboard issue that’s never clearly identified
6) Nvidia is being mean (Yes, I’ve actually seen that being claimed)
I would love to find and get some actual hard facts on the issue, not because I have a serious need to run that many GPUs, I’m just naturally curious. One thing I have not tried is to use the Nouveau driver and see if the limitation still persists. My guess is that it would, not necessarily because of some code from Nvidia, but because in my experience the Nouveau driver is rather feature lacking at the extreme use cases.
My gut feeling is that its a Nvidia Driver issue, considering Nvidia has showcased its VCA units with 8x Tesla K80 Cards(dual GPU cards) units like this with 16 Tesla K80s (dual GPU cards). But to be fair, for a unit like that, its possible they are running a custom BIOS and Kernel with a significant set of patches.
If anyone out there has more concrete information on this, please contact me. I’d love to dig into this issue a bit more and learn the root cause.