問(wèn)題描述
如題,起因是在阿里云GPU服務(wù)器上,使用原先正常運(yùn)行的鏡像生成了容器,但容器的顯卡驅(qū)動(dòng)出問(wèn)題了,使用nvidia-smi命令會(huì)報(bào)錯(cuò) NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver.
嘗試使用官網(wǎng).run文件重新安裝顯卡驅(qū)動(dòng)會(huì)報(bào)錯(cuò)ERROR: Unable to load the kernel module ‘nvidia.ko’. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.
按照?qǐng)?bào)錯(cuò)信息,懷疑是內(nèi)核版本或者gcc版本有誤,更換了多個(gè)內(nèi)核版本和gcc版本,使用了網(wǎng)上很多這兩種保存相關(guān)的解決思路,都沒能解決,一籌莫展。
放棄了原先的鏡像,新建了空的容器,但是空的容器也會(huì)報(bào)NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver的錯(cuò),并且空的容器居然也裝不上顯卡驅(qū)動(dòng),遂懷疑是容器本身的問(wèn)題。
解決方案
發(fā)現(xiàn)可能是容器本身的設(shè)置有問(wèn)題,設(shè)置為GPU計(jì)算時(shí)容器可正常安裝驅(qū)動(dòng),但是設(shè)置為GPU計(jì)算可視化時(shí)就會(huì)報(bào)以上錯(cuò)誤。
咨詢阿里云,發(fā)現(xiàn)GPU計(jì)算可視化型需要提交工單獲取特定的兼容驅(qū)動(dòng),GPU計(jì)算型才可以從官網(wǎng)下載驅(qū)動(dòng)安裝。通過(guò)提交工單獲取特定的兼容驅(qū)動(dòng)后,驅(qū)動(dòng)可正常安裝,問(wèn)題解決。文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-512385.html
反思
如果云服務(wù)器中空的容器連驅(qū)動(dòng)都安裝不好的話,就不要自己折騰了,大概率是容器本身哪里出問(wèn)題了,咨詢?cè)品?wù)商吧。文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-512385.html
到了這里,關(guān)于【已解決】nvidia-smi報(bào)錯(cuò):NVIDIA-SMI has failed because it couldn’t communicate with the ... 阿里云GPU服務(wù)器的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!