声明:这是我在大学毕业后进入第二家互联网公司学习的内容


背景

最新显卡也出现了故障,告警群里有报错,然后发现服务器的显卡驱动不可用,一起来看看吧

现状

1
2
3
4
5
6
7
8
9
$ nvidia-smi
Unable to determine the device handle for GPU 0000:09:00.0: Unknown Error

$ cat /var/log/message

error occurred while collecting nvidia stats for container
Failed to initialize NVML: Driver/library version mismatch
NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1451)
NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

先尝试重装显卡驱动

1
2
3
$ ./NVIDIA-Linux-x86_64-510.68.02.run

正常安装卸载,并没有报错

执行完成后再执行nvidia-smi

1
2
3
4
5
6
7
8
9
10
报错
$ nvidia-smi
No devices were found

重启试下
$ reboot

$ nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

奇怪,为什么显卡驱动又提示没了

检查下硬件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ lsmod | grep nvidia
nvidia_uvm 1073152 0
nvidia_drm 61440 0
nvidia_modeset 1159168 1 nvidia_drm
nvidia 39104512 33 nvidia_uvm,nvidia_modeset
i2c_nvidia_gpu 16384 0
drm_kms_helper 176128 1 nvidia_drm
drm 495616 6 drm_kms_helper,drm_vram_helper,nvidia,nvidia_drm,ttm

$ yum install pciutils -y && lspci | grep -i nvidia
lspci | grep -i nvidia
af:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
af:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
af:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
af:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)

硬件检查有显卡,查了半天没查出来,直接服务器连接显示器看看,不看不知道,一看吓一跳,直接屏幕花了

总结

最后拿去售后了,售后还问我显卡是不是挖过矿的,显存坏了,一启动显卡温度直接上到90度,只能花钱换显存了。

参考资料

Nvidia PCIE Passthrough on Ubuntu VM returning “Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error”

No devices were found