参考链接: 1.https://www.cnblogs.com/klvchen/p/17295624.html https://zhuanlan.zhihu.com/p/664599034 2.https://www.nvidia.com/download/ 3.https://www.cnblogs.com/devilmaycry812839668/p/17269217.html NVIDIA官方:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
一、环境
服务器:R730
显卡:P100、2080TIS
系统:UBUNTU 22.04
Docker+PyTorch安装驱动及环境部署
二、配置过程
参考链接:https://www.cnblogs.com/klvchen/p/17295624.html
2.1 安装 nvidia 显卡驱动
# 安装前先确定机器上的显卡型号
sudo lspci | grep -i nvidia
# 去官网下载
https://www.nvidia.cn/Download/index.aspx?lang=cn
# 禁用 nouveau 驱动
# Ubuntu 系统集成的显卡驱动程序是 nouveau,它是第三方为 NVIDIA 开发的开源驱动,我们需要先将其屏蔽再安装 NVIDIA 官方驱动。
sudo vim /etc/modprobe.d/blacklist.conf
# 在最后添加如下内容
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist rivatv
blacklist nvidiafb
sudo update-initramfs -u
# 重启机器后,执行如下命令,如果没有输出则证明禁用成功。
sudo reboot
sudo lsmod | grep nouveau
# 安装编译工具
sudo apt install gcc make -y
cd /data/software
sudo chmod +x NVIDIA-Linux-x86_64-525.105.17.run
sudo ./NVIDIA-Linux-x86_64-525.105.17.run -no-x-check -no-nouveau-check -no-opengl-files
# 检查
nvidia-smi
2.2安装 nvidia-docker
sudo apt -y install docker.io
# Setup the package repository and the GPG key:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# 更新源和安装 nvidia-container-toolkit
sudo apt-get -y update
sudo apt-get install -y nvidia-container-toolkit
# 配置Docker守护进程以识别NVIDIA容器运行时
sudo nvidia-ctk runtime configure --runtime=docker
# 重启 Docker
sudo systemctl restart docker
# 测试
sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
三、常见问题
1、过了一段时间后,nvidia-smi报错:NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
2、驱动安装错误
注意ubuntu22.04版本中,需要选择cuda12,选择较高版本的cuda。太低会有各种错误。