NVIDIA官方文档:NVIDIA/k8s-device-plugin:用于 Kubernetes 的 NVIDIA 设备插件
前提条件
安装NVIDIA Container Toolkit
官方地址:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
1 | curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \ |
更新运行时
docker
1 | nvidia-ctk runtime configure --runtime=docker |
无根模式下运行的docker
1 | nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json |
Containerd
1 | nvidia-ctk runtime configure --runtime=containerd |
仅用于nerdctl无需配置,直接运行nerdctl run --gpus=all
如果以上修改不生效,修改默认运行时
# 编辑
/etc/containerd/config.toml
搜索
plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc
runtime_type = "io.containerd.runc.v2"
改成io.containerd.runtime.v1.linux
搜索
io.containerd.runtime.v1.linux
runtime = "runc"
改成runtime = "nvidia-container-runtime"
1 | systemctl daemon-reload |
CRI-O
1 | nvidia-ctk runtime configure --runtime=crio |
部署nvidia-device-plugin
1 | kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nvidia-device-plugin.yml |
查看GPU所在节点的pod日志
正常情况(将GPU注册到kubelet):
1 | I0820 08:26:47.926058 1 main.go:317] Retrieving plugins. |
未找到GPU(尝试将nvidia修改为默认运行时):
1 | I0820 08:26:53.370854 1 main.go:317] Retrieving plugins. |
更新运行时后,重启device-plugin daemonset
1 | kubectl rollout restart ds -n kube-system nvidia-device-plugin-daemonset |
测试
1 | # gpu节点打标签 |
1 | apiVersion: v1 |