[GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - V - NVIDIA GPU Operator …
페이지 정보
작성자 꿈꾸는여행자 작성일 25-11-06 15:09 조회 417 댓글 0본문
안녕하세요.
꿈꾸는여행자입니다.
지난 내용에 계속하여 올립니다.
이번 항목에서는
NVIDIA GPU Operator 설치에 대한 요건 확인 내용입니다.
상세 내역은 아래와 같습니다.
감사합니다.
> 아래
목차
3. Installing the NVIDIA GPU Operator
3.1. Prerequisites
3.1.1. You have the kubectl and helm CLIs available on a client machine.
3.1.2. label the namespace for the Operator
3.2. Procedure
3.2.1. Add the NVIDIA Helm repository:
3.2.2. Install the GPU Operator.
3.2.2.1. values.yaml
3.2.3. 설치 확인
3.2.3.1. GPU Resource 확인
3.2.3.2. Node Feature Discovery 확인
상세
________________
3. Installing the NVIDIA GPU Operator
Installing the NVIDIA GPU Operator
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
3.1. Prerequisites
3.1.1. You have the kubectl and helm CLIs available on a client machine.
You can run the following commands to install the Helm CLI:
curl -fsSL -o get_helm.sh \
https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
&& chmod 700 get_helm.sh \
&& ./get_helm.sh
[root@host ~]# curl -fsSL -o get_helm.sh \
https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
&& chmod 700 get_helm.sh \
&& ./get_helm.sh
Downloading https://get.helm.sh/helm-v3.16.2-linux-amd64.tar.gz
Verifying checksum... Done.
Preparing to install helm into /usr/local/bin
helm installed into /usr/local/bin/helm
[root@host ~]#
All worker nodes or node groups to run GPU workloads in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container. Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems.
For worker nodes or node groups that run CPU workloads only, the nodes can run any operating system because the GPU Operator does not perform any configuration or management of nodes for CPU-only workloads.
Nodes must be configured with a container engine such CRI-O or containerd.
3.1.2. label the namespace for the Operator
If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods, label the namespace for the Operator to set the enforcement policy to privileged:
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
[root@host ~]# kubectl create ns gpu-operator
namespace/gpu-operator created
[root@host ~]# kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
namespace/gpu-operator labeled
[root@host ~]#
Node Feature Discovery (NFD) is a dependency for the Operator on each node. By default, NFD master and worker are automatically deployed by the Operator. If NFD is already running in the cluster, then you must disable deploying NFD when you install the Operator.
One way to determine if NFD is already running in the cluster is to check for a NFD label on your nodes:
kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'
[root@host ~]# kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'
false
[root@host ~]#
If the command output is true, then NFD is already running in the cluster.
NVIDIA GPU Operator에서 각 노드의 하드웨어 특성을 감지하고, GPU와 같은 특수 하드웨어 리소스를 Kubernetes 클러스터에 노출하기 위해 필요한 컴포넌트입니다. GPU Operator는 기본적으로 NFD를 자동으로 설치하지만, 클러스터에 이미 NFD가 설치된 경우 중복 설치를 방지하기 위해 NFD 자동 설치를 비활성화해야 합니다.
3.2. Procedure
3.2.1. Add the NVIDIA Helm repository:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
[root@host ~]# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
[root@host ~]#
3.2.2. Install the GPU Operator.
3.2.2.1. values.yaml
* gpu-operator
* https://github.com/NVIDIA/gpu-operator/tree/v24.6.2
* https://github.com/NVIDIA/gpu-operator/blob/v24.6.2/deployments/gpu-operator/values.yaml
values.yaml 파일 수정 적용
* defaultRuntime은 containerd로 설정
* toolkit 비활성화로 진행해야 함
* toolkit daemon 컨테이너는 활성화 진행시 해당 컨테이너 계속 재기동 및 실패
vi gpu-operator-values.yaml
operator:
defaultRuntime: "containerd"
toolkit:
enabled: false
devicePlugin:
enabled: true # NVIDIA Device Plugin 활성화
[root@host 20241017_RKE2]# vi gpu-operator-values.yaml
[root@host 20241017_RKE2]# cat gpu-operator-values.yaml
operator:
defaultRuntime: "containerd"
toolkit:
enabled: false
devicePlugin:
enabled: true # NVIDIA Device Plugin 활성화
[root@host 20241017_RKE2]#
Install the Operator with the default configuration:
helm list -n gpu-operator
helm uninstall gpu-operator -n gpu-operator
kubectl get pods -n gpu-operator
kubectl delete namespace gpu-operator
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
-f gpu-operator-values.yaml
kubectl get all -n gpu-operator
kubectl get pods -n gpu-operator -w
[root@host ~]# helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
NAME: gpu-operator-1729148604
LAST DEPLOYED: Thu Oct 17 16:03:27 2024
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
[root@host ~]#
Refer to the Common Chart Customization Options and Common Deployment Scenarios for more information.
3.2.3. 설치 확인
3.2.3.1. GPU Resource 확인
kubectl describe nodes | grep -i nvidia
[root@host ~]# kubectl describe nodes | grep -i nvidia
nvidia.com/cuda.driver-version.full=560.35.03
nvidia.com/cuda.driver-version.major=560
nvidia.com/cuda.driver-version.minor=35
nvidia.com/cuda.driver-version.revision=03
nvidia.com/cuda.driver.major=560
nvidia.com/cuda.driver.minor=35
nvidia.com/cuda.driver.rev=03
nvidia.com/cuda.runtime-version.full=12.6
nvidia.com/cuda.runtime-version.major=12
nvidia.com/cuda.runtime-version.minor=6
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=6
nvidia.com/gfd.timestamp=1729148646
nvidia.com/gpu-driver-upgrade-state=upgrade-done
nvidia.com/gpu.compute.major=7
nvidia.com/gpu.compute.minor=5
nvidia.com/gpu.count=1
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=pre-installed
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.family=turing
nvidia.com/gpu.machine=20QNCTO1WW
nvidia.com/gpu.memory=4096
nvidia.com/gpu.mode=graphics
nvidia.com/gpu.present=true
nvidia.com/gpu.product=Quadro-T1000
nvidia.com/gpu.replicas=1
nvidia.com/gpu.sharing-strategy=none
nvidia.com/mig.capable=false
nvidia.com/mig.strategy=single
nvidia.com/mps.capable=false
nvidia.com/vgpu.present=false
nvidia.com/gpu-driver-upgrade-enabled: true
nvidia.com/gpu: 1
nvidia.com/gpu: 1
gpu-operator nvidia-container-toolkit-daemonset-mfl5w 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m42s
gpu-operator nvidia-dcgm-exporter-xsw9c 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m41s
gpu-operator nvidia-device-plugin-daemonset-85grp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m42s
gpu-operator nvidia-operator-validator-mc85f 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m42s
nvidia.com/gpu 0 0
Normal GPUDriverUpgrade 2m28s nvidia-gpu-operator Successfully updated node state label to [upgrade-done]%!(EXTRA <nil>)
[root@host ~]#
3.2.3.2. Node Feature Discovery 확인
kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'
[root@host ~]# kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'
true
[root@host ~]#
- 이전글 [GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - VI - NVIDIA GPU 테스트
- 다음글 [GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - IV - Nvidia Container Toolkit 설치
댓글목록 0
등록된 댓글이 없습니다.
