[GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - V - NVIDIA GPU Operator 설치 > SUSE Rancher자료실

[GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - V - NVIDIA GPU Operator …

페이지 정보

작성자 꿈꾸는여행자 작성일 25-11-06 15:09 조회 417 댓글 0

본문

안녕하세요.

꿈꾸는여행자입니다.

지난 내용에 계속하여 올립니다.

이번 항목에서는

NVIDIA GPU Operator 설치에 대한 요건 확인 내용입니다.

상세 내역은 아래와 같습니다.

감사합니다.

> 아래

3. Installing the NVIDIA GPU Operator

3.1. Prerequisites

3.1.1. You have the kubectl and helm CLIs available on a client machine.

3.1.2. label the namespace for the Operator

3.2. Procedure

3.2.1. Add the NVIDIA Helm repository:

3.2.2. Install the GPU Operator.

3.2.2.1. values.yaml

3.2.3. 설치 확인

3.2.3.1. GPU Resource 확인

3.2.3.2. Node Feature Discovery 확인

상세

________________

3. Installing the NVIDIA GPU Operator

Installing the NVIDIA GPU Operator

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html

3.1. Prerequisites

3.1.1. You have the kubectl and helm CLIs available on a client machine.

You can run the following commands to install the Helm CLI:

curl -fsSL -o get_helm.sh \

https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \

&& chmod 700 get_helm.sh \

&& ./get_helm.sh

[root@host ~]# curl -fsSL -o get_helm.sh \

https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \

&& chmod 700 get_helm.sh \

&& ./get_helm.sh

Downloading https://get.helm.sh/helm-v3.16.2-linux-amd64.tar.gz

Verifying checksum... Done.

Preparing to install helm into /usr/local/bin

helm installed into /usr/local/bin/helm

[root@host ~]#

All worker nodes or node groups to run GPU workloads in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container. Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems.

For worker nodes or node groups that run CPU workloads only, the nodes can run any operating system because the GPU Operator does not perform any configuration or management of nodes for CPU-only workloads.

Nodes must be configured with a container engine such CRI-O or containerd.

3.1.2. label the namespace for the Operator

If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods, label the namespace for the Operator to set the enforcement policy to privileged:

kubectl create ns gpu-operator

kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

[root@host ~]# kubectl create ns gpu-operator

namespace/gpu-operator created

[root@host ~]# kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

namespace/gpu-operator labeled

[root@host ~]#

Node Feature Discovery (NFD) is a dependency for the Operator on each node. By default, NFD master and worker are automatically deployed by the Operator. If NFD is already running in the cluster, then you must disable deploying NFD when you install the Operator.

One way to determine if NFD is already running in the cluster is to check for a NFD label on your nodes:

kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

[root@host ~]# kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

false

[root@host ~]#

If the command output is true, then NFD is already running in the cluster.

NVIDIA GPU Operator에서 각 노드의 하드웨어 특성을 감지하고, GPU와 같은 특수 하드웨어 리소스를 Kubernetes 클러스터에 노출하기 위해 필요한 컴포넌트입니다. GPU Operator는 기본적으로 NFD를 자동으로 설치하지만, 클러스터에 이미 NFD가 설치된 경우 중복 설치를 방지하기 위해 NFD 자동 설치를 비활성화해야 합니다.

3.2. Procedure

3.2.1. Add the NVIDIA Helm repository:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \

&& helm repo update

[root@host ~]# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \

&& helm repo update

"nvidia" has been added to your repositories

Hang tight while we grab the latest from your chart repositories...

...Successfully got an update from the "nvidia" chart repository

Update Complete. ⎈Happy Helming!⎈

[root@host ~]#

3.2.2. Install the GPU Operator.

3.2.2.1. values.yaml

* gpu-operator

* https://github.com/NVIDIA/gpu-operator/tree/v24.6.2

* https://github.com/NVIDIA/gpu-operator/blob/v24.6.2/deployments/gpu-operator/values.yaml

values.yaml 파일 수정 적용

* defaultRuntime은 containerd로 설정

* toolkit 비활성화로 진행해야 함

* toolkit daemon 컨테이너는 활성화 진행시 해당 컨테이너 계속 재기동 및 실패

vi gpu-operator-values.yaml

operator:

defaultRuntime: "containerd"

toolkit:

enabled: false

devicePlugin:

enabled: true # NVIDIA Device Plugin 활성화

[root@host 20241017_RKE2]# vi gpu-operator-values.yaml

[root@host 20241017_RKE2]# cat gpu-operator-values.yaml

operator:

defaultRuntime: "containerd"

toolkit:

enabled: false

devicePlugin:

enabled: true # NVIDIA Device Plugin 활성화

[root@host 20241017_RKE2]#

Install the Operator with the default configuration:

helm list -n gpu-operator

helm uninstall gpu-operator -n gpu-operator

kubectl get pods -n gpu-operator

kubectl delete namespace gpu-operator

helm install --wait --generate-name \

-n gpu-operator --create-namespace \

nvidia/gpu-operator \

-f gpu-operator-values.yaml

kubectl get all -n gpu-operator

kubectl get pods -n gpu-operator -w

[root@host ~]# helm install --wait --generate-name \

-n gpu-operator --create-namespace \

nvidia/gpu-operator

NAME: gpu-operator-1729148604

LAST DEPLOYED: Thu Oct 17 16:03:27 2024

NAMESPACE: gpu-operator

STATUS: deployed

REVISION: 1

TEST SUITE: None

[root@host ~]#

Refer to the Common Chart Customization Options and Common Deployment Scenarios for more information.

3.2.3. 설치 확인

3.2.3.1. GPU Resource 확인

kubectl describe nodes | grep -i nvidia

[root@host ~]# kubectl describe nodes | grep -i nvidia

nvidia.com/cuda.driver-version.full=560.35.03

nvidia.com/cuda.driver-version.major=560

nvidia.com/cuda.driver-version.minor=35

nvidia.com/cuda.driver-version.revision=03

nvidia.com/cuda.driver.major=560

nvidia.com/cuda.driver.minor=35

nvidia.com/cuda.driver.rev=03

nvidia.com/cuda.runtime-version.full=12.6

nvidia.com/cuda.runtime-version.major=12

nvidia.com/cuda.runtime-version.minor=6

nvidia.com/cuda.runtime.major=12

nvidia.com/cuda.runtime.minor=6

nvidia.com/gfd.timestamp=1729148646

nvidia.com/gpu-driver-upgrade-state=upgrade-done

nvidia.com/gpu.compute.major=7

nvidia.com/gpu.compute.minor=5

nvidia.com/gpu.count=1

nvidia.com/gpu.deploy.container-toolkit=true

nvidia.com/gpu.deploy.dcgm=true

nvidia.com/gpu.deploy.dcgm-exporter=true

nvidia.com/gpu.deploy.device-plugin=true

nvidia.com/gpu.deploy.driver=pre-installed

nvidia.com/gpu.deploy.gpu-feature-discovery=true

nvidia.com/gpu.deploy.node-status-exporter=true

nvidia.com/gpu.deploy.operator-validator=true

nvidia.com/gpu.family=turing

nvidia.com/gpu.machine=20QNCTO1WW

nvidia.com/gpu.memory=4096

nvidia.com/gpu.mode=graphics

nvidia.com/gpu.present=true

nvidia.com/gpu.product=Quadro-T1000

nvidia.com/gpu.replicas=1

nvidia.com/gpu.sharing-strategy=none

nvidia.com/mig.capable=false

nvidia.com/mig.strategy=single

nvidia.com/mps.capable=false

nvidia.com/vgpu.present=false

nvidia.com/gpu-driver-upgrade-enabled: true

nvidia.com/gpu: 1

gpu-operator nvidia-container-toolkit-daemonset-mfl5w 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m42s

gpu-operator nvidia-dcgm-exporter-xsw9c 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m41s

gpu-operator nvidia-device-plugin-daemonset-85grp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m42s

gpu-operator nvidia-operator-validator-mc85f 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m42s

nvidia.com/gpu 0 0

Normal GPUDriverUpgrade 2m28s nvidia-gpu-operator Successfully updated node state label to [upgrade-done]%!(EXTRA <nil>)

[root@host ~]#

3.2.3.2. Node Feature Discovery 확인

kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

[root@host ~]# kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

true

[root@host ~]#

댓글목록 0

등록된 댓글이 없습니다.

[GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - V - NVIDIA GPU Operator 설치 > SUSE Rancher자료실

사이트 내 전체검색

뒤로가기 SUSE Rancher자료실

[GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - V - NVIDIA GPU Operator …

페이지 정보

본문

댓글목록 0

사이트 정보