[GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - V - NVIDIA GPU Operator 설치 > SUSE Rancher자료실

본문 바로가기

사이트 내 전체검색

뒤로가기 SUSE Rancher자료실

[GPU] RockyLinux 9.4 환경에서 RKE2 구성 및 GPU 테스트 - V - NVIDIA GPU Operator …

페이지 정보

작성자 꿈꾸는여행자 작성일 25-11-06 15:09 조회 417 댓글 0

본문

안녕하세요.

 

꿈꾸는여행자입니다.

 

 

지난 내용에 계속하여 올립니다. 

 

 

이번 항목에서는 

 

NVIDIA GPU Operator 설치에 대한 요건 확인 내용입니다. 

 

상세 내역은 아래와 같습니다.



감사합니다. 


> 아래 

 



 

 

목차 

 

 

3. Installing the NVIDIA GPU Operator

3.1. Prerequisites

3.1.1. You have the kubectl and helm CLIs available on a client machine.

3.1.2. label the namespace for the Operator

3.2. Procedure

3.2.1. Add the NVIDIA Helm repository:

3.2.2. Install the GPU Operator.

3.2.2.1. values.yaml

3.2.3. 설치 확인

3.2.3.1. GPU Resource 확인

3.2.3.2. Node Feature Discovery 확인

 



 

상세

 

 

 

 


________________



3. Installing the NVIDIA GPU Operator

Installing the NVIDIA GPU Operator

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html



3.1. Prerequisites

3.1.1. You have the kubectl and helm CLIs available on a client machine.

You can run the following commands to install the Helm CLI:



curl -fsSL -o get_helm.sh \

    https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \

    && chmod 700 get_helm.sh \

    && ./get_helm.sh

[root@host ~]# curl -fsSL -o get_helm.sh \

    https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \

    && chmod 700 get_helm.sh \

    && ./get_helm.sh

Downloading https://get.helm.sh/helm-v3.16.2-linux-amd64.tar.gz

Verifying checksum... Done.

Preparing to install helm into /usr/local/bin

helm installed into /usr/local/bin/helm

[root@host ~]# 




All worker nodes or node groups to run GPU workloads in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container. Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems.


For worker nodes or node groups that run CPU workloads only, the nodes can run any operating system because the GPU Operator does not perform any configuration or management of nodes for CPU-only workloads.



Nodes must be configured with a container engine such CRI-O or containerd.



3.1.2. label the namespace for the Operator

If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods, label the namespace for the Operator to set the enforcement policy to privileged:



kubectl create ns gpu-operator



kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

[root@host ~]# kubectl create ns gpu-operator

namespace/gpu-operator created

[root@host ~]# kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

namespace/gpu-operator labeled

[root@host ~]# 


Node Feature Discovery (NFD) is a dependency for the Operator on each node. By default, NFD master and worker are automatically deployed by the Operator. If NFD is already running in the cluster, then you must disable deploying NFD when you install the Operator.


One way to determine if NFD is already running in the cluster is to check for a NFD label on your nodes:


kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

[root@host ~]# kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

false

[root@host ~]#




If the command output is true, then NFD is already running in the cluster.



 NVIDIA GPU Operator에서 각 노드의 하드웨어 특성을 감지하고, GPU와 같은 특수 하드웨어 리소스를 Kubernetes 클러스터에 노출하기 위해 필요한 컴포넌트입니다. GPU Operator는 기본적으로 NFD를 자동으로 설치하지만, 클러스터에 이미 NFD가 설치된 경우 중복 설치를 방지하기 위해 NFD 자동 설치를 비활성화해야 합니다.



3.2. Procedure

3.2.1. Add the NVIDIA Helm repository:



helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \

    && helm repo update

[root@host ~]# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \

    && helm repo update

"nvidia" has been added to your repositories

Hang tight while we grab the latest from your chart repositories...

...Successfully got an update from the "nvidia" chart repository

Update Complete. ⎈Happy Helming!⎈

[root@host ~]# 


3.2.2. Install the GPU Operator.

3.2.2.1. values.yaml 

* gpu-operator

   * https://github.com/NVIDIA/gpu-operator/tree/v24.6.2

   * https://github.com/NVIDIA/gpu-operator/blob/v24.6.2/deployments/gpu-operator/values.yaml



values.yaml 파일 수정 적용

* defaultRuntime은 containerd로 설정

* toolkit 비활성화로 진행해야 함

   * toolkit daemon 컨테이너는 활성화 진행시 해당 컨테이너 계속 재기동 및 실패 

vi gpu-operator-values.yaml



operator:

  defaultRuntime: "containerd"



toolkit:

  enabled: false



devicePlugin:

  enabled: true  # NVIDIA Device Plugin 활성화

[root@host 20241017_RKE2]# vi gpu-operator-values.yaml 

[root@host 20241017_RKE2]# cat gpu-operator-values.yaml 

operator:

  defaultRuntime: "containerd"



toolkit:

  enabled: false



devicePlugin:

  enabled: true  # NVIDIA Device Plugin 활성화

[root@host 20241017_RKE2]#


Install the Operator with the default configuration:

helm list -n gpu-operator

helm uninstall gpu-operator -n gpu-operator

kubectl get pods -n gpu-operator

kubectl delete namespace gpu-operator



helm install --wait --generate-name \

    -n gpu-operator --create-namespace \

    nvidia/gpu-operator \

    -f gpu-operator-values.yaml



kubectl get all -n gpu-operator

kubectl get pods -n gpu-operator -w


[root@host ~]# helm install --wait --generate-name \

    -n gpu-operator --create-namespace \

    nvidia/gpu-operator

NAME: gpu-operator-1729148604

LAST DEPLOYED: Thu Oct 17 16:03:27 2024

NAMESPACE: gpu-operator

STATUS: deployed

REVISION: 1

TEST SUITE: None

[root@host ~]# 


Refer to the Common Chart Customization Options and Common Deployment Scenarios for more information.



3.2.3. 설치 확인 

3.2.3.1. GPU Resource 확인



kubectl describe nodes | grep -i nvidia

[root@host ~]# kubectl describe nodes | grep -i nvidia

                    nvidia.com/cuda.driver-version.full=560.35.03

                    nvidia.com/cuda.driver-version.major=560

                    nvidia.com/cuda.driver-version.minor=35

                    nvidia.com/cuda.driver-version.revision=03

                    nvidia.com/cuda.driver.major=560

                    nvidia.com/cuda.driver.minor=35

                    nvidia.com/cuda.driver.rev=03

                    nvidia.com/cuda.runtime-version.full=12.6

                    nvidia.com/cuda.runtime-version.major=12

                    nvidia.com/cuda.runtime-version.minor=6

                    nvidia.com/cuda.runtime.major=12

                    nvidia.com/cuda.runtime.minor=6

                    nvidia.com/gfd.timestamp=1729148646

                    nvidia.com/gpu-driver-upgrade-state=upgrade-done

                    nvidia.com/gpu.compute.major=7

                    nvidia.com/gpu.compute.minor=5

                    nvidia.com/gpu.count=1

                    nvidia.com/gpu.deploy.container-toolkit=true

                    nvidia.com/gpu.deploy.dcgm=true

                    nvidia.com/gpu.deploy.dcgm-exporter=true

                    nvidia.com/gpu.deploy.device-plugin=true

                    nvidia.com/gpu.deploy.driver=pre-installed

                    nvidia.com/gpu.deploy.gpu-feature-discovery=true

                    nvidia.com/gpu.deploy.node-status-exporter=true

                    nvidia.com/gpu.deploy.operator-validator=true

                    nvidia.com/gpu.family=turing

                    nvidia.com/gpu.machine=20QNCTO1WW

                    nvidia.com/gpu.memory=4096

                    nvidia.com/gpu.mode=graphics

                    nvidia.com/gpu.present=true

                    nvidia.com/gpu.product=Quadro-T1000

                    nvidia.com/gpu.replicas=1

                    nvidia.com/gpu.sharing-strategy=none

                    nvidia.com/mig.capable=false

                    nvidia.com/mig.strategy=single

                    nvidia.com/mps.capable=false

                    nvidia.com/vgpu.present=false

                    nvidia.com/gpu-driver-upgrade-enabled: true

  nvidia.com/gpu:     1

  nvidia.com/gpu:     1

  gpu-operator                nvidia-container-toolkit-daemonset-mfl5w                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m42s

  gpu-operator                nvidia-dcgm-exporter-xsw9c                                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m41s

  gpu-operator                nvidia-device-plugin-daemonset-85grp                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m42s

  gpu-operator                nvidia-operator-validator-mc85f                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m42s

  nvidia.com/gpu     0            0

  Normal  GPUDriverUpgrade  2m28s  nvidia-gpu-operator  Successfully updated node state label to [upgrade-done]%!(EXTRA <nil>)

[root@host ~]# 


3.2.3.2. Node Feature Discovery 확인

kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

[root@host ~]# kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

true

[root@host ~]# 

 

 

 

댓글목록 0

등록된 댓글이 없습니다.

Copyright © 소유하신 도메인. All rights reserved.

사이트 정보

회사명 : (주)리눅스데이타시스템 / 대표 : 정정모
서울본사 : 서울특별시 강남구 봉은사로 114길 40 홍선빌딩 2층 / tel : 02-6207-1160
대전지사 : 대전광역시 유성구 노은로174 도원프라자 5층 / tel : 042-331-1161

PC 버전으로 보기