# GPU Addon 설치

## Prerequisite

* GPU 노드에서 디바이스 확인

```
$ lspci -nnk | grep -i nvidia

00:05.0 3D controller [0302]: NVIDIA Corporation Device [10de:20b7] (rev a1)
    Subsystem: NVIDIA Corporation Device [10de:1532]
    Kernel modules: nvidiafb
```

* NVIDIA 드라이버 설치
  * NVIDIA 드라이버는 사전에 설치되어 있어야 합니다.
  * 지원하는 NVIDIA 드라이버는 아래 링크 참조하세요.
  * <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags>
* 폐쇄망 설치일 경우 GPU Addon 이미지 및 차트 업로드합니다.
  * 관련 파일 및 업로드 방법은 <escho@acornsoft.io> 로 문의 바랍니다.

## GPU Addon Installation

* 설치 가능한 애드온 명칭 및 profile 조회
  * \--kubeconfig 미입력 시 기본값 : ${CUBE\_HOME}/config/{{CLUSTER}}/acloud-client-kubeconfig
  * \--profile 미입력 시 기본 적용 파일 : ${CUBE\_HOME}/extends/addon/profile/gpu-operator/default.yaml

```
$ bin/cubectl addon list
┌────────────────┬─────────┬──────────┬─────────┬─────────────────────────────┐
│ ADDON NAME     │ VERSION │ STATUS   │ PROFILE │ VALUES PATH                 │
├────────────────┼─────────┼──────────┼─────────┼─────────────────────────────┤
│ csi-driver-nfs │ v4.8.0  │ disabled │         │ csi-driver-nfs/default.yaml │
│ gpu-operator   │ v23.9.0 │ disabled │         │ gpu-operator/default.yaml   │
│                │         │          │ redhat  │ gpu-operator/redhat.yaml    │
│                │         │          │ ubuntu  │ gpu-operator/ubuntu.yaml    │
│ kore-board     │ 0.5.5   │ disabled │         │ kore-board/default.yaml     │
└────────────────┴─────────┴──────────┴─────────┴─────────────────────────────┘
Duration 73.639078ms time
```

* `${CUBE_HOME}/extends/addon/profile/gpu-operator/default.yaml` 설치하려는 OSEXT의 yaml 수정
  * default.yaml은 ubuntu와 동일합니다.
  * redhat.yaml을 적용하려면 profile 인자값에 해당 파일명을 입력합니다.
    * ex) bin/cubectl addon enable gpu-operator --profile redhat
  * 폐쇄망 설치 시 repository 값 앞부분에 "{{ registry\_domain }}/" 을 추가합니다.
  * `repository: repository: nvcr.io/nvidia` -> `repository: {{ registry_domain }}/repository: nvcr.io/nvidia`
  * `repository: nvcr.io/nvidia/cloud-native` -> `repository: {{ registry_domain }}/nvcr.io/nvidia/cloud-native`
  * `repository: nvcr.io/nvidia/k8s` -> `repository: {{ registry_domain }}/nvcr.io/nvidia/k8s`

```
$ vi ${CUBE_HOME}/extends/addon/profile/gpu-operator/default.yaml
```

* Addon 설치

```
$ bin/cubectl addon enable gpu-operator

addon enable start: gpu-operator ...
addon enable complete: gpu-operator
Duration 52.093538621s time
```

* Addon 설치 확인

```
$ bin/cubectl addon list
┌────────────────┬─────────┬────────────┬─────────┬─────────────────────────────┐
│ ADDON NAME     │ VERSION │ STATUS     │ PROFILE │ VALUES PATH                 │
├────────────────┼─────────┼────────────┼─────────┼─────────────────────────────┤
│ csi-driver-nfs │ v4.8.0  │ disabled   │         │ csi-driver-nfs/default.yaml │
│ gpu-operator   │ v23.9.0 │ enabled ✅ │         │ gpu-operator/default.yaml   │
│                │         │            │ redhat  │ gpu-operator/redhat.yaml    │
│                │         │            │ ubuntu  │ gpu-operator/ubuntu.yaml    │
│ kore-board     │ 0.5.5   │ disabled   │         │ kore-board/default.yaml     │
└────────────────┴─────────┴────────────┴─────────┴─────────────────────────────┘
Duration 75.061448ms time

$ kubectl get pods -n gpu-operator

NAMESPACE      NAME                                                          READY   STATUS    RESTARTS      AGE   IP             NODE        NOMINATED NODE   READINESS GATES
gpu-operator   gpu-operator-5564789746-rlpzk                                 1/1     Running   0             65s   10.4.185.65    cp-node-1   <none>           <none>
gpu-operator   gpu-operator-node-feature-discovery-gc-78b479ccc6-ngfnd       1/1     Running   0             65s   10.4.211.67    wk-node-1   <none>           <none>
gpu-operator   gpu-operator-node-feature-discovery-master-569bfcd8bc-5xb8h   1/1     Running   0             65s   10.4.111.193   cp-node-3   <none>           <none>
gpu-operator   gpu-operator-node-feature-discovery-worker-dlxxh              1/1     Running   0             65s   10.4.111.194   cp-node-3   <none>           <none>
gpu-operator   gpu-operator-node-feature-discovery-worker-fmlmb              1/1     Running   0             65s   10.4.185.66    cp-node-1   <none>           <none>
gpu-operator   gpu-operator-node-feature-discovery-worker-gqn8z              1/1     Running   0             65s   10.4.238.68    cp-node-2   <none>           <none>
gpu-operator   gpu-operator-node-feature-discovery-worker-pksh4              1/1     Running   0             65s   10.4.109.2     wk-node-2   <none>           <none>
gpu-operator   gpu-operator-node-feature-discovery-worker-xx6gb              1/1     Running   0             65s   10.4.211.66    wk-node-1   <none>           <none>
```

* Addon 삭제

```
$ bin/cubectl addon disable gpu-operator
```

## MIG(multi instance GPU) 설정 방법

> MIG를 지원하는 GPU인 경우

* GPU별 지원되는 프로필 확인
  * [NVIDIA Supported MIG Profiles 공식 가이드](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-profiles) 에서 GPU별 지원되는 프로필을 확인합니다.
  * MIG 프로필은 `configmap/default-mig-parted-config` 에서도 확인 가능합니다.
  * `kubectl describe cm default-mig-parted-config -n gpu-operator`
* GPU노드에 label에 프로필명 적용

```
# 1g.5gb 인 경우

$ kubectl label nodes $NODE nvidia.com/mig.config=all-1g.5gb --overwrite
```

* 적용 확인 예

```
$ kubectl -n gpu-operator exec -it nvidia-dcgm-exporter-gc6bm bash

root@nvidia-dcgm-exporter-gc6bm:/# nvidia-smi
Thu Dec  7 06:02:55 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     On  | 00000000:00:05.0 Off |                   On |
| N/A   63C    P0              72W / 165W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    3   0   0  |              12MiB /  5952MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB /  8191MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    4   0   1  |              12MiB /  5952MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB /  8191MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    5   0   2  |              12MiB /  5952MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB /  8191MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    6   0   3  |              12MiB /  5952MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB /  8191MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@nvidia-dcgm-exporter-gc6bm:/# nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-79e36614-3f62-d3dd-cdd0-48b00aa446e0)
  MIG 1g.6gb      Device  0: (UUID: MIG-39f52290-ccf4-5e32-b8b8-cc1877a32051)
  MIG 1g.6gb      Device  1: (UUID: MIG-dbf3834e-128b-5965-88b7-2e3d2fe5a0aa)
  MIG 1g.6gb      Device  2: (UUID: MIG-c735f798-c9d5-5c0e-972c-e0bc6cdb05e7)
  MIG 1g.6gb      Device  3: (UUID: MIG-233d355f-f84e-530d-8526-797b5a867669)
```

* 자세한 내용은 [NVIDIA MIG 공식 가이드](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html) 참조하세요.

## Time-Slicing 설정 방법

> MIG를 지원하지 않는 장비에서 하나의 GPU 분할 사용

* ConfigMap 생성
  * Time slicing을 어떻게 나눌지 정의하는 configmap 작성합니다.

```
# 4개로 분할하는 경우

$ cat <<EOF > time-slicing-config-all.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config-all
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4
EOF

$ kubectl apply -n gpu-operator -f time-slicing-config-all.yaml
```

* NVIDIA ClusterPolicy 객체에 생성된 configmap 지정

```
$ kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-all", "default": "any"}}}}'
```

* 적용 확인
  * 적용 후 gpu-feature-discovery, nvidia-device-plugin-daemonset pod가 자동으로 재시작되고 그 후에 gpu-node의 describe로 정상적으로 적용되었는지 확인가능합니다.

```
$ kubectl describe no $GPU-NODE
```

* 자세한 내용은 [NVIDIA Time-Slicing 공식 가이드](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html) 참조하세요.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://cocktailcloud.gitbook.io/cube/tutorial/addons/addons-gpu-operator.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
