arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

GPU Addon μ„€μΉ˜

hashtag
Prerequisite

  • GPU λ…Έλ“œμ—μ„œ λ””λ°”μ΄μŠ€ 확인

  • NVIDIA λ“œλΌμ΄λ²„ μ„€μΉ˜

    • NVIDIA λ“œλΌμ΄λ²„λŠ” 사전에 μ„€μΉ˜λ˜μ–΄ μžˆμ–΄μ•Ό ν•©λ‹ˆλ‹€.

    • μ§€μ›ν•˜λŠ” NVIDIA λ“œλΌμ΄λ²„λŠ” μ•„λž˜ 링크 μ°Έμ‘°ν•˜μ„Έμš”.

    • https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags

  • 폐쇄망 μ„€μΉ˜μΌ 경우 GPU Addon 이미지 및 차트 μ—…λ‘œλ“œν•©λ‹ˆλ‹€.

    • κ΄€λ ¨ 파일 및 μ—…λ‘œλ“œ 방법은 escho@acornsoft.io 둜 문의 λ°”λžλ‹ˆλ‹€.

hashtag
GPU Addon Installation

  • μ„€μΉ˜ κ°€λŠ₯ν•œ μ• λ“œμ˜¨ λͺ…μΉ­ 및 profile 쑰회

    • --kubeconfig λ―Έμž…λ ₯ μ‹œ κΈ°λ³Έκ°’ : ${CUBE_HOME}/config/{{CLUSTER}}/acloud-client-kubeconfig

    • --profile λ―Έμž…λ ₯ μ‹œ κΈ°λ³Έ 적용 파일 : ${CUBE_HOME}/extends/addon/profile/gpu-operator/default.yaml

  • ${CUBE_HOME}/extends/addon/profile/gpu-operator/default.yaml μ„€μΉ˜ν•˜λ €λŠ” OSEXT의 yaml μˆ˜μ •

    • default.yaml은 ubuntu와 λ™μΌν•©λ‹ˆλ‹€.

    • redhat.yaml을 μ μš©ν•˜λ €λ©΄ profile μΈμžκ°’μ— ν•΄λ‹Ή 파일λͺ…을 μž…λ ₯ν•©λ‹ˆλ‹€.

  • Addon μ„€μΉ˜

  • Addon μ„€μΉ˜ 확인

  • Addon μ‚­μ œ

hashtag
MIG(multi instance GPU) μ„€μ • 방법

MIGλ₯Ό μ§€μ›ν•˜λŠ” GPU인 경우

  • GPU별 μ§€μ›λ˜λŠ” ν”„λ‘œν•„ 확인

    • μ—μ„œ GPU별 μ§€μ›λ˜λŠ” ν”„λ‘œν•„μ„ ν™•μΈν•©λ‹ˆλ‹€.

    • MIG ν”„λ‘œν•„μ€ configmap/default-mig-parted-config μ—μ„œλ„ 확인 κ°€λŠ₯ν•©λ‹ˆλ‹€.

  • 적용 확인 예

  • μžμ„Έν•œ λ‚΄μš©μ€ μ°Έμ‘°ν•˜μ„Έμš”.

hashtag
Time-Slicing μ„€μ • 방법

MIGλ₯Ό μ§€μ›ν•˜μ§€ μ•ŠλŠ” μž₯λΉ„μ—μ„œ ν•˜λ‚˜μ˜ GPU λΆ„ν•  μ‚¬μš©

  • ConfigMap 생성

    • Time slicing을 μ–΄λ–»κ²Œ λ‚˜λˆŒμ§€ μ •μ˜ν•˜λŠ” configmap μž‘μ„±ν•©λ‹ˆλ‹€.

  • NVIDIA ClusterPolicy 객체에 μƒμ„±λœ configmap μ§€μ •

  • 적용 확인

    • 적용 ν›„ gpu-feature-discovery, nvidia-device-plugin-daemonset podκ°€ μžλ™μœΌλ‘œ μž¬μ‹œμž‘λ˜κ³  κ·Έ 후에 gpu-node의 describe둜 μ •μƒμ μœΌλ‘œ μ μš©λ˜μ—ˆλŠ”μ§€ 확인가λŠ₯ν•©λ‹ˆλ‹€.

  • μžμ„Έν•œ λ‚΄μš©μ€ μ°Έμ‘°ν•˜μ„Έμš”.

$ lspci -nnk | grep -i nvidia

00:05.0 3D controller [0302]: NVIDIA Corporation Device [10de:20b7] (rev a1)
    Subsystem: NVIDIA Corporation Device [10de:1532]
    Kernel modules: nvidiafb

ex) bin/cubectl addon enable gpu-operator --profile redhat

  • 폐쇄망 μ„€μΉ˜ μ‹œ repository κ°’ μ•žλΆ€λΆ„μ— "{{ registry_domain }}/" 을 μΆ”κ°€ν•©λ‹ˆλ‹€.

  • repository: repository: nvcr.io/nvidia -> repository: {{ registry_domain }}/repository: nvcr.io/nvidia

  • repository: nvcr.io/nvidia/cloud-native -> repository: {{ registry_domain }}/nvcr.io/nvidia/cloud-native

  • repository: nvcr.io/nvidia/k8s -> repository: {{ registry_domain }}/nvcr.io/nvidia/k8s

  • kubectl describe cm default-mig-parted-config -n gpu-operator

  • GPUλ…Έλ“œμ— label에 ν”„λ‘œν•„λͺ… 적용

  • NVIDIA Supported MIG Profiles 곡식 κ°€μ΄λ“œarrow-up-right
    NVIDIA MIG 곡식 κ°€μ΄λ“œarrow-up-right
    NVIDIA Time-Slicing 곡식 κ°€μ΄λ“œarrow-up-right
    $ bin/cubectl addon list
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ ADDON NAME     β”‚ VERSION β”‚ STATUS   β”‚ PROFILE β”‚ VALUES PATH                 β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
    β”‚ csi-driver-nfs β”‚ v4.8.0  β”‚ disabled β”‚         β”‚ csi-driver-nfs/default.yaml β”‚
    β”‚ gpu-operator   β”‚ v23.9.0 β”‚ disabled β”‚         β”‚ gpu-operator/default.yaml   β”‚
    β”‚                β”‚         β”‚          β”‚ redhat  β”‚ gpu-operator/redhat.yaml    β”‚
    β”‚                β”‚         β”‚          β”‚ ubuntu  β”‚ gpu-operator/ubuntu.yaml    β”‚
    β”‚ kore-board     β”‚ 0.5.5   β”‚ disabled β”‚         β”‚ kore-board/default.yaml     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    Duration 73.639078ms time
    $ vi ${CUBE_HOME}/extends/addon/profile/gpu-operator/default.yaml
    $ bin/cubectl addon enable gpu-operator
    
    addon enable start: gpu-operator ...
    addon enable complete: gpu-operator
    Duration 52.093538621s time
    $ bin/cubectl addon list
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ ADDON NAME     β”‚ VERSION β”‚ STATUS     β”‚ PROFILE β”‚ VALUES PATH                 β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
    β”‚ csi-driver-nfs β”‚ v4.8.0  β”‚ disabled   β”‚         β”‚ csi-driver-nfs/default.yaml β”‚
    β”‚ gpu-operator   β”‚ v23.9.0 β”‚ enabled βœ… β”‚         β”‚ gpu-operator/default.yaml   β”‚
    β”‚                β”‚         β”‚            β”‚ redhat  β”‚ gpu-operator/redhat.yaml    β”‚
    β”‚                β”‚         β”‚            β”‚ ubuntu  β”‚ gpu-operator/ubuntu.yaml    β”‚
    β”‚ kore-board     β”‚ 0.5.5   β”‚ disabled   β”‚         β”‚ kore-board/default.yaml     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    Duration 75.061448ms time
    
    $ kubectl get pods -n gpu-operator
    
    NAMESPACE      NAME                                                          READY   STATUS    RESTARTS      AGE   IP             NODE        NOMINATED NODE   READINESS GATES
    gpu-operator   gpu-operator-5564789746-rlpzk                                 1/1     Running   0             65s   10.4.185.65    cp-node-1   <none>           <none>
    gpu-operator   gpu-operator-node-feature-discovery-gc-78b479ccc6-ngfnd       1/1     Running   0             65s   10.4.211.67    wk-node-1   <none>           <none>
    gpu-operator   gpu-operator-node-feature-discovery-master-569bfcd8bc-5xb8h   1/1     Running   0             65s   10.4.111.193   cp-node-3   <none>           <none>
    gpu-operator   gpu-operator-node-feature-discovery-worker-dlxxh              1/1     Running   0             65s   10.4.111.194   cp-node-3   <none>           <none>
    gpu-operator   gpu-operator-node-feature-discovery-worker-fmlmb              1/1     Running   0             65s   10.4.185.66    cp-node-1   <none>           <none>
    gpu-operator   gpu-operator-node-feature-discovery-worker-gqn8z              1/1     Running   0             65s   10.4.238.68    cp-node-2   <none>           <none>
    gpu-operator   gpu-operator-node-feature-discovery-worker-pksh4              1/1     Running   0             65s   10.4.109.2     wk-node-2   <none>           <none>
    gpu-operator   gpu-operator-node-feature-discovery-worker-xx6gb              1/1     Running   0             65s   10.4.211.66    wk-node-1   <none>           <none>
    $ bin/cubectl addon disable gpu-operator
    # 1g.5gb 인 경우
    
    $ kubectl label nodes $NODE nvidia.com/mig.config=all-1g.5gb --overwrite
    $ kubectl -n gpu-operator exec -it nvidia-dcgm-exporter-gc6bm bash
    
    root@nvidia-dcgm-exporter-gc6bm:/# nvidia-smi
    Thu Dec  7 06:02:55 2023
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  NVIDIA A30                     On  | 00000000:00:05.0 Off |                   On |
    | N/A   63C    P0              72W / 165W |                  N/A |     N/A      Default |
    |                                         |                      |              Enabled |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | MIG devices:                                                                          |
    +------------------+--------------------------------+-----------+-----------------------+
    | GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
    |      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
    |                  |                                |        ECC|                       |
    |==================+================================+===========+=======================|
    |  0    3   0   0  |              12MiB /  5952MiB  | 14      0 |  1   0    1    0    0 |
    |                  |               0MiB /  8191MiB  |           |                       |
    +------------------+--------------------------------+-----------+-----------------------+
    |  0    4   0   1  |              12MiB /  5952MiB  | 14      0 |  1   0    1    0    0 |
    |                  |               0MiB /  8191MiB  |           |                       |
    +------------------+--------------------------------+-----------+-----------------------+
    |  0    5   0   2  |              12MiB /  5952MiB  | 14      0 |  1   0    1    0    0 |
    |                  |               0MiB /  8191MiB  |           |                       |
    +------------------+--------------------------------+-----------+-----------------------+
    |  0    6   0   3  |              12MiB /  5952MiB  | 14      0 |  1   0    1    0    0 |
    |                  |               0MiB /  8191MiB  |           |                       |
    +------------------+--------------------------------+-----------+-----------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    |  No running processes found                                                           |
    +---------------------------------------------------------------------------------------+
    root@nvidia-dcgm-exporter-gc6bm:/# nvidia-smi -L
    GPU 0: NVIDIA A30 (UUID: GPU-79e36614-3f62-d3dd-cdd0-48b00aa446e0)
      MIG 1g.6gb      Device  0: (UUID: MIG-39f52290-ccf4-5e32-b8b8-cc1877a32051)
      MIG 1g.6gb      Device  1: (UUID: MIG-dbf3834e-128b-5965-88b7-2e3d2fe5a0aa)
      MIG 1g.6gb      Device  2: (UUID: MIG-c735f798-c9d5-5c0e-972c-e0bc6cdb05e7)
      MIG 1g.6gb      Device  3: (UUID: MIG-233d355f-f84e-530d-8526-797b5a867669)
    # 4개둜 λΆ„ν• ν•˜λŠ” 경우
    
    $ cat <<EOF > time-slicing-config-all.yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: time-slicing-config-all
    data:
      any: |-
        version: v1
        flags:
          migStrategy: none
        sharing:
          timeSlicing:
            resources:
            - name: nvidia.com/gpu
              replicas: 4
    EOF
    
    $ kubectl apply -n gpu-operator -f time-slicing-config-all.yaml
    $ kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-all", "default": "any"}}}}'
    $ kubectl describe no $GPU-NODE