Even Scheduling on Virtual GPUs

In AI training, inference, and scientific computing, a single GPU often falls short due to limited compute power or memory. Multiple GPUs are therefore needed to work together. However, using entire GPU cards for collaboration can lead to resource wastage, especially when tasks require only a portion of the GPU's memory or compute power. Even scheduling on virtual GPUs addresses this issue by efficiently utilizing GPU memory and compute power across multiple cards, optimizing resource use.

This policy evenly allocates requested GPU resources (such as GPU memory and compute power) across multiple virtual GPUs, improving utilization and reducing waste. With this approach, a pod can flexibly use multiple virtual GPUs, with each providing an equal amount of resources. This enables refined allocation and efficient utilization of GPU resources. Even scheduling on virtual GPUs supports GPU memory isolation by configuring volcano.sh/gpu-mem.128Mi and compute-GPU memory isolation by configuring both volcano.sh/gpu-mem.128Mi and volcano.sh/gpu-core.percentage.

GPU memory isolation: Tasks can split their requested GPU memory across multiple cards for sharing. For example, if an application requests M MiB of GPU memory and specifies to use N GPU cards on a single node, CCE will evenly allocate the M MiB of memory across the N cards. During execution, each task is limited to using M/N MiB of memory per GPU card, ensuring memory isolation between tasks and preventing resource contention.
Compute-GPU memory isolation: Tasks can split their requested GPU memory and compute power across multiple cards for sharing. For example, if an application requests M MiB of GPU memory and T% of compute power and specifies to use N GPU cards on a single node, CCE will evenly allocate the M MiB of memory and T% of compute power across the N cards. During execution, each task is limited to using M/N MiB of memory and T/N% of compute power per GPU card.

In GPU virtualization, the allocated GPU memory in MiB must be an integer multiple of 128. Therefore, M/N, the GPU memory per card, must meet this requirement. Additionally, the compute power in percentage must be an integer multiple of 5. Therefore, T/N, the compute power per card, must be a multiple of 5.

Prerequisites

A CCE standard or Turbo cluster of v1.27.16-r20, v1.28.15-r10, v1.29.10-r10, v1.30.6-r10, v1.31.4-r0, or later is available.
CCE AI Suite (NVIDIA GPU) has been installed in the cluster. For details, see CCE AI Suite (NVIDIA GPU). The add-on version must meet the following requirements:
- If the cluster version is 1.27 or earlier, the add-on version must be 2.1.41 or later.
- If the cluster version is 1.28 or later, the add-on version must be 2.7.57 or later.
GPU nodes with virtualization enabled at the cluster or node pool level are available in the cluster. For details, see Preparing Virtualized GPU Resources.
Volcano of v1.16.10 or later has been installed. For details, see Volcano Scheduler.

Notes and Constraints

Even scheduling on virtual GPUs is not compatible with Kubernetes' default GPU scheduling (which involves workloads using nvidia.com/gpu resources).
Workloads with virtual GPU scheduling enabled cannot trigger auto scaling in the cluster's node pools.

Examples

The following example illustrates how to create a workload that uses even scheduling on virtual GPUs with GPU memory isolation: Configure this workload as follows: Set the number of pods to 1, the requested GPU memory to 8 GiB, and specify the workload to use two GPUs. Each GPU will allocate 4 GiB of memory. After the workload is created, CCE will automatically schedule it to a GPU node that meets these conditions.

Use kubectl to access the cluster.

Run the following command to create a YAML file for a workload that requires even scheduling on virtual GPUs:

vim gpu-app.yaml

The file content is as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-app
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
        volcano.sh/gpu-num: '2'    # Number of GPUs for even scheduling. In this example, the pod requests two GPUs, with each GPU providing 4 GiB of memory.
    spec:
      schedulerName: volcano
      containers:
      - image: <your_image_address>     # Replace it with your image address.
        name: container-0
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
            volcano.sh/gpu-mem.128Mi: '64' # Requested GPU memory. The value 64 indicates that 8 GiB of GPU memory is requested (64 x 128 MiB/1024).
          limits:
            cpu: 250m
            memory: 512Mi
            volcano.sh/gpu-mem.128Mi: '64' # Upper limit of the GPU memory that can be used, which is 8 GiB
      imagePullSecrets:
      - name: default-secret

To enable compute-GPU memory isolation, configure volcano.sh/gpu-core.percentage in both resources.requests and resources.limits to allocate GPU compute power to pods, for example, set volcano.sh/gpu-core.percentage to 5.

Run the following command to create the workload:
```
kubectl apply -f gpu-app.yaml
```
If information similar to the following is displayed, the workload has been created:
```
deployment.apps/gpu-app created
```

Run the following command to view the created pod:

kubectl get pod -n default

Information similar to the following is displayed:

NAME                      READY   STATUS    RESTARTS   AGE
gpu-app-6bdb4d7cb-pmtc2   1/1     Running   0          21s

kubectl exec -it gpu-app-6bdb4d7cb-pmtc2 -- nvidia-smi

Expected output:

Fri Mar  7 03:36:03 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   33C    P8               9W /  70W |      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:0E.0 Off |                    0 |
| N/A   34C    P8               9W /  70W |      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The command output shows that the pod can use two GPUs, each providing 4 GiB of GPU memory (4096 MiB/1024). The pod's requested GPU memory has been evenly distributed across the two GPUs, with each GPU card's memory resources isolated.

Parent Topic: GPU Virtualization

Previous topic: Supporting Kubernetes' Default GPU Scheduling

Next topic: Monitoring GPU Metrics