Install OpenShift AI and GPU Support¶
This guide installs the platform stack: Red Hat OpenShift AI for model serving, plus GPU infrastructure for on-cluster inference. Installing the GPU operators before RHOAI produces more reliable results because NFD labels are already present when the DataScienceCluster reconciles.
Path B users
If you're running on Developer Sandbox or CRC and using an external LLM endpoint, skip this guide. You don't need OpenShift AI or GPU operators installed.
Prerequisites¶
- OpenShift 4.20+ on AWS (see Choosing a Cluster)
cluster-adminrightsoclogged in to the cluster- Budget for a GPU instance (approximately $1.60/hr for
g6e.4xlargeat time of writing)
This tutorial targets Red Hat OpenShift AI 3.x via the fast-3.x
channel (validated on 3.3.1). RHOAI 3.x requires OpenShift 4.19 or later.
RHOAI 3.4 for supplementary modules
The core tutorial (Modules 0-11) works on RHOAI 3.3+. The
Models as a Service and
MCP Gateway supplementary modules
require RHOAI 3.4, which GA'd on May 14, 2026. If you're on 3.3,
the fast-3.x channel will deliver 3.4 automatically — approve the
upgrade in OLM when prompted. No operator reinstall is needed.
Multi-cluster safety
Every oc command in this guide includes --context="$CTX" to avoid
targeting the wrong cluster. Set it once per shell session:
Get the manifests¶
The YAML manifests used in this guide live in the repo under manifests/platform/.
Clone the repo if you haven't already:
Each step below also shows the YAML inline for reference, so you can review what each manifest contains before applying it.
Step 1: Install Node Feature Discovery (NFD)¶
NFD detects hardware features on each node — including GPUs — and exposes them as Kubernetes labels. The NVIDIA GPU Operator depends on these labels, so NFD must be installed first.
Create the namespace, then apply the OperatorGroup and Subscription:
oc create namespace openshift-nfd --context="$CTX" --dry-run=client -o yaml \
| oc apply --context="$CTX" -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: openshift-nfd-group
namespace: openshift-nfd
spec:
targetNamespaces:
- openshift-nfd
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nfd
namespace: openshift-nfd
spec:
channel: stable
installPlanApproval: Automatic
name: nfd
source: redhat-operators
sourceNamespace: openshift-marketplace
Wait for the operator to install:
Once the CSV phase shows Succeeded, create the NFD instance operand.
This tells the operator to start scanning nodes:
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
name: nfd-instance
namespace: openshift-nfd
spec:
workerConfig:
configData: |
core:
sleepInterval: 60s
Step 2: Install the NVIDIA GPU Operator¶
Create the namespace:
oc create namespace nvidia-gpu-operator --context="$CTX" --dry-run=client -o yaml \
| oc apply --context="$CTX" -f -
Apply the OperatorGroup and Subscription. The GPU Operator uses Manual install plan approval so you control exactly which version lands on the cluster:
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: nvidia-gpu-operator-group
namespace: nvidia-gpu-operator
spec:
targetNamespaces:
- nvidia-gpu-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: gpu-operator-certified
namespace: nvidia-gpu-operator
spec:
channel: v25.3
installPlanApproval: Manual
name: gpu-operator-certified
source: certified-operators
sourceNamespace: openshift-marketplace
Because installPlanApproval is Manual, you need to find and approve the
InstallPlan:
Wait for the InstallPlan to appear:
Once it appears with APPROVED=false, approve it:
INSTALL_PLAN=$(oc get installplan --context="$CTX" -n nvidia-gpu-operator \
-o jsonpath='{.items[?(@.spec.approved==false)].metadata.name}')
oc patch installplan "$INSTALL_PLAN" --context="$CTX" -n nvidia-gpu-operator \
--type merge -p '{"spec": {"approved": true}}'
Wait for the CSV to succeed:
Step 3: Install Red Hat OpenShift AI¶
Already have RHOAI on a shared cluster?
If Red Hat OpenShift AI is already installed (e.g., by a cluster admin), skip the Namespace, OperatorGroup, and Subscription steps below. Jump straight to Create a DataScienceCluster.
From the OpenShift web console:
- Navigate to Operators -> OperatorHub.
- Search for Red Hat OpenShift AI.
- Click the tile, then Install.
- Choose the
fast-3.xchannel (3.2 or later). - Accept the defaults (installed into
redhat-ods-operator).
Or from the CLI — create the namespace first, then apply the OperatorGroup and Subscription:
oc create namespace redhat-ods-operator --context="$CTX" \
--dry-run=client -o yaml | oc apply --context="$CTX" -f -
oc apply --context="$CTX" -n redhat-ods-operator -f - <<EOF
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: rhods-operator
namespace: redhat-ods-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: rhods-operator
namespace: redhat-ods-operator
spec:
channel: fast-3.x # stable-3.x also works and is more common in production clusters
name: rhods-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
Wait until the operator pods are running:
Step 4: Create a GPU MachineSet¶
OpenShift manages worker nodes through MachineSets. Rather than writing one from scratch, clone an existing worker MachineSet and modify it for GPU use.
The script below exports the first worker MachineSet, changes the instance
type to g6e.4xlarge (1 NVIDIA L40S, 48 GB VRAM), increases the disk to
200 GB, adds the nvidia.com/gpu taint, and sets replicas to 1:
Clone an existing worker MachineSet for GPU:
WORKER_MS=$(oc get machineset --context="$CTX" -n openshift-machine-api \
-o jsonpath='{.items[0].metadata.name}')
oc get machineset "$WORKER_MS" --context="$CTX" -n openshift-machine-api -o json | \
jq --arg name "gpu-${WORKER_MS}" '
.metadata.name = $name |
.metadata.resourceVersion = null |
.metadata.uid = null |
.metadata.creationTimestamp = null |
.spec.replicas = 1 |
.spec.selector.matchLabels["machine.openshift.io/cluster-api-machineset"] = $name |
.spec.template.metadata.labels["machine.openshift.io/cluster-api-machineset"] = $name |
.spec.template.spec.providerSpec.value.instanceType = "g6e.4xlarge" |
.spec.template.spec.providerSpec.value.blockDevices[0].ebs.volumeSize = 200 |
.spec.template.spec.taints = [{"key": "nvidia.com/gpu", "value": "", "effect": "NoSchedule"}]
' | oc apply --context="$CTX" -f -
Parallel work
Steps 4 and 5 can run in parallel — the GPU MachineSet provisions while the DataScienceCluster reconciles. Both take a few minutes.
Node provisioning takes ~15 minutes
AWS needs time to launch the instance, and the GPU Operator needs time to install drivers on the new node. Watch progress:
The Machine will progress through Provisioning -> Provisioned ->
Running. Once it's Running, wait for the corresponding Node to become
Ready:
Instance type alternatives
g6e.4xlarge provides an L40S with 48 GB VRAM (approximately $1.60/hr
at time of writing). If your region doesn't have g6e availability,
g5.4xlarge (A10G, 24 GB VRAM, approximately $1.20/hr at time of
writing) also works for tutorial-sized models. Adjust instanceType in
the jq command above. Check current AWS pricing for your region and
account agreement.
Step 5: Create a DataScienceCluster¶
The DataScienceCluster (DSC) custom resource tells the operator which
components to enable. For this tutorial you need kserve managed; leave
the rest at the operator's defaults. Two non-obvious choices below get
inline comments.
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
name: default-dsc
spec:
components:
kserve:
managementState: Managed
# The rest of the tutorial assumes KServe Raw with a Headless predictor
# service — that's what produces the `:8000` URL caveat in serve-an-llm.md
# and install-ogx.md. RHOAI 3.x defaults to Headless; setting it
# explicitly documents the dependency.
rawDeploymentServiceConfig: Headless
dashboard:
managementState: Managed
modelregistry:
managementState: Managed
registriesNamespace: rhoai-model-registries
llamastackoperator:
# RHOAI 3.x bundles a LlamaStack/OGX operator. We install the upstream
# ogx-k8s-operator in install-ogx.md instead (the rebrand hasn't shipped
# via RHOAI yet) — leaving this Removed avoids two operators reconciling
# the same LlamaStackDistribution.
managementState: Removed
Model registry
Enabling modelregistry makes the Red Hat AI model catalog available in
the RHOAI dashboard under AI hub, so you can browse and deploy models
from the UI.
Apply it:
Wait for GPU node readiness
If you started Steps 4 and 5 in parallel, pause here until your GPU
node is fully ready. The ClusterPolicy's driver daemonsets need a GPU
node to schedule on. Confirm the node is Ready and reporting GPU
capacity:
oc get nodes --context="$CTX" \
-o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type,GPU:.status.capacity."nvidia\.com/gpu"
You should see one node with 1 in the GPU column before proceeding.
Step 6: Apply the NVIDIA ClusterPolicy¶
The NVIDIA ClusterPolicy tells the GPU Operator how to configure drivers, device plugins, and monitoring on GPU nodes. Apply it after the operator is installed:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
operator:
defaultRuntime: crio
use_ocp_driver_toolkit: true
driver:
enabled: true
upgradePolicy:
autoUpgrade: true
maxParallelUpgrades: 1
maxUnavailable: 25%
devicePlugin:
enabled: true
dcgm:
enabled: true
dcgmExporter:
enabled: true
serviceMonitor:
enabled: true
gfd:
enabled: true
migManager:
enabled: true
config:
default: all-disabled
nodeStatusExporter:
enabled: true
toolkit:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: "false"
daemonsets:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
updateStrategy: RollingUpdate
rollingUpdate:
maxUnavailable: "1"
Key settings
use_ocp_driver_toolkit: true— uses the OpenShift Driver Toolkit to build driver containers matched to your cluster's kernel version.serviceMonitor: enabled: true— exposes GPU metrics (utilization, temperature, memory) to OpenShift's built-in Prometheus stack.- The
daemonsets.tolerationsblock ensures GPU Operator pods can schedule onto nodes with thenvidia.com/gputaint from Step 4.
Wait for the ClusterPolicy to reach ready state. This can take several
minutes as the operator builds and loads driver containers:
Note
The ClusterPolicy status may briefly show notReady while driver
daemonsets initialize on the GPU node. This is normal — wait for it to
settle to ready.
Step 7: Verify¶
Check that the DataScienceCluster is ready:
The output should show Ready.
DSC may show False in the READY column
oc get dsc may display READY: False even when the cluster is
fully operational. This happens because the summary column reflects
removed components. Check the actual readiness with:
oc get dsc default-dsc --context="$CTX" \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
If this prints True, your cluster is healthy.
The Dashboard hostname should also resolve. RHOAI 3.x exposes the
dashboard via Gateway API (not a plain Route in redhat-ods-applications):
oc get route data-science-gateway --context="$CTX" -n openshift-ingress \
-o jsonpath='{.spec.host}'
Confirm the GPU is visible to Kubernetes:
oc get nodes --context="$CTX" \
-o custom-columns=NAME:.metadata.name,GPU:.status.capacity."nvidia\.com/gpu"
You should see one node reporting a GPU capacity of 1. If the GPU column
shows <none> for all nodes, the driver pods may still be initializing.
Check their status:
All pods should be Running or Completed. If any are stuck in
CrashLoopBackOff, check their logs for driver compatibility issues.
If kserve shows Ready in the DSC status and a GPU node reports capacity,
you're done.
Step 8: Create a Hardware Profile¶
A HardwareProfile makes GPU resources selectable in the RHOAI dashboard when deploying models. Before applying the manifest, open the RHOAI dashboard and navigate to Settings > Hardware profiles. Note the default profile that is already listed. Then apply the manifest and refresh the page to see the new NVIDIA GPU profile appear.
apiVersion: infrastructure.opendatahub.io/v1
kind: HardwareProfile
metadata:
name: nvidia-gpu
namespace: redhat-ods-applications
labels:
app.kubernetes.io/part-of: hardwareprofile
app.opendatahub.io/hardwareprofile: "true"
annotations:
opendatahub.io/display-name: "NVIDIA GPU"
opendatahub.io/description: "NVIDIA GPU accelerator for AI/ML workloads"
opendatahub.io/disabled: "false"
spec:
identifiers:
- displayName: CPU
identifier: cpu
defaultCount: 2
maxCount: 8
minCount: 1
resourceType: CPU
- displayName: Memory
identifier: memory
defaultCount: 8Gi
maxCount: 32Gi
minCount: 2Gi
resourceType: Memory
- displayName: GPU
identifier: nvidia.com/gpu
defaultCount: 1
maxCount: 1
minCount: 1
resourceType: Accelerator
scheduling:
type: Node
node:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
The profile will appear under Settings > Hardware profiles in the RHOAI dashboard.