Skip to content

Serve an LLM

The agent in this tutorial talks to an OpenAI-compatible LLM endpoint. The reference model is RedHatAI/gpt-oss-20b served via vLLM. It's a 20 B-parameter Apache-2.0 reasoning model with native tool-calling support; Red Hat AI's MXFP4-quantized variant fits in ~16 GB of VRAM and runs on a single L40S/A100/H100 — a realistic single-GPU target for the tutorial. It also matches the model the workshop-setup-ogx playbook validates end-to-end against OGX, so Module 10's platform-mode install slots in without re-tuning.

This guide covers two paths:

  • Path A — deploy vLLM on your OpenShift cluster (recommended).
  • Path B — point the tutorial at an external OpenAI-compatible endpoint.

Both paths produce the same variables that the rest of the tutorial reads:

Variable Catalog deploy Manual deploy
MODEL_ENDPOINT https://redhataigpt-oss-20b-gpt-oss-model.apps.<cluster>/v1 (local dev) or https://redhataigpt-oss-20b-predictor.gpt-oss-model.svc.cluster.local:8443/v1 (on-cluster) http://gpt-oss-predictor.gpt-oss-model.svc.cluster.local:8000/v1
MODEL_NAME redhataigpt-oss-20b RedHatAI/gpt-oss-20b
OPENAI_API_KEY Service account token from dashboard not-required

Path A: Serve on-cluster with vLLM

There are two ways to deploy the model. The catalog deploy is faster and recommended for most users. The manual deploy gives full control over the ServingRuntime and InferenceService configuration.

Multi-cluster safety

Every oc command in this guide includes --context="$CTX" to avoid targeting the wrong cluster. Set it once per shell session:

export CTX=$(oc config current-context)

Prerequisites

  • OpenShift AI 3.x installed with model registry enabled
  • One GPU node with ~24 GB VRAM (L40S, A10, A100, or H100 all work)
  • RHOAI dashboard accessible

1. Create the namespace

oc new-project gpt-oss-model --context="$CTX"
oc label namespace gpt-oss-model opendatahub.io/dashboard=true --context="$CTX"

2. Deploy from the AI hub

Open the RHOAI dashboard and navigate to AI hub > Models.

Find gpt-oss-20b (from RedHatAI) and click Deploy.

In the deployment form:

  • Set the project to gpt-oss-model
  • Select the NVIDIA GPU hardware profile (created in the install guide)
  • Accept the remaining defaults

3. Wait for the model

Wait for the model to show Ready status in the dashboard under Models > Deployed models.

4. Get the endpoint URL

Once ready, click on the deployed model. Under Inference endpoints, you will find:

  • Internal URL: https://redhataigpt-oss-20b-predictor.gpt-oss-model.svc.cluster.local:8443 (for agents deployed on the cluster)
  • External URL: https://redhataigpt-oss-20b-gpt-oss-model.apps.<cluster-domain> (for local development)
  • Token: a service account token for authentication

Model ID difference

The catalog deploy registers the model as redhataigpt-oss-20b (no slashes). This differs from the Hugging Face ID RedHatAI/gpt-oss-20b used in manual deploys. Use the catalog name when setting MODEL_NAME.

5. Smoke test

Using the external URL:

curl -sk "$MODEL_ENDPOINT/models" \
  -H "Authorization: Bearer $OPENAI_API_KEY" | jq
curl -sk "$MODEL_ENDPOINT/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "redhataigpt-oss-20b",
    "messages": [{"role": "user", "content": "In one short sentence, say hello."}],
    "max_tokens": 300
  }' | jq '.choices[0].message'

Prerequisites

  • OpenShift AI 3.x installed with KServe managed
  • One GPU node with ~24 GB VRAM (L40S, A10, A100, or H100 all work)
  • A namespace to host the model — this guide uses gpt-oss-model

1. Create the namespace

oc new-project gpt-oss-model --context="$CTX"
oc label namespace gpt-oss-model opendatahub.io/dashboard=true --context="$CTX"

The label puts the InferenceService under Models > Deployed models in the RHOAI dashboard.

RedHatAI/gpt-oss-20b is Apache-2.0 and not gated, so no Hugging Face token is required. If you swap in a gated model, see the KServe storage docs for credential setup.

2. Create a PVC for the weight cache

The model is ~13 GB on disk. Backing HF_HOME with a PVC means the weights download once and persist across pod restarts.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: gpt-oss-model-cache
  namespace: gpt-oss-model
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3-csi
  resources:
    requests:
      storage: 100Gi

Substitute your cluster's default block storage class for gp3-csi if needed (oc get storageclass).

3. Define a ServingRuntime

The image is Red Hat AI Inference Server (RHAIIS) — RHOAI's productized vLLM, recent enough to ship native MXFP4 + Marlin kernels and the openai_gptoss reasoning parser.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-gpt-oss-runtime
  namespace: gpt-oss-model
  labels:
    opendatahub.io/dashboard: "true"
  annotations:
    openshift.io/display-name: "vLLM gpt-oss-20b (CUDA)"
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
spec:
  supportedModelFormats:
    - name: vLLM
      autoSelect: true
  multiModel: false
  containers:
    - name: kserve-container
      image: registry.redhat.io/rhaiis/vllm-cuda-rhel9:3
      command: ["vllm", "serve"]
      args:
        - RedHatAI/gpt-oss-20b
        - --port
        - "8000"
        - --dtype
        - bfloat16
        - --max-model-len
        - "131072"
        - --kv-cache-dtype
        - fp8_e4m3
        - --gpu-memory-utilization
        - "0.95"
        - --max-num-seqs
        - "8"
        - --enforce-eager
        - --enable-auto-tool-choice
        - --tool-call-parser
        - openai
        - --enable-prefix-caching
      env:
        - name: HOME
          value: /tmp/home
        - name: HF_HOME
          value: /models/huggingface
        - name: HF_HUB_OFFLINE
          value: "0"
        - name: TRANSFORMERS_CACHE
          value: /models/huggingface
        - name: VLLM_CACHE_DIR
          value: /models/vllm-cache
        - name: XDG_CACHE_HOME
          value: /tmp/cache
      ports:
        - containerPort: 8000
          protocol: TCP
      resources:
        requests:
          cpu: "4"
          memory: 16Gi
          nvidia.com/gpu: "1"
        limits:
          cpu: "8"
          memory: 32Gi
          nvidia.com/gpu: "1"
      readinessProbe:
        httpGet:
          path: /health
          port: 8000
        initialDelaySeconds: 300
        periodSeconds: 15
        failureThreshold: 40
      livenessProbe:
        httpGet:
          path: /health
          port: 8000
        initialDelaySeconds: 300
        periodSeconds: 30
        failureThreshold: 10
      volumeMounts:
        - name: model-cache
          mountPath: /models
        - name: shm
          mountPath: /dev/shm
        - name: tmp-cache
          mountPath: /tmp/cache
        - name: tmp-home
          mountPath: /tmp/home
  volumes:
    - name: model-cache
      persistentVolumeClaim:
        claimName: gpt-oss-model-cache
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: 8Gi
    - name: tmp-cache
      emptyDir: {}
    - name: tmp-home
      emptyDir: {}
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

Why these vLLM args

  • --dtype bfloat16 + --kv-cache-dtype fp8_e4m3 — bf16 weights, FP8 KV cache; tuned for L40S-class hardware
  • --enforce-eager — disables CUDA graph capture; required for the MXFP4 + Marlin kernel path
  • --enable-auto-tool-choice --tool-call-parser openai — exposes gpt-oss's native tool calls in the standard OpenAI tool_calls field. Modules 4-9 of this tutorial rely on this.
  • --max-model-len 131072 — gpt-oss-20b's full 128 k context

4. Define an InferenceService

The RawDeployment annotation tells KServe to use a plain Deployment + Service rather than Knative — required on RHOAI 3.x clusters where KServe's rawDeploymentServiceConfig is Headless.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: gpt-oss
  namespace: gpt-oss-model
  labels:
    opendatahub.io/dashboard: "true"
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    sidecar.istio.io/inject: "false"
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-gpt-oss-runtime

Apply all three manifests:

oc apply --context="$CTX" -n gpt-oss-model -f pvc.yaml
oc apply --context="$CTX" -n gpt-oss-model -f vllm-runtime.yaml
oc apply --context="$CTX" -n gpt-oss-model -f inferenceservice.yaml

Joining an existing cluster?

If the model was deployed by someone else, the KServe service name may differ from gpt-oss-predictor (e.g. it could be gpt-oss-20b-predictor or any other name chosen at deploy time). Run oc get svc -n <model-namespace> --context="$CTX" to find the correct service name and substitute it in the MODEL_ENDPOINT URL below.

The first startup pulls the model weights and can take 5-15 minutes. Watch progress:

oc logs --context="$CTX" -n gpt-oss-model -l serving.kserve.io/inferenceservice=gpt-oss -f

Wait for DeploymentReady=True:

oc wait --context="$CTX" --for=condition=Ready inferenceservice/gpt-oss \
  -n gpt-oss-model --timeout=900s

5. Get the endpoint URL

URL=$(oc get inferenceservice gpt-oss --context="$CTX" -n gpt-oss-model \
  -o jsonpath='{.status.url}')
echo "$URL"

On RHOAI 3.x with Headless service config, the reported URL is missing the port

Many RHOAI 3.x DataScienceClusters set kserve.rawDeploymentServiceConfig: Headless. With Headless, the predictor Service has ClusterIP: None, DNS resolves directly to pod IPs, and the Service's port: 80 -> targetPort: 8000 mapping doesn't apply — you must hit the pod's listening port (:8000) directly. The InferenceService's .status.url won't include it.

Check with:

oc get svc --context="$CTX" -n gpt-oss-model -l serving.kserve.io/inferenceservice=gpt-oss \
  -o jsonpath='{.items[0].spec.clusterIP}{"\n"}'

If it prints None, append :8000 to the URL — for example http://gpt-oss-predictor.gpt-oss-model.svc.cluster.local:8000.

6. Smoke test

RawDeployment + Headless doesn't expose a Route by default, so use port-forward:

oc port-forward --context="$CTX" -n gpt-oss-model deployment/gpt-oss-predictor 18000:8000 &
curl -s http://localhost:18000/v1/models | jq
curl -s http://localhost:18000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "RedHatAI/gpt-oss-20b",
    "messages": [{"role": "user", "content": "In one short sentence, say hello."}],
    "max_tokens": 300
  }' | jq '.choices[0].message'
kill %1

You should see the registered model and a content: "Hello!" response with a populated reasoning_content field.

Reasoning content

gpt-oss-20b emits its chain-of-thought into a separate reasoning_content field on the response message. The visible content is the final answer only. If you ever see content: null with finish_reason: "length", the response was truncated mid-reasoning — raise max_tokens.

Note your values

Before moving on, note these three values from your deployment. You'll use them in Module 2 when deploying the agent to OpenShift.

Value Where to find it
MODEL_ENDPOINT RHOAI dashboard > your model > Inference endpoints > Internal URL, with /v1 appended
MODEL_NAME The model ID shown in the dashboard (e.g. redhataigpt-oss-20b)
OPENAI_API_KEY The service account token shown under Inference endpoints
Value Where to find it
MODEL_ENDPOINT http://<name>-predictor.<namespace>.svc.cluster.local:8000/v1 (see Headless warning above)
MODEL_NAME The Hugging Face model ID (e.g. RedHatAI/gpt-oss-20b)
OPENAI_API_KEY not-required (vLLM is unauthenticated by default)

You're ready for Module 1.

Path B: External OpenAI-compatible endpoint

If you don't have GPU access — Developer Sandbox, CRC, or a cluster with no GPU node pool — point the tutorial at an external endpoint instead.

Anything that speaks the OpenAI Chat Completions API works:

  • A vLLM you run elsewhere (a workstation with a GPU, a separate cluster, a cloud VM)
  • A corporate inference gateway (e.g., a shared llm-d deployment, MaaS, or LiteLLM proxy)
  • Any OpenAI-compatible third-party API

You need three things:

Item Notes
Endpoint URL Must end at the API root, e.g. https://api.example.com/v1
Model name What that endpoint calls the model
API key (if required) Set as OPENAI_API_KEY. For unauthenticated endpoints, set it to any non-empty string (e.g. not-required) — the OpenAI SDK requires the variable to exist.

Smoke test before moving on (substitute your actual URL and key):

curl -s https://api.example.com/v1/models \
  -H "Authorization: Bearer <your-api-key>" | jq

If you get a list of models including the one you'll use, you're ready.

Note these three values -- you'll use them in Module 2 when deploying the agent:

Value What to use
MODEL_ENDPOINT Your provider's API URL ending with /v1
MODEL_NAME The model identifier your endpoint expects
OPENAI_API_KEY Your API key from the provider

The rest of the tutorial works identically — only the LLM lives elsewhere.

Path C: Models as a Service

If your cluster runs RHOAI 3.4+, you can serve models through the MaaS gateway — centralized access with API keys, token-based quotas, and cost visibility. See the Models as a Service supplementary module after completing the core tutorial.

Picking a different model

RedHatAI/gpt-oss-20b is the reference because it's reasoning-capable with native tool calls, fits in ~16 GB of VRAM, and is what the workshop-setup-ogx playbook validates end-to-end. Any instruction-tuned model with reliable tool-calling support works, though most won't share the reasoning-content split:

  • Granite 3.3 8B Instruct (ibm-granite/granite-3.3-8b-instruct) — smaller (24 GB VRAM at fp16), no reasoning track, FIPS-friendly, gated on Hugging Face.
  • Llama 3.3 70B Instruct — substantially larger; needs multi-GPU or aggressive quantization.
  • Mistral Large — also large; tool calling reliable.

If you switch models, check vLLM's CLI args against the model's docs (quantization, KV-cache dtype, max context). Smaller models (under ~7 B params) may struggle with tool-call JSON for some calculus tools.

Next

Install CLI Tools.