
Drop-in OpenAI-compatible inference on OpenShift — RHAIIS packages vLLM for production, with hardware flexibility and a secure external endpoint out of the box - AI generated
Introduction
In this post, I want to describe how to deploy the Red Hat AI Inference Server (RHAIIS) on OpenShift and expose it as an OpenAI-compatible API endpoint. This post builds on Deploying OpenShift on AWS with Automated Cluster Provisioning, which covers getting a working OpenShift cluster into place. If you already have a cluster running, you can skip directly to the deployment steps.
The inference server will load a model from Hugging Face Hub and expose a /v1/chat/completions endpoint that any OpenAI-compatible client can talk to. At the end, I show how to connect the endpoint to the Open WebUI setup described in My Local AI Stack.
What is Red Hat AI Inference Server
vLLM is an open-source inference engine designed for high-throughput LLM serving. It handles memory-efficient attention via PagedAttention, continuous batching, and GPU-optimized execution, and it exposes an OpenAI-compatible HTTP API out of the box. I covered how to run vLLM on the GPU cloud provider RunPod in a previous post.
The Red Hat AI Inference Server is the supported, enterprise-packaged distribution of vLLM. Red Hat provides a hardened container image distributed through registry.redhat.io, tested against specific GPU driver and CUDA versions and with a defined support lifecycle. The API surface is identical to upstream vLLM. Any client that works against a plain vLLM inference server works against RHAIIS without modification.
Deploying RHAIIS directly on OpenShift is one way to reach a running inference endpoint through Red Hat technology. Red Hat OpenShift AI offers other paths, e.g. model serving through KServe, where OpenShift AI manages the deployment lifecycle via a web dashboard and exposes RHAIIS through a ServingRuntime, or a Model as a Service approach that provisions shared inference endpoints across a cluster, so teams can consume models without operating their own deployment. The approach in this post is the most direct option, suited for cases where you want a single inference endpoint.
Prerequisites
This setup requires the following:
- A running OpenShift cluster with at least one GPU-enabled worker node. The post Deploying OpenShift on AWS covers one way to get there.
- Node Feature Discovery (NFD) Operator installed and running to detect GPU hardware on the node.
- NVIDIA GPU Operator installed to provide the CUDA runtime and device plugin.
- OpenShift CLI (oc) – required to interact with the OpenShift cluster, installed and logged into the cluster.
- A Hugging Face access token if you intend to use a gated model. Publicly available models like Granite do not require one.
Deploying the Red Hat AI Inference Server
The deployment consists of a namespace, two secrets, a PersistentVolumeClaim for model caching, a Deployment, a Service, and a Route. All deployment files are available in the smichard/agent_on_ocp GitHub repository. The steps below apply them in sequence.
- Clone the repository:
git clone https://github.com/smichard/agent_on_ocp.git
cd rhaiis
- Create a Namespace
oc new-project rhaiis
- Create the required Secrets
Hugging Face access token:
oc create secret generic hf-secret \
--from-literal=HF_TOKEN=<your_huggingface_token> \
-n rhaiis
API key for the inference endpoint:
The server requires clients to present an API key as a bearer token. Storing it as a secret keeps it out of the Deployment spec.
oc create secret generic vllm-api-key-secret \
--from-literal=VLLM_API_KEY=$(openssl rand -hex 32) \
-n rhaiis
- Create the ConfigMap
Set the Hugging Face model ID you want to serve. Research which model fits your use case before settling on one, the only hard requirement is that the model is supported by the vLLM inference server. The ConfigMap also carries the tool call parser name, which the deployment references to set the correct parsing mode for the chosen model.
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-config
namespace: rhaiis
data:
MODEL_NAME: "Qwen/Qwen3-Coder-30B-A3B-Instruct"
TOOL_CALL_PARSER: "qwen3_coder"
Apply the file to create the ConfigMap:
oc apply -f configmap.yaml
- Create a PersistentVolumeClaim
The model weights are downloaded once on first startup and cached on a persistent volume. This avoids re-downloading the model on every pod restart.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: rhaiis
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 150Gi
Apply the file to create the PVC:
oc apply -f pvc.yaml
- Deploy the Inference Server
The Deployment below references the RHAIIS container image and pulls the model ID from the ConfigMap created in step 4. To serve a different model, update the ConfigMap rather than editing the Deployment spec. The HF_TOKEN and VLLM_API_KEY values are injected from the secrets created in step 3.
Note
Depending on the model size, the number of GPUs and the CPU and memory allocations will need to be adjusted. The example below was tested on an AWS g5.12xlarge node (4x NVIDIA A10G, 24 GB VRAM per GPU) and uses all four GPUs via tensor parallelism.
apiVersion: apps/v1
kind: Deployment
metadata:
name: rhaiis-vllm
namespace: rhaiis
labels:
app: rhaiis-vllm
spec:
replicas: 1
selector:
matchLabels:
app: rhaiis-vllm
template:
metadata:
labels:
app: rhaiis-vllm
spec:
tolerations:
- key: nvidia.com/gpu
effect: NoSchedule
operator: Exists
serviceAccountName: default
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: "16Gi"
containers:
- name: vllm
image: registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.1-1775680192
imagePullPolicy: Always
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HF_TOKEN
- name: VLLM_API_KEY
valueFrom:
secretKeyRef:
name: vllm-api-key-secret
key: VLLM_API_KEY
- name: MODEL_NAME
valueFrom:
configMapKeyRef:
name: vllm-config
key: MODEL_NAME
- name: HF_HOME
value: /cache
- name: HF_HUB_OFFLINE
value: '0'
- name: VLLM_ALLOW_LONG_MAX_MODEL_LEN
value: '1'
- name: TOOL_CALL_PARSER
valueFrom:
configMapKeyRef:
name: vllm-config
key: TOOL_CALL_PARSER
command:
- python
- '-m'
- vllm.entrypoints.openai.api_server
args:
- '--port=8000'
- '--model=$(MODEL_NAME)'
- '--served-model-name=$(MODEL_NAME)'
- '--tensor-parallel-size=4'
- '--gpu-memory-utilization=0.85'
- '--max-model-len=65536'
- '--enable-auto-tool-choice'
- '--tool-call-parser=$(TOOL_CALL_PARSER)'
resources:
limits:
cpu: '10'
nvidia.com/gpu: '4'
memory: 128Gi
requests:
cpu: '2'
memory: 32Gi
nvidia.com/gpu: '4'
volumeMounts:
- name: model-cache
mountPath: /cache
- name: shm
mountPath: /dev/shm
restartPolicy: Always
Apply the file to create the deployment:
oc apply -f deployment.yaml
The container reads the model ID from the ConfigMap at startup and downloads it from HuggingFace into /cache (backed by the PVC). Initial startup takes several minutes depending on model size and network speed.
Follow the progress with:
oc logs -f deployment/rhaiis-vllm -n rhaiis
The server is ready when the log shows Application startup complete.

vLLM server log output on startup, showing all registered API routes and the final Application startup complete confirmation
Once the pod is running, you can verify GPU access from the pod terminal with nvidia-smi. All four GPUs should be visible, each running a tensor-parallel worker process.

nvidia-smi output from inside the vLLM pod, confirming all four A10G GPUs are visible and each tensor-parallel worker has allocated approximately 20 GB of VRAM
- Create a Service and Route
Create a Service that maps port 80 to port 8000 on the pod:
apiVersion: v1
kind: Service
metadata:
name: rhaiis-vllm
namespace: rhaiis
labels:
app: rhaiis-vllm
spec:
selector:
app: rhaiis-vllm
ports:
- name: http
protocol: TCP
port: 8000
targetPort: 8000
Create a TLS-terminated Route if you want to expose the endpoint outside the cluster:
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: rhaiis-vllm
namespace: rhaiis
labels:
app: rhaiis-vllm
spec:
to:
kind: Service
name: rhaiis-vllm
port:
targetPort: http
tls:
termination: edge
insecureEdgeTerminationPolicy: Redirect
Apply both and retrieve the assigned hostname:
oc apply -f service.yaml
oc apply -f route.yaml
oc get route rhaiis-vllm -n rhaii-namespace -o jsonpath='{.spec.host}'
OpenShift builds the hostname from the route and namespace names following the pattern <route-name>-<namespace>.apps.<cluster-domain>. The result looks something like rhaiis-vllm-rhaiis-namespace.apps.ocp.example.com.
Testing the Endpoint
Store the hostname and API key in shell variables to keep the commands readable:
Set environment variables once:
export RHAIIS_HOST=$(oc get route rhaiis-vllm -n rhaiis -o jsonpath='{.spec.host}')
export RHAIIS_API_KEY=$(oc get secret vllm-api-key-secret -n rhaiis \
-o jsonpath='{.data.VLLM_API_KEY}' | base64 -d)
export MODEL=$(oc get configmap vllm-config -n rhaiis \
-o jsonpath='{.data.MODEL_NAME}')
Verify all three are populated before proceeding:
echo "RHAIIS_HOST : ${RHAIIS_HOST}"
echo "RHAIIS_API_KEY : ${RHAIIS_API_KEY}"
echo "Model: ${MODEL}"
**List available models:**
```bash
curl -s https://$RHAIIS_HOST/v1/models \
-H "Authorization: Bearer $RHAIIS_API_KEY" | jq .
Send a chat completion request:
curl -sS \
"https://${RHAIIS_HOST}/v1/chat/completions" \
-H "Authorization: Bearer ${RHAIIS_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"model": "'"${MODEL}"'",
"messages": [{"role": "user", "content": "What is OpenShift?"}],
"temperature": 0.1,
"max_tokens": 200
}' | jq -r '.choices[0].message.content'
A successful response confirms the server is running, the model is loaded, and the API key authentication is working.
Connecting to Open WebUI
The inference server exposes a standard OpenAI-compatible API, which means Open WebUI can connect to it directly as an external provider. The setup in My Local AI Stack already runs Open WebUI. Adding the RHAIIS endpoint as a direct external connection requires no changes to the existing stack.
In Open WebUI, go to Settings > Connections and add a new external connection. Set the URL to the route hostname with the /v1 suffix, add the API key created in step 3 as a bearer token, set the provider type to OpenAI, and the API type to Chat Completions. Leave the model ID field empty so Open WebUI queries the /v1/models endpoint and discovers available models automatically.

Open WebUI external connection configured against the Red Hat AI Inference Server endpoint
Once saved, the deployed model appears in the model selector alongside any other configured providers.
Conclusion
The Red Hat AI Inference Server puts the vLLM engine into OpenShift, or any other supported platform, with a supported container image and a deployment pattern that fits standard Kubernetes workflows. The outcome is an OpenAI-compatible endpoint running on your own cluster, backed by a model from Hugging Face Hub, secured with an API key, and accessible over a TLS-terminated OpenShift Route. Any client that speaks the OpenAI Chat Completions format can talk to it, including Open WebUI, which connects to it the same way it connects to any other provider.
References
- GitHub repository with eployment files - link
- Deploying OpenShift on AWS with Automated Cluster Provisioning - link
- My Local AI Stack: Open WebUI, LiteLLM, SearXNG, and Docling - link
- Extending the Local AI Stack with On-Demand GPU Inference on RunPod - link
- Model as a Service GitHub repository - link
- Node Feature Discovery Operator - link
- NVIDIA GPU Operator - link
- OpenShift CLI (oc) - link
- Granite family of models on Hugging Face - link
- smichard/agent_on_ocp - GitHub repository - link
- Red Hat AI Inference Server - Documentation - link
- Deploying Red Hat AI Inference Server on OpenShift - link
- vLLM - upstream project - link
- vLLM - OpenAI-compatible server documentation - link
- Open WebUI - project site - link