Openshift on Home

Running the Red Hat AI Inference Server on OpenShift

Sun, 17 May 2026 00:00:00 +0000

Drop-in OpenAI-compatible inference on OpenShift — RHAIIS packages vLLM for production, with hardware flexibility and a secure external endpoint out of the box - AI generated

Introduction

In this post, I want to describe how to deploy the Red Hat AI Inference Server (RHAIIS) on OpenShift and expose it as an OpenAI-compatible API endpoint. This post builds on Deploying OpenShift on AWS with Automated Cluster Provisioning, which covers getting a working OpenShift cluster into place. If you already have a cluster running, you can skip directly to the deployment steps.

The inference server will load a model from Hugging Face Hub and expose a /v1/chat/completions endpoint that any OpenAI-compatible client can talk to. At the end, I show how to connect the endpoint to the Open WebUI setup described in My Local AI Stack.

What is Red Hat AI Inference Server

vLLM is an open-source inference engine designed for high-throughput LLM serving. It handles memory-efficient attention via PagedAttention, continuous batching, and GPU-optimized execution, and it exposes an OpenAI-compatible HTTP API out of the box. I covered how to run vLLM on the GPU cloud provider RunPod in a previous post.

The Red Hat AI Inference Server is the supported, enterprise-packaged distribution of vLLM. Red Hat provides a hardened container image distributed through registry.redhat.io, tested against specific GPU driver and CUDA versions and with a defined support lifecycle. The API surface is identical to upstream vLLM. Any client that works against a plain vLLM inference server works against RHAIIS without modification.

Deploying RHAIIS directly on OpenShift is one way to reach a running inference endpoint through Red Hat technology. Red Hat OpenShift AI offers other paths, e.g. model serving through KServe, where OpenShift AI manages the deployment lifecycle via a web dashboard and exposes RHAIIS through a ServingRuntime, or a Model as a Service approach that provisions shared inference endpoints across a cluster, so teams can consume models without operating their own deployment. The approach in this post is the most direct option, suited for cases where you want a single inference endpoint.

Prerequisites

This setup requires the following:

A running OpenShift cluster with at least one GPU-enabled worker node. The post Deploying OpenShift on AWS covers one way to get there.
Node Feature Discovery (NFD) Operator installed and running to detect GPU hardware on the node.
NVIDIA GPU Operator installed to provide the CUDA runtime and device plugin.
OpenShift CLI (oc) – required to interact with the OpenShift cluster, installed and logged into the cluster.
A Hugging Face access token if you intend to use a gated model. Publicly available models like Granite do not require one.

Deploying the Red Hat AI Inference Server

The deployment consists of a namespace, two secrets, a PersistentVolumeClaim for model caching, a Deployment, a Service, and a Route. All deployment files are available in the smichard/agent_on_ocp GitHub repository. The steps below apply them in sequence.

Clone the repository:

git clone https://github.com/smichard/agent_on_ocp.git
cd rhaiis

Create a Namespace

oc new-project rhaiis

Create the required Secrets

Hugging Face access token:

oc create secret generic hf-secret \
 --from-literal=HF_TOKEN=<your_huggingface_token> \
 -n rhaiis

API key for the inference endpoint:

The server requires clients to present an API key as a bearer token. Storing it as a secret keeps it out of the Deployment spec.

oc create secret generic vllm-api-key-secret \
 --from-literal=VLLM_API_KEY=$(openssl rand -hex 32) \
 -n rhaiis

Create the ConfigMap

Set the Hugging Face model ID you want to serve. Research which model fits your use case before settling on one, the only hard requirement is that the model is supported by the vLLM inference server. The ConfigMap also carries the tool call parser name, which the deployment references to set the correct parsing mode for the chosen model.

apiVersion: v1
kind: ConfigMap
metadata:
 name: vllm-config
 namespace: rhaiis
data:
 MODEL_NAME: "Qwen/Qwen3-Coder-30B-A3B-Instruct"
 TOOL_CALL_PARSER: "qwen3_coder"

Apply the file to create the ConfigMap:

oc apply -f configmap.yaml

Create a PersistentVolumeClaim

The model weights are downloaded once on first startup and cached on a persistent volume. This avoids re-downloading the model on every pod restart.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: model-cache
 namespace: rhaiis
spec:
 accessModes:
 - ReadWriteOnce
 resources:
 requests:
 storage: 150Gi

Apply the file to create the PVC:

oc apply -f pvc.yaml

Deploy the Inference Server

The Deployment below references the RHAIIS container image and pulls the model ID from the ConfigMap created in step 4. To serve a different model, update the ConfigMap rather than editing the Deployment spec. The HF_TOKEN and VLLM_API_KEY values are injected from the secrets created in step 3.

Note

Depending on the model size, the number of GPUs and the CPU and memory allocations will need to be adjusted. The example below was tested on an AWS g5.12xlarge node (4x NVIDIA A10G, 24 GB VRAM per GPU) and uses all four GPUs via tensor parallelism.

apiVersion: apps/v1
kind: Deployment
metadata:
 name: rhaiis-vllm
 namespace: rhaiis
 labels:
 app: rhaiis-vllm
spec:
 replicas: 1
 selector:
 matchLabels:
 app: rhaiis-vllm
 template:
 metadata:
 labels:
 app: rhaiis-vllm
 spec:
 tolerations:
 - key: nvidia.com/gpu
 effect: NoSchedule
 operator: Exists
 serviceAccountName: default
 volumes:
 - name: model-cache
 persistentVolumeClaim:
 claimName: model-cache
 - name: shm
 emptyDir:
 medium: Memory
 sizeLimit: "16Gi"
 containers:
 - name: vllm
 image: registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.1-1775680192
 imagePullPolicy: Always
 env:
 - name: HF_TOKEN
 valueFrom:
 secretKeyRef:
 name: hf-secret
 key: HF_TOKEN
 - name: VLLM_API_KEY
 valueFrom:
 secretKeyRef:
 name: vllm-api-key-secret
 key: VLLM_API_KEY
 - name: MODEL_NAME
 valueFrom:
 configMapKeyRef:
 name: vllm-config
 key: MODEL_NAME
 - name: HF_HOME
 value: /cache
 - name: HF_HUB_OFFLINE
 value: '0'
 - name: VLLM_ALLOW_LONG_MAX_MODEL_LEN
 value: '1'
 - name: TOOL_CALL_PARSER
 valueFrom:
 configMapKeyRef:
 name: vllm-config
 key: TOOL_CALL_PARSER
 command:
 - python
 - '-m'
 - vllm.entrypoints.openai.api_server
 args:
 - '--port=8000'
 - '--model=$(MODEL_NAME)'
 - '--served-model-name=$(MODEL_NAME)'
 - '--tensor-parallel-size=4'
 - '--gpu-memory-utilization=0.85'
 - '--max-model-len=65536'
 - '--enable-auto-tool-choice'
 - '--tool-call-parser=$(TOOL_CALL_PARSER)'
 resources:
 limits:
 cpu: '10'
 nvidia.com/gpu: '4'
 memory: 128Gi
 requests:
 cpu: '2'
 memory: 32Gi
 nvidia.com/gpu: '4'
 volumeMounts:
 - name: model-cache
 mountPath: /cache
 - name: shm
 mountPath: /dev/shm
 restartPolicy: Always

Apply the file to create the deployment:

oc apply -f deployment.yaml

The container reads the model ID from the ConfigMap at startup and downloads it from HuggingFace into /cache (backed by the PVC). Initial startup takes several minutes depending on model size and network speed. Follow the progress with:

oc logs -f deployment/rhaiis-vllm -n rhaiis

The server is ready when the log shows Application startup complete.

vLLM server log output on startup, showing all registered API routes and the final Application startup complete confirmation

Once the pod is running, you can verify GPU access from the pod terminal with nvidia-smi. All four GPUs should be visible, each running a tensor-parallel worker process.

nvidia-smi output from inside the vLLM pod, confirming all four A10G GPUs are visible and each tensor-parallel worker has allocated approximately 20 GB of VRAM

Create a Service and Route

Create a Service that maps port 80 to port 8000 on the pod:

apiVersion: v1
kind: Service
metadata:
 name: rhaiis-vllm
 namespace: rhaiis
 labels:
 app: rhaiis-vllm
spec:
 selector:
 app: rhaiis-vllm
 ports:
 - name: http
 protocol: TCP
 port: 8000
 targetPort: 8000

Create a TLS-terminated Route if you want to expose the endpoint outside the cluster:

apiVersion: route.openshift.io/v1
kind: Route
metadata:
 name: rhaiis-vllm
 namespace: rhaiis
 labels:
 app: rhaiis-vllm
spec:
 to:
 kind: Service
 name: rhaiis-vllm
 port:
 targetPort: http
 tls:
 termination: edge
 insecureEdgeTerminationPolicy: Redirect

Apply both and retrieve the assigned hostname:

oc apply -f service.yaml
oc apply -f route.yaml
oc get route rhaiis-vllm -n rhaii-namespace -o jsonpath='{.spec.host}'

OpenShift builds the hostname from the route and namespace names following the pattern <route-name>-<namespace>.apps.<cluster-domain>. The result looks something like rhaiis-vllm-rhaiis-namespace.apps.ocp.example.com.

Testing the Endpoint

Store the hostname and API key in shell variables to keep the commands readable:

Set environment variables once:

export RHAIIS_HOST=$(oc get route rhaiis-vllm -n rhaiis -o jsonpath='{.spec.host}')
export RHAIIS_API_KEY=$(oc get secret vllm-api-key-secret -n rhaiis \
 -o jsonpath='{.data.VLLM_API_KEY}' | base64 -d)
export MODEL=$(oc get configmap vllm-config -n rhaiis \
 -o jsonpath='{.data.MODEL_NAME}')

Verify all three are populated before proceeding:

echo "RHAIIS_HOST : ${RHAIIS_HOST}"
echo "RHAIIS_API_KEY : ${RHAIIS_API_KEY}"
echo "Model: ${MODEL}"

**List available models:**

```bash
curl -s https://$RHAIIS_HOST/v1/models \
 -H "Authorization: Bearer $RHAIIS_API_KEY" | jq .

Send a chat completion request:

curl -sS \
 "https://${RHAIIS_HOST}/v1/chat/completions" \
 -H "Authorization: Bearer ${RHAIIS_API_KEY}" \
 -H "Content-Type: application/json" \
 -d '{
 "model": "'"${MODEL}"'",
 "messages": [{"role": "user", "content": "What is OpenShift?"}],
 "temperature": 0.1,
 "max_tokens": 200
 }' | jq -r '.choices[0].message.content'

A successful response confirms the server is running, the model is loaded, and the API key authentication is working.

Connecting to Open WebUI

The inference server exposes a standard OpenAI-compatible API, which means Open WebUI can connect to it directly as an external provider. The setup in My Local AI Stack already runs Open WebUI. Adding the RHAIIS endpoint as a direct external connection requires no changes to the existing stack.

In Open WebUI, go to Settings > Connections and add a new external connection. Set the URL to the route hostname with the /v1 suffix, add the API key created in step 3 as a bearer token, set the provider type to OpenAI, and the API type to Chat Completions. Leave the model ID field empty so Open WebUI queries the /v1/models endpoint and discovers available models automatically.

Open WebUI external connection configured against the Red Hat AI Inference Server endpoint

Once saved, the deployed model appears in the model selector alongside any other configured providers.

Conclusion

The Red Hat AI Inference Server puts the vLLM engine into OpenShift, or any other supported platform, with a supported container image and a deployment pattern that fits standard Kubernetes workflows. The outcome is an OpenAI-compatible endpoint running on your own cluster, backed by a model from Hugging Face Hub, secured with an API key, and accessible over a TLS-terminated OpenShift Route. Any client that speaks the OpenAI Chat Completions format can talk to it, including Open WebUI, which connects to it the same way it connects to any other provider.

References

GitHub repository with eployment files - link
Deploying OpenShift on AWS with Automated Cluster Provisioning - link
My Local AI Stack: Open WebUI, LiteLLM, SearXNG, and Docling - link
Extending the Local AI Stack with On-Demand GPU Inference on RunPod - link
Model as a Service GitHub repository - link
Node Feature Discovery Operator - link
NVIDIA GPU Operator - link
OpenShift CLI (oc) - link
Granite family of models on Hugging Face - link
smichard/agent_on_ocp - GitHub repository - link
Red Hat AI Inference Server - Documentation - link
Deploying Red Hat AI Inference Server on OpenShift - link
vLLM - upstream project - link
vLLM - OpenAI-compatible server documentation - link
Open WebUI - project site - link

Installing OpenShift AI on OpenShift

Thu, 14 May 2026 00:00:00 +0000

From GitOps repo to OpenShift AI deployment with verified GPU access in minutes - AI generated]

Introduction

In this post, I want to describe how to install Red Hat OpenShift AI on an existing OpenShift cluster and configure it to run GPU-accelerated workloads. The approach uses the rhoai-gitops repository, created and maintained by my team mate Álvaro López Medina, which automates the installation of OpenShift AI, the required operators, and the NVIDIA GPU stack through a single script backed by a GitOps approach.

If you do not have an OpenShift cluster available yet and want to provision one on AWS, a previous post Deploying OpenShift on AWS with Automated Cluster Provisioning covers exactly that. The steps below pick up where that post leaves off, though they apply equally to any running OpenShift cluster.

Prerequisites

Before proceeding, ensure the following are in place:

A running OpenShift cluster with sufficient compute capacity
The OpenShift CLI (oc) installed and available on your workstation
Cluster-admin access
If GPU support is needed: sufficient AWS quota for GPU instance types

Selecting the correct GPU instance node type

Selecting the right GPU instance type for your workload is a decision that is worth getting right before you provision anything, the instance family determines not just raw performance but also memory capacity, which directly constrains which models you can load and at what precision. Undersizing leads to out-of-memory failures, oversizing means paying for capacity you do not use.

Consult the AWS recommended GPU instances for deep learning to identify instance families suited to your workload, then cross-reference with the EC2 instance type availability by region to confirm that your target region actually offers the instance type you need. GPU instance availability varies significantly across regions and is a common source of unexpected quota errors at deployment time.

The following AWS instance types are commonly used in OpenShift AI GPU deployments:

Instance Name	GPU	GPU RAM	vCPUs	RAM
g5.4xlarge	1x NVIDIA A10G	24 GiB	16	64 GiB
g5.12xlarge	4x NVIDIA A10G	96 GiB	48	192 GiB
g5.24xlarge	4x NVIDIA A10G	96 GiB	96	384 GiB
g5.48xlarge	8x NVIDIA A10G	192 GiB	192	768 GiB
p4d.24xlarge	8x NVIDIA A100	320 GiB	96	1,152 GiB

Installing OpenShift AI

Clone the rhoai-gitops repository:

git clone https://github.com/alvarolop/rhoai-gitops
cd rhoai-gitops

Open the installation script and review the GPU-related configuration:

vi auto-install.sh

The three parameters that matter most for GPU-enabled deployments:

CREATE_GPU_MACHINESETS (Line 9): When set to true, the script automatically creates MachineSets for GPU nodes. Set to false if you do not need GPU support initially.
GPU_NODE_COUNT (Line 10): Total number of GPU nodes to provision. The nodes are distributed across Availability Zones a, b, and c for resilience.
AWS_GPU_INSTANCE (Line 18): Defaults to g5.4xlarge, which provides an NVIDIA A10G GPU per node. Adjust based on the workload requirements and available quota.

Throughout the following steps, any value written in <angle brackets> is a placeholder and must be replaced with your actual value before running the command.

oc login -u <user_name> <cluster_api_url>

Run the installation script:

./auto-install.sh

The script installs the required operators — including the OpenShift AI Operator, the Node Feature Discovery Operator, and the NVIDIA GPU Operator — and provisions GPU MachineSets if configured to do so. Depending on node provisioning times, the complete process takes 15 to 30 minutes.

Confirm that the GPU worker nodes have joined the cluster:

oc get machineset -n openshift-machine-api
oc get machine -n openshift-machine-api
oc get nodes

Verify that the NVIDIA driver is loaded and that the GPU is accessible:

oc exec -it -n nvidia-gpu-operator \
 $(oc get pod -o wide -l openshift.driver-toolkit=true \
 -o jsonpath="{.items[0].metadata.name}" \
 -n nvidia-gpu-operator) \
 -- nvidia-smi

nvidia-smi output confirming GPU access from within the NVIDIA GPU Operator pod

Check the Argo CD applications deployed as part of the GitOps installation:

Argo CD application overview after the rhoai-gitops installation completes

All applications should be in a healthy and synced state before proceeding to configuration.

Configuring OpenShift AI for GPU Workloads

With OpenShift AI installed, a small amount of configuration is needed to allow workbenches to schedule onto the GPU nodes. GPU nodes in OpenShift are typically tainted with nvidia.com/gpu:NoSchedule to prevent standard workloads from landing on them accidentally. Workbenches that need GPU access must be configured with a matching toleration.

Check the taints applied to the GPU nodes:

oc get nodes
oc describe node <gpu_node_name>

The relevant taint will appear as nvidia.com/gpu=:NoSchedule in the node description.

In the OpenShift AI console, navigate to Settings > Hardware Profiles and create a new profile (for example, nvidia-gpu).
Add a Toleration with the following values:

Field	Value
Key	`nvidia.com/gpu`
Effect	`NoSchedule`
Operator	`Exists`

Configuring a toleration for the NVIDIA GPU taint in the Hardware Profile

This toleration allows workbenches assigned to this profile to be scheduled onto GPU nodes while keeping those nodes unavailable to other workloads.

Create a new workbench and select the nvidia-gpu hardware profile. The workbench pod will be scheduled on a GPU node.
Once the workbench is running, open a terminal and confirm GPU access:

nvidia-smi

nvidia-smi output from inside an OpenShift AI workbench, confirming direct access to the NVIDIA A10G GPU

For a complete reference on hardware profiles and toleration configuration, the Red Hat OpenShift AI documentation covers the options in detail.

Conclusion

The rhoai-gitops repository makes the Red Hat OpenShift AI installation genuinely straightforward: one script handles the operator stack, the GPU node provisioning, and the GitOps wiring. The manual steps that remain — creating the hardware profile and configuring the workbench — are minimal and need to be done only once per cluster.

The end result is an OpenShift AI environment with full GPU access, ready for running Jupyter notebooks, training jobs, or serving models. If you provisioned the underlying cluster using the approach described in Deploying OpenShift on AWS with Automated Cluster Provisioning, the two repositories together cover the entire path from a blank AWS account to a working AI platform within a short timeframe of approximately two hours.

References

rhoai-gitops - GitHub repository by Álvaro López Medina - link
ocp-on-aws - GitHub repository by Álvaro López Medina - link
Red Hat OpenShift AI - Managing Hardware Profiles - link
OpenShift AI - Product documentation - link
OpenShift CLI (oc) - Getting started - link
NVIDIA GPU Operator documentation - link
AWS EC2 instance type availability by region - link
AWS recommended GPU instances for deep learning - link
G5-Instances von Amazon EC2 - link
Amazon-EC2-P4-Instances - link

Deploying OpenShift on AWS with Automated Cluster Provisioning

Sat, 09 May 2026 00:00:00 +0000

The full provisioning pipeline: CLI setup, ocp-on-aws config, and a single script that spins up VPCs, EC2 instances, DNS records, and an Argo CD baseline - AI generated

Introduction

In this post, I want to describe how to deploy Red Hat OpenShift in a blank Amazon Web Services (AWS) environment using a fully automated and repeatable approach. This post is part of a series of two posts: 1. This post covers the cluster provisioning step. 2. The installation of OpenShift AI on top of the running OpenShift cluster is covered in a separate post: Install OpenShift AI on OpenShift. If you already have an OpenShift cluster available, feel free to jump straight to that post. Both workflows build on two GitHub repositories that cover both infrastructure provisioning and the installation of the AI platform components, and they reduce what could easily be a multi-hour manual effort to a handful of shell commands.

I should be upfront: one purpose of this post is also to serve as a personal reference for future me, who will inevitably return here after six months asking “wait, what was the exact command again?” Consider this the written documentation I should have filed away the first time.

A special thanks goes to my team mate Álvaro López Medina, who created and maintains the ocp-on-aws and rhoai-gitops repositories. Without his work and support, setting up this environment would have been significantly more involved.

Prerequisites

Before starting, a Linux workstation or jump host is recommended for running the commands. The following command line tools must be installed and configured:

OpenShift CLI (oc) – required to interact with the OpenShift cluster
AWS CLI – required to provision and manage AWS infrastructure
htpasswd – required to generate user credentials for the cluster

These are fundamental prerequisites. The installation scripts will fail or behave unexpectedly without them.

Ordering an AWS Blank Environment

For Red Hat employees and Red Hat partners, the easiest starting point is an AWS Blank Open Environment from the Red Hat Demo Platform (RHDP). Otherwise, an existing AWS account accessed through the AWS Web Console works just as well.

This tutorial was validated against eu-west-1. The blank environment provides a clean, ephemeral AWS account with the necessary IAM permissions and service quotas to support an Installer-Provisioned Infrastructure (IPI) deployment of OpenShift.

Once the environment is provisioned, the service overview page contains the AWS access credentials and the base DNS zone that will be needed in the configuration step below.

Deploying OpenShift on AWS

With the AWS environment in place, the ocp-on-aws repository handles the rest of the cluster provisioning. The repository wraps the OpenShift IPI installer in a shell script and manages user creation, cluster-admin group configuration, and the pull secret in a structured, repeatable way.

Preparing the repository

Throughout the following steps, any value written in <angle brackets> is a placeholder and must be replaced with your actual value before running the command.

Clone the repository:

git clone https://github.com/alvarolop/ocp-on-aws
cd ocp-on-aws

Copy the authentication file templates:

cp auth/users.htpasswd.example auth/users.htpasswd
cp auth/group-cluster-admins.yaml.example auth/group-cluster-admins.yaml

Generate a password hash for your user:

htpasswd -b -B auth/users.htpasswd <user_name> <password>

Adjust auth/group-cluster-admins.yaml to list the users that should receive cluster-admin privileges:

apiVersion: user.openshift.io/v1
kind: Group
metadata:
 name: cluster-admins
users:
 - redhat
 - <user_name>

Configuring the installation

Copy the configuration template:

cp aws-ocp4-config aws-ocp4-config-labs

Open the configuration file and adjust the following parameters:

vi aws-ocp4-config-labs

The key values to review:

OPENSHIFT_VERSION (Line 6): Set this to match your local oc client version for maximum compatibility.
RHPDS_TOP_LEVEL_ROUTE53_DOMAIN (Line 9): The base DNS zone for your cluster; find this in the RHDP service overview.
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (Lines 16–18): The programmatic access credentials from the RHDP environment, required to create the VPC and EC2 instances.
RHOCM_PULL_SECRET (Line 31): Retrieve this from the Hybrid Cloud Console.
WORKER_REPLICAS (Line 47): Set to the number of worker nodes required for your workload.

Running the installation

Start the cluster installation:

./aws-ocp4-install.sh aws-ocp4-config-labs

The script invokes the OpenShift IPI installer and creates all required AWS infrastructure: VPC, subnets, EC2 instances, Elastic Load Balancers, and Route53 DNS records. The process typically takes 30 to 45 minutes. It is worth monitoring the AWS console in the corresponding region during this time to observe the resources coming up.

EC2 instances and load balancers provisioned in AWS after the installation completes

Once the installer finishes, the cluster API and console URLs, along with the kubeconfig file, will be available in the output and in the auth/ directory of the repository.

Argo CD applications deployed as part of the cluster bootstrap

The installation script also bootstraps a set of Argo CD applications that manage cluster-level configurations through GitOps from the start. This gives the cluster a solid, declarative baseline before any additional workloads are installed.

Conclusion

The combination of the AWS blank environment and the ocp-on-aws repository makes it straightforward to spin up a fully functional OpenShift cluster in under an hour with minimal manual intervention. The IPI installer handles the infrastructure details, and the GitOps bootstrap ensures a consistent cluster configuration from the first login.

With the cluster in place, the next step is installing OpenShift AI and enabling GPU support, which is covered in the follow-up post: Install OpenShift AI on OpenShift.

References

ocp-on-aws - GitHub repository by Álvaro López Medina - link
rhoai-gitops - GitHub repository by Álvaro López Medina - link
Red Hat Demo Platform - link
OpenShift CLI - Getting started - link
AWS CLI - Installation guide - link
htpasswd - link
Red Hat Hybrid Cloud Console - Pull Secret - link