8. Production Hardening¶
Your full stack is deployed: agent, MCP server, gateway, UI, and code execution sandbox. Everything works in development. This final module covers what it takes to run the stack in production -- secrets management, FIPS compliance, authentication, security policy, resource limits, and monitoring.
FIPS compliance¶
Federal Information Processing Standards (FIPS 140-2/140-3) mandate the use of validated cryptographic modules. FIPS compliance is required for U.S. government workloads and many enterprise environments. If your OpenShift cluster runs with FIPS mode enabled, every container in the cluster must use FIPS-validated crypto.
Red Hat UBI base images are FIPS-capable out of the box. When the host kernel
has fips=1 set, UBI's OpenSSL automatically restricts itself to
FIPS-validated algorithms. No application-level configuration is needed.
To verify FIPS mode is active in a running pod:
A return value of 1 means FIPS mode is active. A return value of 0 means
the host kernel does not have FIPS enabled.
What breaks under FIPS
MD5 hashing raises an error unless called with usedforsecurity=False.
TLS is restricted to AEAD cipher suites (AES-GCM, AES-CCM). If your agent
calls an external endpoint that requires legacy ciphers (CBC, RC4), the TLS
handshake will fail. The fix is on the remote endpoint, not your agent.
The litellm migration¶
The framework originally used litellm as an LLM abstraction layer. Two problems forced a switch:
-
FIPS incompatibility. litellm's dependency tree pulls in cryptographic libraries that are not FIPS-validated. On a FIPS-enabled cluster, these libraries either fail at import time or silently use non-compliant algorithms.
-
Supply chain compromise. litellm versions
1.82.7and1.82.8were compromised in a supply chain attack (March 2026). The malicious versions exfiltrated API keys to an external endpoint.
The fix was straightforward: replace litellm with the openai async SDK.
vLLM, LlamaStack, llm-d, and most inference servers expose an
OpenAI-compatible API, so litellm's abstraction layer was adding complexity
without adding value. The result is a simpler dependency tree that is easier to
audit, FIPS-compliant, and free of supply chain risk.
Never install litellm 1.82.7 or 1.82.8
These versions are compromised. If you encounter them in a lockfile or
dependency tree, pin to >=1.83.0 or <=1.82.6.
The takeaway: fewer dependencies means a smaller attack surface. Prefer standard SDKs over abstraction layers when the abstraction doesn't carry its weight.
Secrets management¶
Production credentials must never appear in agent.yaml, prompts, or source
code. OpenShift Secrets are the standard mechanism for injecting sensitive
values at runtime.
Create a Secret¶
The openai SDK requires OPENAI_API_KEY to be set, even when calling
unauthenticated endpoints like vLLM (set it to any non-empty string in that
case). For endpoints that require real credentials, create a Secret:
oc create secret generic llm-credentials \
--from-literal=OPENAI_API_KEY=sk-your-real-key-here \
-n calculus-agent
Mount via Helm values¶
The Helm chart's env section supports secretKeyRef for injecting Secret
values as environment variables. Add this to your values.yaml:
Then upgrade the release:
The Deployment template injects the Secret value as an environment variable.
The ${OPENAI_API_KEY} reference in agent.yaml picks it up at runtime
through normal env var substitution.
Secrets vs ConfigMaps
Use ConfigMaps for non-sensitive configuration (MODEL_ENDPOINT,
LOG_LEVEL). Use Secrets for credentials, API keys, and tokens. Secrets
are base64-encoded at rest and can be encrypted with etcd encryption if
your cluster is configured for it.
MCP server authentication¶
The calculus-helper MCP server includes JWT authentication support in
src/core/auth.py. When enabled, the server validates a bearer token on every
request before executing any tool.
Enable JWT auth on the MCP server¶
Set these environment variables on the MCP server deployment:
| Variable | Purpose |
|---|---|
MCP_AUTH_JWT_ALG |
Algorithm: RS256, HS256, etc. Auth is disabled if unset |
MCP_AUTH_JWT_SECRET |
Shared secret for HMAC algorithms |
MCP_AUTH_JWT_JWKS_URI |
JWKS endpoint URL (alternative to a static key) |
MCP_AUTH_JWT_ISSUER |
Expected iss claim in the token |
MCP_AUTH_JWT_AUDIENCE |
Expected aud claim in the token |
For HMAC-based auth (simplest to set up):
oc create secret generic mcp-auth \
--from-literal=MCP_AUTH_JWT_SECRET=your-shared-secret \
-n calculus-agent
Then add the env vars to the MCP server's Helm values:
env:
- name: MCP_AUTH_JWT_ALG
value: HS256
- name: MCP_AUTH_JWT_SECRET
valueFrom:
secretKeyRef:
name: mcp-auth
key: MCP_AUTH_JWT_SECRET
Configure the agent for authenticated MCP¶
On the agent side, the MCP server entry in agent.yaml supports auth headers.
The agent passes a bearer token when connecting:
mcp_servers:
- url: ${MCP_CALCULUS_URL:-http://mcp-server.calculus-mcp.svc.cluster.local:8080/mcp/}
auth:
token: ${MCP_AUTH_TOKEN}
Store the token in a Secret and inject it the same way as OPENAI_API_KEY.
Production auth patterns
For production, prefer RS256 with a JWKS endpoint over shared secrets. This
lets you rotate keys without redeploying. Set MCP_AUTH_JWT_JWKS_URI to
your identity provider's JWKS URL (e.g., Keycloak or Red Hat SSO).
Security configuration¶
The security section in agent.yaml controls runtime security behavior:
security:
mode: ${SECURITY_MODE:-enforce}
tool_inspection:
enabled: ${TOOL_INSPECTION_ENABLED:-true}
Enforce vs observe mode¶
| Mode | Behavior | Use when |
|---|---|---|
enforce |
Blocks execution when a security finding is detected | Production |
observe |
Logs findings but allows execution to continue | Tuning and testing |
Start with observe when you first deploy to understand what the security
layer flags. Once you've reviewed the findings and confirmed they're legitimate,
switch to enforce.
Tool inspection¶
When tool_inspection.enabled is true, the framework validates tool inputs
and outputs against their declared schemas before and after execution. This
catches malformed tool calls from the LLM and unexpected return values from
tool implementations.
You can override the global mode per layer. For example, to enforce tool inspection but only observe guardrails while tuning them:
Resource limits and scaling¶
Resource limits¶
The default Helm values set conservative resource limits because agents are I/O-bound -- they spend most of their time waiting for LLM and MCP responses:
Adjust these based on your agent's actual usage. An agent that processes large context windows or runs heavy tool-result parsing may need more memory.
Horizontal scaling¶
The agent and MCP server scale independently. Add a HorizontalPodAutoscaler to scale the agent based on CPU utilization:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: calculus-agent
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: calculus-agent
minReplicas: 2
maxReplicas: 8
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Apply it with oc apply -f hpa.yaml -n calculus-agent. The same pattern works
for the MCP server -- create a separate HPA targeting its Deployment.
Apply HPA after your final Helm upgrade
The HPA takes ownership of .spec.replicas once applied. Any subsequent
helm upgrade will conflict with the HPA over the replica count. Apply
the HPA as the last step, after all Helm configuration is finalized. If
you need to run helm upgrade later, delete the HPA first, upgrade, then
re-apply.
Why min 2 replicas?
A single replica means any pod restart causes downtime. Two replicas ensure one pod is always available during rolling updates and restarts.
Monitoring¶
Health and readiness probes¶
The agent exposes /healthz for liveness probes. The Helm chart includes
probe definitions -- enable them in values.yaml:
This configures Kubernetes to restart the pod if /healthz stops responding
(liveness) and to stop routing traffic to it during startup (readiness).
Pod logs¶
The most immediate debugging tool. Watch logs in real time:
Key log patterns to watch for:
| Pattern | Meaning |
|---|---|
Uvicorn running on |
Agent started successfully |
Connected to MCP server |
MCP connection established |
Tool inspection finding |
Security layer flagged a tool call |
Retrying after error |
Backoff triggered on a failed LLM call |
Max iterations reached |
Agent hit the loop ceiling -- check loop.max_iterations |
Set LOG_LEVEL to DEBUG temporarily when investigating issues, then return
to INFO or WARNING for normal operation.
Route timeouts¶
OpenShift Routes have a default timeout of 30 seconds. LLM calls regularly exceed this, especially with large context windows. If you haven't already set this in Module 5, annotate the agent's Route:
oc annotate route calculus-gateway \
haproxy.router.openshift.io/timeout=120s \
-n calculus-agent --overwrite
Do the same for the agent Route if it is also directly exposed. The UI Route typically doesn't need a longer timeout since it serves static assets.
What's next¶
You've built and hardened a complete AI agent system across eight modules:
- Scaffolded an agent project and understood every file
- Configured the agent for a real LLM and deployed it to OpenShift
- Built an MCP server with calculus tools
- Wired the MCP tools into the agent
- Deployed a gateway and chat UI for browser-based interaction
- Added a code execution sandbox for numerical computation
- Extended the agent with AI-assisted slash commands
- Hardened the stack with secrets, authentication, security policy, and monitoring
The calculus-agent/ and calculus-helper/ directories in this repository
serve as complete reference implementations. Use them as starting points for
your own agents.
For deeper dives into specific topics, see the Reference pages: agent.yaml configuration, Helm chart anatomy, BaseAgent API, and MCP protocol details.