Skip to content

8. Production Hardening

Your full stack is deployed: agent, MCP server, gateway, UI, and code execution sandbox. Everything works in development. This module covers what it takes to run the stack in production -- secrets management, FIPS compliance, authentication, security policy, resource limits, monitoring, and observability.

FIPS compliance

Federal Information Processing Standards (FIPS 140-2/140-3) mandate the use of validated cryptographic modules. FIPS compliance is required for U.S. government workloads and many enterprise environments. If your OpenShift cluster runs with FIPS mode enabled, every container in the cluster must use FIPS-validated crypto.

Red Hat UBI base images are FIPS-capable out of the box. When the host kernel has fips=1 set, UBI's OpenSSL automatically restricts itself to FIPS-validated algorithms. No application-level configuration is needed.

Multi-cluster safety

If you work with more than one cluster, pin the context at the top of your terminal session so every command targets the right cluster:

export CTX=$(oc config current-context)

Every oc and helm command in this module uses $CTX.

To verify FIPS mode is active in a running pod, first confirm the pod is up:

oc get pods --context="$CTX" -n calculus-agent -l app.kubernetes.io/instance=calculus-agent

You should see a pod in Running state with READY 1/1 (or 2/2 if the sandbox sidecar from Module 6 is enabled). If the pod is CrashLoopBackOff, ImagePullBackOff, or Pending, the next command will fail with "no running pod found" -- that's a deployment problem, not a FIPS problem; run oc describe deployment calculus-agent --context="$CTX" -n calculus-agent to diagnose.

Once the pod is running:

oc exec deployment/calculus-agent --context="$CTX" -n calculus-agent -- \
  cat /proc/sys/crypto/fips_enabled

A return value of 1 means FIPS mode is active. A return value of 0 means the host kernel does not have FIPS enabled.

What breaks under FIPS

MD5 hashing raises an error unless called with usedforsecurity=False. TLS is restricted to AEAD cipher suites (AES-GCM, AES-CCM). If your agent calls an external endpoint that requires legacy ciphers (CBC, RC4), the TLS handshake will fail. The fix is on the remote endpoint, not your agent.

Secrets management

Production credentials must never appear in agent.yaml, prompts, or source code. OpenShift Secrets are the standard mechanism for injecting sensitive values at runtime.

Create a Secret

The openai SDK requires OPENAI_API_KEY to be set, even when calling unauthenticated endpoints like vLLM (set it to any non-empty string in that case). For endpoints that require real credentials, create a Secret:

oc create secret generic llm-credentials \
  --from-literal=OPENAI_API_KEY=sk-your-real-key-here \
  --context="$CTX" -n calculus-agent

Mount via Helm values

The Helm chart's env section supports secretKeyRef for injecting Secret values as environment variables. Add this to chart/values.yaml in your agent project:

env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: llm-credentials
        key: OPENAI_API_KEY

Edit chart/values.yaml and add the block above, then run the upgrade:

Make sure you're in the calculus-agent/ directory so chart/ resolves correctly:

helm upgrade calculus-agent chart/ --reuse-values --kube-context="$CTX" -n calculus-agent

The Deployment template injects the Secret value as an environment variable. The ${OPENAI_API_KEY} reference in agent.yaml picks it up at runtime through normal env var substitution.

Secrets vs ConfigMaps

Use ConfigMaps for non-sensitive configuration (MODEL_ENDPOINT, LOG_LEVEL). Use Secrets for credentials, API keys, and tokens. Secrets are base64-encoded at rest and can be encrypted with etcd encryption if your cluster is configured for it.

MCP server authentication

The calculus-helper MCP server includes JWT authentication support in src/core/auth.py. When enabled, the server validates a bearer token on every request before executing any tool.

Enable JWT auth on the MCP server

Set these environment variables on the MCP server deployment:

Variable Purpose
MCP_AUTH_JWT_ALG Algorithm: RS256, HS256, etc. Auth is disabled if unset
MCP_AUTH_JWT_SECRET Shared secret for HMAC algorithms
MCP_AUTH_JWT_JWKS_URI JWKS endpoint URL (alternative to a static key)
MCP_AUTH_JWT_ISSUER Expected iss claim in the token
MCP_AUTH_JWT_AUDIENCE Expected aud claim in the token

For HMAC-based auth (simplest to set up):

oc create secret generic mcp-auth \
  --from-literal=MCP_AUTH_JWT_SECRET=your-shared-secret \
  --context="$CTX" -n calculus-agent

Add the env vars to the MCP server's openshift.yaml deployment spec, or set them directly:

oc set env deployment/mcp-server \
  MCP_AUTH_JWT_ALG=HS256 \
  --from=secret/mcp-auth \
  --context="$CTX" -n calculus-mcp

Configure the agent for authenticated MCP

On the agent side, the MCP server entry in agent.yaml supports auth headers. The agent passes a bearer token when connecting:

mcp_servers:
  - url: ${MCP_CALCULUS_URL:-http://mcp-server.calculus-mcp.svc.cluster.local:8080/mcp/}
    auth:
      token: ${MCP_AUTH_TOKEN}

Note

The auth.token field requires fipsagents v0.28.0 or later. If your agent version doesn't support it, pass the token via the MCP_AUTH_TOKEN environment variable instead -- the MCP client reads it automatically.

Store the token in a Secret and inject it the same way as OPENAI_API_KEY.

Production auth patterns

For production, prefer RS256 with a JWKS endpoint over shared secrets. This lets you rotate keys without redeploying. Set MCP_AUTH_JWT_JWKS_URI to your identity provider's JWKS URL (e.g., Keycloak or Red Hat SSO).

Security configuration

The security section in agent.yaml controls runtime security behavior:

security:
  mode: ${SECURITY_MODE:-enforce}
  tool_inspection:
    enabled: ${TOOL_INSPECTION_ENABLED:-true}

Enforce vs observe mode

Mode Behavior Use when
enforce Blocks execution when a security finding is detected Production
observe Logs findings but allows execution to continue Tuning and testing

Start with observe when you first deploy to understand what the security layer flags. Once you've reviewed the findings and confirmed they're legitimate, switch to enforce.

Tool inspection

When tool_inspection.enabled is true, the ToolInspector scans tool call arguments for secrets, C2 patterns, and prompt injection before execution. Findings are logged to fipsagents.security.audit. In enforce mode, flagged calls are blocked; in observe mode, they are logged but allowed.

You can override the global mode per layer. For example, to enforce tool inspection but only observe guardrails while tuning them:

security:
  mode: enforce
  tool_inspection:
    enabled: true
  guardrails:
    mode: observe

Resource limits and scaling

Resource limits

The default Helm values set conservative resource limits because agents are I/O-bound -- they spend most of their time waiting for LLM and MCP responses:

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Adjust these based on your agent's actual usage. An agent that processes large context windows or runs heavy tool-result parsing may need more memory.

Horizontal scaling

The agent and MCP server scale independently. Add a HorizontalPodAutoscaler to scale the agent based on CPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: calculus-agent
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: calculus-agent
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Save the YAML above to hpa.yaml (in your current directory), then apply it:

oc apply -f hpa.yaml --context="$CTX" -n calculus-agent

The same pattern works for the MCP server -- create a separate HPA targeting its Deployment.

Apply HPA after your final Helm upgrade

The HPA takes ownership of .spec.replicas once applied. Any subsequent helm upgrade will conflict with the HPA over the replica count. Apply the HPA as the last step, after all Helm configuration is finalized. If you need to run helm upgrade later, delete the HPA first, upgrade, then re-apply.

Why min 2 replicas?

A single replica means any pod restart causes downtime. Two replicas ensure one pod is always available during rolling updates and restarts.

Monitoring

Health and readiness probes

The agent exposes /healthz for liveness probes and /readyz for readiness probes. The Helm chart includes probe definitions -- enable them in values.yaml:

probes:
  enabled: true

This configures Kubernetes to restart the pod if /healthz stops responding (liveness) and to hold traffic until /readyz returns 200 during startup (readiness).

Pod logs

The most immediate debugging tool. Watch logs in real time:

oc logs deployment/calculus-agent --context="$CTX" -n calculus-agent -f

Key log patterns to watch for:

Pattern Meaning
Uvicorn running on Agent started successfully
Connected to MCP server MCP connection established
Tool inspection finding Security layer flagged a tool call
Retrying after error Backoff triggered on a failed LLM call
Max iterations reached Agent hit the loop ceiling -- check loop.max_iterations

Set LOG_LEVEL to DEBUG temporarily when investigating issues, then return to INFO or WARNING for normal operation.

Route timeouts

OpenShift Routes have a default timeout of 30 seconds. LLM calls regularly exceed this, especially with large context windows. If you haven't already set this in Module 5, annotate the agent's Route:

oc annotate route calculus-gateway \
  haproxy.router.openshift.io/timeout=120s \
  --context="$CTX" -n calculus-gateway --overwrite

Do the same for the agent Route if it is also directly exposed. The UI Route typically doesn't need a longer timeout since it serves static assets.

Observability

The agent runtime includes built-in observability features for production deployments: session persistence, Prometheus metrics, structured trace collection, and optional OpenTelemetry export. All are configured through agent.yaml and share a common storage backend.

Session persistence

Enable session persistence to maintain conversation continuity across requests. Sessions are stored in the shared storage backend and expire automatically.

server:
  storage:
    backend: sqlite             # or: postgres
    sqlite_path: ./agent.db
  sessions:
    enabled: true
    max_age_hours: 168          # 7-day expiry

Override via Helm:

helm upgrade calculus-agent chart/ \
  --set config.STORAGE_BACKEND=sqlite \
  --set config.SESSIONS_ENABLED=true \
  --kube-context="$CTX" -n calculus-agent

The server exposes POST /v1/sessions, GET /v1/sessions/{id}, and DELETE /v1/sessions/{id} for explicit session management. You can also pass a session_id on any ChatCompletionRequest to auto-create the session on first use. See the BaseAgent API reference for details.

Prometheus metrics

The agent exposes Prometheus-format metrics at GET /metrics. Enable with:

server:
  metrics:
    enabled: true

Requires the [metrics] extra: pip install fipsagents[metrics].

Available metrics:

Metric Type Labels
agent_requests_total counter model, status, stream
agent_request_duration_seconds histogram model
agent_model_call_duration_seconds histogram model
agent_tool_call_total counter tool_name, status
agent_tokens_total counter model, direction

To scrape metrics with OpenShift user-workload monitoring, create a ServiceMonitor.

Prerequisite: user-workload monitoring

The ServiceMonitor requires OpenShift's user-workload monitoring to be enabled. See the OpenShift documentation if your cluster doesn't have it configured.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: calculus-agent-metrics
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: calculus-agent
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

Trace collection

TraceCollector records structured spans for every request -- model calls, tool invocations, and durations. Enable traces alongside storage:

server:
  storage:
    backend: sqlite
  traces:
    enabled: true
    sampling_rate: 1.0

Query traces via GET /v1/traces and GET /v1/traces/{id}. Each trace includes duration, span count, tool calls, and the model used. See the BaseAgent API reference for the full trace schema.

OTEL export (optional)

For enterprise observability stacks, export traces to an OpenTelemetry Collector via OTLP:

server:
  traces:
    enabled: true
    exporter: otel
    otel_endpoint: http://otel-collector:4317
    service_name: calculus-agent

Requires the [otel] extra: pip install fipsagents[otel].

The server automatically propagates W3C Trace Context (traceparent header) -- extracting it from incoming requests and injecting it into outgoing RemoteNode calls. This links spans across multi-agent workflows into a single distributed trace without any application-level code.

User feedback collection

Metrics tell you the agent is fast. Traces tell you what it did. Neither tells you whether users were happy with the answer. Feedback collection closes that gap by storing thumbs-up / thumbs-down ratings, optional comments, and corrections -- joined to the trace that produced each response so you can replay the conversation behind a bad rating.

This data is the raw material for two downstream pipelines: dashboards that surface degradations early, and labelled datasets for fine-tuning or RLHF.

Enable feedback alongside tracing:

server:
  storage:
    backend: sqlite             # or: postgres
  traces:
    enabled: true               # so feedback can join to a trace
  feedback:
    enabled: true
    max_age_hours: 720          # keep 30 days

Override via Helm or env vars:

helm upgrade calculus-agent chart/ \
  --set config.STORAGE_BACKEND=sqlite \
  --set config.TRACES_ENABLED=true \
  --set config.FEEDBACK_ENABLED=true \
  --kube-context="$CTX" -n calculus-agent

REST endpoints

The server exposes four feedback endpoints:

Path Method Purpose
/v1/feedback POST Submit a rating (1 = thumbs-up, -1 = thumbs-down)
/v1/feedback GET Query records, filterable by trace_id, session_id, time window
/v1/feedback/{feedback_id} PATCH Edit an existing record in place — change the rating, revise the comment
/v1/feedback/stats GET Aggregated counts grouped by time window (hour / day / week)

Grab the agent's route URL so curl commands work from your workstation:

AGENT_ROUTE=$(oc get route calculus-agent --context="$CTX" -n calculus-agent -o jsonpath='{.spec.host}')

A minimal POST looks like this:

curl -X POST https://$AGENT_ROUTE/v1/feedback \
  -H 'Content-Type: application/json' \
  -d '{"trace_id":"trace_abc123","rating":1,"comment":"clear explanation"}'

trace_id is optional -- if omitted, the server synthesises a stand-alone identifier so feedback works even when tracing is disabled or sampled out. Records keyed to a real trace can be joined to the trace store; orphan records are still useful as raw rating data.

When a user changes their mind on an already-rated message, send a PATCH with the new fields rather than posting again -- the record updates in place, no duplicate row is created. PATCH bodies are partial: omitted fields stay as they were.

Capture the feedback_id from the original POST response, then update it:

curl -X PATCH https://$AGENT_ROUTE/v1/feedback/fb_abc123 \
  -H 'Content-Type: application/json' \
  -d '{"rating":-1,"comment":"on second look, this was wrong"}'

Returns 200 with the full updated record, or 404 if the id is unknown.

Where the trace_id comes from

Every chat completion response now carries an X-Trace-Id header (sync and streaming) and a top-level trace_id field on the final SSE usage chunk. UI clients capture either value and attach it to subsequent feedback POSTs. The gateway preserves the value verbatim:

Browser  ──POST /v1/feedback──▶  UI proxy  ──▶  Gateway  ──▶  Agent
                                                  └─ forwards Authorization,
                                                     X-User-ID for attribution

UI integration

The chat UI scaffolded by fips-agents create ui includes thumbs-up / thumbs-down icons that hover-reveal on completed assistant messages. Thumbs-up records a positive rating immediately. Thumbs-down opens a small modal asking for a category (Inaccurate / Not helpful / Harmful / Too long / Other) plus an optional free-text comment, then POSTs to /v1/feedback via the gateway. Categories are encoded as a bracketed prefix on the comment field ([Inaccurate] verbose detail) so they round-trip through the existing schema and remain recoverable from queries.

Querying feedback

List the most recent records for a session:

curl "https://$AGENT_ROUTE/v1/feedback?session_id=demo-1&limit=20" | jq

Get aggregated stats for the last 7 days, bucketed by day:

curl "https://$AGENT_ROUTE/v1/feedback/stats?window=day&since=2026-04-19T00:00:00Z" | jq

Each stats row contains window_start, window_end, agent_type, thumbs_up, thumbs_down, and total. Pipe these to your analytics stack -- a Grafana panel keyed off the SQLite or Postgres backend is typical.

Lab exercise

Enable feedback on the calculus agent with sqlite storage:

  1. Set server.feedback.enabled: true and server.storage.backend: sqlite in agent.yaml.
  2. Add fipsagents[feedback] to the dependencies list in pyproject.toml (or run pip install 'fipsagents[feedback]' in your venv).
  3. Redeploy:

    oc start-build calculus-agent --from-dir=. --follow -n calculus-agent --context="$CTX"
    oc rollout restart deployment/calculus-agent -n calculus-agent --context="$CTX"
    
    4. Open the chat UI, run several conversations, click thumbs-up on good answers and thumbs-down (with a category) on bad ones. 5. Query /v1/feedback/stats?window=hour to see your ratings aggregated. 6. Pick a low-rated trace and fetch it: GET /v1/traces/{trace_id} -- the full conversation, tool calls, and timings are recoverable. That is your first labelled training example.

What's next

You've built and hardened a complete AI agent system across the first eight modules:

  1. Scaffolded an agent project and understood every file
  2. Configured the agent for a real LLM and deployed it to OpenShift
  3. Built an MCP server with calculus tools
  4. Wired the MCP tools into the agent
  5. Deployed a gateway and chat UI for browser-based interaction
  6. Added a code execution sandbox for numerical computation
  7. Extended the agent with AI-assisted slash commands
  8. Hardened the stack with secrets, authentication, security policy, monitoring, observability, and user feedback collection

The calculus-agent/ and calculus-helper/ directories in this repository serve as complete reference implementations. Use them as starting points for your own agents.

For deeper dives into specific topics, see the Reference pages: agent.yaml configuration, Helm chart anatomy, BaseAgent API, and MCP protocol details.

When you're ready to teach the agent to read user-supplied documents, Module 9 covers the file-upload track end-to-end: drag-drop UI, streaming gateway proxy, Docling parsing, and ClamAV virus scanning.