8. Production Hardening¶

Your full stack is deployed: agent, MCP server, gateway, UI, and code execution sandbox. Everything works in development. This module covers what it takes to run the stack in production -- secrets management, FIPS compliance, authentication, security policy, resource limits, monitoring, and observability.

FIPS compliance¶

Federal Information Processing Standards (FIPS 140-2/140-3) mandate the use of validated cryptographic modules. FIPS compliance is required for U.S. government workloads and many enterprise environments. If your OpenShift cluster runs with FIPS mode enabled, every container in the cluster must use FIPS-validated crypto.

Red Hat UBI base images are FIPS-capable out of the box. When the host kernel has fips=1 set, UBI's OpenSSL automatically restricts itself to FIPS-validated algorithms. No application-level configuration is needed.

Multi-cluster safety

If you work with more than one cluster, pin the context at the top of your terminal session so every command targets the right cluster:

export CTX=$(oc config current-context)

Every oc and helm command in this module uses $CTX.

To verify FIPS mode is active in a running pod, first confirm the pod is up:

oc get pods --context="$CTX" -n calculus-agent -l app.kubernetes.io/instance=calculus-agent

You should see a pod in Running state with READY 1/1 (or 2/2 if the sandbox sidecar from Module 6 is enabled). If the pod is CrashLoopBackOff, ImagePullBackOff, or Pending, the next command will fail with "no running pod found" -- that's a deployment problem, not a FIPS problem; run oc describe deployment calculus-agent --context="$CTX" -n calculus-agent to diagnose.

Once the pod is running:

oc exec deployment/calculus-agent --context="$CTX" -n calculus-agent -- \
  cat /proc/sys/crypto/fips_enabled

A return value of 1 means FIPS mode is active. A return value of 0 means the host kernel does not have FIPS enabled.

What breaks under FIPS

MD5 hashing raises an error unless called with usedforsecurity=False. TLS is restricted to AEAD cipher suites (AES-GCM, AES-CCM). If your agent calls an external endpoint that requires legacy ciphers (CBC, RC4), the TLS handshake will fail. The fix is on the remote endpoint, not your agent.

Secrets management¶

Production credentials must never appear in agent.yaml, prompts, or source code. OpenShift Secrets are the standard mechanism for injecting sensitive values at runtime.

Create a Secret¶

The openai SDK requires OPENAI_API_KEY to be set, even when calling unauthenticated endpoints like vLLM (set it to any non-empty string in that case). For endpoints that require real credentials, create a Secret:

oc create secret generic llm-credentials \
  --from-literal=OPENAI_API_KEY=sk-your-real-key-here \
  --context="$CTX" -n calculus-agent

Mount via Helm values¶

The Helm chart's env section supports secretKeyRef for injecting Secret values as environment variables. Add this to chart/values.yaml in your agent project:

env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: llm-credentials
        key: OPENAI_API_KEY

Edit chart/values.yaml and add the block above, then run the upgrade:

Make sure you're in the calculus-agent/ directory so chart/ resolves correctly:

helm upgrade calculus-agent chart/ --reuse-values --kube-context="$CTX" -n calculus-agent

The Deployment template injects the Secret value as an environment variable. The ${OPENAI_API_KEY} reference in agent.yaml picks it up at runtime through normal env var substitution.

Secrets vs ConfigMaps

Use ConfigMaps for non-sensitive configuration (MODEL_ENDPOINT, LOG_LEVEL). Use Secrets for credentials, API keys, and tokens. Secrets are base64-encoded at rest and can be encrypted with etcd encryption if your cluster is configured for it.

MCP server authentication¶

The calculus-helper MCP server includes JWT authentication support in src/core/auth.py. When enabled, the server validates a bearer token on every request before executing any tool.

Enable JWT auth on the MCP server¶

Set these environment variables on the MCP server deployment:

Variable	Purpose
`MCP_AUTH_JWT_ALG`	Algorithm: `RS256`, `HS256`, etc. Auth is disabled if unset
`MCP_AUTH_JWT_SECRET`	Shared secret for HMAC algorithms
`MCP_AUTH_JWT_JWKS_URI`	JWKS endpoint URL (alternative to a static key)
`MCP_AUTH_JWT_ISSUER`	Expected `iss` claim in the token
`MCP_AUTH_JWT_AUDIENCE`	Expected `aud` claim in the token

For HMAC-based auth (simplest to set up):

oc create secret generic mcp-auth \
  --from-literal=MCP_AUTH_JWT_SECRET=your-shared-secret \
  --context="$CTX" -n calculus-agent

Add the env vars to the MCP server's openshift.yaml deployment spec, or set them directly:

oc set env deployment/mcp-server \
  MCP_AUTH_JWT_ALG=HS256 \
  --from=secret/mcp-auth \
  --context="$CTX" -n calculus-mcp

Configure the agent for authenticated MCP¶

On the agent side, the MCP server entry in agent.yaml supports auth headers. The agent passes a bearer token when connecting:

mcp_servers:
  - url: ${MCP_CALCULUS_URL:-http://mcp-server.calculus-mcp.svc.cluster.local:8080/mcp/}
    auth:
      token: ${MCP_AUTH_TOKEN}

Note

The auth.token field requires fipsagents v0.28.0 or later. If your agent version doesn't support it, pass the token via the MCP_AUTH_TOKEN environment variable instead -- the MCP client reads it automatically.

Store the token in a Secret and inject it the same way as OPENAI_API_KEY.

Production auth patterns

For production, prefer RS256 with a JWKS endpoint over shared secrets. This lets you rotate keys without redeploying. Set MCP_AUTH_JWT_JWKS_URI to your identity provider's JWKS URL (e.g., Keycloak or Red Hat SSO).

Security configuration¶

The security section in agent.yaml controls runtime security behavior:

security:
  mode: ${SECURITY_MODE:-enforce}
  tool_inspection:
    enabled: ${TOOL_INSPECTION_ENABLED:-true}

Enforce vs observe mode¶

Mode	Behavior	Use when
`enforce`	Blocks execution when a security finding is detected	Production
`observe`	Logs findings but allows execution to continue	Tuning and testing

Start with observe when you first deploy to understand what the security layer flags. Once you've reviewed the findings and confirmed they're legitimate, switch to enforce.

Tool inspection¶

When tool_inspection.enabled is true, the ToolInspector scans tool call arguments for secrets, C2 patterns, and prompt injection before execution. Findings are logged to fipsagents.security.audit. In enforce mode, flagged calls are blocked; in observe mode, they are logged but allowed.

You can override the global mode per layer. For example, to enforce tool inspection but only observe guardrails while tuning them:

security:
  mode: enforce
  tool_inspection:
    enabled: true
  guardrails:
    mode: observe

Resource limits and scaling¶

Resource limits¶

The default Helm values set conservative resource limits because agents are I/O-bound -- they spend most of their time waiting for LLM and MCP responses:

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Adjust these based on your agent's actual usage. An agent that processes large context windows or runs heavy tool-result parsing may need more memory.

Horizontal scaling¶

The agent and MCP server scale independently. Add a HorizontalPodAutoscaler to scale the agent based on CPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: calculus-agent
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: calculus-agent
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Save the YAML above to hpa.yaml (in your current directory), then apply it:

oc apply -f hpa.yaml --context="$CTX" -n calculus-agent

The same pattern works for the MCP server -- create a separate HPA targeting its Deployment.

Apply HPA after your final Helm upgrade

The HPA takes ownership of .spec.replicas once applied. Any subsequent helm upgrade will conflict with the HPA over the replica count. Apply the HPA as the last step, after all Helm configuration is finalized. If you need to run helm upgrade later, delete the HPA first, upgrade, then re-apply.

Why min 2 replicas?

A single replica means any pod restart causes downtime. Two replicas ensure one pod is always available during rolling updates and restarts.

Monitoring¶

Health and readiness probes¶

The agent exposes /healthz for liveness probes and /readyz for readiness probes. The Helm chart includes probe definitions -- enable them in values.yaml:

probes:
  enabled: true

This configures Kubernetes to restart the pod if /healthz stops responding (liveness) and to hold traffic until /readyz returns 200 during startup (readiness).

Pod logs¶

The most immediate debugging tool. Watch logs in real time:

oc logs deployment/calculus-agent --context="$CTX" -n calculus-agent -f

Key log patterns to watch for:

Pattern	Meaning
`Uvicorn running on`	Agent started successfully
`Connected to MCP server`	MCP connection established
`Tool inspection finding`	Security layer flagged a tool call
`Retrying after error`	Backoff triggered on a failed LLM call
`Max iterations reached`	Agent hit the loop ceiling -- check `loop.max_iterations`

Set LOG_LEVEL to DEBUG temporarily when investigating issues, then return to INFO or WARNING for normal operation.

Route timeouts¶

OpenShift Routes have a default timeout of 30 seconds. LLM calls regularly exceed this, especially with large context windows. If you haven't already set this in Module 5, annotate the agent's Route:

oc annotate route calculus-gateway \
  haproxy.router.openshift.io/timeout=120s \
  --context="$CTX" -n calculus-gateway --overwrite

Do the same for the agent Route if it is also directly exposed. The UI Route typically doesn't need a longer timeout since it serves static assets.

Observability¶

The agent runtime includes built-in observability features for production deployments: session persistence, Prometheus metrics, structured trace collection, and optional OpenTelemetry export. All are configured through agent.yaml and share a common storage backend.

Session persistence¶

Enable session persistence to maintain conversation continuity across requests. Sessions are stored in the shared storage backend and expire automatically.

server:
  storage:
    backend: sqlite             # or: postgres
    sqlite_path: ./agent.db
  sessions:
    enabled: true
    max_age_hours: 168          # 7-day expiry

Override via Helm:

helm upgrade calculus-agent chart/ \
  --set config.STORAGE_BACKEND=sqlite \
  --set config.SESSIONS_ENABLED=true \
  --kube-context="$CTX" -n calculus-agent

The server exposes POST /v1/sessions, GET /v1/sessions/{id}, and DELETE /v1/sessions/{id} for explicit session management. You can also pass a session_id on any ChatCompletionRequest to auto-create the session on first use. See the BaseAgent API reference for details.

Prometheus metrics¶

The agent exposes Prometheus-format metrics at GET /metrics. Enable with:

server:
  metrics:
    enabled: true

Requires the [metrics] extra: pip install fipsagents[metrics].

Available metrics:

Metric	Type	Labels
`agent_requests_total`	counter	model, status, stream
`agent_request_duration_seconds`	histogram	model
`agent_model_call_duration_seconds`	histogram	model
`agent_tool_call_total`	counter	tool_name, status
`agent_tokens_total`	counter	model, direction

To scrape metrics with OpenShift user-workload monitoring, create a ServiceMonitor.

Prerequisite: user-workload monitoring

The ServiceMonitor requires OpenShift's user-workload monitoring to be enabled. See the OpenShift documentation if your cluster doesn't have it configured.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: calculus-agent-metrics
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: calculus-agent
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

Trace collection¶

TraceCollector records structured spans for every request -- model calls, tool invocations, and durations. Enable traces alongside storage:

server:
  storage:
    backend: sqlite
  traces:
    enabled: true
    sampling_rate: 1.0

Query traces via GET /v1/traces and GET /v1/traces/{id}. Each trace includes duration, span count, tool calls, and the model used. See the BaseAgent API reference for the full trace schema.

OTEL export (optional)¶

For enterprise observability stacks, export traces to an OpenTelemetry Collector via OTLP:

server:
  traces:
    enabled: true
    exporter: otel
    otel_endpoint: http://otel-collector:4317
    service_name: calculus-agent

Requires the [otel] extra: pip install fipsagents[otel].

The server automatically propagates W3C Trace Context (traceparent header) -- extracting it from incoming requests and injecting it into outgoing RemoteNode calls. This links spans across multi-agent workflows into a single distributed trace without any application-level code.

User feedback collection¶

Metrics tell you the agent is fast. Traces tell you what it did. Neither tells you whether users were happy with the answer. Feedback collection closes that gap by storing thumbs-up / thumbs-down ratings, optional comments, and corrections -- joined to the trace that produced each response so you can replay the conversation behind a bad rating.

This data is the raw material for two downstream pipelines: dashboards that surface degradations early, and labelled datasets for fine-tuning or RLHF.

Enable feedback alongside tracing:

server:
  storage:
    backend: sqlite             # or: postgres
  traces:
    enabled: true               # so feedback can join to a trace
  feedback:
    enabled: true
    max_age_hours: 720          # keep 30 days

Override via Helm or env vars:

helm upgrade calculus-agent chart/ \
  --set config.STORAGE_BACKEND=sqlite \
  --set config.TRACES_ENABLED=true \
  --set config.FEEDBACK_ENABLED=true \
  --kube-context="$CTX" -n calculus-agent

REST endpoints¶

The server exposes four feedback endpoints:

Path	Method	Purpose
`/v1/feedback`	POST	Submit a rating (1 = thumbs-up, -1 = thumbs-down)
`/v1/feedback`	GET	Query records, filterable by `trace_id`, `session_id`, time window
`/v1/feedback/{feedback_id}`	PATCH	Edit an existing record in place — change the rating, revise the comment
`/v1/feedback/stats`	GET	Aggregated counts grouped by time window (`hour` / `day` / `week`)

Grab the agent's route URL so curl commands work from your workstation:

AGENT_ROUTE=$(oc get route calculus-agent --context="$CTX" -n calculus-agent -o jsonpath='{.spec.host}')

A minimal POST looks like this:

curl -X POST https://$AGENT_ROUTE/v1/feedback \
  -H 'Content-Type: application/json' \
  -d '{"trace_id":"trace_abc123","rating":1,"comment":"clear explanation"}'

trace_id is optional -- if omitted, the server synthesises a stand-alone identifier so feedback works even when tracing is disabled or sampled out. Records keyed to a real trace can be joined to the trace store; orphan records are still useful as raw rating data.

When a user changes their mind on an already-rated message, send a PATCH with the new fields rather than posting again -- the record updates in place, no duplicate row is created. PATCH bodies are partial: omitted fields stay as they were.

Capture the feedback_id from the original POST response, then update it:

curl -X PATCH https://$AGENT_ROUTE/v1/feedback/fb_abc123 \
  -H 'Content-Type: application/json' \
  -d '{"rating":-1,"comment":"on second look, this was wrong"}'

Returns 200 with the full updated record, or 404 if the id is unknown.

Where the trace_id comes from¶

Every chat completion response now carries an X-Trace-Id header (sync and streaming) and a top-level trace_id field on the final SSE usage chunk. UI clients capture either value and attach it to subsequent feedback POSTs. The gateway preserves the value verbatim:

Browser  ──POST /v1/feedback──▶  UI proxy  ──▶  Gateway  ──▶  Agent
                                                  └─ forwards Authorization,
                                                     X-User-ID for attribution

UI integration¶

The chat UI scaffolded by fips-agents create ui includes thumbs-up / thumbs-down icons that hover-reveal on completed assistant messages. Thumbs-up records a positive rating immediately. Thumbs-down opens a small modal asking for a category (Inaccurate / Not helpful / Harmful / Too long / Other) plus an optional free-text comment, then POSTs to /v1/feedback via the gateway. Categories are encoded as a bracketed prefix on the comment field ([Inaccurate] verbose detail) so they round-trip through the existing schema and remain recoverable from queries.

Querying feedback¶

List the most recent records for a session:

curl "https://$AGENT_ROUTE/v1/feedback?session_id=demo-1&limit=20" | jq

Get aggregated stats for the last 7 days, bucketed by day:

curl "https://$AGENT_ROUTE/v1/feedback/stats?window=day&since=2026-04-19T00:00:00Z" | jq

Each stats row contains window_start, window_end, agent_type, thumbs_up, thumbs_down, and total. Pipe these to your analytics stack -- a Grafana panel keyed off the SQLite or Postgres backend is typical.

Lab exercise¶

Enable feedback on the calculus agent with sqlite storage:

Set server.feedback.enabled: true and server.storage.backend: sqlite in agent.yaml.
Add fipsagents[feedback] to the dependencies list in pyproject.toml (or run pip install 'fipsagents[feedback]' in your venv).
Redeploy:
```
oc start-build calculus-agent --from-dir=. --follow -n calculus-agent --context="$CTX"
oc rollout restart deployment/calculus-agent -n calculus-agent --context="$CTX"
```
4. Open the chat UI, run several conversations, click thumbs-up on good answers and thumbs-down (with a category) on bad ones. 5. Query /v1/feedback/stats?window=hour to see your ratings aggregated. 6. Pick a low-rated trace and fetch it: GET /v1/traces/{trace_id} -- the full conversation, tool calls, and timings are recoverable. That is your first labelled training example.

What's next¶

You've built and hardened a complete AI agent system across the first eight modules:

Scaffolded an agent project and understood every file
Configured the agent for a real LLM and deployed it to OpenShift
Built an MCP server with calculus tools
Wired the MCP tools into the agent
Deployed a gateway and chat UI for browser-based interaction
Added a code execution sandbox for numerical computation
Extended the agent with AI-assisted slash commands
Hardened the stack with secrets, authentication, security policy, monitoring, observability, and user feedback collection

The calculus-agent/ and calculus-helper/ directories in this repository serve as complete reference implementations. Use them as starting points for your own agents.

For deeper dives into specific topics, see the Reference pages: agent.yaml configuration, Helm chart anatomy, BaseAgent API, and MCP protocol details.

When you're ready to teach the agent to read user-supplied documents, Module 9 covers the file-upload track end-to-end: drag-drop UI, streaming gateway proxy, Docling parsing, and ClamAV virus scanning.