backbone/openspec/changes/archive/2025-10-17-update-helm-chart-best-practices/design.md at 47a8dd96c2d03b3a3ff830278b254265da415f4b

Files

Morten Olsen 47a8dd96c2 feat: at helm deployment

2025-10-18 00:01:58 +02:00

5.7 KiB

Raw Blame History

Context

The Backbone Helm chart currently has minimal configuration support. Users cannot configure the broker through Helm values, making it difficult to deploy in production environments. The chart needs to support all configuration options from the README and follow Helm best practices for production deployments.

Constraints

Must maintain backward compatibility where possible
Must align with environment variables documented in README
Must follow Kubernetes and Helm best practices
Storage backend (SQLite, PostgreSQL) may require persistent volumes

Stakeholders

Kubernetes operators deploying Backbone
Users requiring production-grade deployments with HA, monitoring, and persistence

Goals / Non-Goals

Goals

Expose all README environment variables through Helm values
Support persistent storage for /data directory with configurable storage class
Follow Helm best practices (resources, probes, security contexts, labels)
Enable production-ready deployments with proper health checks
Support ingress for HTTP API exposure

Non-Goals

StatefulSet conversion (deployment is sufficient for single-replica MQTT broker)
Horizontal Pod Autoscaling (MQTT broker state management complexity)
Built-in monitoring/metrics exporters (separate concern)
Multi-replica support with Redis clustering (future enhancement)

Decisions

Decision: Use PVC for /data persistence

Rationale: The application may use SQLite or store session data in /data. A PVC ensures data survives pod restarts and enables backup/restore workflows.

Alternatives considered:

emptyDir: Loses data on pod restart, unsuitable for production
hostPath: Ties pod to specific node, reduces portability
PVC (chosen): Standard Kubernetes pattern, supports storage classes, backup-friendly

Implementation:

Optional PVC controlled by persistence.enabled flag
Configurable storage class, size, and access mode
Defaults to disabled for backward compatibility

Decision: Environment variable structure in values.yaml

Rationale: Flatten environment variables under logical sections (config, k8s, oidc, redis) rather than deep nesting for better readability.

Structure:

config:
  adminToken: ''
  jwtSecret: ''
  httpPort: 8883
  tcpPort: 1883

k8s:
  enabled: true # default true since chart runs in K8s

ws:
  enabled: false

api:
  enabled: false

tcp:
  enabled: false

oidc:
  enabled: false
  discovery: ''
  clientId: ''
  clientSecret: ''
  groupField: 'groups'
  adminGroup: ''
  writerGroup: ''
  readerGroup: ''

redis:
  enabled: false
  host: 'localhost'
  port: 6379
  password: ''
  db: 0

Decision: ServiceAccount template instead of hardcoded name

Rationale: Current deployment references {{ .Release.Name }} for ServiceAccount but doesn't create it. Extract to proper template with configurable name and annotations.

Migration: Existing deployments referencing release name continue working.

Decision: Default K8S_ENABLED to true in chart

Rationale: The Helm chart is deployed TO Kubernetes, so K8s integration should default to enabled. Users can disable if running in non-operator mode.

Decision: Security context defaults

Apply restricted security context by default:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000
  readOnlyRootFilesystem: false # /data needs write access

Rationale: Follows Kubernetes security best practices. ReadOnlyRootFilesystem disabled because SQLite needs write access to /data.

Decision: Probe configuration

Add both liveness and readiness probes with sensible defaults:

Liveness: HTTP GET /health on port 8883 (requires API_ENABLED)
Readiness: HTTP GET /health on port 8883
Fallback: TCP socket check on ports if API disabled

Rationale: Enables Kubernetes to detect unhealthy pods and route traffic appropriately.

Risks / Trade-offs

Risk: Breaking changes for existing deployments

Mitigation:

Set conservative defaults matching current behavior where possible
Document migration path in CHANGELOG or upgrade notes
Version bump signals breaking changes (0.1.0 → 0.2.0)

Risk: Complex values.yaml overwhelming users

Mitigation:

Provide comprehensive comments
Include examples in comments
Keep sensible defaults for 90% use case
Create example values files for common scenarios

Risk: Storage class availability varies by cluster

Mitigation:

Make storage class configurable (default: "" uses cluster default)
Document common storage classes in values comments
Support disabling persistence entirely

Migration Plan

For existing deployments:

Review values.yaml changes
Set persistence.enabled: false to maintain stateless behavior (if desired)
Configure environment variables previously set via manual env overrides
Update service types if non-default required
Helm upgrade with new chart version

Rollback:

Standard Helm rollback: helm rollback <release> <revision>

Validation:

# Dry-run
helm upgrade --install backbone ./charts --dry-run --debug

# Lint
helm lint ./charts

# Template verification
helm template backbone ./charts > manifests.yaml
kubectl apply --dry-run=client -f manifests.yaml

Open Questions

Should probes be enabled by default?
- Proposal: Yes, but only if api.enabled=true, otherwise use TCP checks
Default persistence size?
- Proposal: 1Gi for SQLite database and session data
Should we support initContainers for DB migrations?
- Proposal: No, out of scope for this change (future enhancement)
Ingress class defaults?
- Proposal: Empty string, user must specify their ingress class

5.7 KiB Raw Blame History

Context

Constraints

Stakeholders

Goals / Non-Goals

Goals

Non-Goals

Decisions

Decision: Use PVC for /data persistence

Decision: Environment variable structure in values.yaml

Decision: ServiceAccount template instead of hardcoded name

Decision: Default K8S_ENABLED to true in chart

Decision: Security context defaults

Decision: Probe configuration

Risks / Trade-offs

Risk: Breaking changes for existing deployments

Risk: Complex values.yaml overwhelming users

Risk: Storage class availability varies by cluster

Migration Plan

For existing deployments:

Rollback:

Validation:

Open Questions

5.7 KiB

Raw Blame History