Files
backbone/openspec/changes/archive/2025-10-17-update-helm-chart-best-practices/design.md
2025-10-18 00:01:58 +02:00

5.7 KiB

Context

The Backbone Helm chart currently has minimal configuration support. Users cannot configure the broker through Helm values, making it difficult to deploy in production environments. The chart needs to support all configuration options from the README and follow Helm best practices for production deployments.

Constraints

  • Must maintain backward compatibility where possible
  • Must align with environment variables documented in README
  • Must follow Kubernetes and Helm best practices
  • Storage backend (SQLite, PostgreSQL) may require persistent volumes

Stakeholders

  • Kubernetes operators deploying Backbone
  • Users requiring production-grade deployments with HA, monitoring, and persistence

Goals / Non-Goals

Goals

  • Expose all README environment variables through Helm values
  • Support persistent storage for /data directory with configurable storage class
  • Follow Helm best practices (resources, probes, security contexts, labels)
  • Enable production-ready deployments with proper health checks
  • Support ingress for HTTP API exposure

Non-Goals

  • StatefulSet conversion (deployment is sufficient for single-replica MQTT broker)
  • Horizontal Pod Autoscaling (MQTT broker state management complexity)
  • Built-in monitoring/metrics exporters (separate concern)
  • Multi-replica support with Redis clustering (future enhancement)

Decisions

Decision: Use PVC for /data persistence

Rationale: The application may use SQLite or store session data in /data. A PVC ensures data survives pod restarts and enables backup/restore workflows.

Alternatives considered:

  • emptyDir: Loses data on pod restart, unsuitable for production
  • hostPath: Ties pod to specific node, reduces portability
  • PVC (chosen): Standard Kubernetes pattern, supports storage classes, backup-friendly

Implementation:

  • Optional PVC controlled by persistence.enabled flag
  • Configurable storage class, size, and access mode
  • Defaults to disabled for backward compatibility

Decision: Environment variable structure in values.yaml

Rationale: Flatten environment variables under logical sections (config, k8s, oidc, redis) rather than deep nesting for better readability.

Structure:

config:
  adminToken: ''
  jwtSecret: ''
  httpPort: 8883
  tcpPort: 1883

k8s:
  enabled: true # default true since chart runs in K8s

ws:
  enabled: false

api:
  enabled: false

tcp:
  enabled: false

oidc:
  enabled: false
  discovery: ''
  clientId: ''
  clientSecret: ''
  groupField: 'groups'
  adminGroup: ''
  writerGroup: ''
  readerGroup: ''

redis:
  enabled: false
  host: 'localhost'
  port: 6379
  password: ''
  db: 0

Decision: ServiceAccount template instead of hardcoded name

Rationale: Current deployment references {{ .Release.Name }} for ServiceAccount but doesn't create it. Extract to proper template with configurable name and annotations.

Migration: Existing deployments referencing release name continue working.

Decision: Default K8S_ENABLED to true in chart

Rationale: The Helm chart is deployed TO Kubernetes, so K8s integration should default to enabled. Users can disable if running in non-operator mode.

Decision: Security context defaults

Apply restricted security context by default:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000
  readOnlyRootFilesystem: false # /data needs write access

Rationale: Follows Kubernetes security best practices. ReadOnlyRootFilesystem disabled because SQLite needs write access to /data.

Decision: Probe configuration

Add both liveness and readiness probes with sensible defaults:

  • Liveness: HTTP GET /health on port 8883 (requires API_ENABLED)
  • Readiness: HTTP GET /health on port 8883
  • Fallback: TCP socket check on ports if API disabled

Rationale: Enables Kubernetes to detect unhealthy pods and route traffic appropriately.

Risks / Trade-offs

Risk: Breaking changes for existing deployments

Mitigation:

  • Set conservative defaults matching current behavior where possible
  • Document migration path in CHANGELOG or upgrade notes
  • Version bump signals breaking changes (0.1.0 → 0.2.0)

Risk: Complex values.yaml overwhelming users

Mitigation:

  • Provide comprehensive comments
  • Include examples in comments
  • Keep sensible defaults for 90% use case
  • Create example values files for common scenarios

Risk: Storage class availability varies by cluster

Mitigation:

  • Make storage class configurable (default: "" uses cluster default)
  • Document common storage classes in values comments
  • Support disabling persistence entirely

Migration Plan

For existing deployments:

  1. Review values.yaml changes
  2. Set persistence.enabled: false to maintain stateless behavior (if desired)
  3. Configure environment variables previously set via manual env overrides
  4. Update service types if non-default required
  5. Helm upgrade with new chart version

Rollback:

Standard Helm rollback: helm rollback <release> <revision>

Validation:

# Dry-run
helm upgrade --install backbone ./charts --dry-run --debug

# Lint
helm lint ./charts

# Template verification
helm template backbone ./charts > manifests.yaml
kubectl apply --dry-run=client -f manifests.yaml

Open Questions

  1. Should probes be enabled by default?

    • Proposal: Yes, but only if api.enabled=true, otherwise use TCP checks
  2. Default persistence size?

    • Proposal: 1Gi for SQLite database and session data
  3. Should we support initContainers for DB migrations?

    • Proposal: No, out of scope for this change (future enhancement)
  4. Ingress class defaults?

    • Proposal: Empty string, user must specify their ingress class