Files
nuclei-operator/DESIGN.md
Morten Olsen 12d681ada1 feat: implement pod-based scanning architecture
This major refactor moves from synchronous subprocess-based scanning to
asynchronous pod-based scanning using Kubernetes Jobs.

## Architecture Changes
- Scanner jobs are now Kubernetes Jobs with TTLAfterFinished for automatic cleanup
- Jobs have owner references for garbage collection when NucleiScan is deleted
- Configurable concurrency limits, timeouts, and resource requirements

## New Features
- Dual-mode binary: --mode=controller (default) or --mode=scanner
- Annotation-based configuration for Ingress/VirtualService resources
- Operator-level configuration via environment variables
- Startup recovery for orphaned scans after operator restart
- Periodic cleanup of stuck jobs

## New Files
- DESIGN.md: Comprehensive architecture design document
- internal/jobmanager/: Job Manager for creating/monitoring scanner jobs
- internal/scanner/runner.go: Scanner mode implementation
- internal/annotations/: Annotation parsing utilities
- charts/nuclei-operator/templates/scanner-rbac.yaml: Scanner RBAC

## API Changes
- Added ScannerConfig struct for per-scan scanner configuration
- Added JobReference struct for tracking scanner jobs
- Added ScannerConfig field to NucleiScanSpec
- Added JobRef and ScanStartTime fields to NucleiScanStatus

## Supported Annotations
- nuclei.homelab.mortenolsen.pro/enabled
- nuclei.homelab.mortenolsen.pro/templates
- nuclei.homelab.mortenolsen.pro/severity
- nuclei.homelab.mortenolsen.pro/schedule
- nuclei.homelab.mortenolsen.pro/timeout
- nuclei.homelab.mortenolsen.pro/scanner-image

## RBAC Updates
- Added Job and Pod permissions for operator
- Created separate scanner service account with minimal permissions

## Documentation
- Updated README, user-guide, api.md, and Helm chart README
- Added example annotated Ingress resources
2025-12-12 20:55:09 +01:00

22 KiB

Pod-Based Scanning Architecture Design

Executive Summary

This document describes the new architecture for the nuclei-operator that moves from synchronous subprocess-based scanning to asynchronous pod-based scanning. This change improves scalability, reliability, and operational flexibility while maintaining backward compatibility.

1. Architecture Overview

1.1 Current State Problems

The current implementation has several limitations:

  1. Blocking Reconcile Loop: Scans execute synchronously within the operator pod, blocking the reconcile loop for up to 30 minutes
  2. Single Point of Failure: All scans run in the operator pod - if it restarts, running scans are lost
  3. Resource Contention: Multiple concurrent scans compete for operator pod resources
  4. No Horizontal Scaling: Cannot distribute scan workload across multiple pods
  5. Limited Configuration: No annotation-based configuration for individual Ingress/VirtualService resources

1.2 New Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           KUBERNETES CLUSTER                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────┐    ┌──────────────────┐    ┌─────────────────────────┐   │
│  │   Ingress    │───▶│ IngressReconciler │───▶│                         │   │
│  └──────────────┘    └──────────────────┘    │                         │   │
│                              │                │     NucleiScan CRD      │   │
│  ┌──────────────┐    ┌──────────────────┐    │                         │   │
│  │VirtualService│───▶│   VSReconciler   │───▶│  spec:                  │   │
│  └──────────────┘    └──────────────────┘    │    sourceRef            │   │
│                              │                │    targets[]            │   │
│                              │                │    templates[]          │   │
│                              ▼                │    severity[]           │   │
│                      ┌──────────────────┐    │    schedule             │   │
│                      │  Owner Reference │    │  status:                │   │
│                      │  (GC on delete)  │    │    phase                │   │
│                      └──────────────────┘    │    findings[]           │   │
│                                              │    summary              │   │
│                                              │    jobRef               │   │
│                                              └───────────┬─────────────┘   │
│                                                          │                  │
│                                                          ▼                  │
│                                              ┌─────────────────────────┐   │
│                                              │ NucleiScanReconciler    │   │
│                                              │                         │   │
│                                              │  1. Check phase         │   │
│                                              │  2. Create/monitor Job  │   │
│                                              │  3. Handle completion   │   │
│                                              └───────────┬─────────────┘   │
│                                                          │                  │
│                                                          ▼                  │
│                                              ┌─────────────────────────┐   │
│                                              │   Scanner Jobs          │   │
│                                              │   (Kubernetes Jobs)     │   │
│                                              │                         │   │
│                                              │   - Isolated execution  │   │
│                                              │   - Direct status update│   │
│                                              │   - Auto cleanup (TTL)  │   │
│                                              └─────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

1.3 Key Design Decisions

Decision 1: Kubernetes Jobs vs Bare Pods

Choice: Kubernetes Jobs with TTLAfterFinished

Rationale:

  • Jobs provide built-in completion tracking and retry mechanisms
  • TTLAfterFinished enables automatic cleanup of completed jobs
  • Jobs maintain history for debugging and auditing
  • Better integration with Kubernetes ecosystem tools
apiVersion: batch/v1
kind: Job
metadata:
  name: nucleiscan-myapp-abc123
  namespace: default
spec:
  ttlSecondsAfterFinished: 3600  # Clean up 1 hour after completion
  backoffLimit: 2                 # Retry failed scans twice
  activeDeadlineSeconds: 1800     # 30 minute timeout
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: scanner
        image: ghcr.io/morten-olsen/homelab-nuclei-operator:latest
        args: ["--mode=scanner", "--scan-id=myapp-abc123"]

Decision 2: Result Communication

Choice: Dual-mode operator image with direct API access

Rationale:

  • Single image simplifies deployment and versioning
  • Scanner mode has direct Kubernetes API access to update NucleiScan status
  • No intermediate storage needed (ConfigMaps or logs)
  • Results are immediately available in the CRD status
  • Consistent error handling and status updates

The operator binary supports two modes:

  1. Controller Mode (default): Runs the operator controllers
  2. Scanner Mode (--mode=scanner): Executes a single scan and updates the NucleiScan status

Decision 3: Template Distribution

Choice: Hybrid approach with configurable options

  1. Default: Use projectdiscovery/nuclei built-in templates (updated with each nuclei release)
  2. Custom Templates: Mount via ConfigMap for small template sets
  3. Git Sync: Init container that clones template repositories at runtime
  4. Custom Image: For air-gapped environments, bake templates into a custom scanner image

Configuration hierarchy:

Operator Defaults < NucleiScan Spec < Ingress/VS Annotations

2. Component Design

2.1 NucleiScan Controller Changes

The controller transitions from executing scans to managing scan jobs:

// Simplified reconciliation flow
func (r *NucleiScanReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    nucleiScan := &nucleiv1alpha1.NucleiScan{}
    if err := r.Get(ctx, req.NamespacedName, nucleiScan); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    switch nucleiScan.Status.Phase {
    case ScanPhasePending:
        return r.handlePending(ctx, nucleiScan)    // Create Job
    case ScanPhaseRunning:
        return r.handleRunning(ctx, nucleiScan)    // Monitor Job
    case ScanPhaseCompleted, ScanPhaseFailed:
        return r.handleCompleted(ctx, nucleiScan)  // Schedule next or cleanup
    }
}

2.2 Job Manager Component

New component responsible for:

  • Creating scanner jobs with proper configuration
  • Monitoring job status and updating NucleiScan accordingly
  • Cleaning up orphaned jobs on operator restart
  • Enforcing concurrency limits
type JobManager struct {
    client.Client
    Scheme          *runtime.Scheme
    ScannerImage    string
    MaxConcurrent   int
    DefaultTimeout  time.Duration
}

func (m *JobManager) CreateScanJob(ctx context.Context, scan *nucleiv1alpha1.NucleiScan) (*batchv1.Job, error) {
    job := m.buildJob(scan)
    if err := controllerutil.SetControllerReference(scan, job, m.Scheme); err != nil {
        return nil, err
    }
    return job, m.Create(ctx, job)
}

2.3 Scanner Mode Implementation

The operator binary in scanner mode:

func runScannerMode(scanID string) error {
    // 1. Initialize Kubernetes client
    config, _ := rest.InClusterConfig()
    client, _ := client.New(config, client.Options{})
    
    // 2. Fetch the NucleiScan resource
    scan := &nucleiv1alpha1.NucleiScan{}
    client.Get(ctx, types.NamespacedName{...}, scan)
    
    // 3. Execute the scan
    result, err := scanner.Scan(ctx, scan.Spec.Targets, options)
    
    // 4. Update NucleiScan status directly
    scan.Status.Phase = ScanPhaseCompleted
    scan.Status.Findings = result.Findings
    scan.Status.Summary = result.Summary
    client.Status().Update(ctx, scan)
    
    return nil
}

3. API Changes

3.1 NucleiScan CRD Updates

New fields added to the spec and status:

// NucleiScanSpec additions
type NucleiScanSpec struct {
    // ... existing fields ...
    
    // ScannerConfig allows overriding scanner settings for this scan
    // +optional
    ScannerConfig *ScannerConfig `json:"scannerConfig,omitempty"`
}

// ScannerConfig defines scanner-specific configuration
type ScannerConfig struct {
    // Image overrides the default scanner image
    // +optional
    Image string `json:"image,omitempty"`
    
    // Resources defines resource requirements for the scanner pod
    // +optional
    Resources *corev1.ResourceRequirements `json:"resources,omitempty"`
    
    // Timeout overrides the default scan timeout
    // +optional
    Timeout *metav1.Duration `json:"timeout,omitempty"`
    
    // TemplateURLs specifies additional template repositories to clone
    // +optional
    TemplateURLs []string `json:"templateURLs,omitempty"`
    
    // NodeSelector for scanner pod scheduling
    // +optional
    NodeSelector map[string]string `json:"nodeSelector,omitempty"`
    
    // Tolerations for scanner pod scheduling
    // +optional
    Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
}

// NucleiScanStatus additions
type NucleiScanStatus struct {
    // ... existing fields ...
    
    // JobRef references the current or last scanner job
    // +optional
    JobRef *JobReference `json:"jobRef,omitempty"`
    
    // ScanStartTime is when the scanner pod actually started scanning
    // +optional
    ScanStartTime *metav1.Time `json:"scanStartTime,omitempty"`
}

// JobReference contains information about the scanner job
type JobReference struct {
    // Name of the Job
    Name string `json:"name"`
    
    // UID of the Job
    UID string `json:"uid"`
    
    // PodName is the name of the scanner pod (for log retrieval)
    // +optional
    PodName string `json:"podName,omitempty"`
    
    // StartTime when the job was created
    StartTime *metav1.Time `json:"startTime,omitempty"`
}

4. Annotation Schema

4.1 Supported Annotations

Annotations on Ingress/VirtualService resources to configure scanning:

Annotation Type Default Description
nuclei.homelab.mortenolsen.pro/enabled bool true Enable/disable scanning for this resource
nuclei.homelab.mortenolsen.pro/templates string - Comma-separated list of template paths or tags
nuclei.homelab.mortenolsen.pro/severity string - Comma-separated severity filter: info,low,medium,high,critical
nuclei.homelab.mortenolsen.pro/schedule string - Cron schedule for periodic scans
nuclei.homelab.mortenolsen.pro/timeout duration 30m Scan timeout
nuclei.homelab.mortenolsen.pro/scanner-image string - Override scanner image
nuclei.homelab.mortenolsen.pro/exclude-templates string - Templates to exclude
nuclei.homelab.mortenolsen.pro/rate-limit int 150 Requests per second limit
nuclei.homelab.mortenolsen.pro/tags string - Template tags to include
nuclei.homelab.mortenolsen.pro/exclude-tags string - Template tags to exclude

4.2 Example Annotated Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-ingress
  namespace: production
  annotations:
    nuclei.homelab.mortenolsen.pro/enabled: "true"
    nuclei.homelab.mortenolsen.pro/severity: "medium,high,critical"
    nuclei.homelab.mortenolsen.pro/schedule: "0 2 * * *"
    nuclei.homelab.mortenolsen.pro/templates: "cves/,vulnerabilities/,exposures/"
    nuclei.homelab.mortenolsen.pro/exclude-tags: "dos,fuzz"
    nuclei.homelab.mortenolsen.pro/timeout: "45m"
spec:
  rules:
    - host: myapp.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: myapp
                port:
                  number: 80

5. State Machine

5.1 Updated Scan Lifecycle

                    ┌─────────────────────────────────────┐
                    │                                     │
                    ▼                                     │
┌─────────┐    ┌─────────┐    ┌───────────┐    ┌────────┴─┐
│ Created │───▶│ Pending │───▶│  Running  │───▶│ Completed│
└─────────┘    └────┬────┘    └─────┬─────┘    └──────────┘
                    │               │                │
                    │               │                │ (schedule/rescanAge)
                    │               ▼                │
                    │          ┌─────────┐           │
                    │          │ Failed  │◀──────────┘
                    │          └────┬────┘
                    │               │
                    └───────────────┘ (spec change triggers retry)

5.2 Phase Definitions

Phase Description Job State Actions
Pending Waiting to start None Create scanner job
Running Scan in progress Active Monitor job, check timeout
Completed Scan finished successfully Succeeded Parse results, schedule next
Failed Scan failed Failed Record error, retry logic

6. Error Handling

6.1 Failure Scenarios

Scenario Detection Recovery
Job creation fails API error Retry with backoff, update status
Pod fails to schedule Job pending timeout Alert, manual intervention
Scan timeout activeDeadlineSeconds Mark failed, retry
Scanner crashes Job failed status Retry based on backoffLimit
Operator restarts Running phase with no job Reset to Pending
Target unavailable HTTP check fails Exponential backoff retry
Results too large Status update fails Truncate findings, log warning

6.2 Operator Restart Recovery

On startup, the operator must handle orphaned state:

func (r *NucleiScanReconciler) RecoverOrphanedScans(ctx context.Context) error {
    // List all NucleiScans in Running phase
    scanList := &nucleiv1alpha1.NucleiScanList{}
    if err := r.List(ctx, scanList); err != nil {
        return err
    }
    
    for _, scan := range scanList.Items {
        if scan.Status.Phase != ScanPhaseRunning {
            continue
        }
        
        // Check if the referenced job still exists
        if scan.Status.JobRef != nil {
            job := &batchv1.Job{}
            err := r.Get(ctx, types.NamespacedName{
                Name:      scan.Status.JobRef.Name,
                Namespace: scan.Namespace,
            }, job)
            
            if apierrors.IsNotFound(err) {
                // Job is gone - reset scan to Pending
                scan.Status.Phase = ScanPhasePending
                scan.Status.LastError = "Recovered from operator restart - job not found"
                scan.Status.JobRef = nil
                r.Status().Update(ctx, &scan)
            }
            // If job exists, normal reconciliation will handle it
        }
    }
    
    return nil
}

6.3 Job Cleanup

Orphaned jobs are cleaned up via:

  1. Owner References: Jobs have NucleiScan as owner - deleted when scan is deleted
  2. TTLAfterFinished: Kubernetes automatically cleans up completed jobs
  3. Periodic Cleanup: Background goroutine removes stuck jobs

7. Security Considerations

7.1 RBAC Updates

The operator needs additional permissions for Job management:

# Additional rules for config/rbac/role.yaml
rules:
  # Job management
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  
  # Pod logs for debugging
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]

Scanner pods need minimal RBAC - only to update their specific NucleiScan:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: nuclei-scanner-role
rules:
  - apiGroups: ["nuclei.homelab.mortenolsen.pro"]
    resources: ["nucleiscans"]
    verbs: ["get"]
  - apiGroups: ["nuclei.homelab.mortenolsen.pro"]
    resources: ["nucleiscans/status"]
    verbs: ["get", "update", "patch"]

7.2 Pod Security

Scanner pods run with restricted security context:

securityContext:
  runAsNonRoot: true
  runAsUser: 65532
  runAsGroup: 65532
  fsGroup: 65532
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: false  # Nuclei needs temp files
  seccompProfile:
    type: RuntimeDefault
  capabilities:
    drop:
      - ALL

8. Migration Path

8.1 Version Strategy

Version Changes Compatibility
v0.x Current synchronous scanning -
v1.0 Pod-based scanning, new status fields Backward compatible
v1.1 Annotation support Additive
v2.0 Remove synchronous mode Breaking

8.2 Migration Steps

  1. Phase 1: Add new fields to CRD (non-breaking, all optional)
  2. Phase 2: Dual-mode operation with feature flag
  3. Phase 3: Add annotation support
  4. Phase 4: Deprecate synchronous mode
  5. Phase 5: Remove synchronous mode (v2.0)

8.3 Rollback Plan

If issues are discovered:

  1. Immediate: Set scanner.mode: sync in Helm values
  2. Short-term: Pin to previous operator version
  3. Long-term: Fix issues in pod-based mode

9. Configuration Reference

9.1 Helm Values

# Scanner configuration
scanner:
  # Scanning mode: "pod" or "sync" (legacy)
  mode: "pod"
  
  # Default scanner image
  image: ghcr.io/morten-olsen/homelab-nuclei-operator:latest
  
  # Default scan timeout
  timeout: 30m
  
  # Maximum concurrent scan jobs
  maxConcurrent: 5
  
  # Job TTL after completion (seconds)
  ttlAfterFinished: 3600
  
  # Default resource requirements for scanner pods
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: "1"
      memory: 1Gi

# Template configuration
templates:
  # Built-in templates to use
  defaults:
    - cves/
    - vulnerabilities/
  
  # Git repositories to clone (init container)
  repositories: []
  # - url: https://github.com/projectdiscovery/nuclei-templates
  #   branch: main
  #   path: /templates/community

# Operator configuration
operator:
  # Rescan age - trigger rescan if results older than this
  rescanAge: 168h
  
  # Backoff for target availability checks
  backoff:
    initial: 10s
    max: 10m
    multiplier: 2.0

9.2 Environment Variables

Variable Description Default
SCANNER_MODE pod or sync pod
SCANNER_IMAGE Default scanner image operator image
SCANNER_TIMEOUT Default scan timeout 30m
MAX_CONCURRENT_SCANS Max parallel jobs 5
JOB_TTL_AFTER_FINISHED Job cleanup TTL 3600
NUCLEI_TEMPLATES_PATH Template directory /nuclei-templates

10. Observability

10.1 Metrics

New Prometheus metrics:

  • nuclei_scan_jobs_created_total - Total scanner jobs created
  • nuclei_scan_job_duration_seconds - Duration histogram of scan jobs
  • nuclei_active_scan_jobs - Currently running scan jobs

10.2 Events

Kubernetes events for key state transitions:

  • ScanJobCreated - Scanner job created
  • ScanCompleted - Scan finished successfully
  • ScanFailed - Scan failed

10.3 Logging

Structured logging with consistent fields:

  • scan - NucleiScan name
  • namespace - Namespace
  • targets - Number of targets
  • timeout - Scan timeout