feat: implement pod-based scanning architecture

This major refactor moves from synchronous subprocess-based scanning to
asynchronous pod-based scanning using Kubernetes Jobs.

## Architecture Changes
- Scanner jobs are now Kubernetes Jobs with TTLAfterFinished for automatic cleanup
- Jobs have owner references for garbage collection when NucleiScan is deleted
- Configurable concurrency limits, timeouts, and resource requirements

## New Features
- Dual-mode binary: --mode=controller (default) or --mode=scanner
- Annotation-based configuration for Ingress/VirtualService resources
- Operator-level configuration via environment variables
- Startup recovery for orphaned scans after operator restart
- Periodic cleanup of stuck jobs

## New Files
- DESIGN.md: Comprehensive architecture design document
- internal/jobmanager/: Job Manager for creating/monitoring scanner jobs
- internal/scanner/runner.go: Scanner mode implementation
- internal/annotations/: Annotation parsing utilities
- charts/nuclei-operator/templates/scanner-rbac.yaml: Scanner RBAC

## API Changes
- Added ScannerConfig struct for per-scan scanner configuration
- Added JobReference struct for tracking scanner jobs
- Added ScannerConfig field to NucleiScanSpec
- Added JobRef and ScanStartTime fields to NucleiScanStatus

## Supported Annotations
- nuclei.homelab.mortenolsen.pro/enabled
- nuclei.homelab.mortenolsen.pro/templates
- nuclei.homelab.mortenolsen.pro/severity
- nuclei.homelab.mortenolsen.pro/schedule
- nuclei.homelab.mortenolsen.pro/timeout
- nuclei.homelab.mortenolsen.pro/scanner-image

## RBAC Updates
- Added Job and Pod permissions for operator
- Created separate scanner service account with minimal permissions

## Documentation
- Updated README, user-guide, api.md, and Helm chart README
- Added example annotated Ingress resources
This commit is contained in:
Morten Olsen
2025-12-12 20:51:23 +01:00
parent 519ed32de3
commit 335689da22
22 changed files with 3060 additions and 245 deletions

587
DESIGN.md Normal file
View File

@@ -0,0 +1,587 @@
# Pod-Based Scanning Architecture Design
## Executive Summary
This document describes the new architecture for the nuclei-operator that moves from synchronous subprocess-based scanning to asynchronous pod-based scanning. This change improves scalability, reliability, and operational flexibility while maintaining backward compatibility.
## 1. Architecture Overview
### 1.1 Current State Problems
The current implementation has several limitations:
1. **Blocking Reconcile Loop**: Scans execute synchronously within the operator pod, blocking the reconcile loop for up to 30 minutes
2. **Single Point of Failure**: All scans run in the operator pod - if it restarts, running scans are lost
3. **Resource Contention**: Multiple concurrent scans compete for operator pod resources
4. **No Horizontal Scaling**: Cannot distribute scan workload across multiple pods
5. **Limited Configuration**: No annotation-based configuration for individual Ingress/VirtualService resources
### 1.2 New Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES CLUSTER │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────────┐ ┌─────────────────────────┐ │
│ │ Ingress │───▶│ IngressReconciler │───▶│ │ │
│ └──────────────┘ └──────────────────┘ │ │ │
│ │ │ NucleiScan CRD │ │
│ ┌──────────────┐ ┌──────────────────┐ │ │ │
│ │VirtualService│───▶│ VSReconciler │───▶│ spec: │ │
│ └──────────────┘ └──────────────────┘ │ sourceRef │ │
│ │ │ targets[] │ │
│ │ │ templates[] │ │
│ ▼ │ severity[] │ │
│ ┌──────────────────┐ │ schedule │ │
│ │ Owner Reference │ │ status: │ │
│ │ (GC on delete) │ │ phase │ │
│ └──────────────────┘ │ findings[] │ │
│ │ summary │ │
│ │ jobRef │ │
│ └───────────┬─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ NucleiScanReconciler │ │
│ │ │ │
│ │ 1. Check phase │ │
│ │ 2. Create/monitor Job │ │
│ │ 3. Handle completion │ │
│ └───────────┬─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Scanner Jobs │ │
│ │ (Kubernetes Jobs) │ │
│ │ │ │
│ │ - Isolated execution │ │
│ │ - Direct status update│ │
│ │ - Auto cleanup (TTL) │ │
│ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
### 1.3 Key Design Decisions
#### Decision 1: Kubernetes Jobs vs Bare Pods
**Choice: Kubernetes Jobs with TTLAfterFinished**
Rationale:
- Jobs provide built-in completion tracking and retry mechanisms
- TTLAfterFinished enables automatic cleanup of completed jobs
- Jobs maintain history for debugging and auditing
- Better integration with Kubernetes ecosystem tools
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: nucleiscan-myapp-abc123
namespace: default
spec:
ttlSecondsAfterFinished: 3600 # Clean up 1 hour after completion
backoffLimit: 2 # Retry failed scans twice
activeDeadlineSeconds: 1800 # 30 minute timeout
template:
spec:
restartPolicy: Never
containers:
- name: scanner
image: ghcr.io/morten-olsen/homelab-nuclei-operator:latest
args: ["--mode=scanner", "--scan-id=myapp-abc123"]
```
#### Decision 2: Result Communication
**Choice: Dual-mode operator image with direct API access**
Rationale:
- Single image simplifies deployment and versioning
- Scanner mode has direct Kubernetes API access to update NucleiScan status
- No intermediate storage needed (ConfigMaps or logs)
- Results are immediately available in the CRD status
- Consistent error handling and status updates
The operator binary supports two modes:
1. **Controller Mode** (default): Runs the operator controllers
2. **Scanner Mode** (`--mode=scanner`): Executes a single scan and updates the NucleiScan status
#### Decision 3: Template Distribution
**Choice: Hybrid approach with configurable options**
1. **Default**: Use projectdiscovery/nuclei built-in templates (updated with each nuclei release)
2. **Custom Templates**: Mount via ConfigMap for small template sets
3. **Git Sync**: Init container that clones template repositories at runtime
4. **Custom Image**: For air-gapped environments, bake templates into a custom scanner image
Configuration hierarchy:
```
Operator Defaults < NucleiScan Spec < Ingress/VS Annotations
```
## 2. Component Design
### 2.1 NucleiScan Controller Changes
The controller transitions from executing scans to managing scan jobs:
```go
// Simplified reconciliation flow
func (r *NucleiScanReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
nucleiScan := &nucleiv1alpha1.NucleiScan{}
if err := r.Get(ctx, req.NamespacedName, nucleiScan); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
switch nucleiScan.Status.Phase {
case ScanPhasePending:
return r.handlePending(ctx, nucleiScan) // Create Job
case ScanPhaseRunning:
return r.handleRunning(ctx, nucleiScan) // Monitor Job
case ScanPhaseCompleted, ScanPhaseFailed:
return r.handleCompleted(ctx, nucleiScan) // Schedule next or cleanup
}
}
```
### 2.2 Job Manager Component
New component responsible for:
- Creating scanner jobs with proper configuration
- Monitoring job status and updating NucleiScan accordingly
- Cleaning up orphaned jobs on operator restart
- Enforcing concurrency limits
```go
type JobManager struct {
client.Client
Scheme *runtime.Scheme
ScannerImage string
MaxConcurrent int
DefaultTimeout time.Duration
}
func (m *JobManager) CreateScanJob(ctx context.Context, scan *nucleiv1alpha1.NucleiScan) (*batchv1.Job, error) {
job := m.buildJob(scan)
if err := controllerutil.SetControllerReference(scan, job, m.Scheme); err != nil {
return nil, err
}
return job, m.Create(ctx, job)
}
```
### 2.3 Scanner Mode Implementation
The operator binary in scanner mode:
```go
func runScannerMode(scanID string) error {
// 1. Initialize Kubernetes client
config, _ := rest.InClusterConfig()
client, _ := client.New(config, client.Options{})
// 2. Fetch the NucleiScan resource
scan := &nucleiv1alpha1.NucleiScan{}
client.Get(ctx, types.NamespacedName{...}, scan)
// 3. Execute the scan
result, err := scanner.Scan(ctx, scan.Spec.Targets, options)
// 4. Update NucleiScan status directly
scan.Status.Phase = ScanPhaseCompleted
scan.Status.Findings = result.Findings
scan.Status.Summary = result.Summary
client.Status().Update(ctx, scan)
return nil
}
```
## 3. API Changes
### 3.1 NucleiScan CRD Updates
New fields added to the spec and status:
```go
// NucleiScanSpec additions
type NucleiScanSpec struct {
// ... existing fields ...
// ScannerConfig allows overriding scanner settings for this scan
// +optional
ScannerConfig *ScannerConfig `json:"scannerConfig,omitempty"`
}
// ScannerConfig defines scanner-specific configuration
type ScannerConfig struct {
// Image overrides the default scanner image
// +optional
Image string `json:"image,omitempty"`
// Resources defines resource requirements for the scanner pod
// +optional
Resources *corev1.ResourceRequirements `json:"resources,omitempty"`
// Timeout overrides the default scan timeout
// +optional
Timeout *metav1.Duration `json:"timeout,omitempty"`
// TemplateURLs specifies additional template repositories to clone
// +optional
TemplateURLs []string `json:"templateURLs,omitempty"`
// NodeSelector for scanner pod scheduling
// +optional
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
// Tolerations for scanner pod scheduling
// +optional
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
}
// NucleiScanStatus additions
type NucleiScanStatus struct {
// ... existing fields ...
// JobRef references the current or last scanner job
// +optional
JobRef *JobReference `json:"jobRef,omitempty"`
// ScanStartTime is when the scanner pod actually started scanning
// +optional
ScanStartTime *metav1.Time `json:"scanStartTime,omitempty"`
}
// JobReference contains information about the scanner job
type JobReference struct {
// Name of the Job
Name string `json:"name"`
// UID of the Job
UID string `json:"uid"`
// PodName is the name of the scanner pod (for log retrieval)
// +optional
PodName string `json:"podName,omitempty"`
// StartTime when the job was created
StartTime *metav1.Time `json:"startTime,omitempty"`
}
```
## 4. Annotation Schema
### 4.1 Supported Annotations
Annotations on Ingress/VirtualService resources to configure scanning:
| Annotation | Type | Default | Description |
|------------|------|---------|-------------|
| `nuclei.homelab.mortenolsen.pro/enabled` | bool | `true` | Enable/disable scanning for this resource |
| `nuclei.homelab.mortenolsen.pro/templates` | string | - | Comma-separated list of template paths or tags |
| `nuclei.homelab.mortenolsen.pro/severity` | string | - | Comma-separated severity filter: info,low,medium,high,critical |
| `nuclei.homelab.mortenolsen.pro/schedule` | string | - | Cron schedule for periodic scans |
| `nuclei.homelab.mortenolsen.pro/timeout` | duration | `30m` | Scan timeout |
| `nuclei.homelab.mortenolsen.pro/scanner-image` | string | - | Override scanner image |
| `nuclei.homelab.mortenolsen.pro/exclude-templates` | string | - | Templates to exclude |
| `nuclei.homelab.mortenolsen.pro/rate-limit` | int | `150` | Requests per second limit |
| `nuclei.homelab.mortenolsen.pro/tags` | string | - | Template tags to include |
| `nuclei.homelab.mortenolsen.pro/exclude-tags` | string | - | Template tags to exclude |
### 4.2 Example Annotated Ingress
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
namespace: production
annotations:
nuclei.homelab.mortenolsen.pro/enabled: "true"
nuclei.homelab.mortenolsen.pro/severity: "medium,high,critical"
nuclei.homelab.mortenolsen.pro/schedule: "0 2 * * *"
nuclei.homelab.mortenolsen.pro/templates: "cves/,vulnerabilities/,exposures/"
nuclei.homelab.mortenolsen.pro/exclude-tags: "dos,fuzz"
nuclei.homelab.mortenolsen.pro/timeout: "45m"
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp
port:
number: 80
```
## 5. State Machine
### 5.1 Updated Scan Lifecycle
```
┌─────────────────────────────────────┐
│ │
▼ │
┌─────────┐ ┌─────────┐ ┌───────────┐ ┌────────┴─┐
│ Created │───▶│ Pending │───▶│ Running │───▶│ Completed│
└─────────┘ └────┬────┘ └─────┬─────┘ └──────────┘
│ │ │
│ │ │ (schedule/rescanAge)
│ ▼ │
│ ┌─────────┐ │
│ │ Failed │◀──────────┘
│ └────┬────┘
│ │
└───────────────┘ (spec change triggers retry)
```
### 5.2 Phase Definitions
| Phase | Description | Job State | Actions |
|-------|-------------|-----------|---------|
| `Pending` | Waiting to start | None | Create scanner job |
| `Running` | Scan in progress | Active | Monitor job, check timeout |
| `Completed` | Scan finished successfully | Succeeded | Parse results, schedule next |
| `Failed` | Scan failed | Failed | Record error, retry logic |
## 6. Error Handling
### 6.1 Failure Scenarios
| Scenario | Detection | Recovery |
|----------|-----------|----------|
| Job creation fails | API error | Retry with backoff, update status |
| Pod fails to schedule | Job pending timeout | Alert, manual intervention |
| Scan timeout | activeDeadlineSeconds | Mark failed, retry |
| Scanner crashes | Job failed status | Retry based on backoffLimit |
| Operator restarts | Running phase with no job | Reset to Pending |
| Target unavailable | HTTP check fails | Exponential backoff retry |
| Results too large | Status update fails | Truncate findings, log warning |
### 6.2 Operator Restart Recovery
On startup, the operator must handle orphaned state:
```go
func (r *NucleiScanReconciler) RecoverOrphanedScans(ctx context.Context) error {
// List all NucleiScans in Running phase
scanList := &nucleiv1alpha1.NucleiScanList{}
if err := r.List(ctx, scanList); err != nil {
return err
}
for _, scan := range scanList.Items {
if scan.Status.Phase != ScanPhaseRunning {
continue
}
// Check if the referenced job still exists
if scan.Status.JobRef != nil {
job := &batchv1.Job{}
err := r.Get(ctx, types.NamespacedName{
Name: scan.Status.JobRef.Name,
Namespace: scan.Namespace,
}, job)
if apierrors.IsNotFound(err) {
// Job is gone - reset scan to Pending
scan.Status.Phase = ScanPhasePending
scan.Status.LastError = "Recovered from operator restart - job not found"
scan.Status.JobRef = nil
r.Status().Update(ctx, &scan)
}
// If job exists, normal reconciliation will handle it
}
}
return nil
}
```
### 6.3 Job Cleanup
Orphaned jobs are cleaned up via:
1. **Owner References**: Jobs have NucleiScan as owner - deleted when scan is deleted
2. **TTLAfterFinished**: Kubernetes automatically cleans up completed jobs
3. **Periodic Cleanup**: Background goroutine removes stuck jobs
## 7. Security Considerations
### 7.1 RBAC Updates
The operator needs additional permissions for Job management:
```yaml
# Additional rules for config/rbac/role.yaml
rules:
# Job management
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
# Pod logs for debugging
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
```
Scanner pods need minimal RBAC - only to update their specific NucleiScan:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: nuclei-scanner-role
rules:
- apiGroups: ["nuclei.homelab.mortenolsen.pro"]
resources: ["nucleiscans"]
verbs: ["get"]
- apiGroups: ["nuclei.homelab.mortenolsen.pro"]
resources: ["nucleiscans/status"]
verbs: ["get", "update", "patch"]
```
### 7.2 Pod Security
Scanner pods run with restricted security context:
```yaml
securityContext:
runAsNonRoot: true
runAsUser: 65532
runAsGroup: 65532
fsGroup: 65532
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false # Nuclei needs temp files
seccompProfile:
type: RuntimeDefault
capabilities:
drop:
- ALL
```
## 8. Migration Path
### 8.1 Version Strategy
| Version | Changes | Compatibility |
|---------|---------|---------------|
| v0.x | Current synchronous scanning | - |
| v1.0 | Pod-based scanning, new status fields | Backward compatible |
| v1.1 | Annotation support | Additive |
| v2.0 | Remove synchronous mode | Breaking |
### 8.2 Migration Steps
1. **Phase 1**: Add new fields to CRD (non-breaking, all optional)
2. **Phase 2**: Dual-mode operation with feature flag
3. **Phase 3**: Add annotation support
4. **Phase 4**: Deprecate synchronous mode
5. **Phase 5**: Remove synchronous mode (v2.0)
### 8.3 Rollback Plan
If issues are discovered:
1. **Immediate**: Set `scanner.mode: sync` in Helm values
2. **Short-term**: Pin to previous operator version
3. **Long-term**: Fix issues in pod-based mode
## 9. Configuration Reference
### 9.1 Helm Values
```yaml
# Scanner configuration
scanner:
# Scanning mode: "pod" or "sync" (legacy)
mode: "pod"
# Default scanner image
image: ghcr.io/morten-olsen/homelab-nuclei-operator:latest
# Default scan timeout
timeout: 30m
# Maximum concurrent scan jobs
maxConcurrent: 5
# Job TTL after completion (seconds)
ttlAfterFinished: 3600
# Default resource requirements for scanner pods
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: "1"
memory: 1Gi
# Template configuration
templates:
# Built-in templates to use
defaults:
- cves/
- vulnerabilities/
# Git repositories to clone (init container)
repositories: []
# - url: https://github.com/projectdiscovery/nuclei-templates
# branch: main
# path: /templates/community
# Operator configuration
operator:
# Rescan age - trigger rescan if results older than this
rescanAge: 168h
# Backoff for target availability checks
backoff:
initial: 10s
max: 10m
multiplier: 2.0
```
### 9.2 Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `SCANNER_MODE` | pod or sync | pod |
| `SCANNER_IMAGE` | Default scanner image | operator image |
| `SCANNER_TIMEOUT` | Default scan timeout | 30m |
| `MAX_CONCURRENT_SCANS` | Max parallel jobs | 5 |
| `JOB_TTL_AFTER_FINISHED` | Job cleanup TTL | 3600 |
| `NUCLEI_TEMPLATES_PATH` | Template directory | /nuclei-templates |
## 10. Observability
### 10.1 Metrics
New Prometheus metrics:
- `nuclei_scan_jobs_created_total` - Total scanner jobs created
- `nuclei_scan_job_duration_seconds` - Duration histogram of scan jobs
- `nuclei_active_scan_jobs` - Currently running scan jobs
### 10.2 Events
Kubernetes events for key state transitions:
- `ScanJobCreated` - Scanner job created
- `ScanCompleted` - Scan finished successfully
- `ScanFailed` - Scan failed
### 10.3 Logging
Structured logging with consistent fields:
- `scan` - NucleiScan name
- `namespace` - Namespace
- `targets` - Number of targets
- `timeout` - Scan timeout