feat: implement pod-based scanning architecture

This major refactor moves from synchronous subprocess-based scanning to
asynchronous pod-based scanning using Kubernetes Jobs.

## Architecture Changes
- Scanner jobs are now Kubernetes Jobs with TTLAfterFinished for automatic cleanup
- Jobs have owner references for garbage collection when NucleiScan is deleted
- Configurable concurrency limits, timeouts, and resource requirements

## New Features
- Dual-mode binary: --mode=controller (default) or --mode=scanner
- Annotation-based configuration for Ingress/VirtualService resources
- Operator-level configuration via environment variables
- Startup recovery for orphaned scans after operator restart
- Periodic cleanup of stuck jobs

## New Files
- DESIGN.md: Comprehensive architecture design document
- internal/jobmanager/: Job Manager for creating/monitoring scanner jobs
- internal/scanner/runner.go: Scanner mode implementation
- internal/annotations/: Annotation parsing utilities
- charts/nuclei-operator/templates/scanner-rbac.yaml: Scanner RBAC

## API Changes
- Added ScannerConfig struct for per-scan scanner configuration
- Added JobReference struct for tracking scanner jobs
- Added ScannerConfig field to NucleiScanSpec
- Added JobRef and ScanStartTime fields to NucleiScanStatus

## Supported Annotations
- nuclei.homelab.mortenolsen.pro/enabled
- nuclei.homelab.mortenolsen.pro/templates
- nuclei.homelab.mortenolsen.pro/severity
- nuclei.homelab.mortenolsen.pro/schedule
- nuclei.homelab.mortenolsen.pro/timeout
- nuclei.homelab.mortenolsen.pro/scanner-image

## RBAC Updates
- Added Job and Pod permissions for operator
- Created separate scanner service account with minimal permissions

## Documentation
- Updated README, user-guide, api.md, and Helm chart README
- Added example annotated Ingress resources
This commit is contained in:
Morten Olsen
2025-12-12 20:51:23 +01:00
parent 519ed32de3
commit 12d681ada1
22 changed files with 3060 additions and 245 deletions

View File

@@ -10,6 +10,8 @@ This document provides a complete reference for the Nuclei Operator Custom Resou
- [Status](#status)
- [Type Definitions](#type-definitions)
- [SourceReference](#sourcereference)
- [ScannerConfig](#scannerconfig)
- [JobReference](#jobreference)
- [Finding](#finding)
- [ScanSummary](#scansummary)
- [ScanPhase](#scanphase)
@@ -62,6 +64,16 @@ spec:
- critical
schedule: "@every 24h"
suspend: false
scannerConfig:
image: "custom-scanner:latest"
timeout: "1h"
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
```
#### Spec Fields
@@ -74,6 +86,7 @@ spec:
| `severity` | []string | No | Severity filter. Valid values: `info`, `low`, `medium`, `high`, `critical` |
| `schedule` | string | No | Cron schedule for periodic rescanning |
| `suspend` | bool | No | When true, suspends scheduled scans |
| `scannerConfig` | [ScannerConfig](#scannerconfig) | No | Scanner-specific configuration overrides |
#### Schedule Format
@@ -110,6 +123,12 @@ status:
lastScanTime: "2024-01-15T10:30:00Z"
completionTime: "2024-01-15T10:35:00Z"
nextScheduledTime: "2024-01-16T10:30:00Z"
scanStartTime: "2024-01-15T10:30:05Z"
jobRef:
name: my-app-scan-abc123
uid: "job-uid-12345"
podName: my-app-scan-abc123-xyz
startTime: "2024-01-15T10:30:00Z"
summary:
totalFindings: 3
findingsBySeverity:
@@ -127,6 +146,7 @@ status:
timestamp: "2024-01-15T10:32:00Z"
lastError: ""
observedGeneration: 1
retryCount: 0
```
#### Status Fields
@@ -138,10 +158,14 @@ status:
| `lastScanTime` | *Time | When the last scan was initiated |
| `completionTime` | *Time | When the last scan completed |
| `nextScheduledTime` | *Time | When the next scheduled scan will run |
| `scanStartTime` | *Time | When the scanner pod actually started scanning |
| `jobRef` | *[JobReference](#jobreference) | Reference to the current or last scanner job |
| `summary` | *[ScanSummary](#scansummary) | Aggregated scan statistics |
| `findings` | [][Finding](#finding) | Array of scan results |
| `lastError` | string | Error message if the scan failed |
| `observedGeneration` | int64 | Generation observed by the controller |
| `retryCount` | int | Number of consecutive availability check retries |
| `lastRetryTime` | *Time | When the last availability check retry occurred |
#### Conditions
@@ -188,6 +212,82 @@ type SourceReference struct {
| `namespace` | string | Yes | Namespace of the source resource |
| `uid` | string | Yes | UID of the source resource |
### ScannerConfig
`ScannerConfig` defines scanner-specific configuration that can override default settings.
```go
type ScannerConfig struct {
Image string `json:"image,omitempty"`
Resources *corev1.ResourceRequirements `json:"resources,omitempty"`
Timeout *metav1.Duration `json:"timeout,omitempty"`
TemplateURLs []string `json:"templateURLs,omitempty"`
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `image` | string | No | Override the default scanner image |
| `resources` | ResourceRequirements | No | Resource requirements for the scanner pod |
| `timeout` | Duration | No | Override the default scan timeout |
| `templateURLs` | []string | No | Additional template repositories to clone |
| `nodeSelector` | map[string]string | No | Node selector for scanner pod scheduling |
| `tolerations` | []Toleration | No | Tolerations for scanner pod scheduling |
**Example:**
```yaml
scannerConfig:
image: "ghcr.io/custom/scanner:v1.0.0"
timeout: "1h"
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: "2"
memory: 2Gi
nodeSelector:
node-type: scanner
tolerations:
- key: "dedicated"
operator: "Equal"
value: "scanner"
effect: "NoSchedule"
```
### JobReference
`JobReference` contains information about the scanner job for tracking and debugging.
```go
type JobReference struct {
Name string `json:"name"`
UID string `json:"uid"`
PodName string `json:"podName,omitempty"`
StartTime *metav1.Time `json:"startTime,omitempty"`
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `name` | string | Yes | Name of the Kubernetes Job |
| `uid` | string | Yes | UID of the Job |
| `podName` | string | No | Name of the scanner pod (for log retrieval) |
| `startTime` | *Time | No | When the job was created |
**Example:**
```yaml
jobRef:
name: my-scan-abc123
uid: "12345678-1234-1234-1234-123456789012"
podName: my-scan-abc123-xyz
startTime: "2024-01-15T10:30:00Z"
```
### Finding
`Finding` represents a single vulnerability or issue discovered during a scan.

View File

@@ -7,6 +7,8 @@ This guide provides detailed instructions for using the Nuclei Operator to autom
- [Introduction](#introduction)
- [Installation](#installation)
- [Basic Usage](#basic-usage)
- [Scanner Architecture](#scanner-architecture)
- [Annotation-Based Configuration](#annotation-based-configuration)
- [Configuration Options](#configuration-options)
- [Working with Ingress Resources](#working-with-ingress-resources)
- [Working with VirtualService Resources](#working-with-virtualservice-resources)
@@ -24,11 +26,13 @@ The Nuclei Operator automates security scanning by watching for Kubernetes Ingre
1. Extracts target URLs from the resource
2. Creates a NucleiScan custom resource
3. Executes a Nuclei security scan
3. Creates a Kubernetes Job to execute the Nuclei security scan in an isolated pod
4. Stores the results in the NucleiScan status
This enables continuous security monitoring of your web applications without manual intervention.
The operator uses a **pod-based scanning architecture** where each scan runs in its own isolated Kubernetes Job, providing better scalability, reliability, and resource control.
---
## Installation
@@ -151,6 +155,224 @@ kubectl apply -f manual-scan.yaml
---
## Scanner Architecture
The nuclei-operator uses a pod-based scanning architecture for improved scalability and reliability:
1. **Operator Pod**: Manages NucleiScan resources and creates scanner jobs
2. **Scanner Jobs**: Kubernetes Jobs that execute nuclei scans in isolated pods
3. **Direct Status Updates**: Scanner pods update NucleiScan status directly via the Kubernetes API
### Architecture Diagram
```
┌─────────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────────┐ ┌──────────────────────────────────────┐ │
│ │ Operator Pod │ │ Scanner Jobs │ │
│ │ │ │ │ │
│ │ ┌────────────┐ │ │ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Controller │──┼─────┼─▶│ Job 1 │ │ Job 2 │ ... │ │
│ │ │ Manager │ │ │ │(Scanner)│ │(Scanner)│ │ │
│ │ └────────────┘ │ │ └────┬────┘ └────┬────┘ │ │
│ │ │ │ │ │ │ │ │
│ └────────┼─────────┘ └───────┼────────────┼─────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Kubernetes API Server │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ NucleiScan │ │ NucleiScan │ │ NucleiScan │ ... │ │
│ │ │ Resource │ │ Resource │ │ Resource │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
### Benefits
- **Scalability**: Multiple scans can run concurrently across the cluster
- **Isolation**: Each scan runs in its own pod with dedicated resources
- **Reliability**: Scans survive operator restarts
- **Resource Control**: Per-scan resource limits and quotas
- **Observability**: Individual pod logs for each scan
### Scanner Configuration
Configure scanner behavior via Helm values:
```yaml
scanner:
# Enable scanner RBAC resources
enabled: true
# Scanner image (defaults to operator image)
image: "ghcr.io/morten-olsen/nuclei-operator:latest"
# Default scan timeout
timeout: "30m"
# Maximum concurrent scan jobs
maxConcurrent: 5
# Job TTL after completion (seconds)
ttlAfterFinished: 3600
# Default resource requirements for scanner pods
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: "1"
memory: 1Gi
# Default templates to use
defaultTemplates: []
# Default severity filter
defaultSeverity: []
```
### Per-Scan Scanner Configuration
You can override scanner settings for individual scans using the `scannerConfig` field in the NucleiScan spec:
```yaml
apiVersion: nuclei.homelab.mortenolsen.pro/v1alpha1
kind: NucleiScan
metadata:
name: custom-scan
spec:
sourceRef:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: my-ingress
namespace: default
uid: "abc123"
targets:
- https://example.com
scannerConfig:
# Override scanner image
image: "custom-scanner:latest"
# Override timeout
timeout: "1h"
# Custom resource requirements
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: "2"
memory: 2Gi
# Node selector for scanner pod
nodeSelector:
node-type: scanner
# Tolerations for scanner pod
tolerations:
- key: "scanner"
operator: "Equal"
value: "true"
effect: "NoSchedule"
```
---
## Annotation-Based Configuration
You can configure scanning behavior for individual Ingress or VirtualService resources using annotations.
### Supported Annotations
| Annotation | Type | Default | Description |
|------------|------|---------|-------------|
| `nuclei.homelab.mortenolsen.pro/enabled` | bool | `true` | Enable/disable scanning for this resource |
| `nuclei.homelab.mortenolsen.pro/templates` | string | - | Comma-separated list of template paths or tags |
| `nuclei.homelab.mortenolsen.pro/severity` | string | - | Comma-separated severity filter: info,low,medium,high,critical |
| `nuclei.homelab.mortenolsen.pro/schedule` | string | - | Cron schedule for periodic scans |
| `nuclei.homelab.mortenolsen.pro/timeout` | duration | `30m` | Scan timeout |
| `nuclei.homelab.mortenolsen.pro/scanner-image` | string | - | Override scanner image |
| `nuclei.homelab.mortenolsen.pro/exclude-templates` | string | - | Templates to exclude |
| `nuclei.homelab.mortenolsen.pro/tags` | string | - | Template tags to include |
| `nuclei.homelab.mortenolsen.pro/exclude-tags` | string | - | Template tags to exclude |
### Example Annotated Ingress
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
annotations:
nuclei.homelab.mortenolsen.pro/enabled: "true"
nuclei.homelab.mortenolsen.pro/severity: "medium,high,critical"
nuclei.homelab.mortenolsen.pro/schedule: "0 2 * * *"
nuclei.homelab.mortenolsen.pro/templates: "cves/,vulnerabilities/"
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp
port:
number: 80
```
### Example Annotated VirtualService
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp-vs
annotations:
nuclei.homelab.mortenolsen.pro/enabled: "true"
nuclei.homelab.mortenolsen.pro/severity: "high,critical"
nuclei.homelab.mortenolsen.pro/timeout: "1h"
nuclei.homelab.mortenolsen.pro/tags: "cve,oast"
spec:
hosts:
- myapp.example.com
gateways:
- my-gateway
http:
- route:
- destination:
host: myapp
port:
number: 80
```
### Disabling Scanning
To disable scanning for a specific resource:
```yaml
metadata:
annotations:
nuclei.homelab.mortenolsen.pro/enabled: "false"
```
This is useful when you want to temporarily exclude certain resources from scanning without removing them from the cluster.
### Annotation Precedence
When both annotations and NucleiScan spec fields are present, the following precedence applies:
1. **NucleiScan spec fields** (highest priority) - Direct configuration in the NucleiScan resource
2. **Annotations** - Configuration from the source Ingress/VirtualService
3. **Helm values** - Default configuration from the operator deployment
4. **Built-in defaults** (lowest priority) - Hardcoded defaults in the operator
---
## Configuration Options
### Severity Filtering