Raxx · internal docs

internal · gated

ADR-0101 — Nightly Scan-to-Issue Pipeline Rewrite

Status: Accepted Date: 2026-05-19 UTC Deciders: Kristerpher (operator) Supersedes: Composite card #2159 Related: feedback_bandit_in_tests_policy, feedback_security_scan_per_file_grouping, feedback_nightly_scan_dark_is_high, ADR-0091


Context

The nightly security scan pipeline (.github/workflows/nightly-security-scan.yml + .github/scripts/security_aggregate.py + .github/scripts/security_file_issues.py) has accreted six distinct failure modes, all discovered concurrently during the 2026-05-19 security triage four days before v1 launch:

Root Cause Analysis

# Failure Root cause
1 Issues file under github-actions[bot] instead of raxx-ops-bot PEM in vault is PKCS#1; prior openssl dgst -sha256 -sign workflow step was patched to PyJWT but the PKCS#1 vs PKCS#8 mismatch was documented in the incident but the vault was never updated. Bot-identity fallback triggers silently.
2 B310 (urlopen) and B608 (hardcoded SQL) promoted from MEDIUM to HIGH security_aggregate.py applies a blanket {"HIGH": "CRITICAL", "MEDIUM": "HIGH", "LOW": "MEDIUM"} mapping to bandit severity, ignoring that bandit's MEDIUM covers both genuinely risky findings and well-understood false-positive-class checks. No per-rule override table exists.
3 Multiple issues filed per (file, rule_id) security_file_issues.py titles issues using the full filename:line description, so each occurrence in a file generates a unique title that fails the de-dup check.
4 gitleaks generic-api-key in test fixtures filed as CRIT No path-based or rule-based allowlist applied at the issue-filing stage. feedback_bandit_in_tests_policy covers bandit but was never extended to gitleaks. Partial AKIA strings (e.g., AKIA...XXXX) used as test stubs fire as real credential detections.
5 BILLING_DB_PATH + POSTMARK_BILLING_TIER absent from vault billing-collector-cron.yml probes the same vault paths as the security scan; secrets were never seeded. Unrelated to scan logic — rides in the same composite card.
6 Scan workflow status unverifiable by raxx-dev-bot raxx-dev-bot's GitHub App installation has no Actions:read scope. When the scan job itself crashes before emitting the findings artifact, the detection-gap monitor cannot tell whether the scan ran.

Invariants


Decision

Rewrite the scan-to-issue pipeline as three composable Python scripts that replace security_aggregate.py and security_file_issues.py, with a minimal config file that encodes the severity-mapping and auto-close rules:

.github/scripts/
  scan_normalize.py       # raw scanner JSON → normalized findings.jsonl
  scan_deduplicate.py     # findings.jsonl → grouped.jsonl (one entry per file+rule)
  scan_file_issues.py     # grouped.jsonl → gh issue create/update/close
  scan_detect_gap.py      # called when scan artifact is absent → emits HIGH finding

.github/config/
  scan_severity_map.yaml  # per-tool, per-rule severity overrides
  scan_autoclose.yaml     # test-fixture false-positive rules

The workflow wires them in sequence. Each script is independently testable with fixture inputs.


Data Flow

flowchart TD
    A[Scanner outputs\nbandit.json\ngitleaks.json\npip-audit-*.json\nnpm-audit.json\ntrivy.json] --> B[scan_normalize.py\nPer-scanner parser\nEmits findings.jsonl]
    B --> C[scan_deduplicate.py\nGroup by file+rule_id\nEmits grouped.jsonl]
    C --> D[scan_file_issues.py\nApply severity_map\nApply autoclose rules\nCreate / update / close GH issues]
    E[scan_detect_gap.py\nRuns if scan artifact absent] --> D
    D --> F[GitHub Issues\nraxx-ops-bot attribution]
    D --> G[reports/filed.jsonl\nAudit trail]

Normalized Finding Schema

Each record in findings.jsonl (one JSON object per line):

{
  "tool":        str,   # "bandit" | "gitleaks" | "pip-audit" | "npm-audit" | "trivy" | "detect-gap"
  "rule_id":     str,   # tool-native rule ID: "B310", "generic-api-key", "CVE-2024-XXXX", etc.
  "file":        str,   # repo-relative path, "" for dep findings
  "line":        int,   # 0 for dep/package findings
  "tool_sev":    str,   # tool's own severity: "HIGH" | "MEDIUM" | "LOW" etc.
  "title":       str,   # human-readable summary (no file:line, no raw secret value)
  "detail":      str,   # additional context, secrets redacted to "[REDACTED]"
  "fingerprint": str,   # sha256(tool+rule_id+file) — stable across runs
}

The fingerprint field is the de-dup and update key. It is stable even when line numbers shift — a line-number change on the same file+rule is an update, not a new finding.


Grouping Algorithm

scan_deduplicate.py groups findings.jsonl by (tool, rule_id, file). For dependency findings (empty file), the group key is (tool, rule_id, package_name).

Output (grouped.jsonl) has one record per group:

{
  "group_key":    str,   # "<tool>:<rule_id>:<file>" or "<tool>:<rule_id>:dep:<pkg>"
  "fingerprint":  str,   # sha256(group_key) — matches GH issue `<!-- fp: ... -->` marker
  "tool":         str,
  "rule_id":      str,
  "file":         str,
  "occurrences":  int,   # number of individual hits in this file
  "lines":        list[int],  # first N line numbers (max 20)
  "tool_sev":     str,
  "title":        str,   # "<tool>: <rule_id> in <file> (N occurrences)"
  "detail":       str,
}

Issue title format: [security] <RAXX_SEV>: <tool> <rule_id> in <file>. Stable across runs — the fingerprint is embedded in the issue body as an HTML comment so scan_file_issues.py can find and update (not re-create) it on subsequent nights.


Severity Mapping

scan_severity_map.yaml structure:

# format: tool.rule_id: raxx_severity
# Absent entries fall through to the default_map below.

default_map:
  bandit:
    HIGH: CRITICAL
    MEDIUM: HIGH
    LOW: MEDIUM
  gitleaks:
    HIGH: CRITICAL
    MEDIUM: HIGH
    LOW: HIGH           # gitleaks LOW is still a secret — treat as HIGH
  pip-audit:
    CRITICAL: CRITICAL
    HIGH: HIGH
    MODERATE: MEDIUM
  npm-audit:
    critical: CRITICAL
    high: HIGH
    moderate: MEDIUM
    low: LOW
  trivy:
    CRITICAL: CRITICAL
    HIGH: HIGH
    MEDIUM: MEDIUM
  detect-gap:
    HIGH: HIGH          # absence of scan output is always HIGH

overrides:
  # bandit: well-understood, low-exploitation MEDIUM rules downgraded
  bandit.B310: MEDIUM   # urllib urlopen — real concern but not HIGH in this codebase
  bandit.B608: MEDIUM   # hardcoded SQL — aggregate script uses string formatting intentionally
  bandit.B105: MEDIUM   # hardcoded_password_string default
  bandit.B106: MEDIUM   # hardcoded_password_funcarg default
  bandit.B107: MEDIUM   # hardcoded_password_default default
  # gitleaks: partial/placeholder patterns
  gitleaks.generic-api-key: HIGH   # confirm genuine before CRIT; auto-close if test path

New entries can be added without a redeploy — the workflow reads the YAML on each run. Any rule not in overrides falls through to default_map.


Auto-Close Ruleset

scan_autoclose.yaml drives scan_file_issues.py auto-close logic. An issue matching all conditions in a rule is closed with a standardised comment citing the rule name.

rules:
  - name: bandit-in-tests
    description: "Test-fixture artifacts — not exploitable in production"
    match:
      tool: bandit
      rule_ids:
        - B101  # assert_used
        - B105  # hardcoded_password_string
        - B106  # hardcoded_password_funcarg
        - B107  # hardcoded_password_default
        - B324  # hashlib
        - B608  # hardcoded_sql_expressions (test path only)
        - B110  # try_except_pass
      file_path_pattern: "*/tests/*"
    action: close
    close_reason: "Test-fixture artifact. Rule {rule_id} in {file} is a non-exploitable pattern per scan_autoclose.yaml#bandit-in-tests."

  - name: gitleaks-test-stub
    description: "Partial/placeholder credential patterns in test files — not real secrets"
    match:
      tool: gitleaks
      rule_ids:
        - generic-api-key
        - aws-access-token
      file_path_pattern: "*/tests/*"
      # Additionally: match if the leaked value contains placeholder markers
      value_pattern_any:
        - "AKIA...XXXX"
        - "sk-test-"
        - "test_key_"
        - "EXAMPLE"
        - "PLACEHOLDER"
    action: close
    close_reason: "Test-fixture stub value. Pattern matched scan_autoclose.yaml#gitleaks-test-stub. Confirm no real secret committed."

  - name: gitleaks-akia-partial
    description: "Partial AKIA references in any file (redacted/masked patterns)"
    match:
      tool: gitleaks
      rule_ids:
        - aws-access-token
      value_pattern_any:
        - "AKIA...XXXX"
        - "AKIA[.]{3,}"
    action: close
    close_reason: "Partial AKIA placeholder. Pattern matched scan_autoclose.yaml#gitleaks-akia-partial."

Auto-close leaves an audit comment on the issue before closing; the issue is not deleted. Re-open is triggered on the next nightly run if the finding recurs outside the auto-close conditions.


Bot Identity Story

Problem (incident 2026-05-15): openssl dgst -sha256 -sign rejected PKCS#1 PEM on openssl 3.0 (ubuntu-22.04). The workflow was patched to PyJWT, but the vault secret was never re-exported in PKCS#8.

Current state (post-PyJWT patch): PyJWT handles both PKCS#1 and PKCS#8 transparently. The vault secret does not need to be converted — PyJWT's cryptography backend loads -----BEGIN RSA PRIVATE KEY----- without error.

Actual remaining gap: The vault secret PRIVATE_KEY_PEM at /MooseQuest/raxx-ops-bot env=prod is base64-encoded in the GitHub Actions secret (RAXX_OPS_BOT_PRIVATE_KEY). If the base64 encode/decode round-trip is broken (extra newline, wrong padding), PyJWT will fail and fall back to github-actions[bot]. The fix is to verify the base64 decode round-trip produces a valid RSA PEM header — an operator-action SC documents the verification steps.

Identity guarantee in the rewritten pipeline: - Bot-identity fallback is logged as a structured WARNING in GITHUB_STEP_SUMMARY. - If BOT_IDENTITY != "raxx-ops-bot", the commit-summary step fails loudly (existing behavior retained). - The issue-filing script checks GH_TOKEN source at startup and writes a header comment on any issue filed under fallback identity, so triage can identify attribution gaps in historical issues.


Detection Gap Monitor

scan_detect_gap.py is invoked in a dedicated detect-gap job that runs if: always() and checks whether the scan job uploaded its artifact:

sequenceDiagram
    participant WF as Workflow
    participant DG as detect-gap job
    participant GH as GitHub Issues

    WF->>DG: scan job result (success/failure/skipped)
    DG->>DG: Check: was "security-scan-reports-*" artifact uploaded?
    alt artifact present
        DG->>DG: exit 0 (scan ran, normal path)
    else artifact absent
        DG->>GH: scan_detect_gap.py emits detect-gap finding
        GH->>GH: File/update issue: [security] HIGH: scan output absent YYYY-MM-DD
    end

The detect-gap job needs only issues:write. It does NOT need the full raxx-ops-bot token — it can use GITHUB_TOKEN since it is not creating PRs, only filing issues. (Exception: if raxx-ops-bot attribution is required for consistency, the same PyJWT mint pattern applies.)

The finding fingerprint for detection-gap issues is detect-gap:scan-absent:<YYYY-MM-DD>, so consecutive failure nights each generate a distinct open issue rather than updating a single one (making the run-dark gap visible in the dashboard count).


Migration

No schema changes. The rewrite replaces two Python scripts and adds two new ones + two YAML config files. The workflow YAML gains a new detect-gap job and updates the aggregate + file-issues step names to point to the new scripts.

Rollout: 1. Dark (SC-2 through SC-5): Scripts land in .github/scripts/ behind a new security_rewrite_enabled env var in the workflow. Existing scripts remain alongside. Nightly run executes both paths; outputs are compared in step summary. 2. Flag cutover (SC-6): Remove old scripts once two consecutive nightly runs produce identical (or better) issue sets. 3. GA: Remove security_rewrite_enabled toggle.

Given the 4-day launch window, the operator may choose to skip dark mode and cut over directly. Open question flagged below.


Rollout Plan

Phase Trigger Rollback
dark SC-5 merged Remove new scripts; old scripts still present
flag cutover Two clean nightly runs Revert flag env var in workflow
ga Operator approval N/A

Security Considerations


Open Questions

  1. Dark-mode skip (launch proximity): With 4 days to v1 launch, should the rewrite skip the dark-mode parallel run and cut over directly? Risk: a regression in issue filing goes undetected for 24 h. Recommendation: cut over directly given the existing pipeline's known broken state, but schedule a manual workflow_dispatch run on the day SC-5 merges before the nightly window.

  2. B310 / B608 permanent downgrade: The override table downgrades both to MEDIUM. If the codebase ever adds untrusted-URL urlopen or unsanitised SQL outside tests, MEDIUM will not trigger the HIGH dashboard alert. Operator should confirm whether a nosec annotation + MEDIUM is acceptable or whether these rules need re-evaluation per PR.

  3. raxx-dev-bot Actions:read scope: SC-7 (operator-action) requires adding Actions:read to the raxx-dev-bot GitHub App installation permissions. This is a GitHub App permission change that requires operator approval in the GitHub UI. Confirm before SC-7 is dispatched.

  4. findings.jsonl artifact retention: Currently 30 days. Is this sufficient for GDPR audit purposes, or should scan artifacts be shipped to a longer-lived store (S3, Postgres) post-launch?