Scanner Pro: evidence-grade tool verification

The distinction that matters

The free scanner at /scan/ has run on tens of thousands of repositories. It finds things. Real things — hardcoded credentials, unpinned dependencies, IaC misconfigurations, GPL licence violations sitting next to a closed-source codebase.

But all of its findings carry an implicit asterisk: heuristic. Pattern matching. AST analysis. Regex. Pure Python, no external tools. Good enough to flag issues, not good enough to cite in a formal evidence record.

That distinction didn't matter when the scanner was a demo. It started mattering when we wired Gatekeep to produce evidence.json that feeds into devcontract evaluations. An evaluator that sees a finding needs to know how it was produced before deciding whether to block a merge. A heuristic finding and a tool-verified finding are not the same thing.

Scanner Pro is the answer to that distinction. Same interface. Different machinery under the hood.

What actually runs

The free scanner runs on the main ticketyboo-api Lambda — a 512 MB, 90-second function. The six free layers are all pure Python, designed to stay inside that budget.

Scanner Pro runs on ticketyboo-scanner-pro — a separate container image Lambda. 1024 MB, 300 seconds, deployed from an ECR Private image built with --platform linux/amd64. The container ships bandit, semgrep, checkov, detect-secrets, pip-audit, and ruff as proper installed packages, not Python re-implementations.

Six tools. Six real CVE databases and ruleset libraries. Not approximations of them.

bandit — Python SAST

bandit uses its bandit.core programmatic API directly. No subprocess. The BanditManager discovers and runs over all Python files in the extracted repo, returning structured Issue objects with CWE mappings, severity, confidence, and line locations. Every issue becomes a Finding with category: "security", severity mapped from bandit's HIGH/MEDIUM/LOW, and confidence from its three-tier scale.

semgrep — multi-language SAST

semgrep runs as a subprocess with --config p/ci. That's the curated CI ruleset that ships with semgrep — Python, JavaScript/TypeScript, Go, Java, Ruby coverage. Not --config auto, which downloads rules from the registry at scan time. Inside a Lambda that needs deterministic, offline behaviour, auto is the wrong choice. p/ci is bundled in the container at image build time.

Exit code 0 (no findings) and exit code 1 (findings found) are both treated as success. Only exit codes above 1 indicate an error. That's semgrep's documented contract.

checkov — IaC security

checkov uses its RunnerRegistry programmatic API with four runners wired in: Terraform, CloudFormation, Kubernetes, Dockerfile. It runs over the extracted repo path and returns structured check results with CKV IDs, resource names, file paths, and line ranges. Over 1,000 built-in policies.

detect-secrets — credential scanning

detect-secrets uses its SecretsCollection API to walk the filesystem. Plugin-based detection — AWS keys, Stripe keys, private keys, JWT tokens, Base64 high-entropy strings. Every finding has the secret value redacted in the output. Line numbers, file paths, and secret type are retained.

pip-audit — CVE database

pip-audit runs as a subprocess, walking the repo for Python manifests: requirements.txt, Pipfile, pyproject.toml. For each manifest it queries the PyPA advisory database and returns structured JSON with CVE IDs, affected version ranges, and fix versions. No fix version = unfixed vulnerability.

ruff — code quality

ruff runs as a subprocess with --output-format json. E and F rule codes map to high severity findings (errors and fatal errors in pyflakes terms). W rule codes map to medium. ruff is the fastest Python linter available — sub-second on large codebases. That matters when you're inside a 300-second Lambda budget.

Why method_label matters

Every finding from every Pro layer carries two fields that the free scanner doesn't set:

{
  "tool_version": "1.7.9",
  "method_label": "tool_verified"
}

method_label is a vocabulary token from the Gate evidence schema. It answers the question: how was this finding produced?

The schema has two primary values:

"heuristic" — produced by pattern matching, AST analysis, or statistical inference. Informative. Not citeable in a blocking gate.
"tool_verified" — produced by a deterministic, versioned security tool with a published CVE database or ruleset. Citeable. A Gate evaluator can act on it.

When Gatekeep evaluates a devcontract and encounters a finding, it reads method_label before deciding whether a blocking gate should fail. A finding labelled "heuristic" can appear in the evidence report as an advisory. A finding labelled "tool_verified" can block a merge.

This matters most when the contract includes gates like:

- id: no-critical-cve
  layer: dependency
  severity: blocking
  method_label_required: tool_verified

A gate with method_label_required: tool_verified will not trigger on free-tier heuristic findings. It will trigger on pip-audit output. That's the point.

Evidence signing

After the six layers complete, the Pro scanner builds an evidence.json record:

{
  "scan_id": "pro-20260331-a1b2c3",
  "scan_tier": "pro",
  "owner": "acme",
  "repo": "payments-service",
  "tools_used": [
    { "name": "bandit",         "version": "1.7.9",  "method_label": "tool_verified" },
    { "name": "semgrep",        "version": "1.72.0", "method_label": "tool_verified" },
    { "name": "checkov",        "version": "3.2.0",  "method_label": "tool_verified" },
    { "name": "detect-secrets", "version": "1.5.0",  "method_label": "tool_verified" },
    { "name": "pip-audit",      "version": "2.7.3",  "method_label": "tool_verified" },
    { "name": "ruff",           "version": "0.4.1",  "method_label": "tool_verified" }
  ],
  "finding_count": 14,
  "health_score": 72,
  "evidence_hash": "sha256:e3b0c44298fc1c14...",
  "signature": "hmac-sha256:7f4e3d...",
  "signed_by": "ticketyboo-scanner-pro-v1",
  "timestamp": "2026-03-31T14:22:01Z"
}

The evidence_hash is SHA-256 over the canonical JSON of the full report. The signature is HMAC-SHA256 over the hash using a key stored in SSM Parameter Store (/ticketyboo/scanner-pro-signing-key). The signing key never touches the Lambda environment variables — it's loaded at scan time via boto3.client('ssm').get_parameter(WithDecryption=True).

signed_by: "ticketyboo-scanner-pro-v1" is the identity token. When a Gate evaluator receives this evidence, it can verify the HMAC against the expected signing identity before accepting the method_label: "tool_verified" claims.

Credit model

Pro scans cost 1 credit. The deduction happens in handler.py — not in the pro scanner Lambda itself. The sequence is:

JWT authentication — Cognito RS256 token verification
Validate repo URL — extract owner/repo
Check credits ≥ 1 (DynamoDB GetItem)
Create scan record in DynamoDB — status: pending
Deduct 1 credit — atomic ADD -1 with ConditionExpression: credits_remaining >= :one
Async-invoke ticketyboo-scanner-pro
Return scan ID to client — poll for completion

The atomic ConditionExpression means two simultaneous scans can't both succeed if only one credit remains. The condition fails the second request with a 402. No silent overages.

Credit logic lives in one place: handler.py. The pro scanner Lambda does not touch credits. If the scan fails after deduction, the credit is spent — that's the same model as any cloud security scanner. The alternative (refund on failure) requires a transaction log and rollback mechanism that adds more complexity than a retry mechanism justifies at this scale.

The container boundary

Why a container Lambda and not a zip Lambda with a layer? semgrep and checkov together exceed 250 MB — the Lambda zip deployment limit. Container images on ECR Private support up to 10 GB. The container is built for linux/amd64 (ARM is not yet supported by all the native extension modules in these tools).

The image uses public.ecr.aws/lambda/python:3.12 as the base. All six tools are pinned at build time. The build script tags :latest plus a timestamp tag for rollback. ECR Private is in eu-north-1 alongside the Lambda — no cross-region pulls.

The Terraform module provisions the ECR repository, the Lambda function, and an IAM policy attachment that gives the main ticketyboo-api Lambda permission to async-invoke it. The pro scanner Lambda's own role gets the minimum permissions it needs: SSM read for the signing key, S3 write for report storage, DynamoDB write for findings and scan status.

What this enables

The practical result: a repository scan that produces the same structured evidence format as a Gatekeep PR run. The findings are in the same schema, carry the same method_label token, and are stored in the same S3 path structure.

A team using Gatekeep for PR enforcement can run a Pro scan against their repo baseline and get findings that are directly comparable to what Gatekeep would produce. Same tool versions, same ruleset IDs, same evidence structure.

For teams evaluating ticketyboo: the Pro scan is a concrete demonstration of what method_label: "tool_verified" means in practice before committing to the PR enforcement workflow.

Try Scanner Pro → See all features