How It Works — PhantomCorgi

Code Corgi uses a multi-stage asynchronous pipeline to analyze pull requests. Each stage is independently scalable and connected via NATS JetStream message queues.

Architecture Overview

GitHub / GitLab Webhook
        │
        ▼
 Webhook Ingestion          ← HTTP 202 in < 200ms
        │
        ▼ NATS JetStream
   Normalizer               ← fetches diff, splits into files
        │
        ▼ NATS JetStream (fan-out)
 ┌──────┴──────┬──────────────────┐
 │             │                  │
Unicode     Homoglyph         Semantic
Detector    Detector          Analyzer
 │             │                  │
 └──────┬──────┴──────────────────┘
        │
        ▼ NATS JetStream
 Scoring Engine             ← aggregates, classifies severity
        │
        ├──▶ Audit Logger   ← append-only PostgreSQL
        │
        └──▶ Notification   ← GitHub status check, Slack, webhooks

All inter-service communication uses NATS JetStream — no direct HTTP between services. This provides back-pressure handling, at-least-once delivery guarantees, and horizontal scalability.

Stage 1: Webhook Ingestion

The webhook handler receives the pull request event and immediately returns HTTP 202 Accepted. It performs HMAC-SHA256 signature verification before publishing a pr.received message to NATS JetStream.

Target response time: under 200ms — fast enough to never block CI pipelines.

Stage 2: Normalization

The normalizer fetches the actual diff from the VCS API, decomposes it into individual file changes, and emits one file.changed message per file. This allows downstream detectors to process files in parallel.

Each message includes: repository metadata, file path, inferred language, raw diff content, and PR context (author, branch, base commit SHA).

Stage 3: Detection (3 Layers in Parallel)

Layer 1 — Unicode Detector

Scans every added line for non-ASCII codepoints. For each finding, records:

The Unicode codepoint (e.g., U+202E)
The Unicode name (e.g., RIGHT-TO-LEFT OVERRIDE)
Line number and column offset
Character category (control character, letter-like, modifier, etc.)
Context classification (identifier, string literal, comment, operator)

High-risk categories include: bidirectional control characters, zero-width characters, non-printing characters, and confusable lookalikes.

Layer 1 — Homoglyph Detector

Tokenizes identifiers and string values, then compares each token against a precomputed homoglyph database covering Latin, Cyrillic, Greek, Arabic, Armenian, and CJK character ranges.

Uses a modified edit distance metric that treats visually confusable character pairs as distance-0. For example, Latin a and Cyrillic а (U+0430) are considered identical substitutions.

Layer 2 — Semantic Analyzer

Parses source code into an Abstract Syntax Tree using tree-sitter (WASM build, vendored in the container image — no runtime downloads). Walks the AST looking for:

Pattern	Languages
`eval()`, `exec()`, `Function()`	JS, TS, Python
Dynamic `import()` / `require()` with non-literal paths	JS, TS
`__import__`, `importlib.import_module`	Python
`os.system`, `subprocess` with shell=True	Python
Base64-encoded string literals	All
Hex-encoded payloads > 32 chars	All
`reflect.Value.Call` with external input	Go

Stage 4: Scoring

The scoring engine aggregates findings from all three detectors and assigns a final severity:

CRITICAL — BiDi override or homoglyph in an identifier; eval() with dynamic input
HIGH — Invisible character in a string literal; dynamic import with external path
MEDIUM — Unusual Unicode in comments; base64 payload in config
INFO — Non-ASCII in string value; encoded data within expected bounds

Multiple findings on the same file escalate severity. The aggregate determines whether the PR status check passes or fails.

Stage 5: Audit Logging

Every event is written to the audit_events PostgreSQL table using an INSERT-only role — no UPDATE or DELETE permissions. This makes the log tamper-evident by design.

Logged events include: raw PR payload, each detector finding (with full evidence), the final score, and any alert delivery outcomes.

Stage 6: Notifications

The notification service reads from the findings.ready NATS stream and:

Posts a GitHub or GitLab status check with a pass/fail result and summary
Sends a Slack message (if configured) with severity, file paths, and codepoint details
Triggers configured outbound webhooks (SIEM integration, PagerDuty, etc.)

CRITICAL findings generate an immediate alert. Lower-severity findings are batched into a PR summary comment.