How It Works
A technical overview of the Code Corgi detection pipeline, from webhook to verdict.
Code Corgi uses a multi-stage asynchronous pipeline to analyze pull requests. Each stage is independently scalable and connected via NATS JetStream message queues.
Architecture Overview
GitHub / GitLab Webhook
│
▼
Webhook Ingestion ← HTTP 202 in < 200ms
│
▼ NATS JetStream
Normalizer ← fetches diff, splits into files
│
▼ NATS JetStream (fan-out)
┌──────┴──────┬──────────────────┐
│ │ │
Unicode Homoglyph Semantic
Detector Detector Analyzer
│ │ │
└──────┬──────┴──────────────────┘
│
▼ NATS JetStream
Scoring Engine ← aggregates, classifies severity
│
├──▶ Audit Logger ← append-only PostgreSQL
│
└──▶ Notification ← GitHub status check, Slack, webhooks
All inter-service communication uses NATS JetStream — no direct HTTP between services. This provides back-pressure handling, at-least-once delivery guarantees, and horizontal scalability.
Stage 1: Webhook Ingestion
The webhook handler receives the pull request event and immediately returns HTTP 202 Accepted. It performs HMAC-SHA256 signature verification before publishing a pr.received message to NATS JetStream.
Target response time: under 200ms — fast enough to never block CI pipelines.
Stage 2: Normalization
The normalizer fetches the actual diff from the VCS API, decomposes it into individual file changes, and emits one file.changed message per file. This allows downstream detectors to process files in parallel.
Each message includes: repository metadata, file path, inferred language, raw diff content, and PR context (author, branch, base commit SHA).
Stage 3: Detection (3 Layers in Parallel)
Layer 1 — Unicode Detector
Scans every added line for non-ASCII codepoints. For each finding, records:
- The Unicode codepoint (e.g.,
U+202E) - The Unicode name (e.g.,
RIGHT-TO-LEFT OVERRIDE) - Line number and column offset
- Character category (control character, letter-like, modifier, etc.)
- Context classification (identifier, string literal, comment, operator)
High-risk categories include: bidirectional control characters, zero-width characters, non-printing characters, and confusable lookalikes.
Layer 1 — Homoglyph Detector
Tokenizes identifiers and string values, then compares each token against a precomputed homoglyph database covering Latin, Cyrillic, Greek, Arabic, Armenian, and CJK character ranges.
Uses a modified edit distance metric that treats visually confusable character pairs as distance-0. For example, Latin a and Cyrillic а (U+0430) are considered identical substitutions.
Layer 2 — Semantic Analyzer
Parses source code into an Abstract Syntax Tree using tree-sitter (WASM build, vendored in the container image — no runtime downloads). Walks the AST looking for:
| Pattern | Languages |
|---|---|
eval(), exec(), Function() | JS, TS, Python |
Dynamic import() / require() with non-literal paths | JS, TS |
__import__, importlib.import_module | Python |
os.system, subprocess with shell=True | Python |
| Base64-encoded string literals | All |
| Hex-encoded payloads > 32 chars | All |
reflect.Value.Call with external input | Go |
Stage 4: Scoring
The scoring engine aggregates findings from all three detectors and assigns a final severity:
- CRITICAL — BiDi override or homoglyph in an identifier;
eval()with dynamic input - HIGH — Invisible character in a string literal; dynamic import with external path
- MEDIUM — Unusual Unicode in comments; base64 payload in config
- INFO — Non-ASCII in string value; encoded data within expected bounds
Multiple findings on the same file escalate severity. The aggregate determines whether the PR status check passes or fails.
Stage 5: Audit Logging
Every event is written to the audit_events PostgreSQL table using an INSERT-only role — no UPDATE or DELETE permissions. This makes the log tamper-evident by design.
Logged events include: raw PR payload, each detector finding (with full evidence), the final score, and any alert delivery outcomes.
Stage 6: Notifications
The notification service reads from the findings.ready NATS stream and:
- Posts a GitHub or GitLab status check with a pass/fail result and summary
- Sends a Slack message (if configured) with severity, file paths, and codepoint details
- Triggers configured outbound webhooks (SIEM integration, PagerDuty, etc.)
CRITICAL findings generate an immediate alert. Lower-severity findings are batched into a PR summary comment.