DOCS

How It Works

A technical overview of the Code Corgi detection pipeline, from webhook to verdict.

Code Corgi uses a multi-stage asynchronous pipeline to analyze pull requests. Each stage is independently scalable and connected via NATS JetStream message queues.

Architecture Overview

GitHub / GitLab Webhook


 Webhook Ingestion          ← HTTP 202 in < 200ms

        ▼ NATS JetStream
   Normalizer               ← fetches diff, splits into files

        ▼ NATS JetStream (fan-out)
 ┌──────┴──────┬──────────────────┐
 │             │                  │
Unicode     Homoglyph         Semantic
Detector    Detector          Analyzer
 │             │                  │
 └──────┬──────┴──────────────────┘

        ▼ NATS JetStream
 Scoring Engine             ← aggregates, classifies severity

        ├──▶ Audit Logger   ← append-only PostgreSQL

        └──▶ Notification   ← GitHub status check, Slack, webhooks

All inter-service communication uses NATS JetStream — no direct HTTP between services. This provides back-pressure handling, at-least-once delivery guarantees, and horizontal scalability.

Stage 1: Webhook Ingestion

The webhook handler receives the pull request event and immediately returns HTTP 202 Accepted. It performs HMAC-SHA256 signature verification before publishing a pr.received message to NATS JetStream.

Target response time: under 200ms — fast enough to never block CI pipelines.

Stage 2: Normalization

The normalizer fetches the actual diff from the VCS API, decomposes it into individual file changes, and emits one file.changed message per file. This allows downstream detectors to process files in parallel.

Each message includes: repository metadata, file path, inferred language, raw diff content, and PR context (author, branch, base commit SHA).

Stage 3: Detection (3 Layers in Parallel)

Layer 1 — Unicode Detector

Scans every added line for non-ASCII codepoints. For each finding, records:

  • The Unicode codepoint (e.g., U+202E)
  • The Unicode name (e.g., RIGHT-TO-LEFT OVERRIDE)
  • Line number and column offset
  • Character category (control character, letter-like, modifier, etc.)
  • Context classification (identifier, string literal, comment, operator)

High-risk categories include: bidirectional control characters, zero-width characters, non-printing characters, and confusable lookalikes.

Layer 1 — Homoglyph Detector

Tokenizes identifiers and string values, then compares each token against a precomputed homoglyph database covering Latin, Cyrillic, Greek, Arabic, Armenian, and CJK character ranges.

Uses a modified edit distance metric that treats visually confusable character pairs as distance-0. For example, Latin a and Cyrillic а (U+0430) are considered identical substitutions.

Layer 2 — Semantic Analyzer

Parses source code into an Abstract Syntax Tree using tree-sitter (WASM build, vendored in the container image — no runtime downloads). Walks the AST looking for:

PatternLanguages
eval(), exec(), Function()JS, TS, Python
Dynamic import() / require() with non-literal pathsJS, TS
__import__, importlib.import_modulePython
os.system, subprocess with shell=TruePython
Base64-encoded string literalsAll
Hex-encoded payloads > 32 charsAll
reflect.Value.Call with external inputGo

Stage 4: Scoring

The scoring engine aggregates findings from all three detectors and assigns a final severity:

  • CRITICAL — BiDi override or homoglyph in an identifier; eval() with dynamic input
  • HIGH — Invisible character in a string literal; dynamic import with external path
  • MEDIUM — Unusual Unicode in comments; base64 payload in config
  • INFO — Non-ASCII in string value; encoded data within expected bounds

Multiple findings on the same file escalate severity. The aggregate determines whether the PR status check passes or fails.

Stage 5: Audit Logging

Every event is written to the audit_events PostgreSQL table using an INSERT-only role — no UPDATE or DELETE permissions. This makes the log tamper-evident by design.

Logged events include: raw PR payload, each detector finding (with full evidence), the final score, and any alert delivery outcomes.

Stage 6: Notifications

The notification service reads from the findings.ready NATS stream and:

  1. Posts a GitHub or GitLab status check with a pass/fail result and summary
  2. Sends a Slack message (if configured) with severity, file paths, and codepoint details
  3. Triggers configured outbound webhooks (SIEM integration, PagerDuty, etc.)

CRITICAL findings generate an immediate alert. Lower-severity findings are batched into a PR summary comment.