System Architecture

How the three cascade layers connect on Azure

Published

April 30, 2026

Design philosophy

The architecture is shaped by three constraints, in priority order:

  1. Cost. Minimise Azure OpenAI tokens spent per production frame.
  2. Latency. The common case (no defect) must complete in tens of milliseconds.
  3. Recall over precision at every gate. A missed defect on the line is expensive; a false positive is just a few extra ms of compute downstream.

Layer 1 — Gatekeeper (Convolutional Autoencoder)

The autoencoder is trained only on defect-free metal — the union of KSDD2 train normals and Severstal normals (~7,500 images at 256×256). When shown clean metal, it reconstructs the input with low MSE; when shown something it has never seen as “normal” — pitting, inclusion, scratches — the reconstruction degrades and MSE spikes.

flowchart LR
    IMG[Input 256×256×3]
    DOMAIN{domain hint}
    ENC["Encoder<br/>4× Conv2d ↓"]
    LAT["Latent 256×16×16"]
    DEC["Decoder<br/>4× ConvTranspose2d ↑"]
    RECON["Reconstruction"]
    MSE["MSE error"]
    GATE_K{"MSE > τ_ksdd2?"}
    GATE_S{"MSE > τ_severstal?"}
    DISCARD([No defect])
    PASS([Pass to Layer 2])

    IMG --> ENC --> LAT --> DEC --> RECON
    IMG --> MSE
    RECON --> MSE
    DOMAIN -- ksdd2 --> GATE_K
    DOMAIN -- severstal --> GATE_S
    MSE --> GATE_K
    MSE --> GATE_S
    GATE_K -- No --> DISCARD
    GATE_S -- No --> DISCARD
    GATE_K -- Yes --> PASS
    GATE_S -- Yes --> PASS

    style DISCARD fill:#dcfce7
    style PASS fill:#fef9c3
Figure 1: Layer 1: per-domain MSE thresholding

Per-domain thresholds are necessary because KSDD2 and Severstal have different intrinsic reconstruction-error scales. Both τ_ksdd2 and τ_severstal are derived independently as mean + 3σ of the AE’s MSE on the domain’s held-out normals. See Data Strategy for why this matters.


Layer 2 — Specialist (YOLOv8n on Severstal)

YOLOv8n is trained purely on Severstal’s four defect classes (pitting, inclusion, scratch, patch). RLE masks from the Kaggle release are decoded to per-class binary masks, then converted to bounding boxes via cv2.connectedComponentsWithStats (drop components below 16 px²).

flowchart LR
    IN[Flagged frame from L1]
    YOLO["YOLOv8n<br/>4-class detector"]
    BOXES["Predicted bboxes<br/>+ class + confidence"]
    BEST["Top detection"]
    GATE{conf ≥ 0.85?}
    LOG([Log defect: class + bbox])
    ESC([Escalate to Layer 3])

    IN --> YOLO --> BOXES --> BEST --> GATE
    GATE -- Yes --> LOG
    GATE -- No --> ESC

    style LOG fill:#dcfce7
    style ESC fill:#fee2e2
Figure 2: Layer 2: YOLOv8n inference and confidence routing

KSDD2 defects are intentionally out-of-distribution for YOLO. Track C of the evaluation measures how badly a Severstal-trained detector misses commutator defects — the result motivates why Layer 3 exists.


Layer 3 — Oracle (Azure OpenAI vision)

Only frames where YOLO is uncertain (or KSDD2-style frames YOLO doesn’t know how to classify) reach this layer. The Oracle uses a 5-shot prompt — one reference image per Severstal class plus one KSDD2 anomaly — and returns a Pydantic-validated JSON object via client.beta.chat.completions.parse().

sequenceDiagram
    participant L2 as Layer 2 (YOLO)
    participant L3 as Layer 3 (Oracle)
    participant AZ as Azure OpenAI<br/>gpt-4.1-mini
    participant DB as Logs

    L2->>L3: POST /predict (image, low-conf)
    L3->>L3: Build 5-shot prompt
    L3->>AZ: parse(image, schema=DefectPrediction)
    AZ-->>L3: {defect_class, confidence, reasoning}
    L3->>DB: log + add to retrain queue
    L3-->>L2: structured response
Figure 3: Layer 3: Azure OpenAI vision with structured outputs

The Pydantic schema is enforced at the API call — malformed model output raises ValidationError rather than corrupting the label store.


Azure topology

flowchart TB
    subgraph INGEST["Ingestion"]
        CAM[Factory camera<br/>RTSP stream]
        UPLOADER[Frame uploader]
    end

    subgraph BLOB["Azure Blob Storage<br/>cascadedev6ya7a3px"]
        RAW[raw/]
        MODELS[models/<br/>autoencoder_metal/<br/>yolo_metal/]
        LOGS[logs/anomalies/]
    end

    subgraph SB["Service Bus<br/>(decoupled queue)"]
        QUEUE[defect-queue]
    end

    subgraph ACA["Container Apps<br/>cascade-dev-aca-env"]
        ROUTER["cascade-router<br/>(public ingress)"]
        L1APP["cascade-l1-ae<br/>min=0 max=5"]
        L2APP["cascade-l2-yolo<br/>gpu-t4 profile"]
        L3APP["cascade-l3-oracle"]
    end

    subgraph OPENAI["Azure OpenAI<br/>cascade-dev-aoai"]
        GPT["gpt-4.1-mini deployment"]
    end

    subgraph ACR["Azure Container Registry<br/>cascadedevacr"]
        IMG[cascade-base, layer1, layer2, layer3, router]
    end

    CAM --> UPLOADER --> RAW
    RAW --> ROUTER
    ROUTER --> L1APP
    L1APP -- "MSE > τ" --> ROUTER
    ROUTER --> L2APP
    L2APP -- "conf < 0.85" --> ROUTER
    ROUTER --> L3APP
    L3APP --> GPT
    L3APP --> LOGS
    L2APP --> LOGS
    MODELS --> L1APP
    MODELS --> L2APP
    ACR --> ACA

    style ACA fill:#dbeafe,stroke:#3b82f6
    style SB fill:#fef9c3,stroke:#eab308
    style OPENAI fill:#f3e8ff,stroke:#a855f7
Figure 4: Deployed Azure infrastructure (West Europe, RG cascade-dev-rg)

The router is the single orchestrator — Layer 2 never escalates to Layer 3 directly, which keeps the call graph linear and easy to trace.


Deployment notes

Step Notes
Resource group cascade-dev-rg in westeurope (already provisioned via infra/main.bicep).
GPU ACA gpu-t4 workload profile; falls back to Consumption (CPU) if quota lapses.
Model weights Pulled from Blob on container startup via BLOB_ACCOUNT + *_MODEL_BLOB env vars — retraining never requires a new image push.
Threshold updates mseThreshold parameter on the L1 Bicep module; per-domain values land as MSE_THRESHOLD_KSDD2 / MSE_THRESHOLD_SEVERSTAL env vars.
Cold start First request after scale-to-zero pays a ~10–30 s penalty (image pull + weight load). Excluded from the latency numbers reported on the Evaluation page.
WarningGPU quota reality

Modern GPU SKUs (T4, A10, A100, H100) currently have zero VM quota on this subscription. ACA’s gpu-t4 workload profile sidesteps the VM quota entirely, which is why all training and inference runs through Container Apps Jobs rather than a managed compute cluster.

Why these specific choices?

A reviewer’s-eye view of the eight non-obvious decisions in this build, each phrased as the alternative I rejected and why.

Layer 1 — PatchCore-lite, not anomalib’s full PatchCore

Rejected: anomalib’s PatchCore (WideResNet50 backbone, greedy-coreset sub-sampling, Mahalanobis scoring, ~250 ms / image on CPU). Chose: ResNet18 backbone (~11 M params), random 10% coreset, plain cosine kNN. ~50 ms on CPU; ~11 ms via ONNX. Why. The L1 latency budget is 100 ms p50, and the random coreset is statistically equivalent to the greedy version for our 6 k-image validation pool while being 50× faster to build. The accuracy cost is zero on KSDD2 (Δμ +8.18σ vs +6σ in the paper) and near-zero on Severstal (the hard case is hard for both).

Layer 2 — YOLOv8n single-shot, not YOLOv8m or two-stage

Rejected: YOLOv8m (mAP +5–8 points, ~3× compute) and Faster-RCNN (another +5 points, ~10×). Chose: YOLOv8n at 640 px, 50 epochs on a single T4. Why. Layer 2 is a throughput component, not the system’s accuracy ceiling — Layer 3 is. Spending the next dollar on a bigger Layer 2 buys ~3 mAP at 3× the cost; spending it on better Layer 1 calibration moves the cascade by ~7 F1 points (see Phase K). A 3 M-param backbone also exports cleanly to ONNX for edge.

Layer 3 — gpt-4.1-mini, not gpt-4o

Rejected: gpt-4o ($2.50 in / $10.00 out per 1M tokens). Chose: gpt-4.1-mini ($0.40 / $1.60). Why. A 6× cost reduction at no measurable quality loss for this task — Severstal class labels are a closed 4-way set with strong visual cues, not a free-form description. Validated on a 60-frame sample: agreement with gpt-4o was 58/60.

Cache — dHash with 6-bit Hamming radius, not CLIP embeddings

Rejected: CLIP-image embeddings + cosine similarity (the “right” way to do perceptual cache). Chose: 64-bit difference-hash with a Hamming-distance ≤ 6 radius. Why. Severstal coil frames within a roll are near-duplicates (same physical strip moving under the camera at frame rate). dHash is designed for exactly this — pixel-level near-identity. CLIP would generalise too much (different rolls of the same defect class would collide). Ships at 47% hit-rate on Track C, contributes 0% to false positives.

Calibration — per-domain τ, not a single τ

Rejected: A single z-threshold across both metal domains. Chose: {severstal: -0.5, ksdd2: +1.0}, persisted into the model summary. Why. PatchCore’s score scale depends on the texture statistics of the normal pool. KSDD2’s per-patch features cluster tightly; Severstal’s spread out. A single τ either over-escalates KSDD2 (kills the cost story) or under-escalates Severstal (kills recall). The Pareto curves on the evaluation page make the asymmetry explicit.

Confidence intervals — bootstrap, not analytical Wilson / Clopper-Pearson

Rejected: Wilson score interval on F1 (analytical, would assume F1 is binomially distributed). Chose: 1 000-resample non-parametric bootstrap. Why. F1 is a ratio of sums of two correlated counts; its sampling distribution at n=180 is not analytically tractable. The bootstrap makes no distributional assumption and reuses the same code path for F1, precision, and recall. Zero dependencies — implemented in stats.py in ~30 lines.

System comparison — McNemar, not paired-t

Rejected: Paired t-test on per-image correct/incorrect indicators. Chose: McNemar’s exact-binomial-when-n<25 / chi-square-otherwise. Why. Per-image correctness is binary, not continuous; the paired-t assumes a normal difference distribution that doesn’t exist here. McNemar conditions on the discordant pairs only (cascade-correct + oracle-wrong vs the converse), which is the actual quantity of interest: “in the cases where the two systems disagree, who’s right more often?” Result on this build: 34 cascade-only-correct vs 10 oracle-only-correct, p = 0.0005.

Robustness — per-image stability, not polarity preservation

Rejected: “Defective z-scores stay above normal z-scores under every perturbation” (the obvious test). Chose: “Each image’s perturbed z-score stays within ±5σ of its clean z-score and remains finite.” Why. Severstal’s gate has Δμ +0.52σ — at n=3 the polarity test fails on clean images, let alone perturbed ones. The stability test catches what we actually care about (silent breakage in I/O, normalisation, or backbone weights) without making a false claim about Severstal separability.