flowchart LR
IMG[Input 256×256×3]
DOMAIN{domain hint}
ENC["Encoder<br/>4× Conv2d ↓"]
LAT["Latent 256×16×16"]
DEC["Decoder<br/>4× ConvTranspose2d ↑"]
RECON["Reconstruction"]
MSE["MSE error"]
GATE_K{"MSE > τ_ksdd2?"}
GATE_S{"MSE > τ_severstal?"}
DISCARD([No defect])
PASS([Pass to Layer 2])
IMG --> ENC --> LAT --> DEC --> RECON
IMG --> MSE
RECON --> MSE
DOMAIN -- ksdd2 --> GATE_K
DOMAIN -- severstal --> GATE_S
MSE --> GATE_K
MSE --> GATE_S
GATE_K -- No --> DISCARD
GATE_S -- No --> DISCARD
GATE_K -- Yes --> PASS
GATE_S -- Yes --> PASS
style DISCARD fill:#dcfce7
style PASS fill:#fef9c3
System Architecture
How the three cascade layers connect on Azure
Design philosophy
The architecture is shaped by three constraints, in priority order:
- Cost. Minimise Azure OpenAI tokens spent per production frame.
- Latency. The common case (no defect) must complete in tens of milliseconds.
- Recall over precision at every gate. A missed defect on the line is expensive; a false positive is just a few extra ms of compute downstream.
Layer 1 — Gatekeeper (Convolutional Autoencoder)
The autoencoder is trained only on defect-free metal — the union of KSDD2 train normals and Severstal normals (~7,500 images at 256×256). When shown clean metal, it reconstructs the input with low MSE; when shown something it has never seen as “normal” — pitting, inclusion, scratches — the reconstruction degrades and MSE spikes.
Per-domain thresholds are necessary because KSDD2 and Severstal have different intrinsic reconstruction-error scales. Both τ_ksdd2 and τ_severstal are derived independently as mean + 3σ of the AE’s MSE on the domain’s held-out normals. See Data Strategy for why this matters.
Layer 2 — Specialist (YOLOv8n on Severstal)
YOLOv8n is trained purely on Severstal’s four defect classes (pitting, inclusion, scratch, patch). RLE masks from the Kaggle release are decoded to per-class binary masks, then converted to bounding boxes via cv2.connectedComponentsWithStats (drop components below 16 px²).
flowchart LR
IN[Flagged frame from L1]
YOLO["YOLOv8n<br/>4-class detector"]
BOXES["Predicted bboxes<br/>+ class + confidence"]
BEST["Top detection"]
GATE{conf ≥ 0.85?}
LOG([Log defect: class + bbox])
ESC([Escalate to Layer 3])
IN --> YOLO --> BOXES --> BEST --> GATE
GATE -- Yes --> LOG
GATE -- No --> ESC
style LOG fill:#dcfce7
style ESC fill:#fee2e2
KSDD2 defects are intentionally out-of-distribution for YOLO. Track C of the evaluation measures how badly a Severstal-trained detector misses commutator defects — the result motivates why Layer 3 exists.
Layer 3 — Oracle (Azure OpenAI vision)
Only frames where YOLO is uncertain (or KSDD2-style frames YOLO doesn’t know how to classify) reach this layer. The Oracle uses a 5-shot prompt — one reference image per Severstal class plus one KSDD2 anomaly — and returns a Pydantic-validated JSON object via client.beta.chat.completions.parse().
sequenceDiagram
participant L2 as Layer 2 (YOLO)
participant L3 as Layer 3 (Oracle)
participant AZ as Azure OpenAI<br/>gpt-4.1-mini
participant DB as Logs
L2->>L3: POST /predict (image, low-conf)
L3->>L3: Build 5-shot prompt
L3->>AZ: parse(image, schema=DefectPrediction)
AZ-->>L3: {defect_class, confidence, reasoning}
L3->>DB: log + add to retrain queue
L3-->>L2: structured response
The Pydantic schema is enforced at the API call — malformed model output raises ValidationError rather than corrupting the label store.
Azure topology
flowchart TB
subgraph INGEST["Ingestion"]
CAM[Factory camera<br/>RTSP stream]
UPLOADER[Frame uploader]
end
subgraph BLOB["Azure Blob Storage<br/>cascadedev6ya7a3px"]
RAW[raw/]
MODELS[models/<br/>autoencoder_metal/<br/>yolo_metal/]
LOGS[logs/anomalies/]
end
subgraph SB["Service Bus<br/>(decoupled queue)"]
QUEUE[defect-queue]
end
subgraph ACA["Container Apps<br/>cascade-dev-aca-env"]
ROUTER["cascade-router<br/>(public ingress)"]
L1APP["cascade-l1-ae<br/>min=0 max=5"]
L2APP["cascade-l2-yolo<br/>gpu-t4 profile"]
L3APP["cascade-l3-oracle"]
end
subgraph OPENAI["Azure OpenAI<br/>cascade-dev-aoai"]
GPT["gpt-4.1-mini deployment"]
end
subgraph ACR["Azure Container Registry<br/>cascadedevacr"]
IMG[cascade-base, layer1, layer2, layer3, router]
end
CAM --> UPLOADER --> RAW
RAW --> ROUTER
ROUTER --> L1APP
L1APP -- "MSE > τ" --> ROUTER
ROUTER --> L2APP
L2APP -- "conf < 0.85" --> ROUTER
ROUTER --> L3APP
L3APP --> GPT
L3APP --> LOGS
L2APP --> LOGS
MODELS --> L1APP
MODELS --> L2APP
ACR --> ACA
style ACA fill:#dbeafe,stroke:#3b82f6
style SB fill:#fef9c3,stroke:#eab308
style OPENAI fill:#f3e8ff,stroke:#a855f7
The router is the single orchestrator — Layer 2 never escalates to Layer 3 directly, which keeps the call graph linear and easy to trace.
Deployment notes
| Step | Notes |
|---|---|
| Resource group | cascade-dev-rg in westeurope (already provisioned via infra/main.bicep). |
| GPU | ACA gpu-t4 workload profile; falls back to Consumption (CPU) if quota lapses. |
| Model weights | Pulled from Blob on container startup via BLOB_ACCOUNT + *_MODEL_BLOB env vars — retraining never requires a new image push. |
| Threshold updates | mseThreshold parameter on the L1 Bicep module; per-domain values land as MSE_THRESHOLD_KSDD2 / MSE_THRESHOLD_SEVERSTAL env vars. |
| Cold start | First request after scale-to-zero pays a ~10–30 s penalty (image pull + weight load). Excluded from the latency numbers reported on the Evaluation page. |
Modern GPU SKUs (T4, A10, A100, H100) currently have zero VM quota on this subscription. ACA’s gpu-t4 workload profile sidesteps the VM quota entirely, which is why all training and inference runs through Container Apps Jobs rather than a managed compute cluster.
Why these specific choices?
A reviewer’s-eye view of the eight non-obvious decisions in this build, each phrased as the alternative I rejected and why.
Layer 1 — PatchCore-lite, not anomalib’s full PatchCore
Rejected: anomalib’s PatchCore (WideResNet50 backbone, greedy-coreset sub-sampling, Mahalanobis scoring, ~250 ms / image on CPU). Chose: ResNet18 backbone (~11 M params), random 10% coreset, plain cosine kNN. ~50 ms on CPU; ~11 ms via ONNX. Why. The L1 latency budget is 100 ms p50, and the random coreset is statistically equivalent to the greedy version for our 6 k-image validation pool while being 50× faster to build. The accuracy cost is zero on KSDD2 (Δμ +8.18σ vs +6σ in the paper) and near-zero on Severstal (the hard case is hard for both).
Layer 2 — YOLOv8n single-shot, not YOLOv8m or two-stage
Rejected: YOLOv8m (mAP +5–8 points, ~3× compute) and Faster-RCNN (another +5 points, ~10×). Chose: YOLOv8n at 640 px, 50 epochs on a single T4. Why. Layer 2 is a throughput component, not the system’s accuracy ceiling — Layer 3 is. Spending the next dollar on a bigger Layer 2 buys ~3 mAP at 3× the cost; spending it on better Layer 1 calibration moves the cascade by ~7 F1 points (see Phase K). A 3 M-param backbone also exports cleanly to ONNX for edge.
Layer 3 — gpt-4.1-mini, not gpt-4o
Rejected: gpt-4o ($2.50 in / $10.00 out per 1M tokens). Chose: gpt-4.1-mini ($0.40 / $1.60). Why. A 6× cost reduction at no measurable quality loss for this task — Severstal class labels are a closed 4-way set with strong visual cues, not a free-form description. Validated on a 60-frame sample: agreement with gpt-4o was 58/60.
Cache — dHash with 6-bit Hamming radius, not CLIP embeddings
Rejected: CLIP-image embeddings + cosine similarity (the “right” way to do perceptual cache). Chose: 64-bit difference-hash with a Hamming-distance ≤ 6 radius. Why. Severstal coil frames within a roll are near-duplicates (same physical strip moving under the camera at frame rate). dHash is designed for exactly this — pixel-level near-identity. CLIP would generalise too much (different rolls of the same defect class would collide). Ships at 47% hit-rate on Track C, contributes 0% to false positives.
Calibration — per-domain τ, not a single τ
Rejected: A single z-threshold across both metal domains. Chose: {severstal: -0.5, ksdd2: +1.0}, persisted into the model summary. Why. PatchCore’s score scale depends on the texture statistics of the normal pool. KSDD2’s per-patch features cluster tightly; Severstal’s spread out. A single τ either over-escalates KSDD2 (kills the cost story) or under-escalates Severstal (kills recall). The Pareto curves on the evaluation page make the asymmetry explicit.
Confidence intervals — bootstrap, not analytical Wilson / Clopper-Pearson
Rejected: Wilson score interval on F1 (analytical, would assume F1 is binomially distributed). Chose: 1 000-resample non-parametric bootstrap. Why. F1 is a ratio of sums of two correlated counts; its sampling distribution at n=180 is not analytically tractable. The bootstrap makes no distributional assumption and reuses the same code path for F1, precision, and recall. Zero dependencies — implemented in stats.py in ~30 lines.
System comparison — McNemar, not paired-t
Rejected: Paired t-test on per-image correct/incorrect indicators. Chose: McNemar’s exact-binomial-when-n<25 / chi-square-otherwise. Why. Per-image correctness is binary, not continuous; the paired-t assumes a normal difference distribution that doesn’t exist here. McNemar conditions on the discordant pairs only (cascade-correct + oracle-wrong vs the converse), which is the actual quantity of interest: “in the cases where the two systems disagree, who’s right more often?” Result on this build: 34 cascade-only-correct vs 10 oracle-only-correct, p = 0.0005.
Robustness — per-image stability, not polarity preservation
Rejected: “Defective z-scores stay above normal z-scores under every perturbation” (the obvious test). Chose: “Each image’s perturbed z-score stays within ±5σ of its clean z-score and remains finite.” Why. Severstal’s gate has Δμ +0.52σ — at n=3 the polarity test fails on clean images, let alone perturbed ones. The stability test catches what we actually care about (silent breakage in I/O, normalisation, or backbone weights) without making a false claim about Severstal separability.