flowchart LR
IMG[Input Image\n256×256×3]
ENC["Encoder\n4× Conv2d\n↓ stride-2"]
LAT["Latent Space\n256×16×16"]
DEC["Decoder\n4× ConvTranspose2d\n↑ stride-2"]
RECON["Reconstructed\nImage 256×256×3"]
MSE["MSE\nReconstruction\nError"]
GATE{"MSE > τ?"}
DISCARD(["Discard\n(no defect)"])
PASS(["Pass to\nLayer 2"])
IMG --> ENC --> LAT --> DEC --> RECON
IMG --> MSE
RECON --> MSE
MSE --> GATE
GATE -- No --> DISCARD
GATE -- Yes --> PASS
style DISCARD fill:#dcfce7
style PASS fill:#fef9c3
System Architecture
How the three cascade layers connect on Azure
Design Philosophy
The architecture is guided by three constraints:
- Cost: Minimise GPT-4o API tokens consumed per production frame.
- Latency: Ensure the common case (no defect) completes in < 20 ms.
- Accuracy: Never miss a true defect — optimise for recall over precision at each gate.
Layer 1 — The Gatekeeper (Convolutional Autoencoder)
The autoencoder is trained only on defect-free images. When shown a defective image, the network reconstructs it poorly, producing a high Mean Squared Error (MSE). This asymmetry is the entire trick.
Threshold Selection: The MSE threshold τ is calibrated on the validation set to achieve 99% recall — we accept more false positives at this stage to ensure no defect escapes undetected.
Layer 2 — The Specialist (YOLOv8n)
YOLOv8n is a lightweight single-stage object detector. It classifies and localises the defect in a single forward pass at ~15 ms on a T4 GPU.
flowchart LR
IN[Flagged Frame\nfrom Layer 1]
YOLO["YOLOv8n\nBackbone + Head"]
BOXES["Predicted Bounding\nBoxes + Classes"]
BEST["Best Detection\n(highest conf)"]
GATE{"conf ≥ 0.85?"}
LOG(["📋 Log Defect\nclass + bbox"])
ESC(["Escalate to\nLayer 3"])
IN --> YOLO --> BOXES --> BEST --> GATE
GATE -- Yes --> LOG
GATE -- No --> ESC
style LOG fill:#dcfce7
style ESC fill:#fee2e2
Training Data: YOLOv8 is trained purely on pseudo-labels generated by GPT-4o in the offline annotation phase. See Data Strategy for details.
Layer 3 — The Oracle (GPT-4o)
Only frames where YOLOv8 is uncertain (confidence < 85%) reach this layer. GPT-4o acts as a high-accuracy fallback and simultaneously generates annotations that are fed back into the YOLOv8 retraining queue.
sequenceDiagram
participant L2 as Layer 2<br/>(YOLOv8)
participant L3 as Layer 3<br/>(GPT-4o)
participant AZ as Azure OpenAI<br/>gpt-4o
participant DB as ADLS Logs
L2->>L3: POST /predict (image, low-conf)
L3->>L3: Build few-shot prompt\n(18 seed images)
L3->>AZ: Structured output request\n(Pydantic schema)
AZ-->>L3: DefectPrediction JSON\n{class, confidence, reasoning}
L3->>DB: Log prediction + add to retrain queue
L3-->>L2: Return result
JSON Schema Enforcement: We use client.beta.chat.completions.parse() with a Pydantic DefectPrediction model to guarantee valid JSON output — GPT-4o never returns free-form text.
Azure Infrastructure
flowchart TB
subgraph INGEST["Data Ingestion"]
CAM["🎥 Factory Camera\nRTSP Stream"]
UPLOADER["Frame Uploader\nPython script"]
end
subgraph ADLS["Azure Data Lake Gen2\n(cascade-defect-adls)"]
RAW["raw/"]
PROCESSED["processed/\npseudo_labels.json"]
LOGS["logs/anomalies/"]
end
subgraph SB["Azure Service Bus\n(cascade-defect-sb)"]
QUEUE["defect-queue"]
end
subgraph ACA["Azure Container Apps — West Europe\nConsumption-GPU-NC8as-T4"]
L1APP["layer1-autoencoder\nmin=0 max=5"]
L2APP["layer2-yolo\nmin=0 max=10\nKEDA: queue length"]
end
subgraph OPENAI["Azure OpenAI\n(swedencentral)"]
GPT4O["gpt-4o\ndeployment"]
end
subgraph AML["Azure ML Workspace\n(offline training only)"]
CLUSTER["Standard_NC6s_v3\ntransient compute"]
MLFLOW["MLflow Tracking"]
REGISTRY["Model Registry\nautoencoder_v1\nyolo_pseudo_v1"]
end
subgraph ACR["Azure Container Registry\n(cascadedefectacr)"]
IMG1["layer1-autoencoder:latest"]
IMG2["layer2-yolo:latest"]
end
CAM --> UPLOADER --> RAW
RAW --> L1APP
L1APP -- "MSE > τ\nenqueue image_uri" --> QUEUE
QUEUE -- "KEDA trigger\nscale-up" --> L2APP
L2APP -- "conf < 0.85" --> GPT4O
GPT4O --> LOGS
L2APP -- "conf ≥ 0.85" --> LOGS
AML --> REGISTRY
REGISTRY --> ACR
ACR --> L1APP
ACR --> L2APP
style ACA fill:#dbeafe,stroke:#3b82f6
style SB fill:#fef9c3,stroke:#eab308
style OPENAI fill:#f3e8ff,stroke:#a855f7
style AML fill:#f0fdf4,stroke:#16a34a
Deployment Checklist
| Step | Command / Action |
|---|---|
| 1. Request T4 GPU quota | Azure Portal → Support → “Managed Environment Consumption T4 Gpus” |
| 2. Create resource group | az group create --name cascade-defect-rg --location westeurope |
| 3. Create ACR | az acr create --name cascadedefectacr --sku Basic |
| 4. Build & push images | az acr build --registry cascadedefectacr ... |
| 5. Create ACA environment | See azure_container_apps.md |
| 6. Deploy container apps | az containerapp create ... |
| 7. Configure KEDA | az containerapp update --scale-rule-type azure-servicebus ... |
Scaling from zero replicas with a T4 GPU attached takes 30–90 seconds (image pull + GPU attachment). This is not part of the inference latency and must be measured and documented separately. In production, keep Layer 2 at --min-replicas 1 during active factory shifts.