System Architecture

How the three cascade layers connect on Azure

Published

April 24, 2026

Design Philosophy

The architecture is guided by three constraints:

  1. Cost: Minimise GPT-4o API tokens consumed per production frame.
  2. Latency: Ensure the common case (no defect) completes in < 20 ms.
  3. Accuracy: Never miss a true defect — optimise for recall over precision at each gate.

Layer 1 — The Gatekeeper (Convolutional Autoencoder)

The autoencoder is trained only on defect-free images. When shown a defective image, the network reconstructs it poorly, producing a high Mean Squared Error (MSE). This asymmetry is the entire trick.

flowchart LR
    IMG[Input Image\n256×256×3]
    ENC["Encoder\n4× Conv2d\n↓ stride-2"]
    LAT["Latent Space\n256×16×16"]
    DEC["Decoder\n4× ConvTranspose2d\n↑ stride-2"]
    RECON["Reconstructed\nImage 256×256×3"]
    MSE["MSE\nReconstruction\nError"]
    GATE{"MSE > τ?"}
    DISCARD(["Discard\n(no defect)"])
    PASS(["Pass to\nLayer 2"])

    IMG --> ENC --> LAT --> DEC --> RECON
    IMG --> MSE
    RECON --> MSE
    MSE --> GATE
    GATE -- No --> DISCARD
    GATE -- Yes --> PASS

    style DISCARD fill:#dcfce7
    style PASS fill:#fef9c3
Figure 1: Layer 1: Autoencoder reconstruction pipeline

Threshold Selection: The MSE threshold τ is calibrated on the validation set to achieve 99% recall — we accept more false positives at this stage to ensure no defect escapes undetected.


Layer 2 — The Specialist (YOLOv8n)

YOLOv8n is a lightweight single-stage object detector. It classifies and localises the defect in a single forward pass at ~15 ms on a T4 GPU.

flowchart LR
    IN[Flagged Frame\nfrom Layer 1]
    YOLO["YOLOv8n\nBackbone + Head"]
    BOXES["Predicted Bounding\nBoxes + Classes"]
    BEST["Best Detection\n(highest conf)"]
    GATE{"conf ≥ 0.85?"}
    LOG(["📋 Log Defect\nclass + bbox"])
    ESC(["Escalate to\nLayer 3"])

    IN --> YOLO --> BOXES --> BEST --> GATE
    GATE -- Yes --> LOG
    GATE -- No --> ESC

    style LOG fill:#dcfce7
    style ESC fill:#fee2e2
Figure 2: Layer 2: YOLOv8 inference and confidence routing

Training Data: YOLOv8 is trained purely on pseudo-labels generated by GPT-4o in the offline annotation phase. See Data Strategy for details.


Layer 3 — The Oracle (GPT-4o)

Only frames where YOLOv8 is uncertain (confidence < 85%) reach this layer. GPT-4o acts as a high-accuracy fallback and simultaneously generates annotations that are fed back into the YOLOv8 retraining queue.

sequenceDiagram
    participant L2 as Layer 2<br/>(YOLOv8)
    participant L3 as Layer 3<br/>(GPT-4o)
    participant AZ as Azure OpenAI<br/>gpt-4o
    participant DB as ADLS Logs

    L2->>L3: POST /predict (image, low-conf)
    L3->>L3: Build few-shot prompt\n(18 seed images)
    L3->>AZ: Structured output request\n(Pydantic schema)
    AZ-->>L3: DefectPrediction JSON\n{class, confidence, reasoning}
    L3->>DB: Log prediction + add to retrain queue
    L3-->>L2: Return result
Figure 3: Layer 3: GPT-4o few-shot classification

JSON Schema Enforcement: We use client.beta.chat.completions.parse() with a Pydantic DefectPrediction model to guarantee valid JSON output — GPT-4o never returns free-form text.


Azure Infrastructure

flowchart TB
    subgraph INGEST["Data Ingestion"]
        CAM["🎥 Factory Camera\nRTSP Stream"]
        UPLOADER["Frame Uploader\nPython script"]
    end

    subgraph ADLS["Azure Data Lake Gen2\n(cascade-defect-adls)"]
        RAW["raw/"]
        PROCESSED["processed/\npseudo_labels.json"]
        LOGS["logs/anomalies/"]
    end

    subgraph SB["Azure Service Bus\n(cascade-defect-sb)"]
        QUEUE["defect-queue"]
    end

    subgraph ACA["Azure Container Apps — West Europe\nConsumption-GPU-NC8as-T4"]
        L1APP["layer1-autoencoder\nmin=0 max=5"]
        L2APP["layer2-yolo\nmin=0 max=10\nKEDA: queue length"]
    end

    subgraph OPENAI["Azure OpenAI\n(swedencentral)"]
        GPT4O["gpt-4o\ndeployment"]
    end

    subgraph AML["Azure ML Workspace\n(offline training only)"]
        CLUSTER["Standard_NC6s_v3\ntransient compute"]
        MLFLOW["MLflow Tracking"]
        REGISTRY["Model Registry\nautoencoder_v1\nyolo_pseudo_v1"]
    end

    subgraph ACR["Azure Container Registry\n(cascadedefectacr)"]
        IMG1["layer1-autoencoder:latest"]
        IMG2["layer2-yolo:latest"]
    end

    CAM --> UPLOADER --> RAW
    RAW --> L1APP
    L1APP -- "MSE > τ\nenqueue image_uri" --> QUEUE
    QUEUE -- "KEDA trigger\nscale-up" --> L2APP
    L2APP -- "conf < 0.85" --> GPT4O
    GPT4O --> LOGS
    L2APP -- "conf ≥ 0.85" --> LOGS
    AML --> REGISTRY
    REGISTRY --> ACR
    ACR --> L1APP
    ACR --> L2APP

    style ACA fill:#dbeafe,stroke:#3b82f6
    style SB fill:#fef9c3,stroke:#eab308
    style OPENAI fill:#f3e8ff,stroke:#a855f7
    style AML fill:#f0fdf4,stroke:#16a34a
Figure 4: Full Azure infrastructure diagram

Deployment Checklist

Step Command / Action
1. Request T4 GPU quota Azure Portal → Support → “Managed Environment Consumption T4 Gpus”
2. Create resource group az group create --name cascade-defect-rg --location westeurope
3. Create ACR az acr create --name cascadedefectacr --sku Basic
4. Build & push images az acr build --registry cascadedefectacr ...
5. Create ACA environment See azure_container_apps.md
6. Deploy container apps az containerapp create ...
7. Configure KEDA az containerapp update --scale-rule-type azure-servicebus ...
WarningCold-Start Penalty

Scaling from zero replicas with a T4 GPU attached takes 30–90 seconds (image pull + GPU attachment). This is not part of the inference latency and must be measured and documented separately. In production, keep Layer 2 at --min-replicas 1 during active factory shifts.