Data Strategy

Two industrial-metal datasets, one autoencoder, one detector

Published

April 30, 2026

Why two datasets

The cascade architecture has two distinct data appetites:

The autoencoder needs a lot of defect-free metal. The more variety it sees as “normal”, the less it over-flags benign textures at inference time. It also benefits from domain breadth — exposure to multiple plausible “normal metal surface” textures keeps it from overfitting to a single factory’s lighting.
The detector needs labelled defects with bounding boxes. Few public industrial datasets supply both at scale.

No single public dataset gives you both at the volume this project needs. Combining two does.

Dataset	Role	Why we use it	Why we don’t use it for everything
KSDD2	AE normals + OOD detector test	~2,979 defect-free metal images, real industrial capture	Only ~356 defectives; weak masks; non-commercial licence
Severstal	AE normals + YOLO training + in-domain test	~12,500 images, ~6,600 defectives with RLE masks across 4 classes	Single domain (flat steel) — would over-specialise the AE

flowchart LR
    subgraph KSDD["KSDD2 (3,335 images)"]
        K_NORM["Normal × 2,979"]
        K_DEF["Defective × 356"]
    end

    subgraph SEV["Severstal (~12,500 images)"]
        S_NORM["Normal × ~5,900"]
        S_DEF["Defective × ~6,600<br/>(RLE masks, 4 classes)"]
    end

    AE["Layer 1<br/>Autoencoder<br/>(union of normals)"]
    YOLO["Layer 2<br/>YOLOv8n<br/>(Severstal only)"]
    OOD["Cascade test — Track C<br/>OOD generalisation"]
    INDOM["Cascade test — Track A<br/>in-domain"]
    KTEST["Cascade test — Track B<br/>second domain, AE+Oracle"]

    K_NORM --> AE
    S_NORM --> AE
    S_DEF  --> YOLO
    S_DEF  --> INDOM
    S_NORM --> INDOM
    K_DEF  --> KTEST
    K_NORM --> KTEST
    K_DEF  --> OOD

    style AE fill:#dbeafe,stroke:#3b82f6
    style YOLO fill:#dcfce7,stroke:#22c55e

Figure 1: Where each dataset feeds the cascade

The combination strategy

Naively concatenating the two datasets fails for a subtle reason: KSDD2 images are roughly 230×640 portrait crops of a commutator surface, while Severstal is 1600×256 wide strips of flat steel. Their per-pixel intensity statistics are not comparable. If you train one autoencoder on the union and pick a single MSE threshold, the dataset with higher intrinsic reconstruction error dominates — and the other gets either flagged on every frame or ignored entirely.

The plan handles this in three places:

Common canvas. Every image is resized to a 256×256 square before it touches the AE — KSDD2 by centre-crop, Severstal by tile-sampling. This is imperfect (Severstal tiles lose long-range context) but keeps the model simple and CPU-trainable.
Per-domain thresholds. After training, we compute the AE’s MSE on separate held-out normals from each dataset and derive τ_ksdd2 and τ_severstal independently as mean + 3σ. The router selects which τ to apply based on a domain hint at request time.
Train YOLO on Severstal alone. KSDD2’s defect masks are coarse and single-class; mixing them in would dilute the four-class signal Severstal gives us. KSDD2 defects become an explicit OOD test instead.

The split layout

src/cascade_defect/data/split_metal.py produces this deterministic layout under data/splits_metal/:

ae_train/                         # KSDD2 + Severstal normals (~7,500 imgs)
ae_val/
    ksdd2/                        # 10% of KSDD2 train normals → derive τ_ksdd2
    severstal/                    # 10% of remaining Severstal normals → derive τ_severstal
yolo_train/                       # 80% of Severstal defectives
yolo_val/                         # 20% of Severstal defectives
cascade_test/
    severstal/{normal,defective}  # 20% holdout from each Severstal class
    ksdd2/{normal,defective}      # KSDD2 official test split (untouched)
manifest.json

The split is reproducible — same seed, same on-disk dataset, same files in every directory.

The three evaluation tracks

This is the part that the v1 NEU build couldn’t do honestly:

Track	Test set	What it measures
A — In-domain	Severstal cascade_test (normals + defectives)	Headline number: precision, recall, F1, cost/latency. With real negatives, the L1 drop rate finally pays for itself.
B — Second domain	KSDD2 cascade_test (AE + Oracle, no YOLO)	Does the cascade survive a different metal domain without retraining the detector?
C — OOD detector	KSDD2 defectives only, YOLO standalone	How badly does a Severstal-trained YOLO miss commutator defects? Quantifies the cost of the missing class.

Layer 3 prompt

The Oracle keeps its few-shot, structured-output design from v1. The schema shrinks to five labels — Severstal’s four named classes plus a catch-all for KSDD2-style anomalies the YOLO doesn’t know about:

class DefectPrediction(BaseModel):
    defect_class: Literal[
        "pitting", "inclusion", "scratch", "patch",
        "surface_anomaly", "no_defect"
    ]
    confidence: float          # 0.0 – 1.0
    reasoning: str             # one-sentence justification
    bounding_box_present: bool

Five reference images go in the system prompt: one from each Severstal class plus one KSDD2 defect.

Reproducing the data step

# 1. KSDD2 — auto-downloaded; current copy lives at data/raw/KolektorSDD2/
#    (Kolektor publishes via direct ZIP, no auth required.)

# 2. Severstal — manual: accept the Kaggle competition rules at
#    https://www.kaggle.com/competitions/severstal-steel-defect-detection
#    and unzip into data/raw/severstal-steel-defect-detection/ so it
#    contains train.csv + train_images/.

# 3. Build the unified split.
uv run python -m cascade_defect.data.split_metal

# 4. Sanity counts.
uv run python -m cascade_defect.data.ksdd2
uv run python -m cascade_defect.data.severstal