flowchart LR
subgraph KSDD["KSDD2 (3,335 images)"]
K_NORM["Normal × 2,979"]
K_DEF["Defective × 356"]
end
subgraph SEV["Severstal (~12,500 images)"]
S_NORM["Normal × ~5,900"]
S_DEF["Defective × ~6,600<br/>(RLE masks, 4 classes)"]
end
AE["Layer 1<br/>Autoencoder<br/>(union of normals)"]
YOLO["Layer 2<br/>YOLOv8n<br/>(Severstal only)"]
OOD["Cascade test — Track C<br/>OOD generalisation"]
INDOM["Cascade test — Track A<br/>in-domain"]
KTEST["Cascade test — Track B<br/>second domain, AE+Oracle"]
K_NORM --> AE
S_NORM --> AE
S_DEF --> YOLO
S_DEF --> INDOM
S_NORM --> INDOM
K_DEF --> KTEST
K_NORM --> KTEST
K_DEF --> OOD
style AE fill:#dbeafe,stroke:#3b82f6
style YOLO fill:#dcfce7,stroke:#22c55e
Data Strategy
Two industrial-metal datasets, one autoencoder, one detector
Why two datasets
The cascade architecture has two distinct data appetites:
- The autoencoder needs a lot of defect-free metal. The more variety it sees as “normal”, the less it over-flags benign textures at inference time. It also benefits from domain breadth — exposure to multiple plausible “normal metal surface” textures keeps it from overfitting to a single factory’s lighting.
- The detector needs labelled defects with bounding boxes. Few public industrial datasets supply both at scale.
No single public dataset gives you both at the volume this project needs. Combining two does.
| Dataset | Role | Why we use it | Why we don’t use it for everything |
|---|---|---|---|
| KSDD2 | AE normals + OOD detector test | ~2,979 defect-free metal images, real industrial capture | Only ~356 defectives; weak masks; non-commercial licence |
| Severstal | AE normals + YOLO training + in-domain test | ~12,500 images, ~6,600 defectives with RLE masks across 4 classes | Single domain (flat steel) — would over-specialise the AE |
The combination strategy
Naively concatenating the two datasets fails for a subtle reason: KSDD2 images are roughly 230×640 portrait crops of a commutator surface, while Severstal is 1600×256 wide strips of flat steel. Their per-pixel intensity statistics are not comparable. If you train one autoencoder on the union and pick a single MSE threshold, the dataset with higher intrinsic reconstruction error dominates — and the other gets either flagged on every frame or ignored entirely.
The plan handles this in three places:
- Common canvas. Every image is resized to a 256×256 square before it touches the AE — KSDD2 by centre-crop, Severstal by tile-sampling. This is imperfect (Severstal tiles lose long-range context) but keeps the model simple and CPU-trainable.
- Per-domain thresholds. After training, we compute the AE’s MSE on separate held-out normals from each dataset and derive
τ_ksdd2andτ_severstalindependently asmean + 3σ. The router selects which τ to apply based on adomainhint at request time. - Train YOLO on Severstal alone. KSDD2’s defect masks are coarse and single-class; mixing them in would dilute the four-class signal Severstal gives us. KSDD2 defects become an explicit OOD test instead.
The split layout
src/cascade_defect/data/split_metal.py produces this deterministic layout under data/splits_metal/:
ae_train/ # KSDD2 + Severstal normals (~7,500 imgs)
ae_val/
ksdd2/ # 10% of KSDD2 train normals → derive τ_ksdd2
severstal/ # 10% of remaining Severstal normals → derive τ_severstal
yolo_train/ # 80% of Severstal defectives
yolo_val/ # 20% of Severstal defectives
cascade_test/
severstal/{normal,defective} # 20% holdout from each Severstal class
ksdd2/{normal,defective} # KSDD2 official test split (untouched)
manifest.json
The split is reproducible — same seed, same on-disk dataset, same files in every directory.
The three evaluation tracks
This is the part that the v1 NEU build couldn’t do honestly:
| Track | Test set | What it measures |
|---|---|---|
| A — In-domain | Severstal cascade_test (normals + defectives) | Headline number: precision, recall, F1, cost/latency. With real negatives, the L1 drop rate finally pays for itself. |
| B — Second domain | KSDD2 cascade_test (AE + Oracle, no YOLO) | Does the cascade survive a different metal domain without retraining the detector? |
| C — OOD detector | KSDD2 defectives only, YOLO standalone | How badly does a Severstal-trained YOLO miss commutator defects? Quantifies the cost of the missing class. |
Layer 3 prompt
The Oracle keeps its few-shot, structured-output design from v1. The schema shrinks to five labels — Severstal’s four named classes plus a catch-all for KSDD2-style anomalies the YOLO doesn’t know about:
class DefectPrediction(BaseModel):
defect_class: Literal[
"pitting", "inclusion", "scratch", "patch",
"surface_anomaly", "no_defect"
]
confidence: float # 0.0 – 1.0
reasoning: str # one-sentence justification
bounding_box_present: boolFive reference images go in the system prompt: one from each Severstal class plus one KSDD2 defect.
Reproducing the data step
# 1. KSDD2 — auto-downloaded; current copy lives at data/raw/KolektorSDD2/
# (Kolektor publishes via direct ZIP, no auth required.)
# 2. Severstal — manual: accept the Kaggle competition rules at
# https://www.kaggle.com/competitions/severstal-steel-defect-detection
# and unzip into data/raw/severstal-steel-defect-detection/ so it
# contains train.csv + train_images/.
# 3. Build the unified split.
uv run python -m cascade_defect.data.split_metal
# 4. Sanity counts.
uv run python -m cascade_defect.data.ksdd2
uv run python -m cascade_defect.data.severstal