ChangeImageTo

Image Validation for Machine Learning Datasets

Automatically filter bad images before model training

"Garbage in, garbage out" is the oldest law in ML. But dataset hygiene usually means deduplication or label cleaning — almost nobody validates raw image quality before training. That's a mistake. Blurry, corrupted, overexposed, and pixelated images don't just waste training compute — they actively harm model generalisation by teaching the model to associate class labels with artefacts.

This guide covers why image quality matters in ML datasets and how to automate the validation step in Python before you train.

How bad images harm model training

The damage bad images cause depends on the type of problem:

Image IssueEffect on Training
Blurry imagesModel learns blurry features as class-discriminative signals. Fails on sharp test data.
Heavily compressed imagesModel learns JPEG grid artefacts. Brittle on images from different sources.
Overexposed imagesModel learns washed-out colours. Fails on normally exposed images of the same class.
Low-resolution imagesModel trained on upscaled images hallucinates fine detail at inference.
Noisy imagesModel learns noise patterns as signal. Produces inconsistent results at inference.
Real example: A product classifier trained partly on heavily compressed web-scraped images can develop an implicit bias toward JPEG artefacts. When you later add a cleaner data source, accuracy drops — because the model partially relied on artefact patterns that are no longer present.

When to validate: train time vs. ingest time

You can validate images at two points:

Ingest-time validation is strongly preferred because it's cheaper to reject once than to skip at every epoch. It also makes your dataset auditable.

What to check

For most ML datasets, the six quality signals worth checking are:

Automated validation with imageguard

imageguard — image quality validation for Python

imageguard checks all six signals in a single call. It's designed for exactly this use case: cleaning datasets before training.

View on GitHub →

Install

pip install imageguard

Validate a full dataset folder

import csv
from pathlib import Path
from imageguard import validate

dataset_dir = Path("raw_dataset/")
extensions = {".jpg", ".jpeg", ".png", ".webp"}

passed, rejected = [], []

for path in dataset_dir.rglob("*"):
    if path.suffix.lower() not in extensions:
        continue
    result = validate(path)
    row = {"path": str(path), "score": result.score, "issues": result.issues, "reason": result.reason}
    (passed if result.ok else rejected).append(row)

print(f"Accepted: {len(passed)} | Rejected: {len(rejected)}")

# Write rejected list for audit
with open("rejected.csv", "w", newline="") as f:
    w = csv.DictWriter(f, fieldnames=["path", "score", "issues", "reason"])
    w.writeheader(); w.writerows(rejected)

Integrate into a PyTorch DataLoader

from torch.utils.data import Dataset
from imageguard import validate
from PIL import Image
import torchvision.transforms as T

class ValidatedImageDataset(Dataset):
    def __init__(self, paths, transform=None, min_score=0.5):
        self.transform = transform
        # Filter at construction time — once, not every epoch
        self.paths = [
            p for p in paths
            if validate(p).score >= min_score
        ]

    def __len__(self): return len(self.paths)

    def __getitem__(self, idx):
        img = Image.open(self.paths[idx]).convert("RGB")
        return self.transform(img) if self.transform else img

Recommended thresholds for ML datasets

Use Caseblur_scoreresolution_scorenoise_score
General classification406030
Fine-grained classification607540
Object detection507035
Segmentation557040

What to do with rejected images

Don't simply delete them. Log the rejection reasons to a CSV and review periodically. Common patterns to look for:

For hands-on inspection, upload a sample to the free image quality checker to see individual scores before writing your validation script.