ChangeImageTo

How to Filter Blurry Images Before OCR

Catch out-of-focus images in Python before they reach your text extraction pipeline

OCR engines — Tesseract, Google Vision, AWS Textract, Azure Read — all share the same weakness: they assume the input is sharp. Feed them a blurry document and you get corrupted text, missing characters, and wrong line breaks. The bad news is they don't warn you; they just return bad text with high confidence.

This guide shows you how to detect blur in Python, pick the right threshold for your documents, and wire it into your pipeline so only sharp images reach the OCR step.

Why blur destroys OCR accuracy

OCR engines find characters by detecting sharp transitions — the edges that make an 'A' look like an 'A' instead of a smear. Blur, by definition, reduces edge contrast. At a Laplacian variance below ~200 (raw pixel values), most OCR engines start producing errors. Below ~50, results are essentially random.

The practical sources of blur in document pipelines are:

The standard approach: Laplacian variance

The most reliable fast blur detector is the variance of the Laplacian. The Laplacian is a second-derivative operator — it responds strongly to edges. If an image is blurry, it has few edges, so the Laplacian has low variance.

import cv2
import numpy as np

def laplacian_variance(image_path: str) -> float:
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    lap = cv2.Laplacian(img.astype(np.float64), cv2.CV_64F)
    return float(lap.var())

score = laplacian_variance("scan.jpg")
if score < 100:
    print("Too blurry for OCR")

Choosing the right threshold

Laplacian VarianceImage QualityOCR Suitability
> 500Very sharpExcellent
200 – 500AcceptableGood, minor errors possible
50 – 200SoftDegraded accuracy
< 50BlurryNot suitable
Tip: Raw Laplacian variance depends on image resolution. A 4K photo has a different scale than a 300 DPI scan. Always normalise by texture variance (divide by np.var(gray_float) + 1e-6) when comparing across different image sizes.

A better approach: combined sharpness score

Raw Laplacian variance can be fooled by low-contrast images with little texture (a blank white page scores as "blurry" even though it's perfectly sharp). A more robust approach combines Laplacian variance with Tenengrad (Sobel-based gradient energy) and normalises by texture variance:

import cv2
import numpy as np

def sharpness_score(gray: np.ndarray) -> float:
    """Return 0–100 sharpness score; below 40 is blurry."""
    gf = gray.astype(np.float64)
    texture_var = float(np.var(gf)) + 1e-6

    lap = cv2.Laplacian(gf, cv2.CV_64F)
    lap_norm = float(lap.var()) / texture_var

    sx = cv2.Sobel(gf, cv2.CV_64F, 1, 0, ksize=3)
    sy = cv2.Sobel(gf, cv2.CV_64F, 0, 1, ksize=3)
    ten_norm = float(np.hypot(sx, sy).var()) / texture_var

    blur_index = 0.5 * lap_norm + 0.5 * ten_norm
    return max(0.0, min(100.0, (blur_index / 1.2) * 100.0))

Wire it into your OCR pipeline

Here's a complete example that validates sharpness, checks resolution, and only calls Tesseract when the image passes:

import cv2
import pytesseract
from pathlib import Path

from imageguard import validate

def ocr_with_validation(image_path: str) -> str | None:
    # Validate before sending to OCR
    result = validate(
        image_path,
        thresholds={
            "blur_score": 60.0,       # strict: need sharp text
            "resolution_score": 70.0,  # need enough pixels
        },
    )
    if not result.ok:
        print(f"Skipping {image_path}: {result.reason} (score {result.score:.2f})")
        return None

    img = cv2.imread(image_path)
    text = pytesseract.image_to_string(img)
    return text

imageguard — blur detection + 5 other quality signals

imageguard combines blur detection with resolution, noise, exposure, compression, and pixelation checks into a single validate() call. Open-source, no API keys, no cloud dependency.

View on GitHub →

Handling depth-of-field images

One edge case: product photos and portraits often have intentional background blur (bokeh). A naive Laplacian check will flag these as blurry even though the subject is sharp. The fix is to compare the centre region's sharpness against the edge regions. If the centre is 2× sharper than the edges, it's almost certainly intentional DOF — not a focus problem.

Batch filtering a folder of documents

from pathlib import Path
from imageguard import validate

doc_folder = Path("scans/")
sharp, blurry = [], []

for p in doc_folder.glob("*.jpg"):
    r = validate(p, thresholds={"blur_score": 60.0})
    (sharp if r.ok else blurry).append((p, r.score))

print(f"Sharp: {len(sharp)} | Blurry (rejected): {len(blurry)}")

You can also use the free online image quality checker to inspect individual documents without writing any code.