OCR engines — Tesseract, Google Vision, AWS Textract, Azure Read — all share the same weakness: they assume the input is sharp. Feed them a blurry document and you get corrupted text, missing characters, and wrong line breaks. The bad news is they don't warn you; they just return bad text with high confidence.
This guide shows you how to detect blur in Python, pick the right threshold for your documents, and wire it into your pipeline so only sharp images reach the OCR step.
Why blur destroys OCR accuracy
OCR engines find characters by detecting sharp transitions — the edges that make an 'A' look like an 'A' instead of a smear. Blur, by definition, reduces edge contrast. At a Laplacian variance below ~200 (raw pixel values), most OCR engines start producing errors. Below ~50, results are essentially random.
The practical sources of blur in document pipelines are:
- Camera shake from handheld phone scanning
- Out-of-focus due to autofocus failure on flat documents
- Motion blur from documents on a conveyor belt
- Re-scanned faxes or photocopies of photocopies
- Thumbnail-sized images that have been upscaled
The standard approach: Laplacian variance
The most reliable fast blur detector is the variance of the Laplacian. The Laplacian is a second-derivative operator — it responds strongly to edges. If an image is blurry, it has few edges, so the Laplacian has low variance.
import cv2
import numpy as np
def laplacian_variance(image_path: str) -> float:
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
lap = cv2.Laplacian(img.astype(np.float64), cv2.CV_64F)
return float(lap.var())
score = laplacian_variance("scan.jpg")
if score < 100:
print("Too blurry for OCR")
Choosing the right threshold
| Laplacian Variance | Image Quality | OCR Suitability |
|---|---|---|
| > 500 | Very sharp | Excellent |
| 200 – 500 | Acceptable | Good, minor errors possible |
| 50 – 200 | Soft | Degraded accuracy |
| < 50 | Blurry | Not suitable |
np.var(gray_float) + 1e-6) when comparing across different image sizes.
A better approach: combined sharpness score
Raw Laplacian variance can be fooled by low-contrast images with little texture (a blank white page scores as "blurry" even though it's perfectly sharp). A more robust approach combines Laplacian variance with Tenengrad (Sobel-based gradient energy) and normalises by texture variance:
import cv2
import numpy as np
def sharpness_score(gray: np.ndarray) -> float:
"""Return 0–100 sharpness score; below 40 is blurry."""
gf = gray.astype(np.float64)
texture_var = float(np.var(gf)) + 1e-6
lap = cv2.Laplacian(gf, cv2.CV_64F)
lap_norm = float(lap.var()) / texture_var
sx = cv2.Sobel(gf, cv2.CV_64F, 1, 0, ksize=3)
sy = cv2.Sobel(gf, cv2.CV_64F, 0, 1, ksize=3)
ten_norm = float(np.hypot(sx, sy).var()) / texture_var
blur_index = 0.5 * lap_norm + 0.5 * ten_norm
return max(0.0, min(100.0, (blur_index / 1.2) * 100.0))
Wire it into your OCR pipeline
Here's a complete example that validates sharpness, checks resolution, and only calls Tesseract when the image passes:
import cv2
import pytesseract
from pathlib import Path
from imageguard import validate
def ocr_with_validation(image_path: str) -> str | None:
# Validate before sending to OCR
result = validate(
image_path,
thresholds={
"blur_score": 60.0, # strict: need sharp text
"resolution_score": 70.0, # need enough pixels
},
)
if not result.ok:
print(f"Skipping {image_path}: {result.reason} (score {result.score:.2f})")
return None
img = cv2.imread(image_path)
text = pytesseract.image_to_string(img)
return text
imageguard — blur detection + 5 other quality signals
imageguard combines blur detection with resolution, noise, exposure, compression, and pixelation checks into a single validate() call. Open-source, no API keys, no cloud dependency.
Handling depth-of-field images
One edge case: product photos and portraits often have intentional background blur (bokeh). A naive Laplacian check will flag these as blurry even though the subject is sharp. The fix is to compare the centre region's sharpness against the edge regions. If the centre is 2× sharper than the edges, it's almost certainly intentional DOF — not a focus problem.
Batch filtering a folder of documents
from pathlib import Path
from imageguard import validate
doc_folder = Path("scans/")
sharp, blurry = [], []
for p in doc_folder.glob("*.jpg"):
r = validate(p, thresholds={"blur_score": 60.0})
(sharp if r.ok else blurry).append((p, r.score))
print(f"Sharp: {len(sharp)} | Blurry (rejected): {len(blurry)}")
You can also use the free online image quality checker to inspect individual documents without writing any code.