Robotics · Safety · Computer Vision

Cross-Corruption Safety Monitoring for Robot Manipulation

A 33K-parameter visual safety monitor predicts when camera corruption will cause a robot to fail — even for corruption types it has never seen in training.

GitHub → RA-L paper in preparation

Rain occlusion at 50% budget — the policy fails 40% of the time. P(failure) gauge is from a model that has never seen rain.

The Task

A vision-language-action model (InternVLA-M1) picks up a coke can from a table in simulation (SimplerEnv / SAPIEN). The coke can starts at a random position each episode. Under clean conditions, the policy achieves 100% success rate.

What happens when the camera lens is dirty, wet, or otherwise degraded? We test 9 corruption types spanning 6 physical mechanisms and ask whether a lightweight safety monitor can predict failure — even for corruption types it has never encountered.

9 Camera Corruption Types

Following the ImageNet-C taxonomy, adapted for fixed-mount robot cameras. Each corruption has a budget level (0–100%) controlling severity.

Each animation shows 10 real episodes where the policy sees the corrupted frames, with a real-time P(failure) gauge and running success counter. The P(failure) predictions come from a model trained without the displayed corruption type.

Clean (100%)

Fingerprint 90% (100%)

Rain 50% (60%)

Glare 90% (100%)

Defocus 90% (60%)

Motion Blur 90% (80%)

Gauss. Noise 90% (100%)

Dust/Mud 90% (80%)

Low-Light 90% (80%)

JPEG 90% (100%)

Format: Type budget (success rate)

Safety Predictor

A lightweight CNN predicts P(failure) from a single camera frame — the occluded image the policy actually sees, with no knowledge of the corruption type, budget, or parameters.

Backbone

ResNet-18

Frozen, ImageNet-pretrained

Trainable Params

33K

0.3% of total

Input

Single frame

No history needed

Output

P(failure)

[0, 1] per frame

Key insight: The frozen backbone provides corruption-agnostic visual features — edges, textures, contrast, sharpness — that degrade in recognizable ways regardless of the physical corruption mechanism. Only the 33K-parameter head learns what degradation patterns lead to task failure.

Leave-One-Out Cross-Corruption Generalization

For each of the 9 corruption types, we hold it out entirely, train on the other 8, and evaluate on the held-out type. Two metrics:

Spearman ρ — does the monitor correctly rank “more corruption = more danger”? ρ = 1.0 is perfect, 0 is no correlation, negative means it gets it backwards.
AUROC — can the monitor distinguish episodes that will fail from those that will succeed? 1.0 is perfect, 0.5 is random guessing.

Held-out Type	Category	Spearman ρ	p-value	AUROC
Rain	Lens contact	1.000	0.000	0.981
Defocus blur	Optical	0.894	0.041	0.973
Dust / mud	Environmental	0.783	0.118	0.943
Motion blur	Blur	0.707	0.182	1.000
Low-light	Illumination	0.707	0.182	1.000
Fingerprint	Lens contact	0.667	0.219	0.636

Mean Spearman ρ

0.793 ± 0.118

Across 6 failure-causing types

Mean AUROC

0.922 ± 0.129

Episode-level classification

From-Scratch Baseline

ρ = −0.62

Anti-correlated without pretraining

In-Distribution Performance

When trained and evaluated on the same corruption types, the predictor accurately ranks severity:

Feature-Space Similarity

How similar do these corruptions look to the frozen ResNet-18 backbone? Pairwise cosine similarity in the 512-dimensional feature space:

Left: Raw feature similarity. Most corruptions cluster tightly (>0.9), but rain is an outlier (0.75). Right: Feature-difference similarity (Mintun et al., NeurIPS 2021) — cosine similarity between the directions corruptions push features away from clean.

Technical Details

Policy

InternVLA-M1

Environment

SimplerEnv / SAPIEN

Task

Pick Coke Can

Corruption Types

LOO Folds

Episodes / Condition

Budget Levels

5 (10%–90%)

GPU

RTX 4090 (vast.ai)

SOTIF Context

SOTIF (ISO 21448) addresses safety of the intended functionality — failures from limitations of perception, not hardware faults. Camera corruption is a canonical SOTIF triggering condition. This project demonstrates that pretrained visual features enable corruption-agnostic safety monitoring, reducing the need for corruption-specific validation.

Stack: Python 3.11 · PyTorch · torchvision · NumPy · OpenCV · SimplerEnv · ManiSkill2 · InternVLA-M1 · scikit-learn · scipy