Robotics · Safety · Computer Vision

Cross-Corruption Safety Monitoring for Robot Manipulation

A 33K-parameter visual safety monitor predicts when camera corruption will cause a robot to fail — even for corruption types it has never seen in training.

GitHub → RA-L paper in preparation

Rain occlusion at 50% budget — the policy fails 40% of the time. P(failure) gauge is from a model that has never seen rain.

The Task

A vision-language-action model (InternVLA-M1) picks up a coke can from a table in simulation (SimplerEnv / SAPIEN). The coke can starts at a random position each episode. Under clean conditions, the policy achieves 100% success rate.

What happens when the camera lens is dirty, wet, or otherwise degraded? We test 9 corruption types spanning 6 physical mechanisms and ask whether a lightweight safety monitor can predict failure — even for corruption types it has never encountered.

9 Camera Corruption Types

Following the ImageNet-C taxonomy, adapted for fixed-mount robot cameras. Each corruption has a budget level (0–100%) controlling severity.

9 corruption types at high severity

Each animation shows 10 real episodes where the policy sees the corrupted frames, with a real-time P(failure) gauge and running success counter. The P(failure) predictions come from a model trained without the displayed corruption type.

Clean (100%)

Fingerprint 90% (100%)

Rain 50% (60%)

Glare 90% (100%)

Defocus 90% (60%)

Motion Blur 90% (80%)

Gauss. Noise 90% (100%)

Dust/Mud 90% (80%)

Low-Light 90% (80%)

JPEG 90% (100%)

Format: Type budget (success rate)

Safety Predictor

A lightweight CNN predicts P(failure) from a single camera frame — the occluded image the policy actually sees, with no knowledge of the corruption type, budget, or parameters.

SafetyNet architecture

Backbone

ResNet-18

Frozen, ImageNet-pretrained

Trainable Params

33K

0.3% of total

Input

Single frame

No history needed

Output

P(failure)

[0, 1] per frame

Key insight: The frozen backbone provides corruption-agnostic visual features — edges, textures, contrast, sharpness — that degrade in recognizable ways regardless of the physical corruption mechanism. Only the 33K-parameter head learns what degradation patterns lead to task failure.

Leave-One-Out Cross-Corruption Generalization

For each of the 9 corruption types, we hold it out entirely, train on the other 8, and evaluate on the held-out type. Two metrics:

LOO results
Held-out Type Category Spearman ρ p-value AUROC
RainLens contact1.0000.0000.981
Defocus blurOptical0.8940.0410.973
Dust / mudEnvironmental0.7830.1180.943
Motion blurBlur0.7070.1821.000
Low-lightIllumination0.7070.1821.000
FingerprintLens contact0.6670.2190.636

Mean Spearman ρ

0.793 ± 0.118

Across 6 failure-causing types

Mean AUROC

0.922 ± 0.129

Episode-level classification

From-Scratch Baseline

ρ = −0.62

Anti-correlated without pretraining

In-Distribution Performance

When trained and evaluated on the same corruption types, the predictor accurately ranks severity:

Per-fold budget vs prediction

Feature-Space Similarity

How similar do these corruptions look to the frozen ResNet-18 backbone? Pairwise cosine similarity in the 512-dimensional feature space:

Feature similarity heatmap

Left: Raw feature similarity. Most corruptions cluster tightly (>0.9), but rain is an outlier (0.75). Right: Feature-difference similarity (Mintun et al., NeurIPS 2021) — cosine similarity between the directions corruptions push features away from clean.

Technical Details

Policy

InternVLA-M1

Environment

SimplerEnv / SAPIEN

Task

Pick Coke Can

Corruption Types

9

LOO Folds

9

Episodes / Condition

10

Budget Levels

5 (10%–90%)

GPU

RTX 4090 (vast.ai)

SOTIF Context

SOTIF (ISO 21448) addresses safety of the intended functionality — failures from limitations of perception, not hardware faults. Camera corruption is a canonical SOTIF triggering condition. This project demonstrates that pretrained visual features enable corruption-agnostic safety monitoring, reducing the need for corruption-specific validation.

Stack: Python 3.11 · PyTorch · torchvision · NumPy · OpenCV · SimplerEnv · ManiSkill2 · InternVLA-M1 · scikit-learn · scipy

Related Work

CNN Robustness to Camera Occlusion

The foundational study: a SOTIF-style robustness audit of a CNN traffic sign classifier against realistic sensor noise, using the open-source camera-occlusion library.