Robotics · Safety · Computer Vision
A 33K-parameter visual safety monitor predicts when camera corruption will cause a robot to fail — even for corruption types it has never seen in training.
Rain occlusion at 50% budget — the policy fails 40% of the time. P(failure) gauge is from a model that has never seen rain.
A vision-language-action model (InternVLA-M1) picks up a coke can from a table in simulation (SimplerEnv / SAPIEN). The coke can starts at a random position each episode. Under clean conditions, the policy achieves 100% success rate.
What happens when the camera lens is dirty, wet, or otherwise degraded? We test 9 corruption types spanning 6 physical mechanisms and ask whether a lightweight safety monitor can predict failure — even for corruption types it has never encountered.
Following the ImageNet-C taxonomy, adapted for fixed-mount robot cameras. Each corruption has a budget level (0–100%) controlling severity.
Each animation shows 10 real episodes where the policy sees the corrupted frames, with a real-time P(failure) gauge and running success counter. The P(failure) predictions come from a model trained without the displayed corruption type.
Clean (100%)
Fingerprint 90% (100%)
Rain 50% (60%)
Glare 90% (100%)
Defocus 90% (60%)
Motion Blur 90% (80%)
Gauss. Noise 90% (100%)
Dust/Mud 90% (80%)
Low-Light 90% (80%)
JPEG 90% (100%)
Format: Type budget (success rate)
A lightweight CNN predicts P(failure) from a single camera frame — the occluded image the policy actually sees, with no knowledge of the corruption type, budget, or parameters.
Backbone
ResNet-18
Frozen, ImageNet-pretrained
Trainable Params
33K
0.3% of total
Input
Single frame
No history needed
Output
P(failure)
[0, 1] per frame
Key insight: The frozen backbone provides corruption-agnostic visual features — edges, textures, contrast, sharpness — that degrade in recognizable ways regardless of the physical corruption mechanism. Only the 33K-parameter head learns what degradation patterns lead to task failure.
For each of the 9 corruption types, we hold it out entirely, train on the other 8, and evaluate on the held-out type. Two metrics:
| Held-out Type | Category | Spearman ρ | p-value | AUROC |
|---|---|---|---|---|
| Rain | Lens contact | 1.000 | 0.000 | 0.981 |
| Defocus blur | Optical | 0.894 | 0.041 | 0.973 |
| Dust / mud | Environmental | 0.783 | 0.118 | 0.943 |
| Motion blur | Blur | 0.707 | 0.182 | 1.000 |
| Low-light | Illumination | 0.707 | 0.182 | 1.000 |
| Fingerprint | Lens contact | 0.667 | 0.219 | 0.636 |
Mean Spearman ρ
0.793 ± 0.118
Across 6 failure-causing types
Mean AUROC
0.922 ± 0.129
Episode-level classification
From-Scratch Baseline
ρ = −0.62
Anti-correlated without pretraining
When trained and evaluated on the same corruption types, the predictor accurately ranks severity:
How similar do these corruptions look to the frozen ResNet-18 backbone? Pairwise cosine similarity in the 512-dimensional feature space:
Left: Raw feature similarity. Most corruptions cluster tightly (>0.9), but rain is an outlier (0.75). Right: Feature-difference similarity (Mintun et al., NeurIPS 2021) — cosine similarity between the directions corruptions push features away from clean.
Policy
InternVLA-M1
Environment
SimplerEnv / SAPIEN
Task
Pick Coke Can
Corruption Types
9
LOO Folds
9
Episodes / Condition
10
Budget Levels
5 (10%–90%)
GPU
RTX 4090 (vast.ai)
SOTIF (ISO 21448) addresses safety of the intended functionality — failures from limitations of perception, not hardware faults. Camera corruption is a canonical SOTIF triggering condition. This project demonstrates that pretrained visual features enable corruption-agnostic safety monitoring, reducing the need for corruption-specific validation.
Stack: Python 3.11 · PyTorch · torchvision · NumPy · OpenCV · SimplerEnv · ManiSkill2 · InternVLA-M1 · scikit-learn · scipy