Silent Killers: Error Suppression in LLM-Generated Code

Abstract

While humans generally understand that code which runs without errors is not necessarily correct, language models routinely produce exception handlers that silently swallow errors, yielding code that appears to work but producing corrupted results. We audit exception-handling in Python code generated by 25 large language models spanning seven provider families on three scientific computing tasks. An abstract syntax tree analysis across 1,500 responses (1,245 yielding parsable code) reveals that most families produce broad catches that silently suppress errors at rates up to 87% on complex tasks, through five recurring failure modes: bare catches, sentinel returns, NaN imputation, zero-fill suppression, and logged-but-suppressed errors.

Rates vary across families in ways that defy simple explanation: open-weight families with similar training diverge sharply, and reasoning-optimized models show lower rates within some provider families but not others. A temperature control experiment confirms that the variation is not a sampling artifact. A prompt ablation study reveals two regimes: on simple tasks, a single instructional sentence nearly eliminates suppression, while on complex tasks, residual rates persist, requiring static analysis as a safety net.

Model	Generation	Diagnosis
Claude Sonnet 4	0.97	0.04	0.68	0.35
DeepSeek V3	0.27	0.00	0.62	0.24
GPT-4.1	0.00	0.00	0.57	0.11
o3-mini	0.00	0.00	0.00	0.00
Qwen3.5-9B	0.58	0.14	0.22	0.26

Methodology

A handler is classified as unsafe if and only if it satisfies both conditions: (1) Broad catch — the except clause catches Exception, BaseException, or uses a bare except:, and (2) No re-raise — the handler body contains no raise statement. This two-criterion test is conservative: handlers that catch broadly but re-raise are classified as safe, and handlers targeting narrow exceptions (e.g., FileNotFoundError) are excluded.

The open-source silent-killers PyPI package provides both this default mode and a strict mode (any handler without raise, regardless of exception type) for safety-critical contexts.

Why It Matters

Proxy satisfaction: Models may optimize for "code runs without visible errors" rather than "code is correct" — a concrete instance of Goodhart's Law in code generation.
Evaluation brittleness: Pass@k and execution-accuracy benchmarks do not penalize silent suppression. A script that swallows all exceptions will pass every execution test.
Agentic amplification: In multi-step pipelines (ReAct, SWE-agent), a suppressed exception produces output that looks valid to downstream steps, compounding errors invisibly.
Reproducibility under $100: All experiments use publicly available model APIs and open-weight models. The full study can be reproduced for under $100 in API costs.

Silent Killers: Error Suppression in LLM-Generated Code for Scientific Computing

Abstract

Key Findings

Reasoning models are more resilient

Open-weight families diverge sharply

Prompt ablation reveals two regimes

Per-Seed Bimodality

Case Study: Claude Sonnet 4 at 97%

Five Failure Modes

Bare catch + pass/continue

Sentinel value returns

NaN imputation

Zero-fill suppression

Logged-but-suppressed

Prompt Ablation: Two Regimes

Difficulty Breakdown

Robustness Checks

Temperature control

Cross-domain generalization

Per-Model Results

Data Quality

Methodology

Why It Matters

Model	Generation		Diagnosis
	Original	+Re-raise	Original	+Re-raise
Claude Sonnet 4	0.97	0.04	0.68	0.35
DeepSeek V3	0.27	0.00	0.62	0.24
GPT-4.1	0.00	0.00	0.57	0.11
o3-mini	0.00	0.00	0.00	0.00
Qwen3.5-9B	0.58	0.14	0.22	0.26