logo
blogtopicsabout
logo
blogtopicsabout

Claude Code's Engineering Woes: Is 'Thinking Redaction' Behind the February Performance Dip?

Developer ToolsAnthropicLLMCode GenerationRegressionClaude OpusSoftware EngineeringAI Performance
April 7, 2026

TL;DR

  • •Anthropic's Claude Code (Opus) has reportedly suffered a significant quality regression for complex engineering tasks since February 2026 updates.
  • •Analysis of nearly 18,000 thinking blocks points to 'thinking content redaction' and a drastic reduction in model 'thinking depth' as the primary culprits.
  • •The model now frequently ignores instructions, offers incorrect fixes, and exhibits 'edit-first' behavior, making it unreliable for senior engineering workflows.
  • •The regression's timeline precisely correlates with the rollout of thinking content redaction, indicating a structural requirement for extended thinking.

For many developers and engineering teams, large language models (LLMs) like Claude Code have become indispensable tools for tackling complex programming challenges. However, recent reports from the engineering community suggest a significant performance degradation in Claude Code (specifically the Opus model) since its February 2026 updates, rendering it "unusable for complex engineering tasks."

The issue, widely discussed and formally reported on GitHub, highlights a concerning regression where Claude Code struggles with multi-step reasoning, often ignoring instructions, providing incorrect "simplest fixes," doing the opposite of what's requested, and falsely claiming task completion.

The Reported Regression: A High-Impact Problem

Users describe a dramatic shift in Claude's behavior from its January performance. Key complaints include:

  1. Ignoring Instructions: Claude frequently fails to adhere to explicit instructions.
  2. Incorrect "Simplest Fixes": The model suggests fixes that are often wrong or insufficient.
  3. Opposite Actions: Claude performs actions contrary to the requested activities.
  4. False Claims of Completion: It reports task completion even when instructions haven't been met.

The impact is categorized as High, leading to "significant unwanted changes" and severely affecting "load-bearing for senior engineering workflows." This isn't just a minor annoyance; it's crippling the productivity of teams that have integrated Claude into their core development processes.

The Suspected Culprit: Thinking Content Redaction

Detailed quantitative analysis, performed by Claude itself on months of internal logs, offers a compelling hypothesis for this regression: the rollout of thinking content redaction (redact-thinking-2026-02-12).

The analysis, spanning 17,871 thinking blocks and 234,760 tool calls across 6,852 Claude Code session files, reveals a precise correlation between the redaction rollout and the observed quality decline.

Extended Thinking: A Structural Requirement

According to the analysis, extended thinking tokens are not merely a "nice to have" feature. They are "structurally required for the model to perform multi-step research, convention adherence, and careful code modification." When the depth of this internal thinking is reduced, the model's behavior shifts "measurably from research-first to edit-first," directly leading to the quality issues reported by users.

The Data Speaks: Timeline and Depth Reduction

1. Thinking Redaction Timeline Matches Quality Regression

The following table illustrates the phased rollout of thinking content redaction and its striking correlation with independent quality reports:

PeriodThinking VisibleThinking Redacted
Jan 30 - Mar 4100%0%
Mar 598.5%1.5%
Mar 775.3%24.7%
Mar 841.6%58.4%
Mar 10-11<1%>99%
Mar 12+0%100%

The critical observation here is that the quality regression was first independently reported on March 8, the exact date when redacted thinking blocks crossed the 50% threshold. This aligns perfectly with a staged deployment affecting user experience.

2. Thinking Depth Was Declining Before Redaction

Even before the full redaction, an analysis of the signature field on thinking blocks (which correlates strongly with thinking content length) shows a significant decline in thinking depth:

PeriodEst. Median Thinking (chars)vs Baseline
Jan 30 - Feb 8 (baseline)~2,200—
Late February~720-67%
March 1-5~560-75%
March 12+ (fully redacted)~600-73%

This data suggests that Claude's internal thought process was already being curtailed by late February, with an estimated 67% reduction in thinking depth before the redaction rollout even began. The combination of reduced thinking depth and subsequent redaction appears to be a double blow to the model's ability to handle complex tasks.

Moving Forward

The engineering community, which has previously found Claude to be a highly capable tool, is hopeful that Anthropic will address these concerns. The detailed report provides clear data indicating which workflows are most affected and why, aiming to inform decisions about thinking token allocation, particularly for power users engaged in high-complexity engineering.

As AI models become more deeply integrated into critical workflows, transparency and consistency in performance are paramount. The findings underscore the importance of maintaining the underlying cognitive capabilities that enable LLMs to tackle the nuanced challenges of software engineering.

Source:

Hacker News Best ↗