
"Perhaps the most prominent change for most users is that in the consumer app experiences (web, mobile, and desktop), Claude will be less prone to abruptly hard-stopping conversations because they have run too long. The improvement to memory within a single conversation applies not just to Opus 4.5, but to any current Claude models in the apps. Users who experienced abrupt endings (despite having room left in their session and weekly usage budgets) were hitting a hard context window (200,000 tokens)."
"Whereas some large language model implementations simply start trimming earlier messages from the context when a conversation runs past the maximum in the window, Claude simply ended the conversation rather than allow the user to experience an increasingly incoherent conversation where the model would start forgetting things based on how old they are. Now, Claude will instead go through a behind-the-scenes process of summarizing the key points from the earlier parts of the conversation, attempting to discard what it deems extraneous while keeping what's important."
"Opus 4.5 is the first model to surpass an accuracy score of 80 percent-specifically, 80.9 percent in the SWE-Bench Verified benchmark, narrowly beating OpenAI's recently released GPT-5.1-Codex-Max (77.9 percent) and Google's Gemini 3 Pro (76.2 percent). The model performs particularly well in agentic coding and agentic tool use benchmarks, but still lags behind GPT-5.1 in visual reasoning (MMMU)."
Opus 4.5 reduces abrupt conversation terminations in consumer apps by improving memory within single conversations, applying the improvement across current Claude models in the apps. Conversations that previously ended when hitting the 200,000-token context limit will now be compacted through behind-the-scenes summarization that discards extraneous content while preserving important details. Developers can apply similar techniques via API context management and context compaction. Opus 4.5 achieved 80.9 percent on the SWE-Bench Verified benchmark, outperforming recent competitor models and showing particular strength in agentic coding and tool use, while trailing in visual reasoning (MMMU).
Read at Ars Technica
Unable to calculate read time
Collection
[
|
...
]