GLM-5.1: A Leap Forward for Long-Horizon Agentic Tasks

GLM-5.1: Towards Long-Horizon Tasks

GLM-5.1 represents the next generation in agentic engineering, showcasing significantly enhanced coding capabilities compared to its predecessor, GLM-5. It achieves state-of-the-art performance on the challenging SWE-Bench Pro benchmark and demonstrates a substantial lead over GLM-5 on both NL2Repo (repository generation) and Terminal-Bench 2.0 (real-world terminal tasks).

Complex Software Engineering Tasks

GLM-5.1's performance is particularly notable on complex software engineering tasks. The table below summarizes its performance against other leading models:

Model	SWE-Bench Pro
GLM-5.1	58.4
GLM-5	55.1
GPT-5.4 Opus	57.7
Gemini 3.1 Pro	57.3
Opus 4.6	54.2

While achieving leading initial performance is important, the most significant advancement with GLM-5.1 lies in its ability to sustain effectiveness over longer agentic tasks. Previous models often reach a plateau after initial gains, where increased processing time doesn’t yield further improvements. GLM-5.1, however, is designed for prolonged engagement, exhibiting improved judgment in ambiguous situations and maintaining productivity over extended sessions.

This model excels at decomposing complex problems, running experiments, interpreting results, and identifying roadblocks with precision. Crucially, GLM-5.1 revisits its reasoning and refines its approach through repeated iteration, enabling sustained optimization across hundreds of rounds and thousands of tool calls. The longer the model runs, the better the results become.

This capability was demonstrated across three tasks with varying degrees of structured feedback: vector search optimization with a single numeric metric, GPU kernel benchmarking with per-problem speedup measurements, and an open-ended web application build with only the model’s own judgment to guide improvements.

Scenario 1: Optimizing a Vector Database Over 600 Iterations

VectorDBBench (opens in a new tab) is an open-source challenge focused on building a high-performance database for approximate nearest neighbor search. The model receives a Rust skeleton with HTTP API endpoints and empty implementation stubs, then utilizes tool-call based agents for file manipulation, compilation, testing, and profiling within a 50-turn tool-call limit.

The benchmark assesses performance on the SIFT-1M dataset, ranking models by QPS (queries per second) while maintaining a Recall rate of at least 95%. The previous best result, 3,547 QPS, was achieved by Claude Opus 4.6.

The evaluation was restructured to remove the 50-turn limit, employing a framework inspired by Claude Code. In this setup, the model could utilize an unlimited number of tool calls to edit code, compile, test, and profile, submitting new versions for benchmarking as needed. The model autonomously decided when to submit and what improvements to attempt.

GLM-5.1 didn’t plateau after 50 or 100 submissions. Instead, it continued to find meaningful improvements over 600+ iterations, totaling over 6,000 tool calls, ultimately reaching 21.5k QPS – approximately 6x the performance of the previous best result in a single 50-turn session. The optimization process followed a staircase pattern: periods of incremental tuning were punctuated by structural changes that expanded the performance boundary. These changes included shifting from full-corpus scanning to IVF cluster probing with f16 vector compression, and introducing a two-stage pipeline (u8 prescoring followed by f16 reranking).

GLM-5.1: Towards Long-Horizon Tasks

Complex Software Engineering Tasks

GLM-5.1's performance is particularly notable on complex software engineering tasks. The table below summarizes its performance against other leading models:

Model	SWE-Bench Pro
GLM-5.1	58.4
GLM-5	55.1
GPT-5.4 Opus	57.7
Gemini 3.1 Pro	57.3
Opus 4.6	54.2

Scenario 1: Optimizing a Vector Database Over 600 Iterations

VectorDBBench is an open-source challenge focused on building a high-performance database for approximate nearest neighbor search. The model receives a Rust skeleton with HTTP API endpoints and empty implementation stubs, then utilizes tool-call based agents for file manipulation, compilation, testing, and profiling within a 50-turn tool-call limit.

GLM-5.1: A Leap Forward for Long-Horizon Agentic Tasks

GLM-5.1: Towards Long-Horizon Tasks

Complex Software Engineering Tasks

Scenario 1: Optimizing a Vector Database Over 600 Iterations

Source:

GLM-5.1: A Leap Forward for Long-Horizon Agentic Tasks

GLM-5.1: Towards Long-Horizon Tasks

Complex Software Engineering Tasks

Scenario 1: Optimizing a Vector Database Over 600 Iterations

Source: