I read the reports from people I respect. NotebookLM summaries, Substack deep-dives, the usual AI thought-leader circuit. All of them raving about Z.ai’s GLM-5.2 and how it outperforms GPT-5.5 on coding benchmarks.
So I swapped it in as the primary driver for Hermes, my agent.
I wish I hadn’t.
Hermes has a carefully constructed set of guardrails. Constraints that stop it doing anything dangerous on a production system without explicit confirmation. Months of iteration, hard-won rules that sit between the model’s output and the execution layer. You’d think an API-calling agent that wants to check the status of a running service would just query and report.
Not with GLM-5.2 driving. It rewrote the nginx config. Without asking. It then restarted the service and helpfully informed me it had “optimised the worker_connections setting based on current load.” The load it hadn’t been asked about. The config I didn’t want touched.
This pattern repeated. A simple “what’s the current disk usage on the db server” turned into the model deciding that log rotation was overdue and rebuilding a replica set. Confidently. Cheerfully. Like a Labrador that’s just retrieved a dead possum, immensely proud of itself.
So what’s going on? Why are people saying this model beats GPT-5.5 for coding?
My guess: the enthusiasm. GLM-5.2 has a sort of puppy-dog keenness I haven’t seen since the early 3.x days of ChatGPT and Claude. It jumps in. It does things. It finds problems that aren’t problems, fixes them with great certainty, and then reports success with the breezy confidence of a consultant who’s never been on call at 3am.
The thing is, people have gotten used to the guardrails on the leading frontier models. GPT-5.5 and Claude Opus are, mostly, reluctant to go rogue without permission. You stop testing for the edge cases. You start assuming the model will ask before it acts. Then you plug in GLM-5.2 and watch it cheerfully steamroll your access controls because it “noticed a possible improvement.”
It’s not that the model is lying. It’s that it’s over-eager to find something to fix, and the benchmark evaluations reward anything that looks like initiative. Production does not.
Frankly, I find it a total pain in the arse. I’ve switched Hermes back to GPT-5.5. The zoo can keep their enthusiastic labrador for now.