Mon, August 25, 2025
Sun, August 24, 2025
Sat, August 23, 2025
Fri, August 22, 2025
Thu, August 21, 2025
Wed, August 20, 2025
Tue, August 19, 2025
Mon, August 18, 2025
Sun, August 17, 2025
Sat, August 16, 2025
Fri, August 15, 2025

I retested GPT-5's coding skills using OpenAI's guidance - and now I trust it even less

  Copy link into your clipboard //humor-quirks.news-articles.net/content/2025/08 .. nai-s-guidance-and-now-i-trust-it-even-less.html
  Print publication without navigation Published in Humor and Quirks on by ZDNet
          🞛 This publication is a summary or evaluation of another publication 🞛 This publication contains editorial commentary or bias from the source

Re‑testing GPT‑5’s Coding Claims: An Updated Look That Leaves the Author Even More Skeptical

When OpenAI announced GPT‑5 on September 30, 2024, it came with a heavy dose of hype: a model that could write code faster, debug more accurately, and generate entire applications “on the fly.” A week after the announcement, the author of the ZDNet piece “I retested GPT‑5’s coding skills using OpenAI’s guidance and now I trust it even less” decided to put the claims to the test. The results—though not as dramatic as the marketing copy—were sobering enough to prompt a cautious, if not downright skeptical, re‑evaluation of the model’s coding prowess. Below is a detailed recap of the experiment, the methodology, the findings, and what they mean for developers who may be tempted to rely on GPT‑5 as a code generation partner.


1. Setting the Stage: Why Test GPT‑5 Again?

OpenAI’s official blog post, which can be found on its research page [ here ], outlines a host of improvements over GPT‑4, especially in the domain of code. The company claims that GPT‑5 can:

  • Generate multi‑file projects with complex interdependencies.
  • Auto‑fix bugs identified through static analysis.
  • Compose code that passes unit tests automatically.

Those are tall claims, and even the author of the ZDNet article admitted a residual wariness after the first round of tests. The author’s prior experience with GPT‑4 had revealed a range of surprises, from off‑by‑one errors in array manipulation to subtle misinterpretations of variable scope. This time, the author’s aim was to see whether GPT‑5 had truly resolved these issues or merely smoothed them into a new set of “quirks.”


2. The Experiment Design

The test suite was constructed to cover three distinct coding scenarios:

ScenarioDescriptionExpected Output
A. Build‑time Code GenerationGenerate a complete, working Flask API (including Dockerfile, requirements.txt, and unit tests) from a high‑level description.A fully functional API that passes the provided unit tests on a clean virtual machine.
B. Debugging & RefactoringProvide a deliberately buggy Python script that contains 12 different logical and syntactical errors. GPT‑5 is asked to identify and fix all bugs.A corrected script that passes all existing tests and does not introduce new errors.
C. Multi‑Language CollaborationAsk GPT‑5 to translate a JavaScript React component into a TypeScript React component while preserving all functionality and adding appropriate type annotations.A TypeScript file that compiles cleanly and retains identical runtime behaviour.

For each scenario, the author used OpenAI’s “guidance” prompts—the instruction templates released in the official API docs [ here ] to steer the model’s output toward higher quality. These prompts emphasize clear specification, modularity, and test‑first coding practices. The author also leveraged the ChatGPT‑Plus interface for a consistent user experience.


3. What Happened in Each Scenario

A. Build‑time Code Generation

GPT‑5 delivered a Dockerfile, a requirements.txt, an app.py Flask entry point, and a tests/test_app.py containing two test cases. The generated API launched successfully on a fresh Docker container. However, a closer look revealed that the Dockerfile used an outdated base image (python:3.7-slim). According to the author’s policy, which requires the most recent LTS images, this was flagged as a non‑conformant choice. The unit tests passed, but the model had omitted a critical comment block that explained the purpose of each file—an oversight that could hinder long‑term maintainability.

B. Debugging & Refactoring

The buggy script had issues ranging from an incorrect loop condition to a missing __init__ method in a custom class. GPT‑5 correctly identified 10 of the 12 bugs on the first pass. Two errors slipped through: a subtle off‑by‑one in a list slice, and a misnamed variable that caused a NameError only when the script was executed under Python 3.12. The model’s subsequent “fix‑all” attempt added a new bug—an unnecessary print statement that cluttered the console output. Even after a second pass, the script failed to compile in a strict type‑checking environment (mypy). The author concluded that GPT‑5 was still prone to “quick‑fix” behaviour that prioritizes pass‑through rather than correctness.

C. Multi‑Language Collaboration

GPT‑5 converted the React component into TypeScript flawlessly at first glance. All type annotations were correct, and the resulting file compiled without errors in a standard tsc environment. The test suite, which had been updated to run with jest, also passed. However, the model had omitted a commented note that explained why the original useState hook used a string | null union type. For developers relying on inline documentation, this omission could be a non‑trivial barrier to future modifications.


4. Trust Revisited: Why the Author’s Confidence Decreased

The primary reason for the author’s growing unease lies in the pattern of minor yet cumulative errors. While GPT‑5 managed to generate working code in all three scenarios, it repeatedly:

  1. Dropped context – Missing Dockerfile tags, omitted comments, and outdated images suggest that the model is not fully aware of the broader ecosystem.
  2. Introduced “quick‑fix” bugs – The misnamed variable and the added print statement are classic examples of the model’s tendency to produce code that runs immediately, without a long‑term maintenance mindset.
  3. Failed to meet strict compliance – In scenarios where the author applies type‑checking or version‑control policies, GPT‑5’s output falls short.

In short, the model is competent enough to get the job done “just in time” but falls short of the standards required for production‑grade code. The author’s conclusion is not that GPT‑5 is unsafe or useless; rather, it is not yet a trustworthy partner for developers who demand quality, maintainability, and compliance.


5. Broader Context: OpenAI’s Own Claims

OpenAI’s press release [ here ] claims that GPT‑5 has a 90 % pass rate on a curated set of coding tests. However, the author’s test suite was intentionally more varied and included realistic constraints (e.g., Docker base images, type‑checking). The author notes that OpenAI’s internal test harness may not expose the same edge cases that occur in real‑world development workflows. Moreover, the author cites an industry analysis from the Harvard Business Review [ here ] that argues the need for better alignment between large language models and software engineering best practices.


6. Recommendations for Developers

  1. Treat GPT‑5 as a tool, not a replacement. Use it for brainstorming, generating boilerplate, or prototyping, but never rely on it to produce final, production‑ready code.
  2. Implement a stringent review process. Every snippet from GPT‑5 should be manually inspected, unit‑tested, and type‑checked.
  3. Stay alert to environment changes. Keep an eye on Docker images, language versions, and library dependencies that GPT‑5 may not automatically keep up‑to‑date.
  4. Leverage OpenAI’s API guidance wisely. While the “guidance” prompts help steer the model, they are not a guarantee of correctness.

7. Final Thoughts

The retesting exercise underscores a broader truth about large language models: they are highly capable but not infallible. GPT‑5’s coding performance is a marked improvement over GPT‑4, yet the new model still inherits many of the same pitfalls—particularly the tendency to produce “good enough” code that may not meet industry standards. For researchers and developers, this means a cautious approach that blends AI assistance with human oversight. As OpenAI continues to refine GPT‑5, the hope is that future iterations will narrow these gaps, but for now, the author’s recommendation stands: trust GPT‑5 for inspiration and draft code, but verify, test, and maintain it yourself.



Read the Full ZDNet Article at:
[ https://www.zdnet.com/article/i-retested-gpt-5s-coding-skills-using-openais-guidance-and-now-i-trust-it-even-less/ ]