My experience is that AIs amplify what you put in them.
If you put in lazy problem definitions, provide the bare minimum context and review the code cursorily then the output is equally lackluster.
However, if you spend a good amount of time describing the problem, carefully construct a context that includes examples, documentation and relevant files and then review the code with care - you can get some very good code out of them. As I've used them more and more I've noticed that the LLM responds in a thankful way when I provide good context.
> Always ask for alternatives
> Trust but verify
I treat the AI as I would a promising junior engineer. And this article is right, you don't have to accept the first solution from either a junior engineer or the AI. I constantly question the AIs decisions, even when I think they are right. I just checked AI studio and the last message I sent to Gemini was "what is the reasoning behind using metatdata in this case instead of a pricing column?" - the context being a db design discussion where the LLM suggested using an existing JSONB metadata column rather than using a new column. Sometimes I already agree with the approach, I just want to see the AI give an explanation.
And on the trust front, often I let the AI coding agent write the code how it wants to write it rather than force it to write it exactly like I would, just like I would with a junior engineer. Sometimes it gets it right and I learn something. Sometimes it gets it wrong and I have to correct it. I would estimate, 1 out of 10 changes has an obvious error/problem that I have to intervene.
I think of it this way: I control the input and I verify the output.
lubujackson 2 hours ago [-]
My experience with LLMs currently is that they can handle any level of abstraction and focus, but you have discern the "layer" to isolate and resolve.
The next improvement may be something like "abstraction isolation" but for now I can vibe code a new feature which will produce something mediocre. Then I ask "is that the cleanest approach?" and it will improve it.
Then I might ask "is this performant?" Or "does this follow the structure used elsewhere?" Or "does this use existing data structures appropriately?" Etc.
Much like the blind men describing an elephant they all might be right, but collectively can still be wrong. Newer, slower models are definitely better at this, but I think rather than throwing infinite context at problems if they were designed with a more top down architectural view and a checklist of competing concerns we might get a lot further in less time.
This seems to be how a lot of people are using them effectively right now - create an architecturr, implement piecemeal.
vjerancrnjak 1 hours ago [-]
I’ve found it to never be able to produce the cleanest approach. I can spend 5 hours and get something very simple and I can even give it that as an example and it cannot extrapolate.
It can’t even write algorithms that rely on the fact that something is sorted. It needs intermediate glue that is not necessary, etc. massive noise.
Tried single allocation algorithms, can’t do that.
Tried guiding to exploit invariants, can’t find
single pass workflows.
The public training data is just bad and it can’t really understand what actually good is.
bongodongobob 9 minutes ago [-]
Define clean, what does that even mean? If you tell it to write efficient code for example, how would it know whether you're taking about caching, RAM, Big O, I/O etc unless you tell it?
nonethewiser 1 hours ago [-]
This has been my experience too. It's largely about breaking things down into smaller problems. LLMs just stop being effective when the scope gets too large.
Architecture documentation are helpful too, as you mentioned. They are basically a set of rules and intentions. It's kind of a compressed version of your codebase.
Of course, this means the programmer still has to do all the real work.
cryptonym 1 hours ago [-]
Sounds like coding with extra steps.
nonethewiser 37 minutes ago [-]
What is the extra step? You have to do the upfront legwork either way.
jbellis 3 hours ago [-]
I think this is correct, and I also think it holds for reviewing human-authored code: it's hard to do the job well without first having your own idea in your head of what the correct solution looks like [even if that idea is itself flawed].
danielbln 6 hours ago [-]
I put the examples he gave into Claude 4(Sonnet) purely asking to eval the code, it pointed out every single issue about the code snippets (N+1 Query, race condition, memory leak). The article doesn;t mention which model was used, or how exactly it was used, or in which environment/IDE it was used.
The rest of the advice in there is sound, but without more specifics I don't know how actionable the section "The spectrum of AI-appropriate tasks" really is.
metalrain 2 hours ago [-]
It's not about "model quality". Most models can improve code their output when asked, but problem is the lack of introspection by the user.
Basically same problem as copy paste coding, but LLM can (sometimes) know your exact variable names, types so it's easier to forget that you need to understand and check the code.
shayonj 5 hours ago [-]
My experience hasn't changed between models, given the core issue mentioned in the article. Primarily I have used Gemini and Claude 3.x and 4. Some GPT 4.1 here and there.
All via Cursor, some internal tools and Tines Workbench
soulofmischief 2 hours ago [-]
My experience changes just throughout the day on the same model, it seems pretty clear that during peak hours (lately most of the daytime) Anthropic is degrading their models in order to meet demand. Claude becomes a confident idiot and the difference is quite noticeable.
genewitch 41 minutes ago [-]
this is on paid plans?
lazy_moderator1 56 minutes ago [-]
did it detect n+1 in the first one, race condition in the second one and memory leak in the third one?
danielbln 36 minutes ago [-]
It did, yeah.
nonethewiser 2 hours ago [-]
This largely seems like an alternative way of saying "you have to validate the results of an LLM." Is there any "premature closure" risk if you simply validate the results?
Premature closure is definitely a risk with LLMs but I think code is much less at risk because you can and SHOULD test it. But its a bigger problem for things you cant validate.
I might starting calling this "the original sin" with LLMs... not validating the output. There are many problems people have identified with using LLMs and perhaps all of them come back to not validating.
bwfan123 1 hours ago [-]
> I might starting calling this "the original sin" with LLMs... not validating the output.
I would rephrase it - the original sin of llms is not "understanding" what they output. By "understanding" it is meant the "the why of the output" - starting from the original problem and reasoning through to the output solution - ie, the causation process behind the output. What llms do is to pattern match a most "plausible output" to the input. But the output is not born out of a process of causation - it is born out of pattern matching.
Humans can find meaning in the output of LLMs, but machines choke at it - which is why, LLM code looks fine at a first glance until someone tries to run it. Another way to put it is, LLMs sound persuasive to humans - but at the core are rote students who dont understand what they are saying.
nonethewiser 33 minutes ago [-]
I agree that its important to understand they are just predicting what you should expect to see. And that is a more fundamental truth about an LLM itself. But im thinking more in terms of using LLMs. That distinction doesn't really matter if the tests pass.
"But what if the real issue is an N+1 query pattern that the index would merely mask? What if the performance problem stems from inefficient data modeling that a different approach might solve more elegantly?"
In the best case you would have to feed every important information into the context. These are my indexes, this is my function, these are my models. After that the model can find problematic code. So the main problem is go give your model all of the important information, that has to be fixed if the user isn’t doing that part (Obviously that doesn’t mean that the LLM will find a correct problem, but it can improve the results).
2 hours ago [-]
MontagFTB 2 hours ago [-]
This isn’t new. Have we not already seen this everywhere already? The example at the top of the article (in a completely different field, no less) just goes to show humans had this particular sin nailed well before AI came along.
Bloated software and unstable code bases abound. This is especially prevalent in legacy code whose maintenance is handed down from one developer to the next, where their understanding of the code base differs from their predecessor’s. Combine that with pressures to ship now vs. getting it right, and you have the perfect recipe for an insipid form of technical debt.
2 hours ago [-]
suddenlybananas 6 hours ago [-]
I initially thought that layout of the sections was an odd and terrible poem.
tempodox 2 hours ago [-]
Now that you mention it, me too.
shayonj 5 hours ago [-]
haha! I didn't see it that way originally. Shall take it as a compliment and rework that ToC UI a bit :D.
shayonj 2 hours ago [-]
ok ok - i put in a little touch up :D
mock-possum 2 hours ago [-]
Oh wow me too - I kinda like it that way.
But if it’s meant to be a table of contents, it really should be styled like a list, rather than a block quote.
If you put in lazy problem definitions, provide the bare minimum context and review the code cursorily then the output is equally lackluster.
However, if you spend a good amount of time describing the problem, carefully construct a context that includes examples, documentation and relevant files and then review the code with care - you can get some very good code out of them. As I've used them more and more I've noticed that the LLM responds in a thankful way when I provide good context.
> Always ask for alternatives
> Trust but verify
I treat the AI as I would a promising junior engineer. And this article is right, you don't have to accept the first solution from either a junior engineer or the AI. I constantly question the AIs decisions, even when I think they are right. I just checked AI studio and the last message I sent to Gemini was "what is the reasoning behind using metatdata in this case instead of a pricing column?" - the context being a db design discussion where the LLM suggested using an existing JSONB metadata column rather than using a new column. Sometimes I already agree with the approach, I just want to see the AI give an explanation.
And on the trust front, often I let the AI coding agent write the code how it wants to write it rather than force it to write it exactly like I would, just like I would with a junior engineer. Sometimes it gets it right and I learn something. Sometimes it gets it wrong and I have to correct it. I would estimate, 1 out of 10 changes has an obvious error/problem that I have to intervene.
I think of it this way: I control the input and I verify the output.
The next improvement may be something like "abstraction isolation" but for now I can vibe code a new feature which will produce something mediocre. Then I ask "is that the cleanest approach?" and it will improve it.
Then I might ask "is this performant?" Or "does this follow the structure used elsewhere?" Or "does this use existing data structures appropriately?" Etc.
Much like the blind men describing an elephant they all might be right, but collectively can still be wrong. Newer, slower models are definitely better at this, but I think rather than throwing infinite context at problems if they were designed with a more top down architectural view and a checklist of competing concerns we might get a lot further in less time.
This seems to be how a lot of people are using them effectively right now - create an architecturr, implement piecemeal.
It can’t even write algorithms that rely on the fact that something is sorted. It needs intermediate glue that is not necessary, etc. massive noise.
Tried single allocation algorithms, can’t do that. Tried guiding to exploit invariants, can’t find single pass workflows.
The public training data is just bad and it can’t really understand what actually good is.
Architecture documentation are helpful too, as you mentioned. They are basically a set of rules and intentions. It's kind of a compressed version of your codebase.
Of course, this means the programmer still has to do all the real work.
The rest of the advice in there is sound, but without more specifics I don't know how actionable the section "The spectrum of AI-appropriate tasks" really is.
Basically same problem as copy paste coding, but LLM can (sometimes) know your exact variable names, types so it's easier to forget that you need to understand and check the code.
All via Cursor, some internal tools and Tines Workbench
Premature closure is definitely a risk with LLMs but I think code is much less at risk because you can and SHOULD test it. But its a bigger problem for things you cant validate.
I might starting calling this "the original sin" with LLMs... not validating the output. There are many problems people have identified with using LLMs and perhaps all of them come back to not validating.
I would rephrase it - the original sin of llms is not "understanding" what they output. By "understanding" it is meant the "the why of the output" - starting from the original problem and reasoning through to the output solution - ie, the causation process behind the output. What llms do is to pattern match a most "plausible output" to the input. But the output is not born out of a process of causation - it is born out of pattern matching.
Humans can find meaning in the output of LLMs, but machines choke at it - which is why, LLM code looks fine at a first glance until someone tries to run it. Another way to put it is, LLMs sound persuasive to humans - but at the core are rote students who dont understand what they are saying.
In the best case you would have to feed every important information into the context. These are my indexes, this is my function, these are my models. After that the model can find problematic code. So the main problem is go give your model all of the important information, that has to be fixed if the user isn’t doing that part (Obviously that doesn’t mean that the LLM will find a correct problem, but it can improve the results).
Bloated software and unstable code bases abound. This is especially prevalent in legacy code whose maintenance is handed down from one developer to the next, where their understanding of the code base differs from their predecessor’s. Combine that with pressures to ship now vs. getting it right, and you have the perfect recipe for an insipid form of technical debt.
But if it’s meant to be a table of contents, it really should be styled like a list, rather than a block quote.