Lots is being written about AI these days, specifically LLM-based AI. I am not going to add anything novel here, but I thought it'd be interesting to document some of what I think about these tools and how I use them today so that I will be able to look back and compare in the future.
LLMs and the systems built around them are clearly impressive. They can also be really dumb in some ways; what some people refer to as jagged intelligence. In part because of that, and in part because I'm from the last millenium and hesitant about the privacy implications of everything "cloud", I don't use them very much in my personal life. I will very occasionally poke at ChatGPT about something or go into Google search's AI mode to drill down on a search. I have also dabbled in local models. But that's pretty much it. I regularly encounter people who seem to use these tools much more than me in their personal lives despite being generally less technical. To put this in context, I have always been hesitant and deliberate about the "smart" devices that I let into my life.
I do have free GitHub Copilot access on my personal GitHub account, and I do use that occasionally on personal projects, but not enough to have really spent a whole lot of time exploring what I can do with it. So, this is mostly going to be about using AI at work, specifically in software development and adjacent activities.
Outside of Software Development
My software development work is centered on LLVM and adjacent compiler-y projects. However, I also spend a large fraction of my working hours collaborating with our hardware architects. This second kind of work very much requires that I, as a human, deeply understand the systems that we are planning to build. I need to understand. By definition, an AI tool cannot directly help with this. There are sub-problems that AI tools can help with, but they are fairly limited in scope.
For example, I may want to make some simulation results interpretable in a way that requires ad hoc number crunching and visualization. I have written Python scripts using matplotlib by hand in the past, but an AI agent can do it faster. I sanity check the results as part of trying to understand them and the underlying processes and implications, just like I would sanity check a similar report produced by a colleague. If in doubt, I will dig into the script that the agent produced. But I don't feel the need to understand all the details of such scripts all the time. They are throwaway code.
Some people suggest that AI tools be used for writing text. I strongly disagree for the most part.
Writing is a form of thinking. In the process of writing and editing a text, such as a design document, a specification, this post, or even just an email or code review, I transfer my thoughts from their fuzzy state in my brain into the less forgiving shape of concrete words. In doing so, I implicitly review and refine them. Am I making a fully logical argument? Is there some corner case or a precondition that I missed? If I were to let an LLM write the text, I would skip this process and diminish my understanding. I would not be doing my job right. In fact, while I have gotten completely used to and have become a fan of LLM-based auto-complete for code, I specifically turn it off for text documents and also for comments in code. Getting those auto-completions in text always reminds me of the ST:TNG image macro where Q looks like he is about to whisper something into the ear of a visibly annoyed Picard.
That said, I only mostly disagree. For example, I have noticed colleagues use LLMs to help with a language barrier. I don't begrudge them that, though I sometimes wish they made more of an effort to find their own voice. If you do this, please be sure to make your own thorough editing pass on the resulting text! As another example, I have occasionally used LLMs to brainstorm synonyms and antonyms, metaphors, and names.
Editing source code
With that out of the way, let's move on to the meat that is all the rage these days: trying to use LLM-based tools for serious software development beyond the one-off, throwaway scripts I mentioned above.
Working on a large and subtle project like LLVM obviously biases me here, and so I'm going to start by rejecting the term "code generation". Most of serious software development in the kind of project that I'm interested in is, or should be, code editing instead. It is true that empirically, software always tends to net growth in terms of lines of code, but every line of code creates a liability. It is unfortunate that current AI tools are clearly more biased towards code generation.
A key aspect of all types of tooling, not just LLM-based tools, is interactivity. Does the tool give a response in seconds, minutes, hours? As a general rule, tools that are in the 5 seconds to minutes range are a giant drag on productivity and are bad for mental health. Sitting around idly waiting for the tool is boring. The temptation to context switch is high, but context switches are mentally draining. Current LLM-based tools are generally unable to give useful responses in under 5 seconds, and that puts a hard ceiling on where their use is desirable. Auto-complete is an obvious exception to this, in large part because it uses a model that isn't actually all that large by today's standards.
Given that the latency of LLMs is more than just a few seconds, the way to use them is to kick off tasks that can run in the background without creating frequent and/or costly context switches.
One example of such a use is that I occasionally have a question about a part of a large code base that I am not very familiar with. In that case, I might put the question to an AI agent integrated into my IDE while then also researching the question myself in parallel. Whether the agent response is still helpful by the time it shows up, or even at all, is hit and miss. There is also a danger here. The agent will almost always create a confident-sounding response, but it may just be incorrect. You must treat it not as an oracle but as a "rubber duck" that can talk back to you. Still, this has been helpful often enough that I feel almost, but not quite, entirely comfortable calling this a habit by now. I expect it will become one over time.
Of course, the big use case is agentic coding, but only if the agent does not have to ask for permission before common actions. We must give the agent a sandbox in which it can run without having to ask for permission.
Sandboxing
My current approach to sandboxing uses defense in depth. I use Claude Code with its integrated sandbox mechanism, into which I unfortunately had to poke some minor holes. Because of that, but more importantly because Claude Code seems rather vibe coded and not that robust, and I am the kind of person who has used Linux since before the kernel reached version 1.0, I created a separate "agents" user account on my workstation that the agent runs under. This user account cannot access my main account's home directory, and it has no credentials at all except for the key to the LLM gateway provided by my employer. There is a clone of LLVM in the agents home directory that has no remotes. I use git push and pull from my main account to locally exchange Git trees between the agent work space and my main work environment. I also have a Visual Studio Code workspace that opens the agent's clone so that I can easily make manual edits to the version of the code that the agent is working on. I use the SetGID bit on directories so that files inherit the agents group of which my main user account is a member.
I know of people who use Docker instead of user accounts to achieve similar defense in depth. That might work better in some "cloud" environments. I generally prefer to avoid unnecessary complexity in my software stacks, and the separate user account works well for me.
Workflow
I use coding agents almost exclusively in the background. I set them a task and do something else in parallel, checking in with the agent every once in a while. I do not use agents for coding on whatever my current main focus is. This requires some mental discipline and task switching that I am still getting used to. There are many days on which my main focus is so intense that I do not use agents at all. I simply do not have the spare mental bandwidth to set useful tasks for an agent and to adequately handle the output.
Coding agents do not currently have good taste for software architecture. They cannot be trusted to make good choices on tasks that have architectural freedom, but I still use them for coding tasks that are intended for production use.
I tend to use coding agents for tasks that are relatively narrow and well-defined steps of a larger project. Sometimes, that project is my main focus and I identify a future step that is largely independent of what I am currently working on. If I can get the agent to take care of that step already, I won't have to do it myself later. Sometimes, there is a larger project that is not my main focus but that is simple enough that I already have a good understanding of how to break it down into smaller steps that an agent can work through over time.
Sometimes, I prompt the agent for the task directly in its chat interface. More often, I prepare a description of the task in a markdown file. This file typically ends up being somewhere between 10 and 20 lines long. This line count includes whitespace and bullet lists. Initially, I made these files read-only for the agent. More recently, I ask the agent to append a progress log to this file as it works through the task. This works reasonably well, but it is certainly a part of my workflow that I can see changing further.
I have two reasons for putting what is essentially the prompt in a separate file. First, I like having these files as a record of the tasks I have given to agents. Second, as I noted above, writing is thinking. I end up implicitly checking myself. Being just a little more thorough and precise also helps keep the agent aligned with my intentions.
When the agent is done, I read its final report to see if anything suspicious stands out to me. For example, the agent may report that it ran into some issue and worked around it in a certain way. I may ask the agent to change something, though I have found this kind of re-prompting to not be successful often enough for my tastes. Unless I find anything obviously problematic quickly, the next step is to thoroughly review all changes with an editor and the diff open side by side. I make changes very liberally as I go. Occasionally, I end up breaking something accidentally in doing so, which often ends up being an important learning experience. I also completely rewrite any commit messages from scratch. Communicating human intent in the commit message is the right thing to do, and not just because that's LLVM policy.
Speaking of LLVM specifically, there are a few common issues that I end up editing almost always.
- LLMs are ridiculously comment-happy. The LLVM code base is very sparsely commented. While I personally think that a few more comments often wouldn't hurt, the comments produced by LLMs are of incredibly low value on average. Most of the time, they restate the obvious.
As of right now, I am hesistant to try to guide the LLM away from generating so many comments for two reasons. First, anything done to push the model away from its trained preference seems likely to have a negative effect elsewhere. Second, while the comments would generally hurt when trying to read the code from scratch, they do often help to review the specific changes that the model was trying to achieve. Third, deleting the comments is quick and easy and prevents me from becoming lazy. - LLMs are trained for the wrong idea of defensive programming by LLVM's standards. In LLVM, we want bad internal state to fail hard and fast. If a pointer is unexpectedly null, we'd rather abort immediately. This biases us strongly towards using assertions. One could call this defensive programming, but it is very different from the more traditional form of defensive programming where you try to limp along when faced with bad internal state, which is what LLMs are biased towards.
I have tried to guide the LLMs away from this behavior with mixed results. I do wonder if the situation could be improved with additional deterministic tooling that would help purely human development as well. Roughly speaking, the idea would be to observe the coverage of conditions in the code while running the test suite. If a condition is never exercised, that's a good sign that there is either a missing test case or the condition should be converted into an assertion. However, there are also many exceptions to this rule.
Other common issues that are not as specific to LLVM:
- LLMs produce overly verbose code. If a piece of domain logic is already implemented in one place, but they need it in a second, they tend to duplicate the logic instead of refactoring it out into a common location.
It is unsurprising that LLMs are bad at making judgments about this. Even humans have a hard time and come up with pithy heuristics like DRY and WET that fail to grasp the issue properly. What you really need is a proper understanding of the problem domain and its mapping to the code. This allows you to decide whether two pieces of logic are inherently identical because they map to the same aspect of the domain, or whether they only incidentally look the same but are in fact conceptually separate.
- Confused and convoluted logic. This is more likely to happens when the agent fails to one-shot a solution and ends up iterating in a debug-retry loop. Similar to the previous bullet, the resulting logic does not map properly to the underlying problem domain. It is not uncommon for the agent to produce a chunk of code that I can fairly easily reduce down to half the length by properly mapping the logic to the problem domain.
So far, I have never used the code changes produced by an agent as-is. I have always done at least some minimal amount of editing, and not just for the sake of it.
Is agentic coding worth it?
That question has a lot of facets.
At its most myopic, one can ask whether the use of agentic coding saves enough time to be worth the money that is paid for the evaluation and generation of tokens. I currently believe that to be the case.
Getting to quality with agentic coding still requires a lot of human involvement. That said, making robust changes requires more than just editing code. In all but the most trivial changes, your first edit will have bugs. For LLVM specifically, rebuilding the project and running even basic offline tests takes more than a minute and falls into that awkward window of non-interactivity that I described earlier. When a test fails, it needs to debugged, understood, and a fix applied. And even for the kinds of relatively simple tasks that I give the agents, I may easily end up in multiple iterations of this loop if I do it myself. Having an LLM-based agent drive this tedious loop is genuinely useful. By the time the agent has finished and I review its output, it will usually have completed the loop. And if the agent addressed a bug incorrectly, at least it has already done some initial root-cause analysis. The analysis is not always reliable, but it does allow me to bring my own understanding to bear more quickly. And so that's the biggest part of how agentic coding saves time in my experience.
Over the course of a day, I might spend something like 20 minutes prompting the agent every once in a while, and another half hour to 45 minutes editing and wrangling the result of something that may well have taken me an afternoon without agentic coding. That's far from the 10x that is promised by some, and it only speeds up a fraction of my work. Amdahl's law applies. But that doesn't matter for the question of whether the cost of the tokens is justified on some accountant's spreadsheet. By the statistics I can see on my token usage, it does seem justified by a healthy margin. Probably not by quite as much as the loudest AI advocates would like, but I'm going to bet neither against further improvements in the models and agent harnesses nor against further cost optimizations in the underlying inference technology (making the latter happen is part of my job).
What about other facets of the question? Is agentic coding worth enough to justify the current investment bonanza? What about the circular financial shenanigans that are being played? Are the externalities of data centers properly priced in? Not to mention the hypocrisy around copyright. And what does the cocktail of social media and generative AI do to the fabric of our society? There are good reasons for concern on each of those points and more, but those are big topics, and this post is already quite long, so please forgive my leaving it at that.
Agentic coding for better code quality
One facet that I do want to touch on is that the momentum around agentic coding at large is very clearly moving towards a reduction in software quality. This is inherent to the idea of "vibe coding", and the term "slop" is thrown around for good reasons. The influx of crap in software projects is an urgent problem, especially where it comes to extractive contributions. It is no wonder that we're seeing many projects adopt new policies around AI tooling.
However, agentic coding can be used to improve code quality. I know that because I have already done so, and I believe that this experience could apply more widely. Every sufficiently senior software developer knows the struggle with tech debt. There is a general pattern where cleaning up tech debt requires a significant amount of refactoring work that is fairly tedious and wasn't traditionally viable to automate. Simple search-and-replace is usually not flexible enough. Writing a rule for a refactoring tool like clang-tidy or Coccinelle requires a lot of expertise. LLM-based agents aren't deterministic and are therefore not as reliable as those tools, but they do have a lot of flexibility. They can be used to make the refactors directly, and that's what I have already done. Or perhaps we can get them to write rules for those deterministic tools. That is something I still want to explore.
I also have the impression that by using both my brain and an agent, I get some of the same quality benefits that talking to another human would give. For example, the agent sometimes finds an approach that is better than what I had thought of.
At least in large companies, quality is fundamentally never going to be the path of least resistance because the cost of tech debt is so difficult to quantify. But if the cost of reducing tech debt goes down, it becomes easier to justify. That gives me some hope. I encourage others to explore this direction as well. Of course, for this to become a net benefit, we do also need to figure out how to cap the rate at which tech debt accumulates through the careless use of agentic coding.
Closing thoughts
I do not believe the maximalist AI hype.
There are those who publish pseudo-scientific mathematical models by which they predict massive upheaval within just a few years. Those models are so dubious and sensitive to chosen parameters that one may as well read tea leaves. Aside from this negative reason for disbelieving the hype, there are positive reasons to believe that change is going to happen more slowly. One of them is that perhaps the closest analogy we have today are self-driving cars. It's not that those aren't coming, but the timelines were massively exaggerated. Another reason is related to the observations around the lack of taste shown by coding agents today. Lack of taste manifests as compounding tech debt when coding agents are left to their own devices. This happens on time horizons that are much longer than even the longest contexts supported by LLMs today, so (1) it seems likely that some black swan innovations are required before the situation improves significantly and (2) even if some such innovations are made, there is no fundamental reason why they would end up scaling in the same way that LLMs have scaled so far.
I also do not believe the maximalist AI denial.
As laid out in this post, I believe that inference for agentic coding is economically viable today. I suspect that generative AI inference is economically viable for a whole bunch of other use cases as well, though I cannot really speak to those. We may well be in a bubble; but the models don't disappear when the bubble pops, and agentic coding is here to stay in some form.
In other words, there is a lot of uncertainty around timelines, but agentic coding is absolutely changing software development. And I do not feel entirely comfortable about that.
Keine Kommentare:
Kommentar veröffentlichen