The Assist

At work we recently started experimenting with generative AI for assistance with programming. We have a new Visual Studio Code plugin which we can ask questions in English, and it spits back code. It was a really interesting piece of, well, mandatory training. I've formed some opinions.

The main thing I dislike about AI coding assistance is that I have to very carefully review the AI's code to make sure that it does the right thing and none of the wrong things. And I, personally, find that reviewing code is more difficult than writing equivalent code myself.

I don't know if anybody else has this experience, but I understand code better when I interact with it directly by hand. Either by bringing code into existence, or by refactoring code which was already there to do something new, or to do the same thing more effectively. I need to take things apart and put them back together. I need to play with the code.

Passively reading someone else's code in the dead form of a git diff just doesn't engage my brain in the same way. It takes more focus, it's much easier to overlook vital details, there's no flow state, and it's much less interesting. Of course, I do it, all the time, and I have no problem with it. Half of the job is code review, and unblocking my fellow developers is always a high priority. But I know which half of the job I enjoy more. (There are several other "halves" of any development job, but let's not confuse things.)

It also takes more time. I can write a simple switch/case statement or an explanatory docstring for a middle-sized method more quickly than I can get an AI to do this for me and then manually verify its correctness to an equivalent level of confidence. This applies even if the AI writing time is zero.

*

For example. I have no idea how to launch a HTTP request from C++. The assistant I tried out was able to provide apparently working code for this, shortcutting my need to spend hours familiarising myself with new syntax and APIs. It compiled and ran and did the thing.

However! Without a priori knowledge of C++ or the HTTP library in question, how am I supposed to know that the AI hasn't made some tragic blunder with its seemingly working implementation? Or misused the APIs in question? I don't even know what classes of problems to look for. So I'm back to checking the AI's work against the documentation. And the value the AI has provided to me — or whatever experienced C++ developer I inflict this PR on — is not zero, but it's not the giant leap head start it looks like at face value.

A similar example would be if I said, "Hey computer, this project uses webpack as its bundler. I want you to convert it to use esbuild instead." Even if the AI appears to do the thing, I'm still going to have to wade through some esbuild documentation to make sure it did it right, right?

Additionally, I find that describing my requirements to the AI in technical detail is often (not always) basically equivalent in complexity to just writing the code. And takes a similar amount of time to writing the code.

Code is actually really efficient, relative to human language. It's extremely dense, expressive and specific. Compare the amount of time taken to write out:

For each element in the field array, we need to emit a debug log entry explaining what element we're working on. Then, examine the type property on the element. If it's "bool", then we want to add the element to the array of Boolean fields, otherwise, add it to the array of non-Boolean fields. In either case, emit a debug log entry saying what we did.

versus:

for (const field of fieldArray) {
  logger.debug('processing field', field)
  if (field.type === 'bool') {
    booleanFields.push(field)
    logger.debug('field is Boolean')
  } else {
    nonBooleanFields.push(field)
    logger.debug('field is not Boolean')
  }
}

The amount of typing required for each of these is comparable. The amount of time taken to properly phrase the English prompt is comparable with the amount of time taken to just write the code directly. The phrasing will need altering a few times to get workable output. And the amount of refactoring of the generated code to create something equivalent to the handwritten code is always non-zero. So, have we gained anything?

Or imagine we have an existing piece of code generated, but the AI has missed something. Compare the effort of messaging it to say:

If the first argument isn't a number, the function should throw an exception saying "not a number".

and having it regenerate its work, versus just plumbing in:

if (typeof a !== 'number') {
  throw Error('not a number')
}

The English is fifty percent longer. It's also far more ambiguous. What about NaN, BigInts, numeric strings, boxed Number objects? The code is shorter, and abundantly clear on what should happen in all of those cases.

On top of all of that, I think I would have less of a problem with manually reviewing AI-generated code if I felt that this was actually helping someone get better at coding. I know that feeding positive and negative votes back to the AI will influence its internal logic and ultimately improve the quality of its output. Good for it. But, to borrow a phrase and twist it a bit, I value individuals and interactions over processes and tools. That is, I'm more personally invested in my fellow developers' professional development than I am in training any machine. As much of a chore as it can be, truthfully, I do enjoy reviewing my teammates' code because it's an opportunity to share good practices, and make good decisions, and, as months pass, watch them develop.

*

To back up a little: Humans are extremely bad at software development and we need all the help we can get.

I'm hugely in favour of human processes which make this better. Peer review, best practices, rules of thumb, comments and documentation, forms and checklists. I'm also hugely in favour of automated tools which augment the human process. Linters, strong type systems, fuzzers, mutation testing. But this shouldn't be an either/or thing. I want all the human checks and all the technological checks.

And as it's currently positioned, AI coding assistance seems like an addition, but it actually removes the human from a part of the process where the human was actually extremely valuable. When a human writes code, we have the original programmer to vouch for their code, plus another person to review and double-check their work. With AI assistance, we don't get that first "guarantee". Nobody with the vital domain knowledge which comes from inhabiting physical reality, or who was present at the meetings where we discussed these new features, participated in the creation of the code. We basically just have two peer reviews.

Essentially, AI assistance replaces coding with code review.

Which is bad, for all the reasons I mentioned, but also, and mainly, because I love coding!

(In the same way, self-driving cars replace the experience of driving with the experience of being a driving instructor. A lot of the same objections apply here. Most of us who have some experience with driving are capable of acting as a driving instructor, but it is more difficult and stressful than driving, requiring greater vigilance. It's a different mental process from driving and it's much less enjoyable than driving.)

So I feel that the better place for the AI assistant is in a supporting role like: "Hey, AI. I have written this code. Do you see anything wrong with it? Could it be improved? Have I missed an obvious simplification? Do my unit tests seem to be exhaustive?"

This also makes it much simpler for us to evaluate the AI's contributions, in a relatively objective fashion. Does it have anything useful to say? Is it generally worth listening to or is it wasting our time?

I think this approach generalises pretty well to other fields where AI is being applied. Not to do work and then have a human review it, but to review human work. "Do you agree? What have I missed?"

And if the AI is all false results, well... I will disable a tool which doesn't add value. I've done it before.

Discussion (25)

2024-04-05 02:03:57 by qntm:

Based on a Twitter thread from a few days ago.

2024-04-05 02:18:26 by osmarks:

I think your arguments about the density of code are missing the point. Empirically, LLMs find it easier to predict than English (by perplexity). Your examples correctly indicate that exhaustively describing the behaviour of code in English is longer than the code, but in most cases it's pretty obvious what it should be from context - I'd expect that saying "check first argument is number" or "filter and log boolean fields" would get you something close enough. I do kind of agree with you in general, mostly because I think most of the code assistant tools are trying to be like a coworker you might delegate to, but are less competent, which is annoying and unpleasant. A good one, in my view, would try and act more like an extension of my thoughts, filling in blanks and correcting silly mistakes and making tweaks to background files to align them with what I'm currently doing.

2024-04-05 03:13:06 by Emily:

Several of my coworkers use this thing and like it. I gave up half an hour into trying it out. I now find myself wondering if the underlying thing here is that those people like reviewing code more than you and I do. There is also AI code review. Yes, that's a thing. Yes, it's as bad as you expect, because it's severely lacking in context one needs to effectively review code. (Including the obvious "you have no idea what this function does and therefore cannot say whether I'm using it wrong", but also shockingly often, *knowing what language the code is even written in*. One major category of bad suggestions were "you should check for null/wrong-typed values here!" in a language with a static type system that models nullability.) Even my very AI-friendly company decided not to continue after our trial run with it.

2024-04-05 03:29:15 by ebenezer:

Docs person here. The current project at work is to add documentation for feature XYZ, using a page of content provided by a developer. My approach this time is to read through and attempt to grasp the content (and understand the feature) myself, then write the documentation myself. For some reason your comments about writing vs. reviewing code reminded me of this.

2024-04-05 06:12:34 by Tetragramm:

I much prefer the auto complete version. It only does something when I'm writing code, and it follows my tries) reason of thought. So I am already thinking about what comes next, and it requires approximately zero effort to think "that's right" and accept the suggestion. Or deny it and just keep typing. It's especially useful for things with almost repeated but slightly different phrases, like switch statements, and input validation. Using your example, by the time you get half way through typeof, it's typically suggesting the rest, and I'm done.

2024-04-05 07:20:32 by Aybri:

I don't use AI much in my workflow, as I don't particularly trust most projects that use it. I see AI as a useful tool when it comes to mere support and assistance, not replacement, and replacement is the direction most projects have been going in. However, my father (who is also a programmer, and someone I frequently butt heads with about this kind of thing) uses and actually appreciates the auto-complete version of the AI. He showed me a bit of it and spoke about how it greatly increased his work flow. It didn't seem too terrible, but I quickly noticed it tends to miss context and make mistakes. In this particular instance, it accidentally made a nested loop, which I had to call out to my father when he begun to get confused as to why his code wasn't working. I suppose it's a useful tool if you like reading code but not writing it. I don't see much use in it myself, but I like to experiment and toy around, metaphorically "get my hands dirty." Again, I don't like being replaced. But that's just my two cents, really.

2024-04-05 11:35:33 by zaratustra:

"it's pretty obvious what it should be from context" is where i spend most of my working hours.

2024-04-05 12:45:20 by mavant:

Agree on all points, but especially deeply agree about needing to play with the code - I just can't get myself to really focus and understand a diff without doing a bit of pushing it around to see why (and if) the new state is really a local optimum.

2024-04-05 12:59:43 by James Fryer:

> For each element in the field array, we need to emit a debug log entry explaining what element we're working on. Then... This isn't how I use AI for coding though. It's pretty much describing the implementation, in English, then checking the generated code. Writing the code is surely faster in this case. I currently only use AI for small data-processing scripts that I'd use once then throw away. In this case my prompt would include input and output samples, describe any other requirements ("output to stdout", "use csv module") and then let the method take care of itself. This declarative style of programming using AI I think is useful, although I haven't tried it with any complex requirements.

2024-04-05 15:41:50 by Oliver:

I find that “get the AI to review rather than just generate” applies in a large range of domains. Maybe (with the current rate of AI progress, probably) someday it will be good enough it will work like a human programmer and we won’t worry so much about checking it, but until then, review is almost always better than generation.

2024-04-06 15:05:29 by BoppreH:

Good point on human review of AI code being a waste of training. The thought never occurred to me. What do you think of AI generating test cases? Tests are usually boilerplate-heavy, conceptually shallow, and have a lower security/quality bar. Seems like the perfect job for AI. "During tests, the branch a=="-" is never taken when b==[]. Doing so crashes the function with SomeError (see below). Add failing test case to suite?" "This function has no tests. Select from suggested tests to add to suite: (13 tests, 2 failing, 98% coverage):"

2024-04-06 22:49:50 by Nathan:

I like AI programming assistants in two cases: Case 1: I have no idea how to do this thing. In this case I assume the code the AI assistant produces is going to be wrong, but that's OK because my initial stab at it would probably also be wrong. I'm fine iterating toward a solution with the AI's code as a starting point. If I have to do research either way, the AI code narrows the scope of my research. Case 2: I know exactly what I want to do but I don't really feel like typing it all out. There's a decent chance that the function signature plus a succinct comment is all the AI assistant needs to write something that's close enough to the code in my head that I don't need to spend any more time proving to myself that it's correct than I would if I had typed it with my own hands. This sort of saving is minimal, but I find it actually enhances the flow state because instead of getting bogged down in writing some trivial string parsing code or JSON deserialization or a unit test that confirms the function with input x produces output y, I'm thinking at a higher level and letting the AI fill in the details. To each their own, of course

2024-04-07 13:45:11 by JP:

I’m with you; I spend a lot longer debugging LLM derived code than the time I gain back from its availability. I find it feels “more useful” when I’m feeling lazy, but in those moods I’m usually more comfortable spending twice as long tutting at & correcting its failures than I would have done getting my head into gear and getting a good working model in my head. Context is key I think (mine or the LLM’s lack); if I don’t care about quality (eg. throw-away script in a language or framework I’m less familiar with) I’ll happily let LLMs take a stab then work on correcting, minimizing how much context either of us needs to get to “good enough”. I’d be interested to see an LLM with hooks that’d let it pull in metrics about the result (test failures, code quality metrics) and provide it with more context. Today it’s a lot like trying to wrangle a less experienced engineer who doesn’t care to learn.

2024-04-07 23:35:53 by TJSomething:

My big thing is that LLMs are pretty decent at extremely repetitive code where I already know the next several lines of code exactly. Maybe I'm copying values out of an ad hoc DTO. Maybe I'm building an object with a lot of fields and I have all the variables in scope. Maybe it's a simple algorithm that's so common that it would probably be in the standard library if the language had a decent standard library and/or rich enough semantics to make a general purpose function for that use. For all of these kind of cases, LLMs can usually make pretty good inferences. And this honestly helps a lot with my RSI. Any code that's so obvious that I can type continuously for a minute or two ends up putting me in a small but noticable amount of pain.

2024-04-08 00:39:19 by qntm:

I hid a comment by user named "jakedowns" which was responding to multiple points I never made in my essay and in general seemed to be hallucinatory/AI-generated. If you are jakedowns and you want your comment restored, get in contact and we can talk about this. > What do you think of AI generating test cases? Test cases are also code, so all the observations regarding code review vs. coding proper apply.

2024-04-09 12:41:31 by RobFisher:

What disturbs me about this critique is: I like the idea of having AI help me (and people in general) to get more done; but what if it does all the fun parts? You know who is good at getting things done while trusting the details they don't understand to others? Managers. So far I have avoided becoming a manager. Where AI has helped me: tedious things (write a command line interface for this function). Tasks of a certain size that are easy to describe in English (I want a Python program that does a regex search and replace on whatever is in the clipboard and works in both Windows and Linux). And helping me ask better questions about unfamiliar domains (what if I test my data samples by choosing some at random lots of times and seeing how variable the results are? Apparently in statistics this is called bootstrapping, and the AI helped me translate statistics speak into something I could understand). I remain hopeful.

2024-04-09 22:15:36 by sdfgeoff:

AI turns coders into code reviewers in the same way CNC machines turned experienced machinists into material loaders and quality control supervisors. CNC machines used to be unreliable things that maybe saved time. Now they are huge force multipliers. 2 years ago AI code generation was definitely a net negative productivity gain. Now it is approximately a zero productivity gain. In a few years will we have a software equivalent of the "Ballad of John Henry" where man races the machine to his own peril? Who knows.

2024-04-11 04:56:55 by dmz:

I think you didn't take the analogy with self-driving cars far enough. Level 2 "self-driving" capabilities like in Teslas are nearly useless because they still require constant supervision. But Level 4+ autonomous vehicles like Waymos are amazing. When coding assistants reach that level of reliability, they will be great as well.

2024-04-14 11:33:28 by asdf123:

You're absolutely right

2024-04-14 23:06:43 by mesyaria:

I think with image generation the human being the reviewer is clearly the way to go. The model can do the bulk of the work quickly and well but stumbles on details, which the human operator can take care of. Even that last bit of work can be done in large part by the model with just a bit of inpainting to nudge it in the right direction.

2024-04-20 20:57:55 by David:

sdfgeoff: > In a few years will we have a software equivalent of the "Ballad of John Henry" where man races the machine to his own peril? Who knows. There's a yearly programming competition called Advent of Code, and in day 3 of 2022, ostwilkens used code completion combined with automatic input downloading and submission to get a humanly impossible 10-second time for part 1. However, the LLM's first attempt for part 2 had a bug, and the LLM came in seven seconds short of a (human) programmer named 5space. I wrote a version of the "Ballad of John Henry" to commemorate the occasion, that you might enjoy: https://hallofdreams.org/posts/advent-of-code-2022/#day-3-john-5space-henry What's funny is that now, a year and a half later, GPT-4 could probably write the poem to boot.

2024-05-03 01:27:01 by randomhuman:

hi hi! nothing relative to this post, but just wanted to say thank you for this blog

2024-05-04 00:18:06 by Jared:

I’m just here to cast a vote for the autocomplete version - I get it to write boilerplate for me. As a physicist I do a lot of plotting, which usually looks the same and is very predictable, but requires a lot of typing to get what I want. For example, I’ve got to set up my figure and my axes, plot each line, add labels, add axis labels, add titles, maybe fiddle with some finer points. What used to take 5-10 minutes now takes just a minute because the AI predicts what I want, doubly so if I write a brief comment at the start like “# plot entropy on linear and log time scales”. It’ll guess appropriate labels based on the variables, predict what goes in each axis based on what I named it, add in appropriate type hints so I get autocomplete suggestions and docs, etc. The library (matplotlib) is also pretty vast, so it helps that it can just draw on that knowledge. Sometimes it’s taught me things I didn’t know. And it learns from my project (this is Copilot) to figure out my code style. Sometimes it makes bad predictions, but that’s fine. I just ignore those. And the worst case? My graph looks dodgy and I fix it, which happens pretty often when I code manually. It’s not so useful when I’m writing simulation code as it often jumps to conclusions about what I’m trying to accomplish, but I hope it’ll get there in a few years.

2024-05-19 09:21:56 by Toricon:

> To back up a little: Humans are extremely bad at software development and we need all the help we can get. I strongly agree with this and am going to quote it whenever possible.

2024-05-19 18:16:40 by lalaithion:

I write almost 100% golang at work and I have found that there are a few things that AI autocomplete helps with, but my favorite one is this: Golang uses multi-argument return values to propagate errors. This means the code base is full of x, err := GetItemFromCache(ctx, itemName, time.Now()) if err != nil { return nil, err } There's a problem with this, though; you'll run your program, get an error, and the error will be something like "error: network timeout", with no idea which part of the program hit that error. In practice, we solve this by writing x, err := GetItemFromCache(ctx, itemName, time.Now()) if err != nil { return nil, fmt.Errorf("GetItemFromCache(%q): %w", itemName, err) } where we manually construct stack traces with custom messages that include relevant values passed into functions. If you have this pattern _everywhere_ an err is returned, AI is pretty good at guessing which arguments are important enough to go in the error, and reading these three lines and validating they're correct is faster for me than typing it out. There's occasional other boilerplate it helps with, but this is actually useful.

New comment by :

Plain text only. Line breaks become <br/>
The square root of minus one: