So, given all the praise I heard about Cursor AI, after a number of months trying to use local LLMs on my laptop with limited success, I decided to spend 20 USD of my hard earned salary to witness this amazing new tool in all of its glory.

The results have been a far cry from all the "10x", and "100x" improvements in productivity that have been claimed left and right.

TLDR - My experience: - Tab (Cursor AI's version of autocomplete) worked very well about 75% of the time. It is actually a pretty nice addition, however it comes at the cost of losing long term memory if all you do is rely on it. - The rate at which agents produced reasonable solutions is very hit-and-miss, and even when they produce decent code, it required a lot of work on my part: either by extensive re-prompting, prompting from scratch, or pretty heavy hands-on review. - Will I continue paying for Cursor AI? Frankly I am not sure. I find that most of the use cases for which agents seem to work really well can be solved with good "cookie-cutter" templates, something like the "full-stack-fastapi-template". It is even faster than using and LLM (just git clone) and consumes way less energy.

The good

Tab works very well, better than any other autocomplete model that I tried. It is quite often the case that a change needs to be repeated in multiple places through the code, and the tab-tab-tab flow is surprisingly functional and accurate. However, care must be taken in that it doesn't become too automatic, since the suggestions are not always spot on.

Another good point is that agents, when they work, tend to work pretty decently. Especially for tasks where there is extensive good quality training data available, such as HTML, bash, and well-known web-frameworks (e.g. React) the generation seems to be competent enough, although I would never skip review if the code I am writing has any importance at all (i.e. it is not a throw-away project). For example, when I asked the agent to create a login page and collapsible sidebar for the FastAPI based project I was working on, the output was a reasonable Bootstrap 5 based layout, and the sidebar started working as desired after a couple of extra prompts.

Unfortunately the good, at least to the extent that I tested the models, stops here.

The bad

LLMs are still too confidently wrong

At one point, I asked the agent to use poetry to output the requirements.txt during the Docker image build stage, and not add it directly to the codebase. At first, the agent gave me the right line which is the follwing

RUN poetry export --without-hashes -f requirements.txt -o requirements.txt.

However, since I hadn't used poetry to output the requrements file in a while, the repeated presence of requirements.txt in the command looked somewhat out of place to me (I had forgotten that the -f flag is the short form of --format) and I asked the model to double check. Note that I never said that the line was wrong, only that it seemed unsual to me.

What would you have expected the agent to tell me? Ex-post I for sure was expecting something along the lines of: "I have double checked and the command I gave you is correct." Instead, the model proceeded to apologize and then gave me a wrong version of the command, which then forced me to read the poetry docs to refresh my memory on the matter.

Agents are extremely inconsistent

As I mentioned, when I first prompted the agent to generate a login page and sidebar, it chose to use Bootstrap 5 (without asking first, mind you). At this point in time I still had written no rules for the agents to follow.

Then, when I asked the agent to write OIDC based authentication, it decided that it was a good idea to write everything from scratch instead of using libauth. This is not only highly inconsistent (why not write CSS from scratch too?) but also much more error prone: OIDC is not particularly difficult in and of itself, it is however pretty hard to get it right. I feel would feel much more comfortable with the AI suggesting code that uses well-known and well-tested libraries first.

It was at this point that I wrote my first rule file to let the agent know about my stack of choice. The rule file looks like follows:

# General

This is an application dedicated to the collection of data, and generation of reports based on such data.
The code for the application is contained in the `app` directory.

## Stack

The application uses the following stack (this list may be outdated):
- Backend: Python 3.12
    - Poetry
    - FastAPI
    - SQLAlchemy
    - Alembic
    - HTMX
    - Authlib
- Interactivity:
    - HTMX
- UI framework:
    - Bootstrap 5
- Database:
    - PostgreSQL

The ugly

Agents don't follow rules I

It was now that I asked the agent to output different content on certain pages dependent on the user's status (logged in or not). I tried to embrace the whole vibe-coding concept and didn't even review the changes, I just accepted them as-is. However it turned out that the agent had produced multiple Jinja blocks with the same name, so I added the relevant log lines to the chat's context and asked it to fix the problem.

Remeber that I has written a rule file describing my tech stack? Well, the agent modified the Jinja templates with Flask specific constructs, such as mesasge flashing, in particular the get_flashed_messages function.

Agents have "no memory", nor real understanding of the codebase

In the whole mess that I described in the previous section, the agent also manged to reference the authenticated_user variable in the Jinja templates, with the only problem that such a variable was never passed to the rendering engine. Instead the user variable was passed, and not only that, it was created and named in one of my previous interactions with the agent!

This clearly demonstrates that language prediction alone is not enough to write good code, and that if you think that engineers can be replaced by AI, you're making a big mistake.

Maybe a time will come when AI models will be as good as engineers, but that time is not now.

Agents don't follow rules II

Since I had been bitten before by the tendency of the agent to modify way more than necessary to satisfy my prompts, I decided to add a new rule file. The file is as follows:

# Do only the minimum necessary

- Do only the minimum necessary to satisfy the user's request
- Do not change something that was working fine
- Do not make assumptions on what programming and/or layout frameworks should be used

Immediately after I added this rule file, I aked the agent to change the login page in particular, because it looked not-centered, and the sidebar should never be shown on the login page anyways.

What do you think happened? The agent changed the layout of the entire site, including the position of the hamburger button, and the animation of the sidebar when showing/hiding.

The end result was good enough, but that's beside the point: the agent went against precise instructions and changed way to much of the codebase.

Conclusion

I happen work for a company that trains and uses AI models. I also happen to have a PhD in electronics engineering and computer science (this doesn't mean that I am particularly smart, but it means that I worked pretty damn hard to get it). Here is my opinion.

Don't drink the cool-ade: coding assistants will not make you 10x mode productive. They might in certain specific areas, but not in general.
All this hype about vibe-coding and prompt-engineering is just BS. Never trust the output of an LLM as is, like you wouldn't trust the output of a colleague. And don't even think to vibe-code your way to production.
Language is not enough. MCP (Model Context Protocol) is going to make things better (who would have guessed, even LLMs need APIs and structure) but it is not going to change the fundamental issues with the current approach.
Use AI agents and LLM based code completion responsibly: when the time comes (it is a when, not an if) that the AI infrastrcuture you rely on breaks, or when it becomes too expensive (there is very little in the way of economics of scale in his particular instance) what are you going to do?