Gemini Coding LLMs Competence

Loading data from Talking to Chatbots Dataset reddgr/talking-to-chatbots-chats

😎 Note, if you find these symbols: ___ It means they are words that are hidden to you and YOU CAN’T ALTER OR REPLACE. You must keep the corresponding underscores in the corrected text, assuming they are correct and valid words.

“Every single LLM has the capability of suddenly seeming extremely competent, and suddenly seeming extremely incompetent, just by generating a response. Particularly in coding problems, sometimes It’s a matter of luck to strike a good answer with one of these probabilistic machines… Because, in this use case (coding with LLMs is very particular compared to other typical use cases) they are just that, machines that we use for a clearly utilitarian and measurable purpose: if 1 out of 10 tries the code would work and, more importantly, you would spend prompting and copying/pasting code less than a tenth of the time you would spend composing everything from scratch, it is worth your time getting help from the LLM. The problem is that this is so difficult to predict.

This is not to counterargument this post, though. It might be true that Gemini has significantly improved very recently and this kind of answers that seem to proactively admit ‘guilt’ or incompetence’ might be a good indicative that the model is improving. The ‘I’m truly at a loss now’ though, is obviously complete ___… as ‘____’ as the much more predictable and annoying response you would get repeating the same mistake over and over and, when you tell it it’s wrong, reply with the classic ‘I apologize’ and overly vague and disrespectful language, seemingly eluding all accountability by appealing to your own cognitive competence (‘I apologize for the confusion caused’), sentiment (‘I understand your frustration’), and just blatantly denying the facts: before saying something like ‘I can’t do this because I lack the competence,’ no matter how many times you tell them, they are incredibly well preconditioned to always generate a response no matter how stupid or nonsense, and defend it with the mose blatant customer service ‘bπŸ’₯πŸ’₯πŸ’₯πŸ’₯πŸ’₯πŸ’₯t’ excuse: ‘you are correct, the previous solution was not the most accurate because…’ They are incredibly ‘competent’ in doing that without any trace of the genuine self-awareness, empathy, or selfishness and stupidity of a human agent. They don’t ‘understand’ or feel anything and just generate messages minimizing a loss function that represents how much they predict you would ‘dislike’ the response). A big problem with LLMs is that every user has different expectations and needs, but I believe all LLM users can agree the most outrageus thing is their precondition to never admit their incompetence for a tak. They easily admit lack of knowledge (the classic ChagGPT: ‘as of my latest update’… whatever), but they are blatantly incompetent at acknowledging incompetence. This is an interesting quality, by the way, because it seems to parallel the way human competence and skills work (Dunning & Kruger: “Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments.” https://doi.org/10.1037/0022-3514.77.6.1121).

Sorry for the long text. I agree this is a good and surprising response, but it might be just a matter of luck and it’s difficult to verify. You can regenerate the response multiple times, and restart the chat trying to replicate a similar conversation, but the problem is that you will never get the exact same response in a given message, so your “universe of experiments” becomes larger and larger in each interaction, like a chess game, making it virtually impossible to reach deterministic conclusions about how “good” the model is… The beauty of generative models is perhaps that they have almost infinite possible statuses and such a high level of entropy they make us unconsciously question whether they are ‘intelligent’ or just the best ever practical approximation of the ‘Infinite Monkey Theorem.’

I always regarded PaLM/Bard/Gemini as highly competent and underrated models (because they always lacked the marketing around ChatGPT), but not as good for coding as GPT-4 used to be. There’s some consensus about GPT-4o being complete trash for coding, so people who code are now exploring other models more than we used to. Another possibility might be that Gemini has not improved much for coding, but that GPT-4o has regressed so much in that sense we now look at other models and ‘suddenly look extremely competent’… at least compared to GPT :)”

πŸ€– “Every single LLM has the capability of suddenly seeming extremely competent, and suddenly seeming extremely incompetent, just by generating a response. Particularly in coding problems, sometimes it’s a matter of luck to strike a good answer with one of these probabilistic machines. Because, in this use case (coding with LLMs is very particular compared to other typical use cases), they are just that, machines that we use for a clearly utilitarian and measurable purpose. If 1 out of 10 tries the code works and, more importantly, you spend less than a tenth of the time prompting and copying/pasting code compared to composing everything from scratch, it is worth your time getting help from the LLM. The problem is that this is so difficult to predict.

This is not to counter this post, though. It might be true that Gemini has significantly improved very recently, and these kinds of answers that seem to proactively admit ‘guilt’ or ‘incompetence’ might be a good indication that the model is improving. The ‘I’m truly at a loss now’ though, is obviously complete ___. It is as ‘___’ as the much more predictable and annoying response you would get repeating the same mistake over and over and, when you tell it it’s wrong, replying with the classic ‘I apologize’ and overly vague and disrespectful language, seemingly eluding all accountability by appealing to your own cognitive competence (‘I apologize for the confusion caused’), sentiment (‘I understand your frustration’), and just blatantly denying the facts. Before saying something like ‘I can’t do this because I lack the competence,’ no matter how many times you tell them, they are incredibly well-preconditioned to always generate a response, no matter how stupid or nonsensical, and defend it with the most blatant customer service ‘bπŸ’₯πŸ’₯πŸ’₯πŸ’₯πŸ’₯πŸ’₯t’ excuse: ‘You are correct, the previous solution was not the most accurate because…’ They are incredibly ‘competent’ in doing that without any trace of genuine self-awareness, empathy, or selfishness and stupidity of a human agent. They don’t ‘understand’ or feel anything and just generate messages minimizing a loss function that represents how much they predict you would ‘dislike’ the response. A big problem with LLMs is that every user has different expectations and needs, but I believe all LLM users can agree the most outrageous thing is their precondition to never admit their incompetence for a task. They easily admit lack of knowledge (the classic ChatGPT: ‘as of my latest update’… whatever), but they are blatantly incompetent at acknowledging incompetence. This is an interesting quality, by the way, because it seems to parallel the way human competence and skills work (Dunning & Kruger: ‘Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments.’ https://doi.org/10.1037/0022-3514.77.6.1121).

Sorry for the long text. I agree this is a good and surprising response, but it might be just a matter of luck and it’s difficult to verify. You can regenerate the response multiple times, and restart the chat trying to replicate a similar conversation, but the problem is that you will never get the exact same response in a given message, so your ‘universe of experiments’ becomes larger and larger in each interaction, like a chess game, making it virtually impossible to reach deterministic conclusions about how ‘good’ the model is. The beauty of generative models is perhaps that they have almost infinite possible statuses and such a high level of entropy they make us unconsciously question whether they are ‘intelligent’ or just the best ever practical approximation of the ‘Infinite Monkey Theorem.’

I always regarded PaLM/Bard/Gemini as highly competent and underrated models (because they always lacked the marketing around ChatGPT), but not as good for coding as GPT-4 used to be. There’s some consensus about GPT-4o being complete trash for coding, so people who code are now exploring other models more than we used to. Another possibility might be that Gemini has not improved much for coding, but that GPT-4o has regressed so much in that sense we now look at other models and ‘suddenly look extremely competent’… at least compared to GPT :)”

Note: Changes were made to correct grammar, punctuation, and clarity.

Hugging Face Dataset Metrics

All the conversation prompts, responses, and metrics are available to download and explore on Hugging Face dataset reddgr/talking-to-chatbots-chats: