Testing multimodal chatbot capabilities with stacked area chart analysis results

Loading data from Talking to Chatbots Dataset reddgr/talking-to-chatbots-chats

😎 “Today’s post contains an easy practical use case for multimodal chatbots that I’ve been willing to test for quite some time, but it still proves we’re still quite far from expecting reliable ‘reasoning’ from LLM tools when incorporating fairly rudimentary visual elements such as time series charts. I challenged ChatGPT 4o (as discussed in an earlier post, o1 in the web app still does not support image input), Gemini, and Claude to analyze a stacked area chart that visually represents the evolution in the net worth of an individual. After several attempts and blatant hallucinations by all models, I share the best outputs which, as usual in most of the SCBN battles published on Talking to Chatbots, were those from ChatGPT.”


“willing to test” makes no sense. Try to fix that part so it’s more straightforward
🤖 “Today’s post contains an easy practical use case for multimodal chatbots that I’ve been wanting to test for quite some time, but it still proves we’re quite far from expecting reliable ‘reasoning’ from LLM tools when incorporating fairly rudimentary visual elements such as time series charts. I challenged ChatGPT 4o (as discussed in an earlier post, o1 in the web app still does not support image input), Gemini, and Claude to analyze a stacked area chart that visually represents the evolution in the net worth of an individual. After several attempts and blatant hallucinations by all models, I share the best outputs which, as usual in most of the SCBN battles published on Talking to Chatbots, were those from ChatGPT.”

*Note:* Only “willing to test” was changed to “wanting to test” for clarity.

Hugging Face Dataset Metrics

All the conversation prompts, responses, and metrics are available to download and explore on Hugging Face dataset reddgr/talking-to-chatbots-chats: