Today’s post contains an easy practical use case for multimodal chatbots that I’ve been wanting to test and show here for quite some time, but it still proves we’re far from expecting reliable ‘reasoning’ from LLM tools when incorporating even fairly rudimentary visual elements such as time series charts. I challenged ChatGPT 4o (as discussed in an earlier post, o1 in the web app still does not support image input), Gemini, and Claude to analyze a stacked area chart that visually represents the evolution in the net worth of an individual. After several attempts and blatant hallucinations by all models, I share the best outputs which, as usual in most of the SCBN battles published on Talking to Chatbots, were those from ChatGPT.

Below is the notebook I submitted (late) to the LMSYS – Chatbot Arena Human Preference Predictions competition on Kaggle. This notebook applies NLP techniques for classifying text with popular Python libraries such as scikit-learn and TextBlob, and my own fine-tuned versions of Distilbert. The notebook introduces the first standardized version of SCBN (Specificity, Coherency, Brevity, Novelty) quantitative scores for evaluating chatbot response performance. Additionally, I introduced a new benchmark for classifying prompts named RQTL (Request vs Question, Test vs Learn), which aims to refine human choice predictions and provide context for the SCBN scores based on inferred user intent. You …

Predicting LMSYS Chatbot Arena Votes With the SCBN and RQTL Benchmarks Read more »

Introducing a new section in Talking to Chatbots: Hiring Chatbots. In this series of chats, we will be conducting job interviews with chatbots – either a standard version of the most popular LLMs or a customized GPT or Character – and we’ll make them compete for the job. The chatbot with the best answer to our question wins and gets published. The rest of the answers will be listed, ranked, and shared in a signature SCBN chatbot battle chart. With the SCBN scores, I value adherence to my prompt instructions (role-playing is important in this case, which most chatbots fail …

Hiring Chatbots: Cloudflare’s Lava Lamps Interview Question Read more »