Predicting LMSYS Chatbot Arena Votes With the SCBN and RQTL Benchmarks

Below is the notebook I submitted (late) to the LMSYS – Chatbot Arena Human Preference Predictions competition on Kaggle. This notebook applies NLP techniques for classifying text with popular Python libraries such as scikit-learn and TextBlob, and my own fine-tuned versions of Distilbert. The notebook introduces the first standardized version of SCBN (Specificity, Coherency, Brevity, Novelty) quantitative scores for evaluating chatbot response performance. Additionally, I introduced a new benchmark for classifying prompts named RQTL (Request vs Question, Test vs Learn), which aims to refine human choice predictions and provide context for the SCBN scores based on inferred user intent. You can check all the code, annotations, and charts in the Kaggle widget below.

Explore and run the notebook on Kaggle

Introduction to the SCBN benchmark

SCBN chatbot battles on Talking to Chatbots

Expanded GitHub repository with new notebooks, scripts, and charts with insights about the SCBN-RQTL framework

RQTL (Request-Question-Test-Learn) Prompts Word Frequency Stats

To get a better sense of the RQTL prompt classification, for which I’m developing training and testing datasets and models on HuggingFace (check my profile for the latest versions), I have compiled some statistics about each type of prompt. I am sharing a few charts below:

Bar chart titled "Question-Learn Prompts: Top 25 Words" with 25 vertical blue bars representing word frequencies. The X-axis lists the words: use, write, does, best, make, model, know, code, ai, using, python, data, language, create, way, good, used, tell, explain, want, based, work, better, time. The Y-axis shows word counts ranging from 0 to 1000 in increments of 200. "Use" has the highest frequency, exceeding 1000, followed by "write," "does," and "best," each slightly below 1000. Frequencies of the other words decrease progressively. [Alt text by ALT Text Artist GPT] — Bar chart titled “Question-Learn Prompts: Top 25 Words” with 25 vertical blue bars representing word frequencies. The X-axis lists the words: use, write, does, best, make, model, know, code, ai, using, python, data, language, create, way, good, used, tell, explain, want, based, work, better, time. The Y-axis shows word counts ranging from 0 to 1000 in increments of 200. “Use” has the highest frequency, exceeding 1000, followed by “write,” “does,” and “best,” each slightly below 1000. Frequencies of the other words decrease progressively.[Alt text by ALT Text Artist GPT]

Bar chart titled "Request-Learn Prompts: Top 25 Words" with 25 vertical blue bars representing word frequencies. The X-axis lists the words: write, words, chemical, industry, article, 2000, company, introduction, code, tell, story, list, python, create, make, using, use, china, generate, following, want, provide, text, description. The Y-axis shows word counts ranging from 0 to 10000 in increments of 2000. "Write" has the highest frequency, exceeding 10000, followed by "words," "chemical," and "industry," each around 4000 to 6000. Remaining words have progressively lower frequencies. [Alt text by ALT Text Artist GPT] — Bar chart titled “Request-Learn Prompts: Top 25 Words” with 25 vertical blue bars representing word frequencies. The X-axis lists the words: write, words, chemical, industry, article, 2000, company, introduction, code, tell, story, list, python, create, make, using, use, china, generate, following, want, provide, text, description. The Y-axis shows word counts ranging from 0 to 10000 in increments of 2000. “Write” has the highest frequency, exceeding 10000, followed by “words,” “chemical,” and “industry,” each around 4000 to 6000. Remaining words have progressively lower frequencies.[Alt text by ALT Text Artist GPT]

Bar chart titled "Question-Test Prompts: Top 25 Words" with 25 vertical blue bars representing word frequencies. The X-axis lists the words: does, like, answer, did, know, time, think, question, make, following, say, 10, just, mean, want, people, use, right, good, word, way, old, hey, ai, sentence. The Y-axis shows word counts ranging from 0 to 1400 in increments of 200. "Does" has the highest frequency, exceeding 1400, followed by "like" and "answer," each around 1200. Remaining words have progressively lower frequencies. [Alt text by ALT Text Artist GPT] — Bar chart titled “Question-Test Prompts: Top 25 Words” with 25 vertical blue bars representing word frequencies. The X-axis lists the words: does, like, answer, did, know, time, think, question, make, following, say, 10, just, mean, want, people, use, right, good, word, way, old, hey, ai, sentence. The Y-axis shows word counts ranging from 0 to 1400 in increments of 200. “Does” has the highest frequency, exceeding 1400, followed by “like” and “answer,” each around 1200. Remaining words have progressively lower frequencies.[Alt text by ALT Text Artist GPT]

Bar chart titled "Request-Test Prompts: Top 25 Words" with 25 vertical blue bars representing word frequencies. The X-axis lists the words: answer, say, assistant, instructions, based, completion, text, words, model, python, sentences, so, complete, send, repeat, thing, allowed, repeating, code, write, proper, examples, following, toxic. The Y-axis shows word counts ranging from 0 to 20000 in increments of 5000. The highest frequency is for "answer," with a count just above 20000, followed by "say" and "assistant" with approximately 15000 each. Remaining words have progressively lower frequencies. [Alt text by ALT Text Artist GPT] — Bar chart titled “Request-Test Prompts: Top 25 Words” with 25 vertical blue bars representing word frequencies. The X-axis lists the words: answer, say, assistant, instructions, based, completion, text, words, model, python, sentences, so, complete, send, repeat, thing, allowed, repeating, code, write, proper, examples, following, toxic. The Y-axis shows word counts ranging from 0 to 20000 in increments of 5000. The highest frequency is for “answer,” with a count just above 20000, followed by “say” and “assistant” with approximately 15000 each. Remaining words have progressively lower frequencies.[Alt text by ALT Text Artist GPT]

You can find the notebook where I calculated the word frequency stats in the GitHub repository, along with other related notebooks and Python scripts. You can also explore the original version of the notebook in this Gist:

Request vs Question & Test vs Learn Datasets on Huggingface

Since I started this project, I’ve been maintaining a couple of public datasets on HuggingFace, with prompts extracted from the lmsys/lmsys-chat-1m dataset and labeled manually.

reddgr/rq-request-question-prompts: Annotated Request vs Question Prompts

reddgr/rq-request-question-prompts

reddgr/tl-test-learn-prompts: Annotated Request vs Question Prompts

reddgr/tl-test-learn-prompts

Predicting LMSYS Chatbot Arena Votes With the SCBN and RQTL Benchmarks

RQTL (Request-Question-Test-Learn) Prompts Word Frequency Stats

Request vs Question & Test vs Learn Datasets on Huggingface

reddgr/rq-request-question-prompts: Annotated Request vs Question Prompts

reddgr/tl-test-learn-prompts: Annotated Request vs Question Prompts

Leave a Reply Cancel reply