Black and white line drawing generated by the ControlNet Canny model, depicting a woman in a wetsuit holding a surfboard on the beach. A text from CLIP Interrogator, describing an image, is superimposed over the drawing and reads: "a woman in a wet suit holding a surfboard on the beach with waves in the background and a blue sky, promotional image, a colorized photo, precisionism."

The intense competition in the chatbot space is reflected in the ever-increasing amount of contenders in the LMSYS Chatbot Arena leaderboard, or in my modest contribution with the SCBN Chatbot Battles I’ve introduced in this blog and complete as time allows. Today we’re exploring WildVision Arena, a new project in Hugging Face Spaces that brings vision-language models to contend. The mechanics of WildVision Arena are similar to that of LMSYS Chatbot Arena. It is a crowd-sourced ranking based on people’s votes, where you can enter any image (plus an optional text prompt), and you will be presented with two responses from two different models, keeping the name of the model hidden until you vote by choosing the answer that looks better to you. I’m sharing a few examples of what I’m testing so far, and we’ll end this post with a traditional ‘SCBN’ battle where I will evaluate the vision-language models based on my use cases.