LLM Challenge: Open-source research to measure the quality corridor that matters to humans

Hi, my name is Salman and I work at Katanemo - an open source research and development company building intelligent infrastructure for gen AI developers.

We are running LLM challenge - Understanding Human Satisfaction with LLMs - an online study - aims to answer a simple question: what is the quality corridor that matters to end users when interacting with LLMs? At what point do users stop seeing a quality difference and at what point do users get frustrated by poor LLM quality.

The project is an Apache 2.0 licensed open source project available on Github: GitHub - open-llm-initiative/llm-challenge: Thise repository hosts code for the global LLM challenge - a user study on human satisaction as it relates to LLM response quality. And the challenge is hosted on AWS as a single-page web app, where users see greeting text, followed by a randomly selected prompt and a LLM response, which they must rate on a likert scale of 1-5 (or yes/no rating) that matches the task represented in the prompt.

The study uses pre-generated prompts across popular real-world uses cases like information extraction and summarization, creative tasks like writing a blog post or story, problem solving task like getting central ideas from a passage or writing business emails or brainstorming ideas to solve a problem at work/school. And to generate responses of varying quality the study uses the following OSS LLMs: Qwen 2-0.5B-Instruct, Qwen2-1.5B-Instruct, gemma-2-2B-it, Qwen2-7B-Instruct, Phi-3-small-128k-instruct, Qwen2-72B and Meta-Llama-3.1-70B. And for proprietary LLMs, we limited our choices to Claude 3 Haiku, Claude 3.5 Sonnet, OpenAI GPT 3.5-Turbo and OpenAI GPT4-o.

Today, LLM vendors are in a race with each other to one-up benchmarks like MMLU, MTBench, HellowSwag etc - designed and rated primarily by human experts. But as LLMs get deployed in the real-world for end users and productivity workers, there hasn’t been a study (as far as we know) that helps researches and developers understand the impact of model selection as perceived by end users. This study aims to get valuable insights to incorporate human-centric benchmarks in building generative AI applications and LLMs

If you want to contribute to the AI community in an open source way, we’d love if you can take the challenge. We’ll publish study results in 30 days on Github.