How to Test Chatbots Like ChatGPT

Scientists are trying to figure out better ways to test artificial intelligence (AI) systems like ChatGPT to understand exactly what they can and can't do. This article explores some of the questions around testing chatbots, and why it matters if we want to use AI safely and effectively in areas like healthcare and education.

Paul Blackhurst

7/22/20234 min read

How to Test Chatbots Like CHATGPT

You've probably seen or used chatbots like ChatGPT that can have surprisingly human-like conversations. People are blown away by them! But while these bots seem smart, they still have many limitations compared to human intelligence.

Scientists are trying to figure out better ways to test artificial intelligence (AI) systems like ChatGPT to understand exactly what they can and can't do. This article explores some of the questions around testing chatbots, and why it matters if we want to use AI safely and effectively in areas like healthcare and education.

Why Testing Matters

Here's an everyday example of why good testing matters...

Imagine your friend Adeel tells you he's a brilliant footballer who should go pro, because he's super-fast and has an amazing kick. You might think, wow, he sounds fantastic!

But then when you play a match with him, you notice he only uses his left foot and can only kick the ball straight. He also gets tired after sprinting for 30 seconds.

Actual matches reveal his strengths BUT also his weaknesses that weren't obvious just from his claims. Testing him in a real game gives you a more complete picture.

It's similar with AI systems like ChatGPT. We need rigorous testing to truly understand their capabilities and limitations, beyond hype or small samples.

The Famous Turing Test

Back in 1950, a scientist named Alan Turing came up with a famous test for machine intelligence.

The idea was simple: have human judges chat to both a person and a hidden computer. If the judge can't reliably tell which is which, it suggests intelligence! This became known as the Turing test.

Many experts now say ChatGPT could pass a Turing test by fooling humans in short text chats. So, does this mean it's as smart as a person?

Not so fast...

Problems with the Turing Test

While ChatGPT might pass a limited Turing test, many AI experts argue the test has flaws:

• It rewards trickery over usefulness.

• You can fool judges without intelligence, like a magic trick.

• It only tests conversation, not versatility.

Adeel might declare himself an elite footballer just through conversation. But as we saw before, we'd want to test him in matches to see his true strengths and weaknesses.

Simply mimicking human chat has little to do with true intelligence. So, researchers want better and more comprehensive ways to test AI systems.

The Debate Over Reasoning

Right now, there's a big debate around whether ChatGPT and systems like it can reason, or just respond based on patterns.

Some experts think ChatGPT shows basic reasoning, like doing well on school exams. Others argue it doesn't understand concepts but uses statistical tricks.

More rigorous testing is needed to settle this debate. But it's hard to design benchmarks that definitively prove or disprove reasoning abilities.

Researchers stress these are not human brains! We can't just give them an IQ test made for people. We need to test AI capabilities in a nuanced way on their own terms.

Problems with Existing Benchmarks

Many existing tests and benchmarks used to evaluate AI have flaws:

· Training data contamination - ChatGPT may have "seen" test questions before, giving it an unfair edge.

· Narrow focus - Doing well in limited tests doesn't mean it will handle the real world.

· Interpretation pitfalls - High scores don't necessarily indicate intelligence like they would for humans.

Researchers need larger, more challenging benchmark tests that require broader cognitive skills. Existing tests fall short of thoroughly evaluating these systems.

Creating Better Benchmarks

Given the limits of the Turing test and existing benchmarks, AI experts are trying to design better tests.

For example, François Chollet created a visual puzzle challenge called ARC (Abstraction and Reasoning Corpus). Humans score 80% on it, while the best AI scores only 21%!

Melanie Mitchell then created ConceptARC - simplified visual puzzles focused on specific concepts like sameness and alignment. Again, humans scored over 90% while ChatGPT barely passed 30%, revealing its limitations.

These puzzles require recognizing abstract concepts and applying them creatively to new examples you haven't seen before. This remains a big challenge for AI.

Experts are also testing ChatGPT with academic exams and professional certification tests meant for humans. But while scores seem impressive, many worry ChatGPT may be finding shortcuts rather than truly demonstrating skills.

There's no perfect single test of intelligence. But a diverse array of benchmarks focused on reasoning could reveal strengths, weaknesses, and differences between AI systems and humans.

The Risk of Anthropomorphising

Here's another key point...

We have a natural tendency to "anthropomorphise" things - to assign human traits and experiences to non-human things.

Like imagining your robot vacuum has a personality, or that your cat understands you and has complex thoughts about life. We like to humanise things.

But when it comes to AI, this tendency can cloud our judgment about what's really going on under the hood. We might think systems like ChatGPT "understand" us and "think" like humans. But the reality is very different.

These systems don't comprehend language or reason logically like people. They use statistical patterns from massive training data to generate plausible responses. It's important we don't get misled by the illusion of human qualities in AI.

Testing rigorously using benchmarks tailored to AI capabilities, rather than human cognition, can help avoid this trap.

Moving Forward

Chatbots like ChatGPT are impressive, but also limited. To direct progress, experts need to probe their boundaries through diverse, stringent testing focused on reasoning.

This will reveal shortcomings to improve, and differences from human cognition. It will also help us deploy these technologies safely and effectively as they grow more advanced.

The goal should not be to mimic conversational humans, but to quantify actual abilities. Only with rigorous empirical research across many benchmarks can we chart the path ahead for AI.

So, in summary, while chatbots seem amazingly human sometimes, targeted testing exposes their flaws. And revealing flaws is a good thing! It's how science and innovation move forward. Through extensive testing, we'll unlock AI's true potential.