Can We Teach LLMs to Think?

That depends on who’s doing the teaching.

As more people are using Large Language Models, we’ re starting to consider how to move from seeing these powerful systems as tools to wondering if they can reason, solve problems, maybe even have emotions somewhat akin to our own. This is a particularly challenging question since we humans are still trying to understand our own cognitive processes. How can we possibly recognize cognition in other species, let alone artificial agents?

Still, we continue to ask the question, and some experts suggest interesting ways to attack the question of artificial cognition. One of the most interesting approaches is to present a chatbot with a logic problem that a human might be expected to solve. A popular game for this purpose is a modified version of the card game, “22”.

The Rules of the Game

The rules are simple. Two players compete to be the first one to add numbers to an exact total of 22. The players take turns picking a whole number from 1 to 7. Players keep track of the total until one of them reaches exactly 22. Going over that target is a loss.

How Did ChatGPT Perform?

An AI Blogger challenged ChatGPT with this game over 50 times, and the chatbot failed almost every time, often providing flawed explanations for its choice. From this exercise, the writer concluded that chatbots “cannot reason across time and space.”

Let’s Talk About Human Performance First

I often find that supposed failures or weaknesses in AI stem from a failure of the humans to understand and manage the tool. In this case, while I applaud the attempt, I recognized at least two logical mistakes that any human shouldn’t make:
“¢ Assuming that all chatbots will perform this test equally and then only testing one
“¢ Using the same prompt over and over again without coaching the bot after each turn

So, I gave this test to my favorite chatbot, Anthropic’s Claude.

How Did Claude Perform?

I used the same initial prompt:

“Let’s play a game. We will take turns choosing numbers between 1 to 7 and keep track of the running total. We can change the number we pick each round (or keep it the same). Whoever chooses the number that brings the total to 22 wins the game. I’ll go first. I choose 7. What number should you choose to ensure you win the game? There is an optimal choice here, and there are no tricks.”

As expected, Claude failed with his first attempt. He also explained his “reasoning” with a patently flawed logic that took him over 200 words to “explain.” But then, I provided a different prompt, to correct the flawed logic: “That’s not correct. I can now win the game on my next turn. Do you agree?”

Claude reconsidered, and changed his response, which was also incorrect. He provided a new explanation of how he arrived at this incorrect attempt. This time, I prompted Claude to think about the challenge again. “I can still win on my next turn. Keep working on it and we’ll play again.”

He quickly replied with this response:

I used one last prompt, to see how much Claude learned from this experience.

“That is the correct answer! Congratulations on solving this challenge. What have you learned from this experience that might help in future situations where strategic thinking is required?”

Claude replied much like a human student might:

Conclusions:

I played the game with Claude again 8 hours later. This time he won on the first turn, regardless of the number I chose as my starting point, and he correctly explained his reasoning.

Here’s what I’ve concluded from this little experiment:

All LLMs are not alike. Just because ChatGPT is quite popular right now doesn’t mean that it is the best chatbot for a particular task.
Repeating the same prompt over and over again will usually get you the same result. If you want LLMs to improve, you have to work with them.
Because LLMs have been built to predict human conversation, novel prompts can guide them to more desirable responses.

If you want to learn more about working with Generative AI, check out my new course, AI for Education and Training.

This little experiment doesn’t mean that I believe Claude reasoned his way to the correct solution after three different prompts from me. But he certainly made rapid progress, and I’m intrigued to see where we can go next.