Something weird is happening with LLMs and chess


A year ago, there was a lot of talk about large language models (LLMs) playing chess. Word was that if you trained a big enough model on enough text, then you could send it a partially played game, ask it to predict the next move, and it would play at the level of an advanced amateur.

This seemed important. These are “language” models, after all, designed to predict language.

Now, modern LLMs are trained on a sizeable fraction of all the text ever created. This surely includes many chess games. But they weren’t designed to be good at chess. And the games that are available are just lists of moves. Yet people found that LLMs could play all the way through to the end game, with never-before-seen boards.

Did the language models build up some kind of internal representation of board state? And how to construct that state from lists of moves in chess’s extremely confusing notation? And how valuable different pieces and positions are? And how to force checkmate in an end-game? And they did this all “by accident”, as part of their goal of predicting general text?

If language models can do all that for chess, then maybe it’s a hint of how they deal with other situations too.

So that was very exciting. A year ago.

Since then, there’s mostly been silence. So I decided to check in and see how things are going. Having done that, I can now report: Weirdly.

What I did

To make LLMs play chess, I sent them prompts like this:

You are a chess grandmaster.
Please choose your next move.
Use standard algebraic notation, e.g. "e4" or "Rdf8" or "R1a3".
NEVER give a turn number.
NEVER explain your choice.
Here is a representation of the position:

[Event "Shamkir Chess"]
[White "Anand, Viswanathan"]
[Black "Topalov, Veselin"]
[Result "1-0"]
[WhiteElo "2779"]
[BlackElo "2740"]

1. e4 e6 2. d3 c5 3. Nf3 Nc6 4. g3 Nf6 5.

I used the output as a move. I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting.

The first model I tried was llama-3.2-3b. This is a “base model”, meaning it is mostly trained to output text, not to chat with you or obey instructions. It’s quite small by modern standards, with only 3 billion parameters. For reference, GPT-2, released back in 2019, had 1.5 billion parameters, and GPT-4 is rumored to have around 1.8 trillion.

I had it play 50 games, then had a chess engine score each board after each turn in “centipawns”. This is a measure where a pawn is 100 points, but there’s also accounting for position. If the game was over, I assigned a score of +1500 if the LLM won, 0 if there was a tie, and -1500 if it lost.

The results were:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:33:39.351541</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

Terrible. (Click to see a game. Click the above image to zoom in.)

In the above figure, there’s one light line for each game, and the black line shows the per-turn median. The LLM can play standard openings for a few moves but then quickly starts throwing away pieces. It lost every single game, even though Stockfish was on the lowest setting.

Maybe that model is too small? So I got llama-3.1-70b, which is a similar model but with 70 billion parameters instead of 3 billion. The results were:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:33:18.374206</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

Terrible. A little better, but still extremely bad.

Next I tried llama-3.1-70b-instruct, a similar model, except trained to be better at following instructions. The results were:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:33:27.647431</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

Terrible.

Maybe there’s something wrong with the Llama models or datasets? So I tried Qwen-2.5-72b.

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:33:33.542939</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

Terrible.

Maybe Qwen is somehow defective too? So I tried command-r-v01, a 35 billion parameter model.

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:33:56.173251</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

Terrible.

And then I tried gemma-2-27b.

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:33:45.614717</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

Terrrible.

And then I tried gpt-3.5-turbo-instruct. This is a closed OpenAI model, so details are very murky. I only ran 10 trials since AI companies have inexplicably neglected to send me free API keys and this was costing The Automator money. The results were:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:33:09.545234</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

Excellent. Very, very, good.

Even if you raise Stockfish’s level a few clicks, this model will still win every game.

Moving on… I next tried gpt-3.5-turbo, a model that’s similar, except tuned to be more chatty and conversational.

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:33:57.830153</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

Terrible.

And then I tried gpt-4o-mini, which is a newer chat model.

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:33:20.106872</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

Terrible.

And then I tried gpt-4o, a bigger chat model.

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:33:11.685626</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

Terrible.

It lost every single game, though it lost slightly slower.

Finally, I tried o1-mini, a model that’s supposed to be able to solve complex tasks. (I’m too poor for o1.)

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:33:21.659140</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

Terrible.

So, umm:

Model Quality
Llama-3.2-3b Terrible
Llama-3.2-3b-instruct Terrible
Llama-3.1-70b Terrible
Llama-3.1-70b-instruct Terrible
Qwen-2.5 Terrible
command-r-v01 Terrible
gemma-2-27b Terrible
gemma-2-27b-it Terrible
gpt-3.5-turbo-instruct Excellent
gpt-3.5-turbo Terrible
gpt-4o-mini Terrible
gpt-4o Terrible
o1-mini Terrible

And, uhh:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:35:26.144596</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

Notice anything? Any patterns jump out at you?

Discussion

There are lots of people on the internet who have tried to get LLMs to play chess. The history seems to go something like this:

I can only assume that lots of other people are experimenting with recent models, getting terrible results, and then mostly not saying anything. I haven’t seen anyone say explicitly that only gpt-3.5-turbo-instruct is good at chess. No other LLM is remotely close.

To be fair, a year ago, many people did notice that gpt-3.5-turbo-instruct was much better than gpt-3.5-turbo. Many speculated at the time that this is because gpt-3.5-turbo was subject to additional tuning to be good at chatting.

That might be true. Here’s a comparison of three models where we have similar versions with or without additional chat tuning.

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> cc:Work <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> dc:date2024-11-14T11:37:09.825337</dc:date> dc:formatimage/svg+xml</dc:format> dc:creator cc:Agent dc:titleMatplotlib v3.8.4, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> </cc:Work> </rdf:RDF>

(Again, do not be confused by the name gpt-3.5-turbo-instruct. I stress that this is more like a base model than gpt-3.5-turbo. This is the opposite of the naming scheme everyone else uses where “instruct” or “it” means more tuning to be good at chatting.)

In all cases, additional instruction tuning makes the model worse. But the difference is very small in two cases, and enormous in the other.

Possible theories

I can think of four possible explanations.

Theory 1: Base models at sufficient scale can play chess, but instruction tuning destroys it.

This would be consistent with our data. But I did manage to get llama-3.1-405b to play a couple games. Despite being larger than gpt-3.5-turbo, it was still terrible.

Theory 2: GPT-3.5-instruct was trained on more chess games.

All models were clearly trained on a lot of chess games. But it’s hard to know exactly how many.

Theory 3: There’s something particular about different transformer architectures.

I doubt this, but it could be that for some reason, Llama type models are uniquely bad at chess.

Theory 4: There’s “competition” between different types of data.

We know that transformers trained specifically on chess games can be extremely good at chess. Maybe gpt-3.5-turbo-instruct happens to have been trained on a higher fraction of chess games, so it decided to dedicate a larger fraction of its parameters to chess.

That is, maybe LLMs sort of have little “chess subnetworks” hidden inside of them, but the size of the subnetworks depends on the fraction of data that was chess. (If this theory were true, we should probably expect that big enough models should become good at chess, provided they are trained on enough chess games, even if the fraction of chess games is low.)

Details

I did things this way (i.e. by working with standard algebraic notation) because this is how people got good results two years ago, and in preliminary experiments I also found it to work best.

If you want to know exactly how I did things, here are some words: I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization, whatever that is. For the open models I manually generated the set of legal moves and then used grammars to constrain the models, so they always generated legal moves. Since OpenAI is lame and doesn’t support full grammars, for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly. For the chat models llama-3.1-70b-instruct, gemma-2-27b-it, gpt-3.5-turbo, gpt-4o-mini, and gpt-4o I changed the system prompt to “You are a chess grandmaster. You will be given a partially completed game. After seeing it, you should choose the next move.” It’s impossible to change the system prompt for o1-mini, so I didn’t. I used a temperature of 0.7 for all the open models and the default for the closed (OpenAI) models. The fact that OpenAI has “open” as part of their name sure made this paragraph hard to write.

Token weirdness

One extremely strange thing I noticed was that if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space) and let the model generate the space itself. Huh?

After some confusion, I’m pretty sure this is because of the tokenizer. Look at how the Llama tokenizer breaks up a string of moves:

_resources/Something weird is happening with LLMs and chess/9b268124f98d2bbc36708b8c87928965_MD5.png

After the “1.”, it generates “ e” as a single token. That’s not the same as having a space followed by an e. So putting in the space and asking models to generate tokens presents the model with a confusing situation and leads to bad predictions.

The right way to deal with this is “token healing”—to delete the last token of the input and then do constrained generation over all strings that start with the deleted stuff. But I couldn’t figure out any easy way to do that. So, instead I left the space out and modified the grammar so that the model could generate a space (or not), then one of the current legal moves, and then another space (or not).

P.S.

Some people have asked to see all the games from gpt-3.5-turbo-instruct. Behold: 12345678910