Perchance.ai
Can LLMs like ChatGPT Understand Chess?
Assessing Large Language ModelsTable of Contents:
Geometric Stability
ChessQA
Fluid Intelligence
LLMs and Chess. Two things which we know about. LLMs stand for Large Language Models. (AI text like ChatGPT).
LLM research has been using chess as a sandbox to play in. Basically chess can be used as a thing to understand the capabilities of LLMs. Chess is not an end in itself, but a means to assess the LLM's ability.
One concept that's big in the AI field is explanability. That is, can the LLM justify and explain its outputs? Because neural nets are a 'black box', they are hard/impossible to understand by humans. Neural nets involve complicated mathematical operations which can't be conceptualized in a way that makes sense. We need neural nets to explain how they got to an output if they are to be trusted. If LLMs are implemented in healthcare for example, obviously we want to know why AI gives a certain recommendation. If we don't know then how can we trust?
Another thing to look at is how an LLM copes under novel situations (generalizability). Will it be effective in a new situation which may not have been in its training data? Automated driving for example needs to be able to cope with some random crazy thing that could happen on the road and demonstrate flexibility.
Adversarial attacks are designed to try to break an LLMs output through deliberately chosen strategies to confuse the AI with the aim of altering its decision for some purpose. Limiting the possibility of interference from such attacks is another focus of LLM research. So adversarial attacks will be simulated with the aim of informing how to defend the LLM from such attacks.
So this is why the concepts of explainability/generalizability and adversarial attacks are being tested in the context of chess. Because chess is well defined, and there is a way of testing the LLMs outputs clearly, through comparing to known chess rules/tactics, stockfish evals, or human annotations by experts. Using chess as a sandbox has parallels to the way chess was attempted to be used as a model in the 60's/70's to create artificial intelligence. The idea is that if we can recreate chess ability on a machine, then that means we created genuine intelligence as chess is supposed to be the peak of human ability.
This project failed. It failed because chess skill could be created in inhuman ways, through the use of chess engines. Chess engines operate using the minimax algorithm meaning that the engine (Player A) will try to maximize its advantage for its moves and assume that the opponent will try to minimize their own (Player A's) advantage on their moves. Engines can search more than millions of lines, far more than humans. The effectiveness of engine search combined with the minimax algorithm meant that chess was not relevant to developing artificial intelligence.
Very interestingly, chess has made itself relevant again in the field of artificial intelligence in the form of testing the function of LLMs. We know that LLMs can recreate contextual language very well. However it struggles when doing things that require clear cut symbolic logical thinking.
Let's see how LLMs square up in chess through looking at published papers on the topic in the past four months:
Geometric Stability (Song et al. 2025)
One way of seeing how well LLMs operate is through geometrical transformation of chess positions. This is also an example of an adversarial attack. Rotating 90 degrees/reflecting/color swapping chess positions. Rotations are done in pawnless positions so the positions are relationally equivalent after the transformation.
Adversarial attack examples. LLMs not giving similar evals for relationally equivalent positions were called consistency errors.
You give the LLM a position and ask for the eval of the best move. Then you rotate/reflect/color swap the board and ask it again for the eval. Since these transformations keep relations between the pieces the same, the LLM should be giving the same eval to the geometrically changed position.
The positions used were generated by random moves from the starting position so that the LLMs couldn't rely on positions it had seen before in its training data.
Another way of trying to trick the LLMs was by giving illegal game positions through illegal moves, two kings, both kings in check, castling in check, castling when illegal etc. To pass, the LLM had to convey the position was not legal as well as not output a eval or move. It would get partial credit for saying the position was not legal and outputing an undefined centipawn eval.
They also looked at whether the LLM outputs were consistent when presented with different formats (FEN, PGN and Natural Language Descriptions (describing activity, king safety, material balance, salient threats, structural imbalances.))
What they also did is create pairs of positions. A position was randomly generated and then 1 random move was played from the position to create a pair (position A, position A + 1 random move). These pairs were filtered so that only pairs with over 90% textual similarity in the piece field of the FEN + a large difference in eval between the positions remained. This ensured that similar looking positions that were actually very different could be tested to see how robust the LLMs were.
The study didn't say what the temperature settings for the LLMs were (For the FEN/PGN/Natural language task just stated that 'We query the LLM three times under identical prompting (same temperature and instruction style)'. Temperature refers to the randomness of the LLM's output. A temperature of 0 results in the LLM giving the exact same output for the exact same prompt.
Centipawn loss (difference between original position eval and new eval). Note the high error for ChatGPT when the board position is rotated by 90 degrees.
When given a winning position (>10,000cp) where there was one unique winning move, the failure to evaluate the position as winning for that side was called tactical blindness. The positions were taken from the Lichess Open Database.
Accuracy error by LLM.
Accuracy of rejecting illegal moves as a valid input.
Errors are defined as >300 cp loss for consistency errors. Hallucination errors are when the LLM doesn't realize a position is invalid. Tactical errors are when the LLM doesn't realize a winning position with a unique move is actually winning.
Tactical errors: The reason for LLMs missing unique best moves in a position is that it smoothes over discrete jumps in the training data. This results in tactical blows being averaged out, causing the LLMs to give a wrong eval.
Consistency errors (difference in eval from a rotated/reflected/color swapped position that is relationally the same): These stem from the fact that equivalent positions are seen as different due to the fact that they are represented as different vectors in the embedding space. This is because the FEN strings are textually different and are then tokenized differently. This happens because local piece relations are memorized by LLMs at the expense of global relations, resulting in relationally identical positions being processed in a disparate way.
Hallucinations (invalid positions) stem from the fact that the set of valid FEN strings will cut off at a point. They can either be valid or invalid. Also the set of valid FEN strings are thin, varying along a narrow area. But the LLMs are designed so that its outputs will be continuous and also dense. As a conceptual example, if you imagine that the valid FEN strings are a rectangular prism then the LLM basically puffs this up into a egg shape with a bigger volume, creating overhang on the edges. This means that invalid FEN strings in the overhang can be seen as valid.
You can see that ChatGPT makes the most errors overall. It also has the highest amount of tactical inaccuracies. A lot of the errors stem from a lack of consistency when the board is rotated 90 degrees (ChatGPT changes its eval by about 25 pawns). The reason this happens is because of ChatGPT's large amount of training data. This results in ChatGPT overfitting to a narrow domain, resulting in FEN strings that are not close to its training data being seen as vastly different.
Kimi and Deepseek have the lowest consistency errors, meaning they are able to maintain stabler evaluations when the board is altered so that the FEN is different, but the underlying relations between the pieces are the same. This was suggested as maybe being due to being able to convert text to tokens more efficiently when dealing with structured data (tokens are the units in a LLM). The authors said it was likely due to data augmentation, where these models were specifically trained with geometrically perturbed data to maintain geometrical consistency.
Gemini has a low amount of tactical errors, which is said to be due to its emergent chain of thought process which could search more deeply in the latent space. But Gemini is also the second lowest when it comes to accuracy overall, so it is not that accurate on the whole but has the ability to cut off tactical errors (a large eval shift).
Grok wasn't on reasoning mode because it took more time to output and used more tokens that the other models on reasoning mode. So to balance it, they put Grok on standard inference mode. So Grok didn't do well in verifying that the position was illegal, nor did it do well on anything else.
Gemini and Claude lead in rejecting invalid board positions. This was said to be due to these models being specially reinforced to reject invalid board states through human feedback. But Claude is middle of the pack when it comes to giving legal moves themselves.
The authors noted that performance does not scale with model size and that LLMs are currently functioning as 'brittle pattern matchers rather than grounded reasoners'.
ChessQA (Wen et al. 2025)
The authors of this study created a benchmark known as ChessQA. A way to test LLMs ability on different metrics. This is conceptualized as a development of chess skill from rule learning to semantic understanding.
The metrics:
Structural is assessing the LLMs ability to demonstrate knowledge of chess rules. Tasks like output any legal move, detecting checks or giving the new board position after a series of moves etc. were used.
Motifs was seeing whether the LLM could label a tactical motif in a position such as pins, skewers, forks etc. The options were determining the motif from a given position or after a 1 ply search from that position.
Short Tactics was determining if the LLM can find the unique best move in a position containing a short tactical sequence. The positions were taken from Lichess Puzzles Database.
Positional Judgement was assessed by having the LLM choose the eval of a position. It was multichoice with the options of -400, -200, 0, 200, 400 in centipawns. The real evals were within 50 points of these preset values. The positions were taken from the Lichess Evals Database which contains pairs of positions and their evals.
Semantics was being able to choose the appropriate natural language description of a given FEN and a instruction telling what move had just been played (4 options, with only one being correct). The natural language commentary was taken from ChessBase 17 and then filtered so that comments with more than 5 words had to contain at least one defined chess concept like zugzwang or outpost for example.
The default sampling parameters of the LLMs were used when testing them on ChessQA.
Performance on different metrics of ChessQA by LLM.
As you can see, ChatGPT is dominant, especially when finding the move in short tactic positions (77%) . DeepSeek and Claude follow behind. ChatGPT manages a near perfect (97%) performance on demonstrating the rules of chess, a high (84%) performance on labeling motifs and picking the proper natural language comment (80%).
But why did ChatGPT not do so well in the tactical positions in the previous study then, where it had the highest amount of errors?
Firstly the previous study used ChatGPT-5.1 in general reasoning mode, not thinking mode. This study used two versions of ChatGPT-5, one in thinking mode and the other without thinking enabled. ChatGPT without thinking enabled is sixth down on the chart from ChatGPT with thinking. It's the second best in short tactics performance but still way below ChatGPT with thinking mode on. Thinking mode increases the output time and breaks down the process into multiple steps to help avoid errors.
Another factor might be that the tactics came from the Lichess Puzzles Database in this study. While in the other study the tactics came from the Lichess Open Database, where positions in real games with one unique move were filtered and selected. The difference is that the unique moves in the Lichess Open Database might not be a simple tactic and could be more obscure, not falling into classic puzzle categories. So being tested on short tactics with less than 6 ply might make it easier.
But ChatGPT with thinking also falls down like the rest when it comes to giving the eval of the position, only getting it right 2 out of 5 times on a multiple choice question.
Fluid Intelligence (Pleiss et al. 2026)
Another way of analyzing the capabilities of LLMs is to see how well they can generalize to out of distribution problems. That means data which the LLMs was not explicitly trained on. Can the LLM solve problems which were not similar to the training set?
Fluid intelligence refers to the ability to solve a novel problem which cannot be solved by simply remembering acquired knowledge/skills in the past through first principles. This is analogous to a LLM being able to solve an out of distribution problem. Enacting solutions based on previous knowledge was referred to as crystallized intelligence (using Within Distribution data).
They tested to see how well ChatGPT can come up with the best move in three conditions. The conditions were Within Distribution, Near Distribution and Out of Distribution.
Example positions by condition.
Within Distribution positions were positions that appeared at least 1,000 times in the Lichess Masters Database (common opening positions). They didn't know the exact data ChatGPT used but common opening positions are the types of data that must be in the training data due to how frequently they are discussed.
Near Distribution positions were created by 10 random moves from a Within Distribution position, to maintain a structurally similar position that was not explicitly in the training data.
Out of Distribution positions were created by randomly sampling up to 10 pieces for each side, while ensuring that the created position was a valid one. This creates positions that were not in the training distribution.
They quantified performance as the average centipawn loss when ChatGPT had to give the eval. They used 3 models of ChatGPT (3.5,4 and 5) due to needing to control for architecture and also so that effects of architecture improvement could be seen directly. If ChatGPT outputted an illegal move, then that was classified as a 1000 centipawn loss so it could have the same format as legal moves. They first looked at ChatGPT without reasoning mode (a mode designed to help LLMs solve multistep tasks needing logical reasoning). They set the temperature (randomness) of ChatGPT to 0 to ensure deterministic outputs.
As you can see, ChatGPT has improved immensely over time in terms of best move output for within distribution. On the other hand, both near distribution and out of distribution centipawn losses are high, meaning ChatGPT can't generalize to new chess positions it hasn't been trained on. Also legal moves decrease as the positions shift from within distribution to out of distribution. ChatGPT only gives a legal move about 1 out of 4 times in a position it has never seen.
Removing the effect of illegal moves shows the same pattern. Move quality for legal moves in near/our of distribution positions are no better than random moves. In the image on the right you can see that extrapolating future improvement in move quality to projected new models of ChatGPT showed that the inability to generalize to new positions is likely a flaw in the architecture of current LLMs themselves, and that scaling up LLMs will likely not improve move quality.
Then they looked at how increasing reasoning from minimal to moderate in ChatGPT improved its performance:
Performance with moderate reasoning enabled. Reasoning helps reduce illegal moves to almost perfect levels. It also cuts down centipawn loss by a large factor. However, centipawn loss is still high for out of distribution positions. On the bottom left you see that the computational cost (tokens used) increases as the positions become less familiar, and on the bottom right you can see that even though out of distribution positions used the most tokens, its performance gain by token was the lowest.
You can see that enabling moderate reasoning this helps a lot. However, even with reasoning mode enabled, the centipawn loss for out of distribution positions was still very high (2.99 pawns worth on average). So ChatGPT still blundered the equivalent of a knight in positions it hadn't seen before in reasoning mode. Also there are diminishing returns when reasoning is enabled when the distribution changes from within to out of distribution. Reasoning simply boosts performance in known positions by improving memory but does not fundamentally change the subpar performance of out of distribution positions.
The authors concluded that this was a qualitative problem not a quantitative one and also mentioned how measuring performance of LLMs on within distribution tasks can hide serious flaws that emerge when the LLM is tested on tasks that are out of its training distribution. They also said the use of LLMs is areas where safety is critical should be 'approached with caution'.
Summary
- LLM evals change drastically when the position is rotated/reflected/color swapped. Even though these positions are relationally identical, the centipawn loss is in the hundreds, indicating that LLMs make errors on scale of piece blunders. ChatGPT blunders about 8 knights worth when the board is rotated 90 degrees in pawnless positions.
- ChatGPT on thinking mode can almost perfectly make rule appropriate actions, and have a high performance on labelling motifs, solving short tactical puzzles from the Lichess Puzzles and choosing a suitable natural language description of the position.
- ChatGPT can perform well on common memorized positions (e.g. common opening moves like Ruy Lopez opening).
- ChatGPT struggles to evaluate out of distribution positions even with moderate reasoning effort.
- Scaling up LLMs through increasing its parameters, training data and computation resources is not likely to reduce errors in out of distribution positions (positions it has never seen before). The authors of the last paper and the first paper noted that addressing this will require changes/additions to the architecture of LLMs such as hybrid reasoning, ways to ground geometrical transformations, neurosymbolic verification layers and new forms of data representation and inference.
Visit Blog Creators Hangout for more featured blogs.
Visit Study Creators Hangout for more featured studies.
