Expected Score of Grandmasters based on Evaluation

Is it maybe you look first if the game is won and then look at what point the evaluation didnt fall below a certain high evaluation level? So for example it was +0.5 at some point of the game for the winning side, and from that point it never fell again below +0.5?

SF evals are often exaggerated for a human. I often see eval of +1.5 given by SF, but that is only valid if the human would follow a deep line sometimes 20 or more moves deep until the material advantage is actually on the board, and it is hard to say if SF based its eval on such material advantage 20 moves prior or if even SF could not see that for sure and the eval was based on positional advantage.

Sometimes the eval raises suddenly by +2.0 or more pawn units, just to drop again if the very best move was missed. But such line often enough isnt a winning pawn tactic, and it is indeed very difficult to understand. While other +2.0 evals are a pawn winning tactic. The eval of SF is of a kind that +1.0 isnt really "a pawn up", but if you have really a clean pawn advantage on the board, the Eval is at that point already much higher than +1.0. Probably because SF sees already the future endgame 20 moves deeper ahead.

Anyway, good work, interesting graphs.
A similar study evaluated the average error per move compared to the best move suggested by an old chess engine.
The less the best moves deviated for the best chess engine move, the stronger the GM was, and you could estimate the elo strength based on the average error deviation. However, this study was made by a much weaker engine than nowadays is available, and the depth for evaluation was only 12 or 13 ply moves deep. Could have been Fritz engine, I cant remember.

Such analysis made it possible to compare the strength of old grandmasters like Lasker or Capablanca with modern GMs.
This analysis came to the conclusion that Magnus Carlsen is indeed the strongest chess player that ever walked the earth. Which is not that surprising - I think that modern GMs are indeed stronger than the best chess players 100 years ago, who didnt have chess engines and chess databases at their disposal.
Such analysis eliminates elo deflation/inflation (ie was fischer's Fide elo worth the same as nowdays Fide Elo? And alone due to the influx of New players during Covid years we saw some deflation (drop) in ratings at every level.

How is the SF evaluation used? Is it the eval that appears highest in the game before a certain move? The evaluation is usually highest near or at the last move played in the game, and often climbs during a winning game. Is it maybe you look first if the game is won and then look at what point the evaluation didnt fall below a certain high evaluation level? So for example it was +0.5 at some point of the game for the winning side, and from that point it never fell again below +0.5? SF evals are often exaggerated for a human. I often see eval of +1.5 given by SF, but that is only valid if the human would follow a deep line sometimes 20 or more moves deep until the material advantage is actually on the board, and it is hard to say if SF based its eval on such material advantage 20 moves prior or if even SF could not see that for sure and the eval was based on positional advantage. Sometimes the eval raises suddenly by +2.0 or more pawn units, just to drop again if the very best move was missed. But such line often enough isnt a winning pawn tactic, and it is indeed very difficult to understand. While other +2.0 evals are a pawn winning tactic. The eval of SF is of a kind that +1.0 isnt really "a pawn up", but if you have really a clean pawn advantage on the board, the Eval is at that point already much higher than +1.0. Probably because SF sees already the future endgame 20 moves deeper ahead. Anyway, good work, interesting graphs. A similar study evaluated the average error per move compared to the best move suggested by an old chess engine. The less the best moves deviated for the best chess engine move, the stronger the GM was, and you could estimate the elo strength based on the average error deviation. However, this study was made by a much weaker engine than nowadays is available, and the depth for evaluation was only 12 or 13 ply moves deep. Could have been Fritz engine, I cant remember. Such analysis made it possible to compare the strength of old grandmasters like Lasker or Capablanca with modern GMs. This analysis came to the conclusion that Magnus Carlsen is indeed the strongest chess player that ever walked the earth. Which is not that surprising - I think that modern GMs are indeed stronger than the best chess players 100 years ago, who didnt have chess engines and chess databases at their disposal. Such analysis eliminates elo deflation/inflation (ie was fischer's Fide elo worth the same as nowdays Fide Elo? And alone due to the influx of New players during Covid years we saw some deflation (drop) in ratings at every level.

jk_182

@Munich said in #21:

How is the SF evaluation used? Is it the eval that appears highest in the game before a certain move?
The evaluation is usually highest near or at the last move played in the game, and often climbs during a winning game.

Is it maybe you look first if the game is won and then look at what point the evaluation didnt fall below a certain high evaluation level? So for example it was +0.5 at some point of the game for the winning side, and from that point it never fell again below +0.5?

When looking at the expected score of an evaluation of +1, I looked through the games and when one side had an advantage of +1, I counted the game and the result. So the evaluation could rise or drop afterwards.

SF evals are often exaggerated for a human. I often see eval of +1.5 given by SF, but that is only valid if the human would follow a deep line sometimes 20 or more moves deep until the material advantage is actually on the board, and it is hard to say if SF based its eval on such material advantage 20 moves prior or if even SF could not see that for sure and the eval was based on positional advantage.

Sometimes the eval raises suddenly by +2.0 or more pawn units, just to drop again if the very best move was missed. But such line often enough isnt a winning pawn tactic, and it is indeed very difficult to understand. While other +2.0 evals are a pawn winning tactic. The eval of SF is of a kind that +1.0 isnt really "a pawn up", but if you have really a clean pawn advantage on the board, the Eval is at that point already much higher than +1.0. Probably because SF sees already the future endgame 20 moves deeper ahead.

That's exactly why I tried to connect the evaluation to the expected score in human games. I think that this gives a better feeling of what to expect. But as others have already mentioned, there is obviously much more to a position than a single evaluation.

Anyway, good work, interesting graphs.
A similar study evaluated the average error per move compared to the best move suggested by an old chess engine.
The less the best moves deviated for the best chess engine move, the stronger the GM was, and you could estimate the elo strength based on the average error deviation. However, this study was made by a much weaker engine than nowadays is available, and the depth for evaluation was only 12 or 13 ply moves deep. Could have been Fritz engine, I cant remember.

Such analysis made it possible to compare the strength of old grandmasters like Lasker or Capablanca with modern GMs.
This analysis came to the conclusion that Magnus Carlsen is indeed the strongest chess player that ever walked the earth. Which is not that surprising - I think that modern GMs are indeed stronger than the best chess players 100 years ago, who didnt have chess engines and chess databases at their disposal.
Such analysis eliminates elo deflation/inflation (ie was fischer's Fide elo worth the same as nowdays Fide Elo? And alone due to the influx of New players during Covid years we saw some deflation (drop) in ratings at every level.

I find this history analysis fascinating but one thing to note is that it's often difficult to adjust for the strength of the opponents. I think that Carlsen faces much tougher competition than Capablanca so his accuracy might not necessarily be higher even though he is stronger. Also modern players often aim to add imbalance into a position, preferring to play an objectively slightly worse but much more complicated position. This makes all the comparisons even more difficult.

@Munich said in #21: > How is the SF evaluation used? Is it the eval that appears highest in the game before a certain move? > The evaluation is usually highest near or at the last move played in the game, and often climbs during a winning game. > > Is it maybe you look first if the game is won and then look at what point the evaluation didnt fall below a certain high evaluation level? So for example it was +0.5 at some point of the game for the winning side, and from that point it never fell again below +0.5? When looking at the expected score of an evaluation of +1, I looked through the games and when one side had an advantage of +1, I counted the game and the result. So the evaluation could rise or drop afterwards. > SF evals are often exaggerated for a human. I often see eval of +1.5 given by SF, but that is only valid if the human would follow a deep line sometimes 20 or more moves deep until the material advantage is actually on the board, and it is hard to say if SF based its eval on such material advantage 20 moves prior or if even SF could not see that for sure and the eval was based on positional advantage. > > Sometimes the eval raises suddenly by +2.0 or more pawn units, just to drop again if the very best move was missed. But such line often enough isnt a winning pawn tactic, and it is indeed very difficult to understand. While other +2.0 evals are a pawn winning tactic. The eval of SF is of a kind that +1.0 isnt really "a pawn up", but if you have really a clean pawn advantage on the board, the Eval is at that point already much higher than +1.0. Probably because SF sees already the future endgame 20 moves deeper ahead. That's exactly why I tried to connect the evaluation to the expected score in human games. I think that this gives a better feeling of what to expect. But as others have already mentioned, there is obviously much more to a position than a single evaluation. > Anyway, good work, interesting graphs. > A similar study evaluated the average error per move compared to the best move suggested by an old chess engine. > The less the best moves deviated for the best chess engine move, the stronger the GM was, and you could estimate the elo strength based on the average error deviation. However, this study was made by a much weaker engine than nowadays is available, and the depth for evaluation was only 12 or 13 ply moves deep. Could have been Fritz engine, I cant remember. > > Such analysis made it possible to compare the strength of old grandmasters like Lasker or Capablanca with modern GMs. > This analysis came to the conclusion that Magnus Carlsen is indeed the strongest chess player that ever walked the earth. Which is not that surprising - I think that modern GMs are indeed stronger than the best chess players 100 years ago, who didnt have chess engines and chess databases at their disposal. > Such analysis eliminates elo deflation/inflation (ie was fischer's Fide elo worth the same as nowdays Fide Elo? And alone due to the influx of New players during Covid years we saw some deflation (drop) in ratings at every level. I find this history analysis fascinating but one thing to note is that it's often difficult to adjust for the strength of the opponents. I think that Carlsen faces much tougher competition than Capablanca so his accuracy might not necessarily be higher even though he is stronger. Also modern players often aim to add imbalance into a position, preferring to play an objectively slightly worse but much more complicated position. This makes all the comparisons even more difficult.

Munich

style is also a factor. Think Michal Tal - his style was wild tactical positions, so naturally the margin of error was likely higher than with other masters of old times.

But I have seen a retro fitted curve, and you can predict that a margin of zero error (perfect play) made the curve leveling out at around 3600 elo.

There was always the question how high a fide elo could theoretically be. Well, at some point chess engines are becoming so strong that even if one engine is slightly less accurate, it's error only results in a "pawn down" endgame, which it can still hold a draw. There is probably a treshhold of an error allowance: you can lose a pawn, and still hold it a draw, and no matter how strong the stronger engine is - it doesnt matter, the weaker engine didnt blunder too hard.

So I am curious if there are going to be much higher ratings than the strongest chess engines currently hold: around 3630 elo for SF 17 I think?

This isnt "solving chess" completly, but most games result indeed in a draw, and if they dont - it is because the engines are forced to play certain (inferior) openings, just to get less draws (increasing the error right at the beginning - the opening).

style is also a factor. Think Michal Tal - his style was wild tactical positions, so naturally the margin of error was likely higher than with other masters of old times. But I have seen a retro fitted curve, and you can predict that a margin of zero error (perfect play) made the curve leveling out at around 3600 elo. There was always the question how high a fide elo could theoretically be. Well, at some point chess engines are becoming so strong that even if one engine is slightly less accurate, it's error only results in a "pawn down" endgame, which it can still hold a draw. There is probably a treshhold of an error allowance: you can lose a pawn, and still hold it a draw, and no matter how strong the stronger engine is - it doesnt matter, the weaker engine didnt blunder too hard. So I am curious if there are going to be much higher ratings than the strongest chess engines currently hold: around 3630 elo for SF 17 I think? This isnt "solving chess" completly, but most games result indeed in a draw, and if they dont - it is because the engines are forced to play certain (inferior) openings, just to get less draws (increasing the error right at the beginning - the opening).

inalansyen

Cool. Being the functional programming entusiast that I am, I'm wondering how difficult would it be to come up with a function that takes a rating and returns your function. So extending your

f :: (k, cp) -> score

g :: rating -> (k, cp) -> score

or, maybe more likely

g :: (rangeMin, rangeMax) -> (k, cp) -> score

Cool. Being the functional programming entusiast that I am, I'm wondering how difficult would it be to come up with a function that takes a rating and returns your function. So extending your f :: (k, cp) -> score to g :: rating -> (k, cp) -> score or, maybe more likely g :: (rangeMin, rangeMax) -> (k, cp) -> score

Your network blocks the Lichess assets!

Expected Score of Grandmasters based on Evaluation