Why Opening Statistics Are Wrong

I think the evolving chess player benchmark should not be the noodle stuck on the ceiling but some kind of self play. I hope I am not going astray. I need hypotheses of reading like that, whether right or wrong, then reading becomes a debate of sort, or some confirmation by other thinker going in same region of thoughts. So don,t mind me.. just journalling here.

Ok. I got not time to read the examples. But op if I did not get it yet let me know, and then I might. I think maia has it wrong in the same way you are proposing. They discarded a lot of information from the full-joint probablility problem. And you just might have put a pin on it in a better language or exposition. unfortunately reading is hard for me. Lots of internal thought compeition.. But I think you are right.

Definitely the binning and the pair averaging is likely to lose a lot of information. I even wonder if that is why I noticed recently that they started averaging in correspondance at least even the single game pair last games in the explorer. in the past we could see that game actual pari of ratings. might have been a fluke.. I wonder if I need to read to the end with all the figures.. I think the evolving chess player benchmark should not be the noodle stuck on the ceiling but some kind of self play. I hope I am not going astray. I need hypotheses of reading like that, whether right or wrong, then reading becomes a debate of sort, or some confirmation by other thinker going in same region of thoughts. So don,t mind me.. just journalling here. Ok. I got not time to read the examples. But op if I did not get it yet let me know, and then I might. I think maia has it wrong in the same way you are proposing. They discarded a lot of information from the full-joint probablility problem. And you just might have put a pin on it in a better language or exposition. unfortunately reading is hard for me. Lots of internal thought compeition.. But I think you are right.

Toadofsky

@marcusbuffett said in #23:

This is an incredibly interesting post, especially as someone who has dealt extensively with opening data from Lichess (I run chessbook.com). I'm going to use the proposed method here in our next iteration of our openings database.

I'd like to see which openings tend to be more or less resilient as compared to the ratings-based expected outcome. Perhaps I should prepare different opening repertoires based upon my opponent's rating?

@marcusbuffett said in #23: > This is an incredibly interesting post, especially as someone who has dealt extensively with opening data from Lichess (I run chessbook.com). I'm going to use the proposed method here in our next iteration of our openings database. I'd like to see which openings tend to be more or less resilient as compared to the ratings-based expected outcome. Perhaps I should prepare different opening repertoires based upon my opponent's rating?

dboing

edited

The fix: Instead of grouping the games based on the average rating of both players, we should group by the rating of white players if we want to see things from white’s point of view. Similarly, we should group by the rating of players with black, if we want to see things from their point of view. Think of this like the personal statistics tab in the Opening Explorer, where you get to choose between your games with white and your games with black. In the example above, if we construct the “2000-2200” white category by aggregating all the games of B and C (and nothing more), we can easily see that both X and Y come with a win rate of 50%, equal to the assumed truth.

This should have had some header for those of us wanting to skip the parts of the stream of text that might be not needed in a first pass reading, if already agreeing with the distorstoin potential from either your introduciton or their own experience and own prevous thinking about this. I almost missed it as it was not at all made salient.

A long blog, in a chess game where the visual information is part of the core of the game, meaining some of us might like spatial clues (I know the figures might might sort of do that, but those are past of the illustration section not the main ideation, so spatial enough, not in this case, I had to be hinted by prevous post about some proposal to seek it and find some information segment of the blog stream (it seems like a stream to me).

Few headers to share the articulation of the blog might have help me. or helped you help me being able to read at that level. This is not about the ideas you proposed, just about the form here. Markdown that lichess does allow can do that helpful spatialization for long articles. You blod is worht that.

shapeless text in blogs is not an obligation compared to say study text modules, for example out of the blue, purely uniformly and indedpendtly randomly. or here. I could mutlipose the quoting to refer to the blog.

But a blog is not the same level of communication as a discussion, in the heat of things. I mean for long blogs. That have actual segments and some articulation and flow beyond the lower level of stream.

> The fix: Instead of grouping the games based on the average rating of both players, we should group by the rating of white players if we want to see things from white’s point of view. Similarly, we should group by the rating of players with black, if we want to see things from their point of view. Think of this like the personal statistics tab in the Opening Explorer, where you get to choose between your games with white and your games with black. In the example above, if we construct the “2000-2200” white category by aggregating all the games of B and C (and nothing more), we can easily see that both X and Y come with a win rate of 50%, equal to the assumed truth. This should have had some header for those of us wanting to skip the parts of the stream of text that might be not needed in a first pass reading, if already agreeing with the distorstoin potential from either your introduciton or their own experience and own prevous thinking about this. I almost missed it as it was not at all made salient. A long blog, in a chess game where the visual information is part of the core of the game, meaining some of us might like spatial clues (I know the figures might might sort of do that, but those are past of the illustration section not the main ideation, so spatial enough, not in this case, I had to be hinted by prevous post about some proposal to seek it and find some information segment of the blog stream (it seems like a stream to me). Few headers to share the articulation of the blog might have help me. or helped you help me being able to read at that level. This is not about the ideas you proposed, just about the form here. Markdown that lichess does allow can do that helpful spatialization for long articles. You blod is worht that. shapeless text in blogs is not an obligation compared to say study text modules, for example out of the blue, purely uniformly and indedpendtly randomly. or here. I could mutlipose the quoting to refer to the blog. But a blog is not the same level of communication as a discussion, in the heat of things. I mean for long blogs. That have actual segments and some articulation and flow beyond the lower level of stream.

D2D4C2C4

@dboing said in #33:

This should have had some header for those of us wanting to skip the parts of the stream of text that might be not needed in a first pass reading, if already agreeing with the distorstoin potential from either your introduciton or their own experience and own prevous thinking about this. I almost missed it as it was not at all made salient.

A long blog, in a chess game where the visual information is part of the core of the game, meaining some of us might like spatial clues (I know the figures might might sort of do that, but those are past of the illustration section not the main ideation, so spatial enough, not in this case, I had to be hinted by prevous post about some proposal to seek it and find some information segment of the blog stream (it seems like a stream to me).

Few headers to share the articulation of the blog might have help me. or helped you help me being able to read at that level. This is not about the ideas you proposed, just about the form here. Markdown that lichess does allow can do that helpful spatialization for long articles. You blod is worht that.

shapeless text in blogs is not an obligation compared to say study text modules, for example out of the blue, purely uniformly and indedpendtly randomly. or here. I could mutlipose the quoting to refer to the blog.

But a blog is not the same level of communication as a discussion, in the heat of things. I mean for long blogs. That have actual segments and some articulation and flow beyond the lower level of stream.

Thanks for the comment and the suggestion. Indeed the use of headers would have helped given the length of the post, I had no idea how long it turned out to be. I'll keep that in mind for future posts!

@dboing said in #33: > This should have had some header for those of us wanting to skip the parts of the stream of text that might be not needed in a first pass reading, if already agreeing with the distorstoin potential from either your introduciton or their own experience and own prevous thinking about this. I almost missed it as it was not at all made salient. > > A long blog, in a chess game where the visual information is part of the core of the game, meaining some of us might like spatial clues (I know the figures might might sort of do that, but those are past of the illustration section not the main ideation, so spatial enough, not in this case, I had to be hinted by prevous post about some proposal to seek it and find some information segment of the blog stream (it seems like a stream to me). > > Few headers to share the articulation of the blog might have help me. or helped you help me being able to read at that level. This is not about the ideas you proposed, just about the form here. Markdown that lichess does allow can do that helpful spatialization for long articles. You blod is worht that. > > shapeless text in blogs is not an obligation compared to say study text modules, for example out of the blue, purely uniformly and indedpendtly randomly. or here. I could mutlipose the quoting to refer to the blog. > > But a blog is not the same level of communication as a discussion, in the heat of things. I mean for long blogs. That have actual segments and some articulation and flow beyond the lower level of stream. Thanks for the comment and the suggestion. Indeed the use of headers would have helped given the length of the post, I had no idea how long it turned out to be. I'll keep that in mind for future posts!

GnocchiPup

I wonder if this quick workaround might work.

Instead of filtering 2000 to 2200 to see how 2100 fares in any given move, why not filter it to 2100 to 2200 to avoid the down bias?

As a rough analogy, depending on the quirks of a restaurant, I might order medium, knowing that they are conservative, so that I'll get the medium rare which is what I really wanted in the first place.

I might do some simulations, if time permits.

I wonder if this quick workaround might work. Instead of filtering 2000 to 2200 to see how 2100 fares in any given move, why not filter it to 2100 to 2200 to avoid the down bias? As a rough analogy, depending on the quirks of a restaurant, I might order medium, knowing that they are conservative, so that I'll get the medium rare which is what I really wanted in the first place. I might do some simulations, if time permits.

GnocchiPup

Lol, a simple inspection of your table shows that my quick fix made matters worse.

Sorry

Lol, a simple inspection of your table shows that my quick fix made matters worse. Sorry

MichaelScotch92

Fortunately, chessbase provides these stats. Here you are:
White's first move:

e4 +1.44%
d4 +1.43%
Nf3 +1.23%
c4 +1.18%

Unsurprisingly, e4 is best by test.
Black's responses against 1.e4:
1.. c5 -1.22%
1.. e5 -1.46%
1.. e6 -1.28%
1.. c6 -1.36%
1.. d6 -2.15%
1.. d5 -1.59%

c5 is the best, as expected. d6 (or d5) is considerably lower.

And finally, against 1.d4

1.. Nf6 -1.23%
1.. d5 -1.19%
1.. e6 -1.70%
1.. f5 -1.59%
1.. d6 -1.80%

Which gives the expected result of Nf6 and d5 being the best moves.

Your premise is correct (that opening statistics are wrong), but your methodology is completely wrong. I don't know how Black having > 50% win rate did not tip you off on that. According to your method, if I beat up some 1200 players and have 100% winrate with some garbage opening that would be more accurate than those games not counting. The only correct way to measure opening performance is by taking the performance rating of players using that opening and comparing it against their average rating. This way you know if the players are performing above or below their rating using these openings. Fortunately, chessbase provides these stats. Here you are: White's first move: 1. e4 +1.44% 1. d4 +1.43% 1. Nf3 +1.23% 1. c4 +1.18% Unsurprisingly, e4 is best by test. Black's responses against 1.e4: 1.. c5 -1.22% 1.. e5 -1.46% 1.. e6 -1.28% 1.. c6 -1.36% 1.. d6 -2.15% 1.. d5 -1.59% c5 is the best, as expected. d6 (or d5) is considerably lower. And finally, against 1.d4 1.. Nf6 -1.23% 1.. d5 -1.19% 1.. e6 -1.70% 1.. f5 -1.59% 1.. d6 -1.80% Which gives the expected result of Nf6 and d5 being the best moves.

D2D4C2C4

@MichaelScotch92 said in #37:

Your premise is correct (that opening statistics are wrong), but your methodology is completely wrong. I don't know how Black having > 50% win rate did not tip you off on that. According to your method, if I beat up some 1200 players and have 100% winrate with some garbage opening that would be more accurate than those games not counting. The only correct way to measure opening performance is by taking the performance rating of players using that opening and comparing it against their average rating. This way you know if the players are performing above or below their rating using these openings.

Fortunately, chessbase provides these stats. Here you are:
White's first move:

e4 +1.44%

d4 +1.43%

Nf3 +1.23%

c4 +1.18%

Unsurprisingly, e4 is best by test.
Black's responses against 1.e4:
1.. c5 -1.22%
1.. e5 -1.46%
1.. e6 -1.28%
1.. c6 -1.36%
1.. d6 -2.15%
1.. d5 -1.59%

c5 is the best, as expected. d6 (or d5) is considerably lower.

And finally, against 1.d4

1.. Nf6 -1.23%
1.. d5 -1.19%
1.. e6 -1.70%
1.. f5 -1.59%
1.. d6 -1.80%

Which gives the expected result of Nf6 and d5 being the best moves.

In the post I focused on the need to move away from grouping by average rating, towards grouping by white rating to study the statistics from white's point of view. There are many other considerations and improvements that can be done. I have already agreed in the comments that removing opponents with lower ratings is also a useful adjustment, unless you want to actually see what openings are successful against the full distribution of potential opponents, including much lower ones. In that case, it is definitely important at least to not count all wins as equal regardless of the opponent's strength. I would say that using the ratings gained or lost, as one of the first comments suggested would be an improvement. I may get back to this in the future and change it accordingly.

@MichaelScotch92 said in #37: > Your premise is correct (that opening statistics are wrong), but your methodology is completely wrong. I don't know how Black having > 50% win rate did not tip you off on that. According to your method, if I beat up some 1200 players and have 100% winrate with some garbage opening that would be more accurate than those games not counting. The only correct way to measure opening performance is by taking the performance rating of players using that opening and comparing it against their average rating. This way you know if the players are performing above or below their rating using these openings. > > Fortunately, chessbase provides these stats. Here you are: > White's first move: > 1. e4 +1.44% > 1. d4 +1.43% > 1. Nf3 +1.23% > 1. c4 +1.18% > > Unsurprisingly, e4 is best by test. > Black's responses against 1.e4: > 1.. c5 -1.22% > 1.. e5 -1.46% > 1.. e6 -1.28% > 1.. c6 -1.36% > 1.. d6 -2.15% > 1.. d5 -1.59% > > c5 is the best, as expected. d6 (or d5) is considerably lower. > > And finally, against 1.d4 > > 1.. Nf6 -1.23% > 1.. d5 -1.19% > 1.. e6 -1.70% > 1.. f5 -1.59% > 1.. d6 -1.80% > > Which gives the expected result of Nf6 and d5 being the best moves. In the post I focused on the need to move away from grouping by average rating, towards grouping by white rating to study the statistics from white's point of view. There are many other considerations and improvements that can be done. I have already agreed in the comments that removing opponents with lower ratings is also a useful adjustment, unless you want to actually see what openings are successful against the full distribution of potential opponents, including much lower ones. In that case, it is definitely important at least to not count all wins as equal regardless of the opponent's strength. I would say that using the ratings gained or lost, as one of the first comments suggested would be an improvement. I may get back to this in the future and change it accordingly.

D2D4C2C4 edited

The presence of lower rated opponents mainly from tournaments creates primarily noise. It creates bias only if 2000+ white players are more likely to play a different, and perhaps strange and aggressive move, when paired with a much lower-rated opponent. So, if I wanted to look into the statistics of 1.g4, I'd definitely want that adjustment too.

It also creates bias in the general win rates, but what i am mostly interested in is the relative win rates of different moves, their difference from the benchmark

We can frame this as an issue of bias and noise. The switch from grouping by the average rating of both players to grouping by white rating is intended to remove the bias induced by players of type a and type d in my example. The presence of lower rated opponents mainly from tournaments creates primarily noise. It creates bias only if 2000+ white players are more likely to play a different, and perhaps strange and aggressive move, when paired with a much lower-rated opponent. So, if I wanted to look into the statistics of 1.g4, I'd definitely want that adjustment too. It also creates bias in the general win rates, but what i am mostly interested in is the relative win rates of different moves, their difference from the benchmark

dboing

edited

I am not yet sure to have understood the question that is asked by your approach. But the first part, that of the dismissed information, and its potential interpretation when questions become not just about a player but about an opening AND a player. or maybe it is an opening a player and a pool of players. ...

maybe the pair of posts 37 and 38 is indeed about making the target questions more explained or defined. not sure. but 37 is the current well posed pair of dependent variables considered and question defined. I think that using more known factors (we would all convene that the pair differential respective ratings of one game, is more informative about the game outcome than the average of the pair, or we would not even use ratings based on probalility of winning, and then the fog starts for me, about how does that interact with the restrictoins on which subsets of early moves or should I say subtrees are made target to estimate some quantities derived from the set of all ratings over the corresponding pools of games and their players pairs.

that is where I need to keep reading to figure out. Don't mind this post if not applicable.

in #37. the question is given a defition of opening repertoire or subtree, one can define in a pools of games the subset that is determined by that specific well defined subtree criterion ( I would guess some names that might have variable actual meaning when talking loosely, but then with actual game information, it would have actual decision tree well defined boundaries, this is my ignorance or opening theory blindness and total inaptitude, speaking). It does not matter the actual means as long as one can define the games subsets one can then using the average rating "projected" onto that sub-pool. and call that performance rating. I think one can just apply the same computations but restricting to the pool of such games. I forgot if one can always have a relation with the globa average rating over the maximal pool. not important. my point was about defining the flow of information well. where I am at. now I would need to work on the blog. later.

I also wonder if your approach might not be one of few or many others that are just including in the propbablility model the extra information that comes with each game. It might depend on the actual nature of the questions that the statistical modeling is geared to shed light onto, providing new information from the adding previously averaged dimensions. I am not yet sure to have understood the question that is asked by your approach. But the first part, that of the dismissed information, and its potential interpretation when questions become not just about a player but about an opening AND a player. or maybe it is an opening a player and a pool of players. ... maybe the pair of posts 37 and 38 is indeed about making the target questions more explained or defined. not sure. but 37 is the current well posed pair of dependent variables considered and question defined. I think that using more known factors (we would all convene that the pair differential respective ratings of one game, is more informative about the game outcome than the average of the pair, or we would not even use ratings based on probalility of winning, and then the fog starts for me, about how does that interact with the restrictoins on which subsets of early moves or should I say subtrees are made target to estimate some quantities derived from the set of all ratings over the corresponding pools of games and their players pairs. that is where I need to keep reading to figure out. Don't mind this post if not applicable. in #37. the question is given a defition of opening repertoire or subtree, one can define in a pools of games the subset that is determined by that specific well defined subtree criterion ( I would guess some names that might have variable actual meaning when talking loosely, but then with actual game information, it would have actual decision tree well defined boundaries, this is my ignorance or opening theory blindness and total inaptitude, speaking). It does not matter the actual means as long as one can define the games subsets one can then using the average rating "projected" onto that sub-pool. and call that performance rating. I think one can just apply the same computations but restricting to the pool of such games. I forgot if one can always have a relation with the globa average rating over the maximal pool. not important. my point was about defining the flow of information well. where I am at. now I would need to work on the blog. later.

Your network blocks the Lichess assets!

Why Opening Statistics Are Wrong