Why Opening Statistics Are Wrong

Think this is arguing over semantics because regardless of the relative win rates listed, bigger is still better. Relative comparison of win % between moves still tells the story between which move is practically better, just not at the exact % indicated. When you include all games, you get mismatches sure, but you put your mouse over the color to see the average rating. For e4 in 2000 category I see 2098 average for white and 2098 for black. Sure 8% of games (at least in your data) appeared to be 2200 vs 1800s and then some 1800s vs 2200s (colors reversed same opening) which will make the opening appear closer to 50-50 than the opening actually is, but the rank of moves still generally holds constant.

AragornFitrion

In proposed solution, will it actually be easier to find better moves? How we should consider, that on your sample database you receive 1.d4 winrate 58.8% and all black responses have 54% winrate? In current solution it is 52.6% for white and 47.5% for black which adds to ~100%. (This 12% difference comes obviously from a fact, that you operate on 2 different databases that are different from unified database currently used by lichess.)

Interesting spotting, that the statistics can be biased by average rating window. In general, you should look on masters database or deep engine evaluation to find best human moves. So I consider this rating window picker only as a titbit. But now I will keep in mind that it's is even less reliable. In proposed solution, will it actually be easier to find better moves? How we should consider, that on your sample database you receive 1.d4 winrate 58.8% and all black responses have 54% winrate? In current solution it is 52.6% for white and 47.5% for black which adds to ~100%. (This 12% difference comes obviously from a fact, that you operate on 2 different databases that are different from unified database currently used by lichess.)

Viethai2012

Why all the total is wrong??

XtwoX

Quite simply, if we were to run a regression, then we would need to account for the rating (and the time control of both players), which totally makes sense.

Tim404

However, I'm personally convinced that the mechanism you're describing (and trying to correct for) isn't the main reason that win rates do not reflect the "practical" value of a certain move - after all, your correction only varies the relative win rate by a maximum of 3%, which isn't all that significant.
I do believe that win rates are more skewed by not correcting for rating difference in individual games. There may be players on Lichess who literally always play the same opening (or use Zen mode), but most players will adjust their opening choices (and probably also middle- and endgame strategy) based on opponent rating.

Hence it would make sense to correct the win rate for each individual game with the expected win rate based on elo OR simply only take games into account where the rating difference is less than - say - 50 points.

I really appreciate your analysis - the general attention to detail and transparent methodology, especially - and the results, where the win rates equalize for different responses to e4 with you correction to data assembly, show that you are really onto something. However, I'm personally convinced that the mechanism you're describing (and trying to correct for) isn't the main reason that win rates do not reflect the "practical" value of a certain move - after all, your correction only varies the relative win rate by a maximum of 3%, which isn't all that significant. I do believe that win rates are more skewed by not correcting for rating difference in individual games. There may be players on Lichess who literally always play the same opening (or use Zen mode), but most players will adjust their opening choices (and probably also middle- and endgame strategy) based on opponent rating. Hence it would make sense to correct the win rate for each individual game with the expected win rate based on elo OR simply only take games into account where the rating difference is less than - say - 50 points.

D2D4C2C4

Thank you very much for the comments, this is a very constructive discussion!

I want to start by sharing some preliminary results using the blitz subsample by @n1000, which is great btw: I compared a) the statistics of the opening explorer in Lichess, b) similarly constructed statistics using the subsample, and c) statistics based on my proposed methodology using the subsample. It turns out that using the subsample is enough to remove the bias of the explorer statistics, and running my proposed methodology on the subsample yields very marginal improvement (at least when it comes to the first move by white and black--I'm now looking into the second moves). I believe that this is because the subsample does not include blitz games played in tournaments. This results to fewer "type-A" players entering the sample and distorting the statistics. Matching of highly different ratings is more common in tournaments than in regular games. Or perhaps my mind on its own goes towards the explanation that fits with my theory. Anyway, based on what I've seen so far, I'd conclude that simply removing the tournament games would be an equally good option.

I'm also quite certain that adding a restriction on rating differences would work about equally well, although I have not run this analysis thoroughly (yet).

The ideal choice of adjustment depends on which question we want to ask to the data:
A) "What are the win rates of white players in the 2000-2200 range when they face a typical distribution of opponents (worse, equal and better) in regular games and tournaments?" I'd say the answer here would rather be given by the method I proposed.
B) "What are the win rates of white players in the 2000-2200 range when they face opponents relatively close to their own strength?" This question should rather be answered by imposing a restriction on rating differences (or filtering out tournament games). The merits of this approach have been highlighted by many, such as @Tim404, @mkubecek. I think that @mvhk's methodology would also work by deflating the effect of wins against lower-rated opponents (and of losses against higher-rated opponents).

In practice the various adjustment proposals may not matter too much in terms of giving different results, and easiness of implementation should be a deciding factor.

Thank you very much for the comments, this is a very constructive discussion! I want to start by sharing some preliminary results using the blitz subsample by @n1000, which is great btw: I compared a) the statistics of the opening explorer in Lichess, b) similarly constructed statistics using the subsample, and c) statistics based on my proposed methodology using the subsample. It turns out that using the subsample is enough to remove the bias of the explorer statistics, and running my proposed methodology on the subsample yields very marginal improvement (at least when it comes to the first move by white and black--I'm now looking into the second moves). I believe that this is because the subsample does not include blitz games played in tournaments. This results to fewer "type-A" players entering the sample and distorting the statistics. Matching of highly different ratings is more common in tournaments than in regular games. Or perhaps my mind on its own goes towards the explanation that fits with my theory. Anyway, based on what I've seen so far, I'd conclude that simply removing the tournament games would be an equally good option. I'm also quite certain that adding a restriction on rating differences would work about equally well, although I have not run this analysis thoroughly (yet). The ideal choice of adjustment depends on which question we want to ask to the data: A) "What are the win rates of white players in the 2000-2200 range when they face a typical distribution of opponents (worse, equal and better) in regular games and tournaments?" I'd say the answer here would rather be given by the method I proposed. B) "What are the win rates of white players in the 2000-2200 range when they face opponents relatively close to their own strength?" This question should rather be answered by imposing a restriction on rating differences (or filtering out tournament games). The merits of this approach have been highlighted by many, such as @Tim404, @mkubecek. I think that @mvhk's methodology would also work by deflating the effect of wins against lower-rated opponents (and of losses against higher-rated opponents). In practice the various adjustment proposals may not matter too much in terms of giving different results, and easiness of implementation should be a deciding factor.

D2D4C2C4

@Viethai2012 said in #14:

Why all the total is wrong??

The total includes all moves, not just the first 5 most common, which are presented in the tables. I didn't show the less common moves, because sample sizes are smaller and standard errors are larger. So, there's much noise there

@Viethai2012 said in #14: > Why all the total is wrong?? The total includes all moves, not just the first 5 most common, which are presented in the tables. I didn't show the less common moves, because sample sizes are smaller and standard errors are larger. So, there's much noise there

D2D4C2C4

@AragornFitrion said in #13:

Interesting spotting, that the statistics can be biased by average rating window. In general, you should look on masters database or deep engine evaluation to find best human moves. So I consider this rating window picker only as a titbit. But now I will keep in mind that it's is even less reliable.

In proposed solution, will it actually be easier to find better moves? How we should consider, that on your sample database you receive 1.d4 winrate 58.8% and all black responses have 54% winrate? In current solution it is 52.6% for white and 47.5% for black which adds to ~100%. (This 12% difference comes obviously from a fact, that you operate on 2 different databases that are different from unified database currently used by lichess.)

These numbers are strange indeed at first look, I had the same reaction when I first saw them. This is actually a direct result of the methodology I propose:
a) In the Lichess opening explorer as it is, the category 2000-2200 is comprised of all the matches played by players whose average rating is between 2000 and 2200. So, for every win there is a loss and the percentages sum up to 100.
b) When I select only players who have individually a rating of 2000-2200 and see what they do with white and (separately) with black, the win rates are substantially above 50% for both colors (a bit more for white, naturally). That is because players in that range tend to win more often than average players. Win rates tend to go up with ratings and if I remember correctly this range is firmly in the top 10% percentile.

For example, my blitz rating is usually between 2000 and 2100 and I have 53% wins and 43% losses with white (58% win rate). With black I have a 45.5% win rate. The sum of my win rates is above 100%. I clearly do worse with black compared to the 2017 players of my range, but it is very likely that a 2100-rated player in 2017 was stronger than me in 2024 since there has been an influx of new players the last years. Also I occupy usually the lower half of the 2000-2200 range so those fluctuating between 2100 and 2200 should have better win rates.

@AragornFitrion said in #13: > Interesting spotting, that the statistics can be biased by average rating window. In general, you should look on masters database or deep engine evaluation to find best human moves. So I consider this rating window picker only as a titbit. But now I will keep in mind that it's is even less reliable. > > In proposed solution, will it actually be easier to find better moves? How we should consider, that on your sample database you receive 1.d4 winrate 58.8% and all black responses have 54% winrate? In current solution it is 52.6% for white and 47.5% for black which adds to ~100%. (This 12% difference comes obviously from a fact, that you operate on 2 different databases that are different from unified database currently used by lichess.) These numbers are strange indeed at first look, I had the same reaction when I first saw them. This is actually a direct result of the methodology I propose: a) In the Lichess opening explorer as it is, the category 2000-2200 is comprised of all the matches played by players whose average rating is between 2000 and 2200. So, for every win there is a loss and the percentages sum up to 100. b) When I select only players who have individually a rating of 2000-2200 and see what they do with white and (separately) with black, the win rates are substantially above 50% for both colors (a bit more for white, naturally). That is because players in that range tend to win more often than average players. Win rates tend to go up with ratings and if I remember correctly this range is firmly in the top 10% percentile. For example, my blitz rating is usually between 2000 and 2100 and I have 53% wins and 43% losses with white (58% win rate). With black I have a 45.5% win rate. The sum of my win rates is above 100%. I clearly do worse with black compared to the 2017 players of my range, but it is very likely that a 2100-rated player in 2017 was stronger than me in 2024 since there has been an influx of new players the last years. Also I occupy usually the lower half of the 2000-2200 range so those fluctuating between 2100 and 2200 should have better win rates.

D2D4C2C4

@Awesome-Days said in #12:

Think this is arguing over semantics because regardless of the relative win rates listed, bigger is still better. Relative comparison of win % between moves still tells the story between which move is practically better, just not at the exact % indicated. When you include all games, you get mismatches sure, but you put your mouse over the color to see the average rating. For e4 in 2000 category I see 2098 average for white and 2098 for black. Sure 8% of games (at least in your data) appeared to be 2200 vs 1800s and then some 1800s vs 2200s (colors reversed same opening) which will make the opening appear closer to 50-50 than the opening actually is, but the rank of moves still generally holds constant.

Bigger is better, but the difference matters too, because I may be willing to make some modification in my opening repertoire to chase a +5% advantage, but a +1% advantage is not worth the effort (for me at least).

@Awesome-Days said in #12: > Think this is arguing over semantics because regardless of the relative win rates listed, bigger is still better. Relative comparison of win % between moves still tells the story between which move is practically better, just not at the exact % indicated. When you include all games, you get mismatches sure, but you put your mouse over the color to see the average rating. For e4 in 2000 category I see 2098 average for white and 2098 for black. Sure 8% of games (at least in your data) appeared to be 2200 vs 1800s and then some 1800s vs 2200s (colors reversed same opening) which will make the opening appear closer to 50-50 than the opening actually is, but the rank of moves still generally holds constant. Bigger is better, but the difference matters too, because I may be willing to make some modification in my opening repertoire to chase a +5% advantage, but a +1% advantage is not worth the effort (for me at least).

Your network blocks the Lichess assets!

Why Opening Statistics Are Wrong