Ratings Are Broken

@Toscani said in #68:

When it comes to a true rating system, we must never be permitted to pick our opponents. When we meet an opponent again we must alternate colors. So a common record should be maintained by Fide to know which color we must play against a given player. With the electronic age, this should be automated.

The "Erdős number" should be included in the pairings. If a rated player enters a tournament and has never played the opponent, then the first rating change should be treated as if it was a provisional rating. If two players have played each other before than the ratings should not be considered provisional.

Strange idea. This may work for online server and non-tournament games only. In all other cases current games in current tournament much more important for pairing than old players' statistics against each other.

@Toscani said in #68: > When it comes to a true rating system, we must never be permitted to pick our opponents. When we meet an opponent again we must alternate colors. So a common record should be maintained by Fide to know which color we must play against a given player. With the electronic age, this should be automated. > > The "Erdős number" should be included in the pairings. If a rated player enters a tournament and has never played the opponent, then the first rating change should be treated as if it was a provisional rating. If two players have played each other before than the ratings should not be considered provisional. Strange idea. This may work for online server and non-tournament games only. In all other cases current games in current tournament much more important for pairing than old players' statistics against each other.

peppie23 edited

@A_Kireev said in #69:
You have a wrong idea about what is going on exactly. I'll try to explain.

The stability of a ratingsystem depends on the difference between how much rating is taken out by the improvers and how much rating is injected by the disimprovers.
Fide made years ago a calculated estimation on the number of improvers, how much the improvers averagely improve per year, the number of disimprovers and how much the disimprovers averagely disimprove per year.

Fide calculated that number of improvers * how much the improvers averagely improve is about twice number of disimprovers * how much the disimprovers averagely disimprove per year. So fide solved this by introducing different K-factors. The rating injected by the disimprovers would be doubled so the balance is restored.

However now fide is noticing that their initial estimations are not holding anymore. The number of improvers has exploded and also the improvement per year went up drastically.

Now you can try to solve this with a bigger K-factor for the improvers (as some here claimed) but you will probably need something like K= 200.

Winning with a single game 200 points makes no sense. We need some other solution. Going back to a minimumrating of 2200 will most likely work but then only a small fraction of the players will played rated for fide. Fide will most likely go bankrupt if they don't get anymore the money from the rated games played by players below 2200.

So something else is needed. My proposal is to use Tpr for all -18 players playing less than 24 rated games in the last year so they can't impact the ratingsystem and they get very quick corrections. This needs to be worked out by simulations but I believe there is potential in it as it decreases the impact of the improvers on the total ratingsystem. We need to find a new balance or as close as possible.

@A_Kireev said in #69: You have a wrong idea about what is going on exactly. I'll try to explain. The stability of a ratingsystem depends on the difference between how much rating is taken out by the improvers and how much rating is injected by the disimprovers. Fide made years ago a calculated estimation on the number of improvers, how much the improvers averagely improve per year, the number of disimprovers and how much the disimprovers averagely disimprove per year. Fide calculated that number of improvers * how much the improvers averagely improve is about twice number of disimprovers * how much the disimprovers averagely disimprove per year. So fide solved this by introducing different K-factors. The rating injected by the disimprovers would be doubled so the balance is restored. However now fide is noticing that their initial estimations are not holding anymore. The number of improvers has exploded and also the improvement per year went up drastically. Now you can try to solve this with a bigger K-factor for the improvers (as some here claimed) but you will probably need something like K= 200. Winning with a single game 200 points makes no sense. We need some other solution. Going back to a minimumrating of 2200 will most likely work but then only a small fraction of the players will played rated for fide. Fide will most likely go bankrupt if they don't get anymore the money from the rated games played by players below 2200. So something else is needed. My proposal is to use Tpr for all -18 players playing less than 24 rated games in the last year so they can't impact the ratingsystem and they get very quick corrections. This needs to be worked out by simulations but I believe there is potential in it as it decreases the impact of the improvers on the total ratingsystem. We need to find a new balance or as close as possible.

dboing

edited

should rating only serve to put eveyone on some ladder? or just be about expected game difficulty given entrant ratings..I guess that depends whether chess is only happening in tournaments or has a background murmur of activity and single games across the pool do happen without tournament wrapping, or within an effective pool, subset of the maximal pool.

this brings back to what do people want the rating to help with.. then mitigate that with what systematic measure systems at the population scale can do.. not what we might abuse them to be.

the idea of true rating or true strength might be wanting too much. and is not defined in any rating system anyway.

I am not my rating.. never assumed that.. and i only want my online rating to give other players that don,t know me some sense of difficulty of my play if they were to accept playing with me... i would not want to be either too much underrated or too much overrated. but i care mostly about having a good game... so I do need ratings... But I am not talking about tournaments. We have seen that they have own constraints (some essential to their physical possibility).

Actually the idea of looking at history time window for relative pool "isotropic" sampling was a bit coarse. For the glicko** rating system convergence results (pool size?, or game events size? not sure) to hold, I am not sure that the pairings of each individuals at some current rating (or during a time window, also not sure) need to sample all possible ratings, perhaps there is only the need, given a fuzzyness or confidence level at population ordering level (ladder focus), that there be enough connectivity between strates (i said given a confidence level, as i would bet it would affect the size of the bands of ratings around any rating that need to be connected to other bands.

I think in tournament (from afar and outside), the tiering obligation would also be dependent on that, although I think the purpose of the tournament is to clarify from those band approximations of ordering, through a limited entry pool and duration of whole event a clearer ordering... There might be population size and time windows criticality in any context... anyway.

** (and or that kind of rating system based on individual estimate assumptions, not whole population)

or the model could include time windows of histories of pairings. (distinguishing IDs from rating, to define how much of the maximal pool has actually been effectively sampled... many games of same rating but different individuals, or many games with same individuals.). also, what if we want to play the same person for many games, and ratings be damned.... should rating only serve to put eveyone on some ladder? or just be about expected game difficulty given entrant ratings..I guess that depends whether chess is only happening in tournaments or has a background murmur of activity and single games across the pool do happen without tournament wrapping, or within an effective pool, subset of the maximal pool. this brings back to what do people want the rating to help with.. then mitigate that with what systematic measure systems at the population scale can do.. not what we might abuse them to be. the idea of true rating or true strength might be wanting too much. and is not defined in any rating system anyway. I am not my rating.. never assumed that.. and i only want my online rating to give other players that don,t know me some sense of difficulty of my play if they were to accept playing with me... i would not want to be either too much underrated or too much overrated. but i care mostly about having a good game... so I do need ratings... But I am not talking about tournaments. We have seen that they have own constraints (some essential to their physical possibility). Actually the idea of looking at history time window for relative pool "isotropic" sampling was a bit coarse. For the glicko** rating system convergence results (pool size?, or game events size? not sure) to hold, I am not sure that the pairings of each individuals at some current rating (or during a time window, also not sure) need to sample all possible ratings, perhaps there is only the need, given a fuzzyness or confidence level at population ordering level (ladder focus), that there be enough connectivity between strates (i said given a confidence level, as i would bet it would affect the size of the bands of ratings around any rating that need to be connected to other bands. I think in tournament (from afar and outside), the tiering obligation would also be dependent on that, although I think the purpose of the tournament is to clarify from those band approximations of ordering, through a limited entry pool and duration of whole event a clearer ordering... There might be population size and time windows criticality in any context... anyway. ** (and or that kind of rating system based on individual estimate assumptions, not whole population)

Toscani

Each extra move to complete a game should have an impact on the exchange rating values.
Checkmating in 10 moves is more unbalanced pairings than checkmating in 40 moves.
That's a 1:4 ratio difference that could be a factor to consider in the chess ratings.

Each extra move to complete a game should have an impact on the exchange rating values. Checkmating in 10 moves is more unbalanced pairings than checkmating in 40 moves. That's a 1:4 ratio difference that could be a factor to consider in the chess ratings.

dboing

https://lichess.org/stat/rating/distribution/blitz

andreas111111x

a 1500 can not win against a 2100
so this means that nowadays 15 Percent of all new 1500 players are cheaters..

a 1500 can not win against a 2100 so this means that nowadays 15 Percent of all new 1500 players are cheaters..

Achilleus30

I've also known very highly rated players, to start over with a new user handle, to see how high a rating they could achieve. This is dishonest because one's rating is supposed to reflect the sum of all one's performance over a lifetime. Furthermore, when these 2200-2300 players log in with a brand new handle, they are initially way underrated, which leads to them beating anyone who is 0-2000, which means they drive all these lower rated players' rating down, while giving themselves an ego boost. They also are likely to do this when their recent performance rating has been particularly good, meaning they're on a hot streak. They feed their own egos dishonestly, at the expense of others'.

pawngrid

1500 rating should be engineered to be consistent for a given skill level, so that there wouldn't be inflation/deflation.

I'm not interested in how good I am relative to average players. I want to know whether or not I've improved. That's difficult for small rating improvments, since my recent rating increase could merely be the result of rating inflation due to influx of newer players.

1500 rating should be engineered to be consistent for a given skill level, so that there wouldn't be inflation/deflation. I'm not interested in how good I am relative to average players. I want to know whether or not I've improved. That's difficult for small rating improvments, since my recent rating increase could merely be the result of rating inflation due to influx of newer players.

Toscani

Take a look at post #60. @andreas111111x
Ever since I installed the extension, I found it more pleasing to play chess on line. The avg opponent and the win % has to seem logical.

I don't think it's normal to have statistics like:
Quote: NaN% (0)
Avg Opponent: 0
So I'm not starting to block them. I cannot explain why they show zero, when they obviously have games played.
I was really hoping someone in the world would be able to explain why it shows zero statistics.

If a 1500 has Avg Opponent 2100, and a percentage of 50% wins, then their rating should be increasing fast.

Toscani (1595) +3
Quote: 45.11% (5267) (I' have 45.11 % wins when my average opponents are at a rating of 1677)
Avg Opponent: 1677.21 (My pairing rating range needs to be lowered a bit so that I have a win of 50%).
Best Win: 2173 (I don't know how this is calculated. I cannot find that number in my profile)

Take a look at post #60. @andreas111111x Ever since I installed the extension, I found it more pleasing to play chess on line. The avg opponent and the win % has to seem logical. I don't think it's normal to have statistics like: Quote: NaN% (0) Avg Opponent: 0 So I'm not starting to block them. I cannot explain why they show zero, when they obviously have games played. I was really hoping someone in the world would be able to explain why it shows zero statistics. If a 1500 has Avg Opponent 2100, and a percentage of 50% wins, then their rating should be increasing fast. Toscani (1595) +3 Quote: 45.11% (5267) (I' have 45.11 % wins when my average opponents are at a rating of 1677) Avg Opponent: 1677.21 (My pairing rating range needs to be lowered a bit so that I have a win of 50%). Best Win: 2173 (I don't know how this is calculated. I cannot find that number in my profile)

Byuffe edited

Okay, there is one issue: Let's say two players are both 2000 FIDE elo and 2100 lichess elo and retain their actual play strength over the coming decade. One of those two players decides to not play on Lichess while the other one continues to play on Lichess. Now with the influx of good new players the one who plays will get punished for playing and will drop some mmr.

The solution is to add some mmr decay and some increase of mmr uncertainty. Also isn't there an uncertainty implementation of elo in place at lichess, so that new players gain and lose more mmr on their very first games? So if that were the case, then the statement that Lichess's implementation of elo is 0 sum wouldn't even hold true.

Lichess and more generally Online Chess elo (possibly even offline elo, but that's debatable) has been known to be too high and inflated for many years and now the author wants to solve this problem by inflating it even harder. That's insane. On the other hand it's quite welcome that players who are 2000 elo on Lichess get pushed closer to their FIDE rating. Okay, there is one issue: Let's say two players are both 2000 FIDE elo and 2100 lichess elo and retain their actual play strength over the coming decade. One of those two players decides to not play on Lichess while the other one continues to play on Lichess. Now with the influx of good new players the one who plays will get punished for playing and will drop some mmr. The solution is to add some mmr decay and some increase of mmr uncertainty. Also isn't there an uncertainty implementation of elo in place at lichess, so that new players gain and lose more mmr on their very first games? So if that were the case, then the statement that Lichess's implementation of elo is 0 sum wouldn't even hold true.

Your network blocks the Lichess assets!

Ratings Are Broken