Exact Ratings for Everyone on Lichess

This blog post and the work behind it are very interesting.

In theory, it is still possible to get another substantial improvement in accuracy over this model, but it involves complications - instead of making converge one value (the rating), you need two (or more) values, the simplest model involving a rating and a "volatility" value that determines the spread of the expected outcomes (for example, a high volatility player would be more likely to beat a much stronger player by rating but also more likely to lose to a much weaker player ; obviously this property fundamentally depends on the player pool). But as far as I know, nobody has really bothered making a rating tool (or updating an existing one such as Ordo) to support this, although computer chess has some easily repeatable examples of engines with a noticeably different "volatility".

By the way, I don't think that the 58%/59% numbers you computed are the best reflection of a rating algorithm accuracy. As far as I can tell from your description in the blog post, the only thing that matters there is the ordering. It tests "is the higher rated player winning", but disregards by how much that player is higher rated. It is probably better to measure the difference between expected outcome and actual outcome with MSE.

As a computer chess aficionado, I have long been thinking about the disadvantages of the incremental rating methods used for human play (in chess and in other games). The assumption of strength not changing between games is obviously not correct for humans, but it's usually close enough to the truth. This blog post and the work behind it are very interesting. In theory, it is still possible to get another substantial improvement in accuracy over this model, but it involves complications - instead of making converge one value (the rating), you need two (or more) values, the simplest model involving a rating and a "volatility" value that determines the spread of the expected outcomes (for example, a high volatility player would be more likely to beat a much stronger player by rating but also more likely to lose to a much weaker player ; obviously this property fundamentally depends on the player pool). But as far as I know, nobody has really bothered making a rating tool (or updating an existing one such as Ordo) to support this, although computer chess has some easily repeatable examples of engines with a noticeably different "volatility". By the way, I don't think that the 58%/59% numbers you computed are the best reflection of a rating algorithm accuracy. As far as I can tell from your description in the blog post, the only thing that matters there is the ordering. It tests "is the higher rated player winning", but disregards by how much that player is higher rated. It is probably better to measure the difference between expected outcome and actual outcome with MSE.

justaz

Ordo does calculate the deviations (extremely slowly) by running a bunch of simulations, not quite sure how it works. But Ordo does not have any volatility metric, so you do have a point. Ordo has the incorrect assumption, for our use case, that skill levels are constant.

I don't think that the 58%/59% numbers you computed are the best reflection of a rating algorithm accuracy.
I agree, check my earlier response in this thread. I'll try doing more rigorous calculations for next month.

@IndigoEngun You might be conflating volatility and deviation, at least in Glicko-2 terminology. Altho I'm not sure myself XD Rating Deviation is the confidence interval of a rating. Deviation is the spread of expected outcomes, and a 2000 ± 50 losing to a 1800 ± 200 will not hurt their rating that much, but losing to a 1900 ± 50 will hurt a lot. Volatility in Glicko-2 is a players change in rating over time. A low volatility player will gain and lose less points than a high volatility player. However, when calculating the expected outcome of a game, volatility plays no role. It only plays a role when updating the ratings. Ordo does calculate the deviations (extremely slowly) by running a bunch of simulations, not quite sure how it works. But Ordo does not have any volatility metric, so you do have a point. Ordo has the incorrect assumption, for our use case, that skill levels are constant. > I don't think that the 58%/59% numbers you computed are the best reflection of a rating algorithm accuracy. I agree, check my earlier response in this thread. I'll try doing more rigorous calculations for next month.

LeNoobzer

Cool blog! I also have the problem of rating addiction and only play higher rated.

Godspeed107

Will this blog be updated for December Rankings?
By the way,cool blog

Will this blog be updated for December Rankings? By the way,cool blog

F3ynman69

Those lists were only for November, right? Will there be similar lists available moving forward?

justaz

@Godspeed107 @F3ynman69
Yep!

@Godspeed107 @F3ynman69 Yep!

IndigoEngun

@justaz said in #12:

You might be conflating volatility and deviation, at least in Glicko-2 terminology.

I was not.

Glicko deviation is about how uncertain the rating is. It mostly makes sense for an incremental algorithm, so that players for which the rating is better known act as "anchors", their rating being moved around less upon updates by results against players with an uncertain rating.

Whereas volatily as I expressed is about describing an actual player characteristic.

We want rating systems to boil players down to one number, so that they can be ordered according to it, so there is no real way to have an algorithm that could predict A > B > C > A even if results showed this. But the assumptions go further. The assumption is that all players with rating R1 have all the same win probabilities against players of rating R2.

This assumption is false. If you take two 1600 players that are evenly matched in a direct confrontation, it's nonetheless very likely that one of them will be much more likely than the other to snag a win against 1800 rated players - or to lose against a 1400 rated player.

If you want a concrete example, compare Stockfish Classic with contempt 0 against Stockfish Classic with contempt 40. The former will perform better against stronger opponents, but worse against weaker opponents. With proper NPS adjustment you can get them to be perfectly evenly matched in head to head, and so logically deserving of the same rating, but with an actually quite different real distribution of win probabilities against other opponents. In such cases, selecting mostly weaker or stronger opponents can spuriously affect the predictions of traditional rating systems.

Obviously, there are further simplifying assumptions about the shape of the win probability distribution. Designing a rating system is at its core an exercise in simplifying raw data into something more practical, and striking a balance between accuracy and practicality. Ignoring what I termed above "volatility" is the simplification with the most impact on accuracy.

@justaz said in #12:

Ordo does not have any volatility metric, so you do have a point. Ordo has the incorrect assumption, for our use case, that skill levels are constant.

Yes, that's indeed something that may make the "volatility" concept further relevant for analysis of human data during a "rating period", although my mention of it started from something that is observable when comparing chess engines, which have a near constant strength across games.

Any rating following the Ordo idea will fundamentally be a rating of the performance displayed during the rating period, rather than the best estimate of strength at the very end of the rating period.

@justaz said in #12:

I agree, check my earlier response in this thread. I'll try doing more rigorous calculations for next month.

I didn't read the earlier comments yet, I will do so later on.

@justaz said in #12: > You might be conflating volatility and deviation, at least in Glicko-2 terminology. I was not. Glicko deviation is about how uncertain the rating is. It mostly makes sense for an incremental algorithm, so that players for which the rating is better known act as "anchors", their rating being moved around less upon updates by results against players with an uncertain rating. Whereas volatily as I expressed is about describing an actual player characteristic. We want rating systems to boil players down to one number, so that they can be ordered according to it, so there is no real way to have an algorithm that could predict A > B > C > A even if results showed this. But the assumptions go further. The assumption is that all players with rating R1 have all the same win probabilities against players of rating R2. This assumption is false. If you take two 1600 players that are evenly matched in a direct confrontation, it's nonetheless very likely that one of them will be much more likely than the other to snag a win against 1800 rated players - or to lose against a 1400 rated player. If you want a concrete example, compare Stockfish Classic with contempt 0 against Stockfish Classic with contempt 40. The former will perform better against stronger opponents, but worse against weaker opponents. With proper NPS adjustment you can get them to be perfectly evenly matched in head to head, and so logically deserving of the same rating, but with an actually quite different real distribution of win probabilities against other opponents. In such cases, selecting mostly weaker or stronger opponents can spuriously affect the predictions of traditional rating systems. Obviously, there are further simplifying assumptions about the shape of the win probability distribution. Designing a rating system is at its core an exercise in simplifying raw data into something more practical, and striking a balance between accuracy and practicality. Ignoring what I termed above "volatility" is the simplification with the most impact on accuracy. @justaz said in #12: > Ordo does not have any volatility metric, so you do have a point. Ordo has the incorrect assumption, for our use case, that skill levels are constant. Yes, that's indeed something that may make the "volatility" concept further relevant for analysis of human data during a "rating period", although my mention of it started from something that is observable when comparing chess engines, which have a near constant strength across games. Any rating following the Ordo idea will fundamentally be a rating of the performance displayed during the rating period, rather than the best estimate of strength at the very end of the rating period. @justaz said in #12: > I agree, check my earlier response in this thread. I'll try doing more rigorous calculations for next month. I didn't read the earlier comments yet, I will do so later on.

pemula64

Does it only calculate based on the last month? Won't it be less accurate because the smaller amount of games used as sample?

justaz

@pemula64 said in #18:

Does it only calculate based on the last month? Won't it be less accurate because the smaller amount of games used as sample?
That's correct, yet it's still arguably better than Lichess ratings. I'll experiment with bigger and smaller rating windows later on.

@pemula64 said in #18: > Does it only calculate based on the last month? Won't it be less accurate because the smaller amount of games used as sample? That's correct, yet it's still arguably better than Lichess ratings. I'll experiment with bigger and smaller rating windows later on.

petri999

@justaz said in #10:

The key factor in Ordo being slow is that it's doing gradient descent on a single cpu thread. If I use my brain maybe I'll manage to put it on the GPU with Torch or something, and get a 1000x speedup. As for the period size, I look into it when I compute the whole backlog of ratings, maybe 3 months is perfect, maybe 3 weeks, we will see. Keep an eye out for my next rating release when December database drops!

using GPU is good idea. But maximum-likelihood in multi variable problem is complex problem and gradient descend is most likely a good choice for the optimization problem. There are obviously others and some them might work better suited on parallel execution than others. Like differential evolution which is bit ad-hock development but popular
THe most popular GO site in west is KGS and has small article on what they do https://www.gokgs.com/help/rmath.html

New rating is published on 15 second intervals. so after game you have wait for a while if got to next kyu/dan level. Obviously the handicap system further complicates matter

@justaz said in #10: > The key factor in Ordo being slow is that it's doing gradient descent on a single cpu thread. If I use my brain maybe I'll manage to put it on the GPU with Torch or something, and get a 1000x speedup. As for the period size, I look into it when I compute the whole backlog of ratings, maybe 3 months is perfect, maybe 3 weeks, we will see. Keep an eye out for my next rating release when December database drops! using GPU is good idea. But maximum-likelihood in multi variable problem is complex problem and gradient descend is most likely a good choice for the optimization problem. There are obviously others and some them might work better suited on parallel execution than others. Like differential evolution which is bit ad-hock development but popular THe most popular GO site in west is KGS and has small article on what they do https://www.gokgs.com/help/rmath.html New rating is published on 15 second intervals. so after game you have wait for a while if got to next kyu/dan level. Obviously the handicap system further complicates matter

Your network blocks the Lichess assets!

Exact Ratings for Everyone on Lichess