@justaz said in #12:
You might be conflating volatility and deviation, at least in Glicko-2 terminology.
I was not.
Glicko deviation is about how uncertain the rating is. It mostly makes sense for an incremental algorithm, so that players for which the rating is better known act as "anchors", their rating being moved around less upon updates by results against players with an uncertain rating.
Whereas volatily as I expressed is about describing an actual player characteristic.
We want rating systems to boil players down to one number, so that they can be ordered according to it, so there is no real way to have an algorithm that could predict A > B > C > A even if results showed this. But the assumptions go further. The assumption is that all players with rating R1 have all the same win probabilities against players of rating R2.
This assumption is false. If you take two 1600 players that are evenly matched in a direct confrontation, it's nonetheless very likely that one of them will be much more likely than the other to snag a win against 1800 rated players - or to lose against a 1400 rated player.
If you want a concrete example, compare Stockfish Classic with contempt 0 against Stockfish Classic with contempt 40. The former will perform better against stronger opponents, but worse against weaker opponents. With proper NPS adjustment you can get them to be perfectly evenly matched in head to head, and so logically deserving of the same rating, but with an actually quite different real distribution of win probabilities against other opponents. In such cases, selecting mostly weaker or stronger opponents can spuriously affect the predictions of traditional rating systems.
Obviously, there are further simplifying assumptions about the shape of the win probability distribution. Designing a rating system is at its core an exercise in simplifying raw data into something more practical, and striking a balance between accuracy and practicality. Ignoring what I termed above "volatility" is the simplification with the most impact on accuracy.
@justaz said in #12:
Ordo does not have any volatility metric, so you do have a point. Ordo has the incorrect assumption, for our use case, that skill levels are constant.
Yes, that's indeed something that may make the "volatility" concept further relevant for analysis of human data during a "rating period", although my mention of it started from something that is observable when comparing chess engines, which have a near constant strength across games.
Any rating following the Ordo idea will fundamentally be a rating of the performance displayed during the rating period, rather than the best estimate of strength at the very end of the rating period.
@justaz said in #12:
I agree, check my earlier response in this thread. I'll try doing more rigorous calculations for next month.
I didn't read the earlier comments yet, I will do so later on.
@justaz said in #12:
> You might be conflating volatility and deviation, at least in Glicko-2 terminology.
I was not.
Glicko deviation is about how uncertain the rating is. It mostly makes sense for an incremental algorithm, so that players for which the rating is better known act as "anchors", their rating being moved around less upon updates by results against players with an uncertain rating.
Whereas volatily as I expressed is about describing an actual player characteristic.
We want rating systems to boil players down to one number, so that they can be ordered according to it, so there is no real way to have an algorithm that could predict A > B > C > A even if results showed this. But the assumptions go further. The assumption is that all players with rating R1 have all the same win probabilities against players of rating R2.
This assumption is false. If you take two 1600 players that are evenly matched in a direct confrontation, it's nonetheless very likely that one of them will be much more likely than the other to snag a win against 1800 rated players - or to lose against a 1400 rated player.
If you want a concrete example, compare Stockfish Classic with contempt 0 against Stockfish Classic with contempt 40. The former will perform better against stronger opponents, but worse against weaker opponents. With proper NPS adjustment you can get them to be perfectly evenly matched in head to head, and so logically deserving of the same rating, but with an actually quite different real distribution of win probabilities against other opponents. In such cases, selecting mostly weaker or stronger opponents can spuriously affect the predictions of traditional rating systems.
Obviously, there are further simplifying assumptions about the shape of the win probability distribution. Designing a rating system is at its core an exercise in simplifying raw data into something more practical, and striking a balance between accuracy and practicality. Ignoring what I termed above "volatility" is the simplification with the most impact on accuracy.
@justaz said in #12:
> Ordo does not have any volatility metric, so you do have a point. Ordo has the incorrect assumption, for our use case, that skill levels are constant.
Yes, that's indeed something that may make the "volatility" concept further relevant for analysis of human data during a "rating period", although my mention of it started from something that is observable when comparing chess engines, which have a near constant strength across games.
Any rating following the Ordo idea will fundamentally be a rating of the performance displayed during the rating period, rather than the best estimate of strength at the very end of the rating period.
@justaz said in #12:
> I agree, check my earlier response in this thread. I'll try doing more rigorous calculations for next month.
I didn't read the earlier comments yet, I will do so later on.