Tuesday, August 24, 2010

Ratings

I recently found out about the bridge ratings provided by Chris Champion, which he calls the "Power ratings".  Clearly, an accurate current rating, recalculated each month, is preferable to the accumulation of points, which is the ACBL's way of keeping track (and making money), especially when those points suffer from inflation.  Champion's power ratings are similar to the "Lehman" ratings which are provided on the OKBridge site.  When I was playing on OKBridge (ten or more years ago), I was part of a committee set up to "fix" the Lehman rating system.  As far as I know, the ratings have never been fixed, although my friend Stephen Pickett calculates a more accurate version of the rating for both OKBridge and BBO.  He's more interested in finding cheaters than determining who the best players are (clue: very good players, such as world champtions, have ratings north of 60; cheaters have ratings in the 70s).

The fundamental flaw, in my opinion, of the Lehman rating scheme was that it assumed all ratings involved to be accurate, that's to say with no margin of error.  This was even true for a player who had never before played on OKBridge and who was therefore assigned a rating of 50 (average).  This actually meant that OKBridge was a very unfriendly place for new players because nobody wanted to play with them (unless they happened to be called Meckstroth, Rodwell, etc), because they were likely to perform below average and thus bring down their partner's rating.  Many tables were limited to certain point ranges.  Having embraced the Lehmans when I first joined OKBridge, they were ultimately the thing that drove me away.

And so it is with the Champion ratings, although to a lesser extent, I believe.  Champion requires a minimum number of games with rated partners in order to be rated (see his explanation).  He explains that his method is based on the notion of an "average level of play" for a particular player.  Every game that he uses is therefore a sample of that average level of play.  This is akin to the sampling of voters to predict an election outcome.  But whereas the election predictions are always accompanied by a margin of error – it's usually plus or minus 3%*  – the power ratings are not.  We are therefore inclined to infer that player A is better than player B, even though player A's power rating is only 0.01 higher.  This is obviously nonsense – the difference is not statistically significant.  There are also secondary effects due to the fact that players play different numbers of games with different partners.  Suppose player A partners player B for 50 games and player C for 50 games.  Should B and C therefore contribute equally to A's rating?  Not if B has played 100 other games while C has played only 12.  Actually, it's not clear to me whether Champion accounts for this in his workings or not, but I suspect not.  But because iterative types of statistical calculations are very sensitive to errors (he notes that each month may require 2000 iterations before convergence is reached), what may seem very minor issues can have major effects.

Another major problem which afflicts all rating schemes is "pooling".  If a group of friends play bridge together every Saturday evening, they will (potentially) be rated.  But those ratings will be meaningless unless at least one of the players regularly players with other rated players.  The Lehman solution is simply to include all the data, whether significant or not.  The Champion solution is not to rate players unless they have played against 12 other rated players.  But that is somewhat arbitrary.  I believe it's better to quote the error bounds (confidence interval) of the ratings.  A player from a small pool, such as the group of friends, will have a rating, but it will be accompanied by a low degree of confidence and will not contribute to rating other players as strongly as the ratings of ubiquitous players.

Other sources of error are that the quality of the E/W opponents are not taken into account.  This is probably very minor, but we might note that in seeded games of 13 rounds, an A pair typically plays 4 A pairs, 4 B pairs and 5 C pairs, provided of course that there is a uniform distribution of A, B and C pairs.  The fact that a particular board may not be played uniformly by A, B and C pairs is also a very minor factor.

Let's take two hypothetical players, who we'll call X and Y.  X has a rating of 31.5 and has played 50 rated games.  Y has a rating of 31.0 and has played 120 games.  How confident are we that X is actually a better player than Y?  This can be derived by combining (subtracting, to be precise) the two normal distributions which represent the probabilities of X's and Y's ratings.  The answer that we get is that we are 52.55% confident that X is a better player than Y.  It's not a lot of confidence, is it?  And there are actually five players (in EMBA) ranked between X and Y!

Now let's look at the top rated player, P, and the lowest shown player, A (actually the one at the 25th percentile of rated players).  How sure are we that P is a better player than A?  We are 73% confident.  But that isn't certainty.  Of course, common sense and experience tell us that P is better than A.  For a start, P is a Grand Life Master.  But aren't we looking something a little more precise than our gut feel and ACBL bridge rankings?

There are other possible sources of error in this calculation.  Perhaps the most obvious is that the raw data on which the ratings are based is about pairs, not about players.  There's no true justification for inferring individual ratings from partnership ratings.  Some partnerships are symbiotic when the partners click and perform better than expected, some are not.

One possible source of error, which isn't clear to me at present, depends on what starting values are used each month.  Do we assume that last month's rating is accurate? or do we go back to using arbitrary values by ACBL master-points?  Or do we allow for a little randomness?  If we always assume last month's values to be accurate, players can get unfairly "typecast" because even if they player much better, half the improvement is attributed to ones partner.  I'd be interested to know how well the iterations converge, that's to say how much residual error remains.  If the residual error is relatively small, then the system should eventually (after a year or so perhaps) converge to fairly precise list of ratings.  But unless the number of sessions used gets into the thousands, the level of confidence in ratings will always be low.  I will be interested to see how much the players move up and down each month.  A high degree of mobility will indicate a fluid system that is not overly dependent on believing previous ratings.  If the mobility is small, it will indicate an over-constrained and likely imprecise rating scheme.

One other error, although you could argue that this would generate a completely new set of ratings, is the (presumed) absence of team results.  This IMPs issue was one of the worst aspects of the Lehman ratings, which simply converted IMPs to match-points and bundled them all in together.  A one-imp gain on a board was considered equivalent to 64% and a seven-imp gain was 100%!  Most experts believe that IMPs is "real" bridge, while matchpoints is "bad" bridge.  I would have more confidence in a rating scheme that predicted how many IMPs a pair would be likely to gain per board.

There's one other aspect of ratings based on percentages at bridge sessions that I think makes the ratings somewhat meaningless.  For the better players, bridge is all about winning.  A good player who senses that he is having a 60% game might take a strategic risk on a late board to get up to 62% because, from his or her point of view, a win is so much more likely with 62% than 60%.  He doesn't really mind if it drops him to 58% because merely scratching holds little interest.  But perhaps two-thirds of the time, the strategy will backfire and the ratings will record that pair overall as a 59.3% pair despite the fact that their skill level is more consistent with 60%.  Admittedly, it's a small point.  But these minuscule differences in percentages are what drives the placings in the ratings table.

There's one other issue which I think is a big one.  Privacy.  What if I don't want the entire world to know how often I play bridge in a given period?  What if I'm not overjoyed with my rating and believe that I deserve better?  Perhaps having a low rating will impact my ability to get good partners.  Does Chris Champion really have the right to calculate and disseminate this information on the internet?  I don't think so, although I have to admit there are far worse privacy breaches made possible by the internet.  And what if I don't see my name on the list.  Is it because I'm unrated (haven't played sufficiently with rated players)?  Or is it because I'm in the bottom 25% of rated players.  And in these cases, just how bad am I?  What will other people think when they don't see my name?

Having so far dwelt almost entirely on the negative aspects of the ratings, let me now sing their praises a little.  They certainly seem to have done well to have put Jeff Meckstroth at the top in ACBL land.  Those guys at the top are all so good it really must be hard to distinguish between them.  But I think that if you asked experts who the best players are (as Zeke Jabbour did recently in his ACBL column), Meckstroth would be the likely MVP.

So, how did the ratings do in our own unit (EMBA – 108), on a subjective level?  The top three, Pat, Sheila and Bob, clearly belong in everyone's top five so there's no question of accuracy there.  I'd be hard-pressed to distinguish them but I'm comfortable with the same ordering.  There are some names in the top twenty that I'd have expected to see higher and there are a couple of names I wouldn't expect to see even in the top thirty.  And one or perhaps two in the top fifty that seem greatly over-rated.  I wonder if the ratings take into account the different DODs (degree-of-difficulty) between day-time bridge which tends to include a lot of quite bad unrated players, and evening/weekend bridge where there are still quite a few unrated players, but they are typically somewhat better.  That would be a very subjective adjustment but perhaps necessary.

As always, comments welcome.

* 3% is the margin of error when your sample size is 1067 and you want 95% confidence that the true value is in the range thus defined.

3 comments:

  1. "Lehman ratings" weren't designed to deliver any specific meaning assigned to any number, as to being above or below 50. The only thing that's supposed to matter in interpreting them is to observe the delta: is your own rating going up, or down, from where it was the last time you looked at it? That only indicates if you've been playing above or below your own standard over the past series of recent boards. That's all it was designed to measure, calculated recursively on one's individual performance, and on an assumption that one would be playing with a variety of partners and opponents.

    I know that a lot of people have misunderstood that business about the delta, and have taken a high number to mean more simply that somebody's a good player, or a low number to mean that somebody would be an undesirable partner. Well, they're mistaken!

    It was a temporary feature that happened to become surprisingly popular. It saddens me that you're claiming it drove you away.

    Brad Lehman

    ReplyDelete
  2. See also my introductory article about it, from 14 years ago, here:
    http://www-personal.umich.edu/~bpl/oksimple.html

    ReplyDelete
    Replies
    1. Hi Brad, as I mentioned, I was a supporter at first. But, unfortunately as with any statistical measure, there are those who will misinterpret its relevance/accuracy/whatever. Once people (who I knew from real life were not very good players) started labeling their tables with 52+ only, or somesuch annotation, I decided enough was enough. Yes, I'm familiar with your article(s). I certainly don't mean in any way to blame you for any of this. I know what you did and what you intended it for. And, as far as it went, your system was good, with the one exception noted in the blog.

      Delete