Saturday, 4 February 2012

On probability of winning a tennis match

I present end to end solution for predicting a winner of ATP tennis match from names of both tennis players, surface and match type. I start with presenting requirements for tennis predictions, then I discuss mathematical model applied for predicting a winner of tennis match, followed by software design and implementation of prediction application. Validating accuracy of tennis prediction models formulates a topic of my future work on tennis markets.

The product of my efforts presented here is a software library written in Scala programming language, which calculates probability of winning a tennis match. This library is open sourced and it is publicly available on GitHub. If you want to play with it, I recommended starting with unit tests, which should make it a little more clear how to use it.
  1. https://github.com/danielkorzekwa/tennis-player-compare
  2. https://github.com/danielkorzekwa/tennis-probability-calculator
  3. https://github.com/danielkorzekwa/atpworldtour-api
Requirements

Tennis prediction application is intended for supporting trading decisions on a Betfair betting exchange markets.

User story:

As a trader,
I want to calculate probability of winning a tennis match,
so that I can support my trading decisions.

Input:
  • Full names of tennis players, e.g. Roger Federer and Milos Raonic. ATP players only.
  • Court surface - CLAY, GRASS, HARD.
  • Match type - THREE_SET_MATCH, FIVE_SET_MATCH.
  • Prediction year - For calculating tennis probability for the current day, specify the current year (2012). To obtain tennis probability at the end of the past year, specify a year from the past, e.g. 2011. In the future, trader would like to specify a full date for prediction, either from the past, e.g. 2 year ago, or from the future, e.g. in 6 months. 
Output:
  • Probability of winning a tennis match for the first specified tennis player - It is a number between 0 and 1 (probability). Probability of winning a tennis match by the second player is 1-probability.
  • Error should be reported if probability cannot be calculated. For example, if a player is not recorded on http://www.atpworldtour.com/ website.
  • In the future, trader may want to know what is the confidence level of winning probability. Consider two games for the same player P1 against players P2 and P3. The probability of winning the game by player P1 against players P2 and P3 may be the same value of 0.4. But it may happen that the player P3 hasn't played any tennis match for last 3 months, which potentially could make us less confident about prediction for a game including this player. Trader would like to know about it. In other words, the probability of winning a game should be defined by two numbers, probability and a confidence level, possibly as a http://en.wikipedia.org/wiki/Normal_distribution.
Tennis prediction model

I considered a few different prediction methods for calculating probability of winning a tennis match, which could be classified into three categories:
I rejected prediction methods based on ATP ratings and logistic regression, which predict match winner from historical match outcomes, not taking into account probabilities of winning a tennis point. I believe the more data the better - there are far more points than matches played in tennis. Nonetheless, I still can see a value in applying logistic regression on top of probabilistic model. For instance, probabilistic model might be used for generating ratings for tennis players, which could be fit with logistic regression into tennis probability function.

From both neural network and markov chain methods I prefer the latter, because it allows for modelling probability of winning a tennis match with two numbers:  probability and uncertainty level, modeled as 'mean' and standard deviation respectively (http://en.wikipedia.org/wiki/Normal_distribution). This lets to differentiate between two players with similar probability of winning a point but with different uncertainty levels, which could be interpreted as player's consistency in a game. 

Eventually, I selected markov chain model, which calculates tennis match winner from probabilities of winning a point on serve and return by both players. Probability of winning a point is calculated from player statistics such as probability of first serve, probability of winning a point on first and second serve and probability of first and second serve return. Then probabilities of winning a game, set, tiebreak and match are calculated. My implementation of this prediction model is based on those two research papers:
Accuracy of this model depends on two factors. First, probability of winning a point on serve by both players must be precise. Every precision error at this level propagates down through probabilities of winning a tennis game, set and match.

Second, the algorithm for calculating game, set and match probabilities, which is based on markov chain, makes assumption that points in tennis match are independently and identically distributed. In other words, whether the probability of winning every point on serve by a player in a tennis match is constant or maybe it depends on a current score or who won last couple of points.  Klassen and Magnum show in their paper (Klaassen, Magnus - 2001 - Are Points in Tennis Independent and Identically distributed Evidence From a Dynamic Binary Panel Data Model), that points are not independent and identically distributed, however deviations are small and making this idd assumption should be fine in many use cases. How making this assumption impacts trader's needs for precise tennis predictions, I don't know yet.

I'm not expecting this model to be very accurate, at least in its generic form, where probability of winning a point is derived purely from percentages of points won on serve and return by both players and by 'average player'. I implemented it to get some reference point I can compare other models with. I'm planning to discount most recent tennis matches less and I also want to compare it with Bradley-Terry, Elo, Glicko and TrueSkill models.

Design and implementation

There are three dedicated components contributing to functionality of comparing ATP tennis players: tennis-trader-compare, atpworldtour-api and tennis-probability-calculator.

Component diagram for comparing ATP tennis players

The tennis-trader-compare component provides a high level interface for predicting outcome of tennis match. Interface definition.

The atpworldtour-api is an adapter for http://www.atpworldtour.com/ tennis statistics, which are used for calculating probability of winning a tennis point. Interface definition.

Tennis-probability-calculator contains functions for calculating probability of winning a point, gem, set, tie break and tennis match. Interface definition.

Flow of control

Sequence diagram for comparing ATP tennis players
  1. Trader calls tennis-trader-compare with player full names, surface, match type and prediction year and receives probability of winning a tennis match.
  2. Tennis-trader-compare component obtains ATP match facts from atpworldtour-api component, e.g. frst serve %,1st serve point won %, 2nd serve point won %.
  3. Tennis-trader-compare calls tennis-probability-calculator to calculate probability of winning a point and tennis match by both players.