Saturday, 4 February 2012

On probability of winning a tennis match

I present end to end solution for predicting a winner of ATP tennis match from names of both tennis players, surface and match type. I start with presenting requirements for tennis predictions, then I discuss mathematical model applied for predicting a winner of tennis match, followed by software design and implementation of prediction application. Validating accuracy of tennis prediction models formulates a topic of my future work on tennis markets.

The product of my efforts presented here is a software library written in Scala programming language, which calculates probability of winning a tennis match. This library is open sourced and it is publicly available on GitHub. If you want to play with it, I recommended starting with unit tests, which should make it a little more clear how to use it.
  1. https://github.com/danielkorzekwa/tennis-player-compare
  2. https://github.com/danielkorzekwa/tennis-probability-calculator
  3. https://github.com/danielkorzekwa/atpworldtour-api
Requirements

Tennis prediction application is intended for supporting trading decisions on a Betfair betting exchange markets.

User story:

As a trader,
I want to calculate probability of winning a tennis match,
so that I can support my trading decisions.

Input:
  • Full names of tennis players, e.g. Roger Federer and Milos Raonic. ATP players only.
  • Court surface - CLAY, GRASS, HARD.
  • Match type - THREE_SET_MATCH, FIVE_SET_MATCH.
  • Prediction year - For calculating tennis probability for the current day, specify the current year (2012). To obtain tennis probability at the end of the past year, specify a year from the past, e.g. 2011. In the future, trader would like to specify a full date for prediction, either from the past, e.g. 2 year ago, or from the future, e.g. in 6 months. 
Output:
  • Probability of winning a tennis match for the first specified tennis player - It is a number between 0 and 1 (probability). Probability of winning a tennis match by the second player is 1-probability.
  • Error should be reported if probability cannot be calculated. For example, if a player is not recorded on http://www.atpworldtour.com/ website.
  • In the future, trader may want to know what is the confidence level of winning probability. Consider two games for the same player P1 against players P2 and P3. The probability of winning the game by player P1 against players P2 and P3 may be the same value of 0.4. But it may happen that the player P3 hasn't played any tennis match for last 3 months, which potentially could make us less confident about prediction for a game including this player. Trader would like to know about it. In other words, the probability of winning a game should be defined by two numbers, probability and a confidence level, possibly as a http://en.wikipedia.org/wiki/Normal_distribution.
Tennis prediction model

I considered a few different prediction methods for calculating probability of winning a tennis match, which could be classified into three categories:
I rejected prediction methods based on ATP ratings and logistic regression, which predict match winner from historical match outcomes, not taking into account probabilities of winning a tennis point. I believe the more data the better - there are far more points than matches played in tennis. Nonetheless, I still can see a value in applying logistic regression on top of probabilistic model. For instance, probabilistic model might be used for generating ratings for tennis players, which could be fit with logistic regression into tennis probability function.

From both neural network and markov chain methods I prefer the latter, because it allows for modelling probability of winning a tennis match with two numbers:  probability and uncertainty level, modeled as 'mean' and standard deviation respectively (http://en.wikipedia.org/wiki/Normal_distribution). This lets to differentiate between two players with similar probability of winning a point but with different uncertainty levels, which could be interpreted as player's consistency in a game. 

Eventually, I selected markov chain model, which calculates tennis match winner from probabilities of winning a point on serve and return by both players. Probability of winning a point is calculated from player statistics such as probability of first serve, probability of winning a point on first and second serve and probability of first and second serve return. Then probabilities of winning a game, set, tiebreak and match are calculated. My implementation of this prediction model is based on those two research papers:
Accuracy of this model depends on two factors. First, probability of winning a point on serve by both players must be precise. Every precision error at this level propagates down through probabilities of winning a tennis game, set and match.

Second, the algorithm for calculating game, set and match probabilities, which is based on markov chain, makes assumption that points in tennis match are independently and identically distributed. In other words, whether the probability of winning every point on serve by a player in a tennis match is constant or maybe it depends on a current score or who won last couple of points.  Klassen and Magnum show in their paper (Klaassen, Magnus - 2001 - Are Points in Tennis Independent and Identically distributed Evidence From a Dynamic Binary Panel Data Model), that points are not independent and identically distributed, however deviations are small and making this idd assumption should be fine in many use cases. How making this assumption impacts trader's needs for precise tennis predictions, I don't know yet.

I'm not expecting this model to be very accurate, at least in its generic form, where probability of winning a point is derived purely from percentages of points won on serve and return by both players and by 'average player'. I implemented it to get some reference point I can compare other models with. I'm planning to discount most recent tennis matches less and I also want to compare it with Bradley-Terry, Elo, Glicko and TrueSkill models.

Design and implementation

There are three dedicated components contributing to functionality of comparing ATP tennis players: tennis-trader-compare, atpworldtour-api and tennis-probability-calculator.

Component diagram for comparing ATP tennis players

The tennis-trader-compare component provides a high level interface for predicting outcome of tennis match. Interface definition.

The atpworldtour-api is an adapter for http://www.atpworldtour.com/ tennis statistics, which are used for calculating probability of winning a tennis point. Interface definition.

Tennis-probability-calculator contains functions for calculating probability of winning a point, gem, set, tie break and tennis match. Interface definition.

Flow of control

Sequence diagram for comparing ATP tennis players
  1. Trader calls tennis-trader-compare with player full names, surface, match type and prediction year and receives probability of winning a tennis match.
  2. Tennis-trader-compare component obtains ATP match facts from atpworldtour-api component, e.g. frst serve %,1st serve point won %, 2nd serve point won %.
  3. Tennis-trader-compare calls tennis-probability-calculator to calculate probability of winning a point and tennis match by both players.

15 comments:

  1. Hello Daniel
    I am enjoying your blog/research and experiments. Well done!
    Martin

    ReplyDelete
  2. I enjoyed reading this article too.

    Post a little more often, please!

    You can read my ideas over at Betfair Pro Trader

    ReplyDelete
  3. Hello,

    Its great to read another article that connects probability and the tennis game. I read a similar article called "Points, Games, and Life Lessons" on www.binfikr.com. Check it out, its interesting :)

    ReplyDelete
  4. Hi,

    How are you retrieving the ATP World Tour statistics here:

    https://github.com/danielkorzekwa/atpworldtour-api

    Is there a freely available api others can access?

    ReplyDelete
  5. How to retrieve tennis matches for year 2011 and store them in a csv file?

    Tennis matchesCSV example:
    event_time, event_name, surface, num_of_sets, playerA,playerB, winner, score, round, duration_minutes, playerATotalServicePointsWon, playerATotalServicePoints, playerBTotalServicePointsWon, playerBTotalServicePoints
    2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Robin Soderling,Ryan Harrison,Robin Soderling,6-2; 6-4,R32,66,39,52,28,50
    2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Michael Berrer,Dudi Sela,Michael Berrer,1-6; 7-6(3); 6-2,R32,152,62,99,62,103
    2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Matthew Ebden,John Millman,Matthew Ebden,4-6; 6-2; 6-4,R32,104,55,85,48,76
    2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Denis Istomin,Thiemo de Bakker,Denis Istomin,7-6(5); 6-4,R32,93,49,63,48,72
    2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Mardy Fish,Adrian Mannarino,Mardy Fish,6-1; 6-4,R32,79,39,51,35,68
    2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Radek Stepanek,Tobias Kamke,Radek Stepanek,5-7; 6-1; 6-4,R32,124,53,80,52,87
    2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Ricardas Berankis,Arnaud Clement,Ricardas Berankis,6-4; 6-3,R32,86,37,55,37,64
    2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Florian Mayer,Bernard Tomic,Florian Mayer,6-2; 6-2,R32,55,32,35,28,58
    2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Feliciano Lopez,Philipp Petzschner,Feliciano Lopez,6-4; 7-6(11),R32,100,54,77,47,68
    ...

    Scala application:

    package dk.test

    import dk.atp.api.tournament.GenericTournamentAtpApi
    import dk.atp.api._

    object MatchesLoadSimple extends App {
    //Increase 3000ms connection timeout, if loading tennis matches fails.
    //There are dozens of requests sent to atp website, so
    //even with a higher timeout the loadMatches method may sometimes fail.
    //With a high speed internet connection, loading all tennis matches
    //for 2011 takes between 3-6 seconds.
    var tournamentApi: GenericTournamentAtpApi = new GenericTournamentAtpApi(3000)
    val genericATPMatchesLoader = new GenericATPMatchesLoader(tournamentApi)
    val matches = genericATPMatchesLoader.loadMatches(2011)
    CSVATPMatchesLoader.toCSVFile(matches, "./target/matches.csv")
    }

    I also added this instruction to https://github.com/danielkorzekwa/atpworldtour-api README file.

    I hope this will help.

    Additionally please look into Junit tests, which illustrate how to use atp API.

    ReplyDelete
  6. The instruction above, shows how to get tennis matches, which in fact could be used to obtain all tennis stats for any tennis player and time period.

    But I also implemented API for getting tennis flat facts at the end of a year directly from atp website. This junit test shows how to use it:

    https://github.com/danielkorzekwa/atpworldtour-api/blob/master/src/test/scala/dk/atp/api/facts/GenericAtpFactsApiTest.scala

    However, personally I load tennis matches first and use them to calculate stats I need.

    ReplyDelete
  7. Thanks for your response.

    Am I correct in saying that you are requesting the ATP website's pages, converting them to a string and then parsing that string to find the data you need?

    I don't know anything about Scala I'm more of a Javascript programmer. The reason I came across your blog post was because I'm interested in doing some data visualisation with tennis stats so I was wondering if there were some freely available stats somewhere I could use.

    ReplyDelete
  8. Correct, I parse ATP web pages into data model in Scala and then I write tennis matches into CSV files.

    If the tennis data in a CSV format I presented above is what you need but you don't know how to use the Scala code example I gave, then contact me by email and I will send you zipped file with tennis data.

    ReplyDelete
  9. I was using a similar markov chain model I developed to trade. Model helped in figuring out entry point to scalp service games. Mixed success. Was thinking about extending the model to project probabilities of winning set and probabilities of winning game. This would potentially help in identifying value bets for trading. Looks like you have done this but I am discouraged by my your expectation that model would not be very accurate. Nevertheless, would be interested in giving your system a try would probably save me a fair amount of time.

    Do you have instructions on how to setup and run your system? Wanted to avoid learning a new language since that stuff does not really interest me too much any more but if this is a necessary step point me in the direction of what I would need to know.

    Also, I think you are on the right track in having some weighting of stats so that more recent performance carries more weight. The other stat I would be interested in seeing is some form of risk measure that captures the variability of the estimated probability. Ideally something that highlights significant turning points in the probability of winning real time would be helpful. In writing this is just occurred to me that Bet Angel's Tennis Trader tool has some of this stuff. I must investigate. When last I looked at it it seemed to be working of a markov chain also but also factored in real time market odds. Sorry for the long post bad day trading and just thinking out loud...

    ReplyDelete
  10. Do you have instructions on how to setup and run your system?

    Unfortunately not yet, however I'm planning to post on my blog on validating tennis prediction models including markov model, elo, glicko and true skill. Then I will probably explain in more details how to use those models.

    ReplyDelete
  11. Hi DK, very interesting stuff. Got any results as to how accurate the model is?

    ReplyDelete
  12. Yes, I have. I compared log likelihood for Elo, Glicko, Glicko2, model based on decayed percentage of winning points, against betting exchange prices. The best model so far is glicko 2, combined with logistic regression, which is used for learning Glicko 2 parameters. It's accuracy is close to accuracy of SP exchange prices, and it outperforms exchange prices with betting volume < 1000 British Pounds. All the models I evaluated so far are better or worse approximations, which are fine for providing ratings but not for precise estimations of player skills and market probabilities. The model I'm working now is based on Dynamic Bayesian Networks, learned with Expectation Maximization.

    ReplyDelete
  13. I'm currently investigating GA / ANN techniques with a view to creating a trading bot for betfair horse racing markets. Just wondering, it's been a while since you last posted... are you actually making an income from automated trading yet?

    ReplyDelete
    Replies
    1. Yes, I am, but it doesn't compensate yet the time I've spent on designing and implementing some trading strategies.

      I'm not posting, because I'm still working on Bayesian Inference in Tennis. I will post when I have something interesting to share.

      Delete
    2. It's encouraging to hear that you're making some money yet somewhat disappointing that you're not making enough to compensate for time spent.

      I've built a simple GA-NN using C language, and for further interest and extra performance I have rewrote parts of it using OpenCL in order to take advantage of parallel GPU hardware. Net result is that searching should be a lot faster.

      So far I haven't used any real betfair data, just a few academic data sets. I'm now at the stage where I'm trying to decide how to preprocess my recorded betfair data. I'm not mathematically minded so it's tough going - too many variables and too few brain cells left haha!

      Delete