I present end to end solution for predicting a winner of ATP tennis match from names of both tennis players, surface and match type. I start with presenting requirements for tennis predictions, then I discuss mathematical model applied for predicting a winner of tennis match, followed by software design and implementation of prediction application. Validating accuracy of tennis prediction models formulates a topic of my future work on tennis markets.
The product of my efforts presented here is a software library written in Scala programming language, which calculates probability of winning a tennis match. This library is open sourced and it is publicly available on GitHub. If you want to play with it, I recommended starting with unit tests, which should make it a little more clear how to use it.
Tennis-probability-calculator contains functions for calculating probability of winning a point, gem, set, tie break and tennis match. Interface definition.
The product of my efforts presented here is a software library written in Scala programming language, which calculates probability of winning a tennis match. This library is open sourced and it is publicly available on GitHub. If you want to play with it, I recommended starting with unit tests, which should make it a little more clear how to use it.
- https://github.com/danielkorzekwa/tennis-player-compare
- https://github.com/danielkorzekwa/tennis-probability-calculator
- https://github.com/danielkorzekwa/atpworldtour-api
Requirements
Tennis prediction application is intended for supporting trading decisions on a Betfair betting exchange markets.
User story:
As a trader,
I want to calculate probability of winning a tennis match,
so that I can support my trading decisions.
Input:
There are three dedicated components contributing to functionality of comparing ATP tennis players: tennis-trader-compare, atpworldtour-api and tennis-probability-calculator.
The tennis-trader-compare component provides a high level interface for predicting outcome of tennis match. Interface definition.
The atpworldtour-api is an adapter for http://www.atpworldtour.com/ tennis statistics, which are used for calculating probability of winning a tennis point. Interface definition.
Tennis prediction application is intended for supporting trading decisions on a Betfair betting exchange markets.
User story:
As a trader,
I want to calculate probability of winning a tennis match,
so that I can support my trading decisions.
Input:
- Full names of tennis players, e.g. Roger Federer and Milos Raonic. ATP players only.
- Court surface - CLAY, GRASS, HARD.
- Match type - THREE_SET_MATCH, FIVE_SET_MATCH.
- Prediction year - For calculating tennis probability for the current day, specify the current year (2012). To obtain tennis probability at the end of the past year, specify a year from the past, e.g. 2011. In the future, trader would like to specify a full date for prediction, either from the past, e.g. 2 year ago, or from the future, e.g. in 6 months.
Output:
- Probability of winning a tennis match for the first specified tennis player - It is a number between 0 and 1 (probability). Probability of winning a tennis match by the second player is 1-probability.
- Error should be reported if probability cannot be calculated. For example, if a player is not recorded on http://www.atpworldtour.com/ website.
- In the future, trader may want to know what is the confidence level of winning probability. Consider two games for the same player P1 against players P2 and P3. The probability of winning the game by player P1 against players P2 and P3 may be the same value of 0.4. But it may happen that the player P3 hasn't played any tennis match for last 3 months, which potentially could make us less confident about prediction for a game including this player. Trader would like to know about it. In other words, the probability of winning a game should be defined by two numbers, probability and a confidence level, possibly as a http://en.wikipedia.org/wiki/Normal_distribution.
Tennis prediction model
I considered a few different prediction methods for calculating probability of winning a tennis match, which could be classified into three categories:
Eventually, I selected markov chain model, which calculates tennis match winner from probabilities of winning a point on serve and return by both players. Probability of winning a point is calculated from player statistics such as probability of first serve, probability of winning a point on first and second serve and probability of first and second serve return. Then probabilities of winning a game, set, tiebreak and match are calculated. My implementation of this prediction model is based on those two research papers:
Design and implementation- Predicting tennis outcome from probabilities of winning a point on serve and serve return by both players. Modelling a tennis game as a finite markov chain.
- Klaassen, Magnus - 2001 - Are Points in Tennis Independent and Identically Distributed Evidence From a Dynamic Binary Panel Data Model
- Barnett, Clarke - 2005 - Combining player statistics to predict outcomes of tennis matches
- Barnett, Brown, Clarke - 2006 - Developing a model that reflects outcomes of tennis matches
- O'Malley - 2008 - Probability Formulas and Statistical Analysis in Tennis
- Newton, Aslam - 2009 - Monte Carlo Tennis A Stochastic Markov Chain Model
- Neural networks. Predicting market outcome from tennis player statistics including percentage of first serve, winning percentage on the first serve return and few others.
- Official ATP ratings and logistic regression.
From both neural network and markov chain methods I prefer the latter, because it allows for modelling probability of winning a tennis match with two numbers: probability and uncertainty level, modeled as 'mean' and standard deviation respectively (http://en.wikipedia.org/wiki/Normal_distribution). This lets to differentiate between two players with similar probability of winning a point but with different uncertainty levels, which could be interpreted as player's consistency in a game.
Eventually, I selected markov chain model, which calculates tennis match winner from probabilities of winning a point on serve and return by both players. Probability of winning a point is calculated from player statistics such as probability of first serve, probability of winning a point on first and second serve and probability of first and second serve return. Then probabilities of winning a game, set, tiebreak and match are calculated. My implementation of this prediction model is based on those two research papers:
- O'Malley - 2008 - Probability Formulas and Statistical Analysis in Tennis
- Newton, Aslam - 2009 - Monte Carlo Tennis A Stochastic Markov Chain Model
Accuracy of this model depends on two factors. First, probability of winning a point on serve by both players must be precise. Every precision error at this level propagates down through probabilities of winning a tennis game, set and match.
Second, the algorithm for calculating game, set and match probabilities, which is based on markov chain, makes assumption that points in tennis match are independently and identically distributed. In other words, whether the probability of winning every point on serve by a player in a tennis match is constant or maybe it depends on a current score or who won last couple of points. Klassen and Magnum show in their paper (Klaassen, Magnus - 2001 - Are Points in Tennis Independent and Identically distributed Evidence From a Dynamic Binary Panel Data Model), that points are not independent and identically distributed, however deviations are small and making this idd assumption should be fine in many use cases. How making this assumption impacts trader's needs for precise tennis predictions, I don't know yet.
I'm not expecting this model to be very accurate, at least in its generic form, where probability of winning a point is derived purely from percentages of points won on serve and return by both players and by 'average player'. I implemented it to get some reference point I can compare other models with. I'm planning to discount most recent tennis matches less and I also want to compare it with Bradley-Terry, Elo, Glicko and TrueSkill models.
Second, the algorithm for calculating game, set and match probabilities, which is based on markov chain, makes assumption that points in tennis match are independently and identically distributed. In other words, whether the probability of winning every point on serve by a player in a tennis match is constant or maybe it depends on a current score or who won last couple of points. Klassen and Magnum show in their paper (Klaassen, Magnus - 2001 - Are Points in Tennis Independent and Identically distributed Evidence From a Dynamic Binary Panel Data Model), that points are not independent and identically distributed, however deviations are small and making this idd assumption should be fine in many use cases. How making this assumption impacts trader's needs for precise tennis predictions, I don't know yet.
I'm not expecting this model to be very accurate, at least in its generic form, where probability of winning a point is derived purely from percentages of points won on serve and return by both players and by 'average player'. I implemented it to get some reference point I can compare other models with. I'm planning to discount most recent tennis matches less and I also want to compare it with Bradley-Terry, Elo, Glicko and TrueSkill models.
There are three dedicated components contributing to functionality of comparing ATP tennis players: tennis-trader-compare, atpworldtour-api and tennis-probability-calculator.
![]() |
| Component diagram for comparing ATP tennis players |
The tennis-trader-compare component provides a high level interface for predicting outcome of tennis match. Interface definition.
The atpworldtour-api is an adapter for http://www.atpworldtour.com/ tennis statistics, which are used for calculating probability of winning a tennis point. Interface definition.
Tennis-probability-calculator contains functions for calculating probability of winning a point, gem, set, tie break and tennis match. Interface definition.
Flow of control
![]() |
| Sequence diagram for comparing ATP tennis players |
- Trader calls tennis-trader-compare with player full names, surface, match type and prediction year and receives probability of winning a tennis match.
- Tennis-trader-compare component obtains ATP match facts from atpworldtour-api component, e.g. frst serve %,1st serve point won %, 2nd serve point won %.
- Tennis-trader-compare calls tennis-probability-calculator to calculate probability of winning a point and tennis match by both players.
Resources
This list below contains additional research papers, which I collected for a purpose of my work described in this blog post.
This list below contains additional research papers, which I collected for a purpose of my work described in this blog post.
- Clarke, Dyte - 2000 - Using official ratings to simulate major tennis tournaments
- Barnett, Clarke - 2002 - Using Microsoft Excel to model a tennis match
- Klaassen, Magnus - 2003 - On the probability of winning a tennis match
- Barnett - 2006 - MATHEMATICAL MODELLING IN HIERARCHICAL GAMES WITH SPECIFIC REFERENCE TO TENNIS
- Chen et al. - 2008 - Pricing Combinatorial Markets for Tournaments Extended Abstract
- Huang - 2011 - Inferring Tennis Match Progress from In-Play Betting Odds


Hello Daniel
ReplyDeleteI am enjoying your blog/research and experiments. Well done!
Martin
I enjoyed reading this article too.
ReplyDeletePost a little more often, please!
You can read my ideas over at Betfair Pro Trader
Hello,
ReplyDeleteIts great to read another article that connects probability and the tennis game. I read a similar article called "Points, Games, and Life Lessons" on www.binfikr.com. Check it out, its interesting :)
Hi,
ReplyDeleteHow are you retrieving the ATP World Tour statistics here:
https://github.com/danielkorzekwa/atpworldtour-api
Is there a freely available api others can access?
How to retrieve tennis matches for year 2011 and store them in a csv file?
ReplyDeleteTennis matchesCSV example:
event_time, event_name, surface, num_of_sets, playerA,playerB, winner, score, round, duration_minutes, playerATotalServicePointsWon, playerATotalServicePoints, playerBTotalServicePointsWon, playerBTotalServicePoints
2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Robin Soderling,Ryan Harrison,Robin Soderling,6-2; 6-4,R32,66,39,52,28,50
2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Michael Berrer,Dudi Sela,Michael Berrer,1-6; 7-6(3); 6-2,R32,152,62,99,62,103
2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Matthew Ebden,John Millman,Matthew Ebden,4-6; 6-2; 6-4,R32,104,55,85,48,76
2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Denis Istomin,Thiemo de Bakker,Denis Istomin,7-6(5); 6-4,R32,93,49,63,48,72
2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Mardy Fish,Adrian Mannarino,Mardy Fish,6-1; 6-4,R32,79,39,51,35,68
2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Radek Stepanek,Tobias Kamke,Radek Stepanek,5-7; 6-1; 6-4,R32,124,53,80,52,87
2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Ricardas Berankis,Arnaud Clement,Ricardas Berankis,6-4; 6-3,R32,86,37,55,37,64
2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Florian Mayer,Bernard Tomic,Florian Mayer,6-2; 6-2,R32,55,32,35,28,58
2011-01-02 00:00:00.000,Brisbane Australia ATP World Tour 250,HARD,2,Feliciano Lopez,Philipp Petzschner,Feliciano Lopez,6-4; 7-6(11),R32,100,54,77,47,68
...
Scala application:
package dk.test
import dk.atp.api.tournament.GenericTournamentAtpApi
import dk.atp.api._
object MatchesLoadSimple extends App {
//Increase 3000ms connection timeout, if loading tennis matches fails.
//There are dozens of requests sent to atp website, so
//even with a higher timeout the loadMatches method may sometimes fail.
//With a high speed internet connection, loading all tennis matches
//for 2011 takes between 3-6 seconds.
var tournamentApi: GenericTournamentAtpApi = new GenericTournamentAtpApi(3000)
val genericATPMatchesLoader = new GenericATPMatchesLoader(tournamentApi)
val matches = genericATPMatchesLoader.loadMatches(2011)
CSVATPMatchesLoader.toCSVFile(matches, "./target/matches.csv")
}
I also added this instruction to https://github.com/danielkorzekwa/atpworldtour-api README file.
I hope this will help.
Additionally please look into Junit tests, which illustrate how to use atp API.
The instruction above, shows how to get tennis matches, which in fact could be used to obtain all tennis stats for any tennis player and time period.
ReplyDeleteBut I also implemented API for getting tennis flat facts at the end of a year directly from atp website. This junit test shows how to use it:
https://github.com/danielkorzekwa/atpworldtour-api/blob/master/src/test/scala/dk/atp/api/facts/GenericAtpFactsApiTest.scala
However, personally I load tennis matches first and use them to calculate stats I need.
Thanks for your response.
ReplyDeleteAm I correct in saying that you are requesting the ATP website's pages, converting them to a string and then parsing that string to find the data you need?
I don't know anything about Scala I'm more of a Javascript programmer. The reason I came across your blog post was because I'm interested in doing some data visualisation with tennis stats so I was wondering if there were some freely available stats somewhere I could use.
Correct, I parse ATP web pages into data model in Scala and then I write tennis matches into CSV files.
ReplyDeleteIf the tennis data in a CSV format I presented above is what you need but you don't know how to use the Scala code example I gave, then contact me by email and I will send you zipped file with tennis data.
I was using a similar markov chain model I developed to trade. Model helped in figuring out entry point to scalp service games. Mixed success. Was thinking about extending the model to project probabilities of winning set and probabilities of winning game. This would potentially help in identifying value bets for trading. Looks like you have done this but I am discouraged by my your expectation that model would not be very accurate. Nevertheless, would be interested in giving your system a try would probably save me a fair amount of time.
ReplyDeleteDo you have instructions on how to setup and run your system? Wanted to avoid learning a new language since that stuff does not really interest me too much any more but if this is a necessary step point me in the direction of what I would need to know.
Also, I think you are on the right track in having some weighting of stats so that more recent performance carries more weight. The other stat I would be interested in seeing is some form of risk measure that captures the variability of the estimated probability. Ideally something that highlights significant turning points in the probability of winning real time would be helpful. In writing this is just occurred to me that Bet Angel's Tennis Trader tool has some of this stuff. I must investigate. When last I looked at it it seemed to be working of a markov chain also but also factored in real time market odds. Sorry for the long post bad day trading and just thinking out loud...
Do you have instructions on how to setup and run your system?
ReplyDeleteUnfortunately not yet, however I'm planning to post on my blog on validating tennis prediction models including markov model, elo, glicko and true skill. Then I will probably explain in more details how to use those models.
Hi DK, very interesting stuff. Got any results as to how accurate the model is?
ReplyDeleteYes, I have. I compared log likelihood for Elo, Glicko, Glicko2, model based on decayed percentage of winning points, against betting exchange prices. The best model so far is glicko 2, combined with logistic regression, which is used for learning Glicko 2 parameters. It's accuracy is close to accuracy of SP exchange prices, and it outperforms exchange prices with betting volume < 1000 British Pounds. All the models I evaluated so far are better or worse approximations, which are fine for providing ratings but not for precise estimations of player skills and market probabilities. The model I'm working now is based on Dynamic Bayesian Networks, learned with Expectation Maximization.
ReplyDeleteI'm currently investigating GA / ANN techniques with a view to creating a trading bot for betfair horse racing markets. Just wondering, it's been a while since you last posted... are you actually making an income from automated trading yet?
ReplyDeleteYes, I am, but it doesn't compensate yet the time I've spent on designing and implementing some trading strategies.
DeleteI'm not posting, because I'm still working on Bayesian Inference in Tennis. I will post when I have something interesting to share.
It's encouraging to hear that you're making some money yet somewhat disappointing that you're not making enough to compensate for time spent.
DeleteI've built a simple GA-NN using C language, and for further interest and extra performance I have rewrote parts of it using OpenCL in order to take advantage of parallel GPU hardware. Net result is that searching should be a lot faster.
So far I haven't used any real betfair data, just a few academic data sets. I'm now at the stage where I'm trying to decide how to preprocess my recorded betfair data. I'm not mathematically minded so it's tough going - too many variables and too few brain cells left haha!