Friday, 9 November 2012

Predicting outcomes of tennis matches with Dynamic Bayesian Networks

For the purpose of building prediction models in Tennis markets, I've developed Bayesian Inference Engine in Scala Programming Language. One of tutorials I wrote on this tool is dedicated to predicting outcomes of tennis matches with Dynamic Bayesian Networks and Expectation Maximization techniques.

Bayes-Scala Project Home Page

Getting Started - Learning parameters with Expectation Maximisation in Unrolled Dynamic Bayesian Networks from incomplete data

Currently I'm evaluating this prediction model on a real data set of tennis matches for last 5 years and I will publish prediction results later on.

Saturday, 4 February 2012

On probability of winning a tennis match

I present end to end solution for predicting a winner of ATP tennis match from names of both tennis players, surface and match type. I start with presenting requirements for tennis predictions, then I discuss mathematical model applied for predicting a winner of tennis match, followed by software design and implementation of prediction application. Validating accuracy of tennis prediction models formulates a topic of my future work on tennis markets.

The product of my efforts presented here is a software library written in Scala programming language, which calculates probability of winning a tennis match. This library is open sourced and it is publicly available on GitHub. If you want to play with it, I recommended starting with unit tests, which should make it a little more clear how to use it.
  1. https://github.com/danielkorzekwa/tennis-player-compare
  2. https://github.com/danielkorzekwa/tennis-probability-calculator
  3. https://github.com/danielkorzekwa/atpworldtour-api
Requirements

Tennis prediction application is intended for supporting trading decisions on a Betfair betting exchange markets.

User story:

As a trader,
I want to calculate probability of winning a tennis match,
so that I can support my trading decisions.

Input:
  • Full names of tennis players, e.g. Roger Federer and Milos Raonic. ATP players only.
  • Court surface - CLAY, GRASS, HARD.
  • Match type - THREE_SET_MATCH, FIVE_SET_MATCH.
  • Prediction year - For calculating tennis probability for the current day, specify the current year (2012). To obtain tennis probability at the end of the past year, specify a year from the past, e.g. 2011. In the future, trader would like to specify a full date for prediction, either from the past, e.g. 2 year ago, or from the future, e.g. in 6 months. 
Output:
  • Probability of winning a tennis match for the first specified tennis player - It is a number between 0 and 1 (probability). Probability of winning a tennis match by the second player is 1-probability.
  • Error should be reported if probability cannot be calculated. For example, if a player is not recorded on http://www.atpworldtour.com/ website.
  • In the future, trader may want to know what is the confidence level of winning probability. Consider two games for the same player P1 against players P2 and P3. The probability of winning the game by player P1 against players P2 and P3 may be the same value of 0.4. But it may happen that the player P3 hasn't played any tennis match for last 3 months, which potentially could make us less confident about prediction for a game including this player. Trader would like to know about it. In other words, the probability of winning a game should be defined by two numbers, probability and a confidence level, possibly as a http://en.wikipedia.org/wiki/Normal_distribution.
Tennis prediction model

I considered a few different prediction methods for calculating probability of winning a tennis match, which could be classified into three categories:
I rejected prediction methods based on ATP ratings and logistic regression, which predict match winner from historical match outcomes, not taking into account probabilities of winning a tennis point. I believe the more data the better - there are far more points than matches played in tennis. Nonetheless, I still can see a value in applying logistic regression on top of probabilistic model. For instance, probabilistic model might be used for generating ratings for tennis players, which could be fit with logistic regression into tennis probability function.

From both neural network and markov chain methods I prefer the latter, because it allows for modelling probability of winning a tennis match with two numbers:  probability and uncertainty level, modeled as 'mean' and standard deviation respectively (http://en.wikipedia.org/wiki/Normal_distribution). This lets to differentiate between two players with similar probability of winning a point but with different uncertainty levels, which could be interpreted as player's consistency in a game. 

Eventually, I selected markov chain model, which calculates tennis match winner from probabilities of winning a point on serve and return by both players. Probability of winning a point is calculated from player statistics such as probability of first serve, probability of winning a point on first and second serve and probability of first and second serve return. Then probabilities of winning a game, set, tiebreak and match are calculated. My implementation of this prediction model is based on those two research papers:
Accuracy of this model depends on two factors. First, probability of winning a point on serve by both players must be precise. Every precision error at this level propagates down through probabilities of winning a tennis game, set and match.

Second, the algorithm for calculating game, set and match probabilities, which is based on markov chain, makes assumption that points in tennis match are independently and identically distributed. In other words, whether the probability of winning every point on serve by a player in a tennis match is constant or maybe it depends on a current score or who won last couple of points.  Klassen and Magnum show in their paper (Klaassen, Magnus - 2001 - Are Points in Tennis Independent and Identically distributed Evidence From a Dynamic Binary Panel Data Model), that points are not independent and identically distributed, however deviations are small and making this idd assumption should be fine in many use cases. How making this assumption impacts trader's needs for precise tennis predictions, I don't know yet.

I'm not expecting this model to be very accurate, at least in its generic form, where probability of winning a point is derived purely from percentages of points won on serve and return by both players and by 'average player'. I implemented it to get some reference point I can compare other models with. I'm planning to discount most recent tennis matches less and I also want to compare it with Bradley-Terry, Elo, Glicko and TrueSkill models.

Design and implementation

There are three dedicated components contributing to functionality of comparing ATP tennis players: tennis-trader-compare, atpworldtour-api and tennis-probability-calculator.

Component diagram for comparing ATP tennis players

The tennis-trader-compare component provides a high level interface for predicting outcome of tennis match. Interface definition.

The atpworldtour-api is an adapter for http://www.atpworldtour.com/ tennis statistics, which are used for calculating probability of winning a tennis point. Interface definition.

Tennis-probability-calculator contains functions for calculating probability of winning a point, gem, set, tie break and tennis match. Interface definition.

Flow of control

Sequence diagram for comparing ATP tennis players
  1. Trader calls tennis-trader-compare with player full names, surface, match type and prediction year and receives probability of winning a tennis match.
  2. Tennis-trader-compare component obtains ATP match facts from atpworldtour-api component, e.g. frst serve %,1st serve point won %, 2nd serve point won %.
  3. Tennis-trader-compare calls tennis-probability-calculator to calculate probability of winning a point and tennis match by both players.

Wednesday, 11 January 2012

Neural Networks and Genetic Algorithm in trading on a Betting Exchange

This is a report on learning neural networks for trading on betting exchange horse racing markets.

Neural Network [1] was learned using Market Simulator [2] with data collected for a period of 3 months between January and April 2011 for Horse Racing GB win only markets, which allows for replaying betting exchange market on a betting exchange and evaluating trading decisions.

Example of collected market data:
{"time":1282481934744,"eventType":"CREATE_MARKET","marketId":101655582,"marketName":"7f Mdn Stks","eventName":"/GB/Folk 22nd Aug","numOfWinners":1,"marketTime":"2010-08-22 14:10:00","runners": [{"runnerId":4848193,"runnerName":"Dubawi Gulf"},{"runnerId":4818333,"runnerName":"Apace"},{"runnerId":4932434,"runnerName":"Menha"},{"runnerId":4932435,"runnerName":"Flying Arch"},{"runnerId":4883524,"runnerName":"Hello Tomorrow"},{"runnerId":4932436,"runnerName":"Mazagee"}]}
{"time":1282481934744,"eventType":"PLACE_BET","betSize":38.0,"betPrice":5.9,"betType":"LAY","marketId":101655582,"runnerId":4818333}
...
{"time":1282482515736,"eventType":"PLACE_BET","betSize":314.0,"betPrice":1.61,"betType":"BACK","marketId":101655582,"runnerId":4848193}
{"time":1282482515736,"eventType":"CANCEL_BETS","betsSize":5.0,"betPrice":18.0,"betType":"LAY","marketId":101655582,"runnerId":4932435}
{"time":1282482515736,"eventType":"PLACE_BET","betSize":17.0,"betPrice":20.0,"betType":"LAY","marketId":101655582,"runnerId":4932435}
Collected data included market prices, traded volume, volume of unmatched bets by price, current risk position, but no knowledge from historical market data was taken into account such as how good a given horse is comparing to other horses in a race. The output of neural network was either trading action to be taken in a current state, e.g. place bet, cancel bet, or indicator whether a given action, e.g. place bet ( (specified as an input to neural network), is good or not decision.

Neural Network wasn't learned from examples, as those didn't exists on how to trade in HR markets, but rather than that, it was optimized with genetic algorithm [3] and simulated annealing [4]. This type of learning neural networks could be perceived as an example of reinforcement learning [5] in an environment, in which we don't know how to act to achieve the goal, e.g. what trading decisions should be taken, but we know when we reach the goal, in this case it is positive expected profit [6] from trading on a betting exchange markets.

The best instance of learned neural network gives a positive profit per market by average of 5£ including 5% winning commission from trading on HR betting exchange markets.  But it also involves a big number of betting transactions, which causes additional transaction fees and takes overall profit per market from 5£ down to 2£. Adding additional costs around setting up hardware, technology and maintenance, limits the usage of presented trading technique to a research activity only. Involving a knowledge gained from historical data potentially could take this technique from a lab project to production ready system.

Neural network structure

Structure 1:
  • Input neurons - Runner prices, traded volume, unmatched volume, derivatives of price and traded volume, risk position.
  • Hidden layer - Single hidden layer.
  • Output neurons - Actions to be taken, e.g. 3 neurons: A (place back bet), B (place lay bet), C(place hedge bet). 
  • Activation function - Hyperbolic tangent. If output neuron value is higher than 0 then action is taken.
Structure 2:
  • Input neurons -  Runner prices, traded volume, unmatched volume , derivatives of price and traded volume, risk position, a proposal for a single action to be taken, e.g. place back bet.
  • Hidden layer - Single hidden layer.
  • Output neuron - Single neuron indicating how good action specified by an input layer is. Before taking any action, all possible actions are evaluated with neural network and the best one is executed.
The latter neural network produces good results (positive profit) faster than the former one, due to a smaller number of weights inside the neural network to be learned.

Learning process

First, neural network is initialized with an initial set of random weights, then market simulation is executed evaluating trading quality (market expected profit) of an agent over 100-300 markets. The overall expected profit from all markets is a quality value of a neural network. In the next step, neural network weights are updated applying evolutionary algorithm or simulated annealing and simulation process is repeated. A single learning iteration takes about 1 minute and the whole process took weeks till neural network was trained for giving positive profit. Learning was conducted on Intel Core i7 2.8GHz 8 Cores PC.

The best learning results were achieved by applying co-evolution version of learning described above, where a number of traders have to fight each other on a betting exchange markets. With this approach weights of best traders are crossed each other by applying genetic algorithm.

What's next?

In the next step I plan to take advantage of a knowledge from historical market data while learning and then executing trading agents. I'm looking at adopting probability theory, Bayesian networks [7], logistic regression [8] and other techniques suitable for analysing huge amount of historical market data.

Notes

A number of trading examples examined for a purpose of this research work are available here:
http://code.google.com/p/betting-ai/source/browse/#svn%2Ftrunk%2Ftrader-examples

All software for a purpose of this project is written in a Scala programming language.

References
  1. Neural Network - http://en.wikipedia.org/wiki/Neural_network
  2. Market Simulator -  http://blog.danmachine.com/2010/08/horse-racing-market-simulation.html
  3. Genetic algorithm -  http://en.wikipedia.org/wiki/Genetic_algorithm
  4. Simulated annealing -  http://en.wikipedia.org/wiki/Simulated_annealing
  5. Reinforcement learning -  http://en.wikipedia.org/wiki/Reinforcement_learning
  6. Expected value -  http://en.wikipedia.org/wiki/Expected_value
  7. Bayesian networks -  http://en.wikipedia.org/wiki/Bayesian_networks
  8. Logistic regression -  http://en.wikipedia.org/wiki/Logistic_regression