For the last couple of months I have been (slowly) working on an expected goals model for the Argentine league and I am very happy to finally present a first version here. I will explain the basics of the model here and I will also try to show some of the underlying numbers, something I haven’t seen so much in posts from others.
For the ones that want to skip this blog and just see the results: click here.
In two earlier posts you can read more about the process and the background of expected goals models, and the underlying shot numbers of this model for which 23,932 shots have been used. All shots come from the Argentine Primera Division and do not include penalty shots nor own goals. All data is provided by DataFactory.
Expected goals models
An expected goals model assigns a probability (between 0 and 1 that is) that a shot will be converted into a goal. This is primarily based on the location of the shot but also on the situation in which the shot is taken: think about a dribble before the shot, the type of assist received and whether the shot came from a fast counter-attack. A value is assigned based on the average of historical conversion in comparable situations. It is mainly used to evaluate player and team performance in competitions but also serves as a nice post-game recap.
Such a model is nothing new anymore and actually quite popular in the growing football analytics community. There are several bloggers of football analytics that have their own xG models, and the ones that (for me) are most widely known are those of Paul Riley (here), 11tegen11 (here) and Michael Caley (here). You can find examples of their resulting game xG maps here and here. Also Arsene Wenger seems to be among the people in (professional) football that uses the metric (or at least his analytics department does).
Another xg Model, WHYYY?!
I decided to develop yet another xG model as I have never seen something similar for Argentina, or for South America in general. Most, if not all, analytics is focussed on European competitions and the MLS. However, one of the best football is played in Latin America of course, ha!
Also, as said earlier, I was very interested in the underlying shot data of those models and wanted to present those. From there it was a small and interesting step to the model. The model I aim to use for three purposes:
- Better insight in statistics. I am not satisfied with just seeing general match statistics. Amount of shots and possession have long been proven to not give an accurate overview of what was really going on in a game. Expected goals provide a lot more insight. See this example of Uruguay.
- Content. Analytics twitter is already full of xG goal maps and rankings and I really like them. However, on this side of the equator there is not so much (nothing!) going on in that area. I hope to start this and get people interested. I work at a data company and we mainly focus on media; although it is not the simplest model, the visualizations can be very clear and serve their discussions.
- Use for prediction models. Everyone loves predicting: predicting league outcomes, giving probabilities for matches and tournaments, it’s great. Half of the discussions on football are about predictions. Various bloggers have shown that expected goals are more reliable when making predictions that actual goals (I will find the links!).
A binary logistic regression was done to find the significance and beta values for each variable and the constant of the model.
The variables that have been included in this version of the model are angle to goal (location), type of assist, length of assist, time in game, game state, time since last possession opponent, forward speed of last 3 passes, #shot of shooter in game and for shots with small angles the shortest distance to goal. A formula (y = constant + b1 * x1 + b2 * x2 + .., where the b indicates the beta value and the x the value of each variable) is constructed based on those variables to calculate a value between 0 and 1 for each shot.
Note. One of the most important variables, as found by other modelers, is something that I do not have, and cannot calculate, in my data: the type of shot, whether it is a left/right-footed shot, a header or something else.
Having calculated the goal probability for each of the shots, here’s the distribution of xG values:
And the conversion rate per expected goal value (taking the conversion over a 0,06 rolling average of the xG):
Expected goals vs real goals
In order to see the accuracy of the model I have made three simple graphs, and included the R² coefficients, plotting expected goals scored vs actual goals scored (R² = 0,947), expected goals scored vs actual goals scored (R² = 0,945) and expected goal difference vs actual goal difference (R² = 0,76). This data includes all 23,932 shots mentioned before. A few teams competed in only one season but all still had at least 273 shots to take into account.
Edit: as Nils Mackay mentions here, this is a far from ideal way to evaluate the performance of the model. The most important flaw is that the formula is checked against the same data that was used to build the model on and a high R2 is not so surprising. The most important check, seeing how the model evaluates future shots, is yet to come. I’ll probably do this by making a distribution of all xG values (when I have a lot) and check whether the average conversion holds against those values: see the first graph above here.
As you can see, the correlation coefficients are quite high. Over 76% of the variance in the actual goal difference can be explained by the calculation of the expected goal difference.
The total of predicted goals by the model is 2,440, the actual amount of goals scored was 2,435. This seems quite OK as I did not do any tweaking to get this numbers closer to each other. On the other hand they’re probably supposed to be close as both the formula for expected goals and the actual goals come from the same dataset. Although the formula uses no information on the outcome of that shot, it does use the average of the outcomes of all comparable shots.
So, the total of xG vs actual goals seems about right. There are some big differences when considering individual teams though (see the last column):
In this short blog I will show a few examples of xG maps and rankings for the current season.
Let’s see how the model holds when the formula, coming from the old dataset, calculates the expected goals from the new data; that is obviously where I will use it for.