I am currently developing an expected goals model for the Argentine top league and I am keeping track of my progress through this blog. It helps me oversee everything, remember flaws/problems I encounter and makes it open for input and critics from others.
Instead of simply taking the amount of shots (maybe shots on goal) and goals to evaluate performance in a match or competition, this model takes into account from which location a shot was taken. Every shot is assigned an expected chance that that shot converts into a goal. This gives every shot an expected value (xG) between 0 and 1. This prediction is primarily based on historical shot data for that location . To make the model more reliable more variables will (have to) be added to the model (see the corresponding paragraph below).
Such xG models already exist and are (widely) used: there are several bloggers of football analytics that have their own xG models, ant the ones that (for me) are most widely known are those of 11tegen11 (here) and Michael Caley (here). You can find examples of the resulting game xG maps here and here. And Arsene Wenger seems to be among the people in professional football that uses the metric (or his analytics department does).
I have collected data of 23,931 shots from the last 5 tournaments (1209 games) in the top division of Argentina. All with x and y coordinates of the origin of the shot and the result (goal, saved, post, wide). I took out all penalty shots, as those are very isolated moments in a football match and would require a different analysis. I will probably add the expected goals for penalties (the league average conversion rate of a penalty) to the totals per team. Also own goals have been deleted from the data set (shots on own goal already did not appear).
For all shot locations the angle to the goal is determined, using the distance to the goal line (‘x’ in the next image), the boundary line to the firs post (‘a’) and second post (‘b’).
Here’s a distribution graph of the shots per angle:
(The angle for a penalty kick – although not included in the numbers – is a bit under 37º)
Besides, shot distance is calculated based on relative distances; standardizing the x,y-coordinates to average pitch sizes in meters. All pitches have different measurements. For this model the following measures are used: 110mx68m. The size of the penalty area is standard and is 16,5mx40m. The width of the goal is 7,32m and they’re placed … in the middle of each of the goal lines! The small box is 5,5m long.
And the basic numbers
Flaws: shot locations are manually uploaded and interpreted by different persons over time. For that reason there is no guarantee that they all use the same criteria and all shot locations are 100% accurate.
The following list of variables are variables that might or might not have an effect on a shot. I would loved to add all of them to the model in case they have a significant effect, but I will see about that later.
A lot of the variables have already been found significant by 11tegen11. Here’s a great article on his model. A complicating factor here is the fact that a lot of data is missing which made me calculate some variables (e.g. types of assists, rebounds, big chances, dribbles) instead of being able to get them from the raw data. A positive note is that ALL matches are uploaded to YouTube and thus available for reviewing.
Type of shot: header, preferred foot, other foot, other body part (no access to this data)
Distance: although the primary factor in the model is the angle to the goal and that metric already contains the variable distante, it might have an effect. Think of a simple example: a ball on the goal line outside of the goal (an angle of 0º). It is easier to score this one from 20 meters away from the post than from right next to it, although both have the same angle. On the other hand, shots might have the same angle, but a shot from further out gives the goal keeper more time to respond. This might be reinforced by a shot coming from a cross, which is generally taken in once, as Michael Caley suggest here but for which I cannot find more info nor proof in my data.
Assisted/unassisted shot: an assisted shot is expected to have a positive effect on the shot.
Depth of assist: a vertical pass forward (bringing the ball closer to the goal at a fast pace) probably has a positive effect on the situation as defenders are taken out.
Length of assist: a short pass is easier to control or shoot at once than a long pass might be.
Squad role of shooter: position of the player in the team.
“Confidence” of shooter: has the player already scored in the game or in previous games. This will be taken as conversion percentage over the past time (I took the game itself, 21 days and the season)
Amount of shot taken by team: the amount of shots taken by a team in the game.
Amount of shot taken by player: the amount of shots taken by a player in the game.
Time on pitch: a player that recently entered the pitch, might take advantage of the fatigue of others. Contrary, a player might not shoot as good after 90 minutes of football.
Game state: it has been suggested that the score in the game has an effect on conversion. In general I assume this is largely based on quality: a team in a +2 game state is (in most cases) better than the other team. Or the goalkeeper might have a terrible day.
Time: a shot in the 90th minute might be different than one in the opening seconds.
Position of opponents: goalkeeper position is very important of course, as is the proximity of other defenders (no access to this data).
Dribble: if a player dribbles with the ball before shooting this possibly has a positive effect as the player is probably moving away from defenders.
Rebounds: generally all players are out of position. I have no data on this, but I consider the amount of seconds since the last shot.
Pass sequence: the amount of passes, the time of possession and the point of possession won (imagine a ball won in the opponent’s box with defenders out of position) can have an influence on the shot.
MY first model
The working model in Tableau can be found here (note the model only contains shot location data at this moment and none of the other suggested variables).
The low amount of shots from just inside the penalty box is interesting. I could be explained by the fact that a team does not want to let the opponent enter the box and therefore the striker decides more often to shoot from just outside the box instead of making an extra effort to enter the area. I will check if there maybe is a bias in loading the coordinates by the data producers. If anyone else has shot data showing the same (or a different) image please let me know!
follow up Questions
Which players give passes that lead to shots with high xG? And which ones are involved in the previous passes leading up to the chance?
Do different players really have the same chance of converting an ‘identical’ shot, as the model assumes?
Other (unmentioned) sources
American Soccer Analysis: http://www.americansocceranalysis.com/explanation/
Different game: https://differentgame.wordpress.com/2014/05/19/a-shooting-model-an-expglanation-and-application/
Michael Caley: http://cartilagefreecaptain.sbnation.com/2015/10/19/9295905/premier-league-projections-and-new-expected-goals
SB Nation: http://cartilagefreecaptain.sbnation.com/2014/2/28/5452786/shot-matrix-tottenham-hotspur-stats-analysis-expected-goals
SB Nation: http://cartilagefreecaptain.sbnation.com/2015/4/10/8381071/football-statistics-expected-goals-michael-caley-deadspin
Note: All raw match data that has been used comes from statistics provider DataFactory