Welcome to my Expected Goals Model built on NHL Play By Play (PBP) data. Expected Goals Models can be built in several ways, but at the core of it they are all trying to predict the probability of a shot, or a group of shots, becoming a goal. Over a game you can assess the player’s impact based on this metric even if they don’t score. You can also downgrade a players performance based on this metric if they just happen to get lucky and score a goal. Expected Goals is the new metric in hockey and I built mine off a multiple logistic regression.
I took the values provided to me in the data set, like the location of the shot or the shot type, and added values I extrapolated from the data set, like the angle the shot was take from the net, if the shot was take closely after a turnover, and many more.
There were several difficulties up front. It was difficult to find data with shot location data. Then creating parameters like ‘Shot Angle’ required an understanding of the dimensions of an ice rink and how that fits into the data, or understanding how the data is structured to determine variables like is_rebound.
I’ll walkthrough the model at a high level but for a more in depth view of the data cleaning please see the r file in my github repository. (https://github.com/rdoherty2019/rdoherty2019.github.io/blob/master/hockey/expectGoalModel.R)
I was able to source the data from : https://www.kaggle.com/datasets/martinellis/nhl-game-data I was able to merge it together and separate the values into seasons in Python. I’m more comfortable working in Python and was able to create several CSVs worth of data for this model to be built on.
Find that Python file here: https://github.com/rdoherty2019/rdoherty2019.github.io/blob/master/hockey/hockeyDataMerge.ipynb
After merging several normalized data sets together. There is a vast amount of information available to us for each play
colnames(Train_PbP_Data)
## [1] "X1" "play_id" "game_id"
## [4] "team_id_for" "team_id_against" "event"
## [7] "secondaryType" "x" "y"
## [10] "period" "periodType" "periodTime"
## [13] "periodTimeRemaining" "dateTime" "goals_away"
## [16] "goals_home" "description" "st_x"
## [19] "st_y" "player_id" "playerType"
## [22] "season" "type" "date_time_GMT"
## [25] "away_team_id" "home_team_id" "away_goals"
## [28] "home_goals" "outcome" "home_rink_side_start"
## [31] "venue" "venue_link" "venue_time_zone_id"
## [34] "venue_time_zone_offset" "venue_time_zone_tz"
Below is a snippet of the data I have to work with:
# Get Columns
str(Train_PbP_Data, max.level = 1)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1248673 obs. of 35 variables:
## $ X1 : num 0 1 2 3 4 5 6 7 8 9 ...
## $ play_id : chr "2016020045_4" "2016020045_4" "2016020045_5" "2016020045_5" ...
## $ game_id : num 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
## $ team_id_for : num 16 16 16 16 16 16 16 16 4 4 ...
## $ team_id_against : num 4 4 4 4 4 4 4 4 16 16 ...
## $ event : chr "Faceoff" "Faceoff" "Shot" "Shot" ...
## $ secondaryType : chr NA NA "Wrist Shot" "Wrist Shot" ...
## $ x : num 0 0 -71 -71 -88 -88 -88 -88 0 0 ...
## $ y : num 0 0 9 9 5 5 5 5 0 0 ...
## $ period : num 1 1 1 1 1 1 1 1 1 1 ...
## $ periodType : chr "REGULAR" "REGULAR" "REGULAR" "REGULAR" ...
## $ periodTime : num 0 0 54 54 56 56 56 56 58 58 ...
## $ periodTimeRemaining : num 1200 1200 1146 1146 1144 ...
## $ dateTime : POSIXct, format: "2016-10-19 01:40:50" "2016-10-19 01:40:50" ...
## $ goals_away : num 0 0 0 0 0 0 0 0 0 0 ...
## $ goals_home : num 0 0 0 0 1 1 1 1 1 1 ...
## $ description : chr "Jonathan Toews faceoff won against Claude Giroux" "Jonathan Toews faceoff won against Claude Giroux" "Artem Anisimov Wrist Shot saved by Michal Neuvirth" "Artem Anisimov Wrist Shot saved by Michal Neuvirth" ...
## $ st_x : num 0 0 71 71 88 88 88 88 0 0 ...
## $ st_y : num 0 0 -9 -9 -5 -5 -5 -5 0 0 ...
## $ player_id : num 8473604 8473512 8473573 8473607 8474141 ...
## $ playerType : chr "Winner" "Loser" "Shooter" "Goalie" ...
## $ season : num 20162017 20162017 20162017 20162017 20162017 ...
## $ type : chr "R" "R" "R" "R" ...
## $ date_time_GMT : POSIXct, format: "2016-10-19 00:30:00" "2016-10-19 00:30:00" ...
## $ away_team_id : num 4 4 4 4 4 4 4 4 4 4 ...
## $ home_team_id : num 16 16 16 16 16 16 16 16 16 16 ...
## $ away_goals : num 4 4 4 4 4 4 4 4 4 4 ...
## $ home_goals : num 7 7 7 7 7 7 7 7 7 7 ...
## $ outcome : chr "home win REG" "home win REG" "home win REG" "home win REG" ...
## $ home_rink_side_start : chr "right" "right" "right" "right" ...
## $ venue : chr "United Center" "United Center" "United Center" "United Center" ...
## $ venue_link : chr "/api/v1/venues/null" "/api/v1/venues/null" "/api/v1/venues/null" "/api/v1/venues/null" ...
## $ venue_time_zone_id : chr "America/Chicago" "America/Chicago" "America/Chicago" "America/Chicago" ...
## $ venue_time_zone_offset: num -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 ...
## $ venue_time_zone_tz : chr "CDT" "CDT" "CDT" "CDT" ...
There are many play types in this data set that all have location parameters and other information that will affect our model.
unique(Train_PbP_Data$event)
## [1] "Faceoff" "Shot" "Goal" "Takeaway" "Hit"
## [6] "Blocked Shot" "Giveaway" "Missed Shot" "Penalty"
I had to remove the non-shot type events from the data set so those values will not affect the model. We also had to remove events from players that didn’t take the shot, like goalies or other players on the ice.
dim(Train_PbP_Data)
## [1] 1248673 35
dim(Train_PbP_Data)
## [1] 686222 35
Although we have the location of the shot and the shot type, there is more information I extracted from the data set was valuable in predicting the likelihood a shot is going to turn into a goal.
After data creation we have several new columns:
colnames(Train_PbP_Data)
. Like the distance, shot_angle,
is_rebound, is_rush.
I removed all the values that will affect the model and added values that will be valuable predictors for a shot becoming a goal.
I used a Logistic Regression model to predict the likelihood a shot will become a goal.
# Build Model
xGmodel <- glm(is_goal ~ poly(distance, 3, raw = TRUE) +
poly(shot_angle, 3, raw = TRUE) + secondaryType +
is_rebound + is_rush,
data = Train_Fenwick_Data,
family = binomial(link = 'logit'))
save(xGmodel, file = "xGmodelver2.rda")
summary(xGmodel)
##
## Call:
## glm(formula = is_goal ~ poly(distance, 3, raw = TRUE) + poly(shot_angle,
## 3, raw = TRUE) + secondaryType + is_rebound + is_rush, family = binomial(link = "logit"),
## data = Train_Fenwick_Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0492 -0.4774 -0.3328 -0.2507 2.9773
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.375e+00 5.345e-02 -25.730 < 2e-16 ***
## poly(distance, 3, raw = TRUE)1 -2.039e-02 4.594e-03 -4.437 9.10e-06 ***
## poly(distance, 3, raw = TRUE)2 -1.158e-03 1.305e-04 -8.870 < 2e-16 ***
## poly(distance, 3, raw = TRUE)3 1.270e-05 1.044e-06 12.170 < 2e-16 ***
## poly(shot_angle, 3, raw = TRUE)1 1.474e-03 2.022e-03 0.729 0.46592
## poly(shot_angle, 3, raw = TRUE)2 -3.777e-04 4.642e-05 -8.137 4.06e-16 ***
## poly(shot_angle, 3, raw = TRUE)3 2.604e-06 2.920e-07 8.920 < 2e-16 ***
## secondaryTypeDeflected 5.909e-01 6.117e-02 9.660 < 2e-16 ***
## secondaryTypeSlap Shot 8.199e-01 4.352e-02 18.840 < 2e-16 ***
## secondaryTypeSnap Shot 6.082e-01 3.879e-02 15.677 < 2e-16 ***
## secondaryTypeTip-In 3.714e-01 4.206e-02 8.830 < 2e-16 ***
## secondaryTypeWrap-around -2.834e-01 1.001e-01 -2.831 0.00464 **
## secondaryTypeWrist Shot 3.994e-01 3.268e-02 12.221 < 2e-16 ***
## is_rebound 6.544e-01 2.645e-02 24.741 < 2e-16 ***
## is_rush 2.216e-01 8.472e-02 2.616 0.00891 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 99068 on 163201 degrees of freedom
## Residual deviance: 90332 on 163187 degrees of freedom
## AIC: 90362
##
## Number of Fisher Scoring iterations: 6
coef(xGmodel)
## (Intercept) poly(distance, 3, raw = TRUE)1
## -1.375252e+00 -2.038563e-02
## poly(distance, 3, raw = TRUE)2 poly(distance, 3, raw = TRUE)3
## -1.157918e-03 1.270123e-05
## poly(shot_angle, 3, raw = TRUE)1 poly(shot_angle, 3, raw = TRUE)2
## 1.474089e-03 -3.777363e-04
## poly(shot_angle, 3, raw = TRUE)3 secondaryTypeDeflected
## 2.604261e-06 5.908737e-01
## secondaryTypeSlap Shot secondaryTypeSnap Shot
## 8.198835e-01 6.081882e-01
## secondaryTypeTip-In secondaryTypeWrap-around
## 3.713624e-01 -2.834322e-01
## secondaryTypeWrist Shot is_rebound
## 3.993653e-01 6.544108e-01
## is_rush
## 2.215849e-01
We built the model on 2015 and 2016 season data and now want to test on 2017 season data. There are multiple ways I could have tested and validated the data but I chose this approach because it would not allow my model to be affected by any of the in-season data, it would also not separate shot values that are connected.
# test the predictions of our model is to plot it’s ROC curve
# calculate the area underneath the curve
g <- roc(is_goal ~ xG, data = Test_Fenwick_Data)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(g)
#Find the area underneath the curve
auc(g)
## Area under the curve: 0.7174
We have a relatively accurate model with a higher than 72% accuracy in predicting goals based on shots. We could improve the accuracy of the plot several ways. We could use cross validation or remove some of the bias in our data like shots that are unlikely to be turned into a goal.
As you can see from this heat map visualization shots take from the slot with a better angle to the net are more likely to go in. These are some of the unique opportunites you can make from these models.
avg_xG_by_coord <- Test_Fenwick_Data %>% group_by(x, y) %>%
summarise(xg = mean(xG))
ggplot(avg_xG_by_coord, aes(x, y, fill = xg)) + geom_raster() +
scale_fill_gradient(low = 'blue', high = 'red')+
geom_vline(xintercept = 0, color = 'red') +
geom_vline(xintercept = 25, color = 'blue') +
geom_vline(xintercept = 88, color = 'red') +
xlab('X Coordinates') + ylab('Y Coordinates') +
labs(title = 'Average xG Value by Coordinate')
The model I created did a good job of predicting how many goals a player should have. We can use this model and then create visualizations like the ones below to determine how well players have performed.
#code to group xg and goals by player
## Go back ad add the player names for more information
xg_player <- Test_Fenwick_Data %>%
group_by(player_id, team_id_for) %>%
summarise( xG = sum(xG), Goals = sum(is_goal), Difference = sum(xG) - sum(is_goal))
head(xg_player)
## # A tibble: 6 x 5
## # Groups: player_id [6]
## player_id team_id_for xG Goals Difference
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8464989 5 9.20 7 2.20
## 2 8465009 6 5.64 7 -1.36
## 3 8466138 28 12.1 20 -7.94
## 4 8466139 10 20.8 16 4.85
## 5 8468493 10 3.47 5 -1.53
## 6 8468498 15 2.65 3 -0.351
arrange(xg_player, desc(xG))
## # A tibble: 1,005 x 5
## # Groups: player_id [893]
## player_id team_id_for xG Goals Difference
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8475166 10 42.2 49 -6.84
## 2 8477492 21 38.4 46 -7.59
## 3 8475794 25 37.7 37 0.689
## 4 8475169 28 35.7 32 3.74
## 5 8478414 28 35.6 35 0.584
## 6 8475765 19 35.2 44 -8.77
## 7 8475848 8 34.0 33 0.962
## 8 8475314 2 33.7 29 4.72
## 9 8474715 29 33.1 43 -9.91
## 10 8476881 28 33.0 44 -11.0
## # … with 995 more rows
ggplot(aes(x = xG, y = Goals), data = xg_player) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = 'Expected Goals vs Goals by Player')
# Linear model
play_xg <- lm(Goals ~ xG, data = xg_player)
summary(play_xg)
##
## Call:
## lm(formula = Goals ~ xG, data = xg_player)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9279 -1.6197 0.2038 1.1817 21.0743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.03877 0.15084 -6.887 1.01e-11 ***
## xG 1.16900 0.01351 86.554 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.438 on 1003 degrees of freedom
## Multiple R-squared: 0.8819, Adjusted R-squared: 0.8818
## F-statistic: 7492 on 1 and 1003 DF, p-value: < 2.2e-16
We can even use the model to assess how well a team performed. How many more or less goals they scored than expected by the model I created.
# By team
## Join the team name in to add more context
xg_team <- Test_Fenwick_Data %>%
group_by(team_id_for) %>%
summarise( xG = sum(xG), Goals = sum(is_goal), Difference = sum(xG) - sum(is_goal))
arrange(xg_team, desc(xG))
## # A tibble: 34 x 4
## team_id_for xG Goals Difference
## <dbl> <dbl> <dbl> <dbl>
## 1 28 316. 346 -30.1
## 2 19 316. 318 -2.30
## 3 12 304. 281 22.9
## 4 6 302. 336 -34.1
## 5 54 289. 271 18.4
## 6 10 287. 302 -14.7
## 7 21 279. 292 -12.7
## 8 25 276. 243 32.8
## 9 5 273. 277 -3.68
## 10 20 266. 300 -34.1
## # … with 24 more rows
team_xg <- lm(Goals ~ xG, data = xg_team)
summary(team_xg)
##
## Call:
## lm(formula = Goals ~ xG, data = xg_team)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.910 -13.491 -2.756 14.411 64.357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.83398 12.48907 0.227 0.822
## xG 1.02283 0.05164 19.805 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23 on 32 degrees of freedom
## Multiple R-squared: 0.9246, Adjusted R-squared: 0.9222
## F-statistic: 392.3 on 1 and 32 DF, p-value: < 2.2e-16
ggplot(aes(x = xG, y = Goals), data = xg_team) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = 'Expected Goals vs Goals by Team')