Welcome to my Expected Goals Model built on NHL Play By Play (PBP) data. Expected Goals Models can be built in several ways, but at the core of it they are all trying to predict the probability of a shot, or a group of shots, becoming a goal. Over a game you can assess the player’s impact based on this metric even if they don’t score. You can also downgrade a players performance based on this metric if they just happen to get lucky and score a goal. Expected Goals is the new metric in hockey and I built mine off a multiple logistic regression.

I took the values provided to me in the data set, like the location of the shot or the shot type, and added values I extrapolated from the data set, like the angle the shot was take from the net, if the shot was take closely after a turnover, and many more.

There were several difficulties up front. It was difficult to find data with shot location data. Then creating parameters like ‘Shot Angle’ required an understanding of the dimensions of an ice rink and how that fits into the data, or understanding how the data is structured to determine variables like is_rebound.

I’ll walkthrough the model at a high level but for a more in depth view of the data cleaning please see the r file in my github repository. (https://github.com/rdoherty2019/rdoherty2019.github.io/blob/master/hockey/expectGoalModel.R)

The Data

I was able to source the data from : https://www.kaggle.com/datasets/martinellis/nhl-game-data I was able to merge it together and separate the values into seasons in Python. I’m more comfortable working in Python and was able to create several CSVs worth of data for this model to be built on.

Find that Python file here: https://github.com/rdoherty2019/rdoherty2019.github.io/blob/master/hockey/hockeyDataMerge.ipynb

Data Set

After merging several normalized data sets together. There is a vast amount of information available to us for each play

colnames(Train_PbP_Data)
##  [1] "X1"                     "play_id"                "game_id"               
##  [4] "team_id_for"            "team_id_against"        "event"                 
##  [7] "secondaryType"          "x"                      "y"                     
## [10] "period"                 "periodType"             "periodTime"            
## [13] "periodTimeRemaining"    "dateTime"               "goals_away"            
## [16] "goals_home"             "description"            "st_x"                  
## [19] "st_y"                   "player_id"              "playerType"            
## [22] "season"                 "type"                   "date_time_GMT"         
## [25] "away_team_id"           "home_team_id"           "away_goals"            
## [28] "home_goals"             "outcome"                "home_rink_side_start"  
## [31] "venue"                  "venue_link"             "venue_time_zone_id"    
## [34] "venue_time_zone_offset" "venue_time_zone_tz"

Below is a snippet of the data I have to work with:

# Get Columns
str(Train_PbP_Data, max.level = 1)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1248673 obs. of  35 variables:
##  $ X1                    : num  0 1 2 3 4 5 6 7 8 9 ...
##  $ play_id               : chr  "2016020045_4" "2016020045_4" "2016020045_5" "2016020045_5" ...
##  $ game_id               : num  2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ team_id_for           : num  16 16 16 16 16 16 16 16 4 4 ...
##  $ team_id_against       : num  4 4 4 4 4 4 4 4 16 16 ...
##  $ event                 : chr  "Faceoff" "Faceoff" "Shot" "Shot" ...
##  $ secondaryType         : chr  NA NA "Wrist Shot" "Wrist Shot" ...
##  $ x                     : num  0 0 -71 -71 -88 -88 -88 -88 0 0 ...
##  $ y                     : num  0 0 9 9 5 5 5 5 0 0 ...
##  $ period                : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ periodType            : chr  "REGULAR" "REGULAR" "REGULAR" "REGULAR" ...
##  $ periodTime            : num  0 0 54 54 56 56 56 56 58 58 ...
##  $ periodTimeRemaining   : num  1200 1200 1146 1146 1144 ...
##  $ dateTime              : POSIXct, format: "2016-10-19 01:40:50" "2016-10-19 01:40:50" ...
##  $ goals_away            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ goals_home            : num  0 0 0 0 1 1 1 1 1 1 ...
##  $ description           : chr  "Jonathan Toews faceoff won against Claude Giroux" "Jonathan Toews faceoff won against Claude Giroux" "Artem Anisimov Wrist Shot saved by Michal Neuvirth" "Artem Anisimov Wrist Shot saved by Michal Neuvirth" ...
##  $ st_x                  : num  0 0 71 71 88 88 88 88 0 0 ...
##  $ st_y                  : num  0 0 -9 -9 -5 -5 -5 -5 0 0 ...
##  $ player_id             : num  8473604 8473512 8473573 8473607 8474141 ...
##  $ playerType            : chr  "Winner" "Loser" "Shooter" "Goalie" ...
##  $ season                : num  20162017 20162017 20162017 20162017 20162017 ...
##  $ type                  : chr  "R" "R" "R" "R" ...
##  $ date_time_GMT         : POSIXct, format: "2016-10-19 00:30:00" "2016-10-19 00:30:00" ...
##  $ away_team_id          : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ home_team_id          : num  16 16 16 16 16 16 16 16 16 16 ...
##  $ away_goals            : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ home_goals            : num  7 7 7 7 7 7 7 7 7 7 ...
##  $ outcome               : chr  "home win REG" "home win REG" "home win REG" "home win REG" ...
##  $ home_rink_side_start  : chr  "right" "right" "right" "right" ...
##  $ venue                 : chr  "United Center" "United Center" "United Center" "United Center" ...
##  $ venue_link            : chr  "/api/v1/venues/null" "/api/v1/venues/null" "/api/v1/venues/null" "/api/v1/venues/null" ...
##  $ venue_time_zone_id    : chr  "America/Chicago" "America/Chicago" "America/Chicago" "America/Chicago" ...
##  $ venue_time_zone_offset: num  -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 ...
##  $ venue_time_zone_tz    : chr  "CDT" "CDT" "CDT" "CDT" ...

Data Cleaning

There are many play types in this data set that all have location parameters and other information that will affect our model.

unique(Train_PbP_Data$event)
## [1] "Faceoff"      "Shot"         "Goal"         "Takeaway"     "Hit"         
## [6] "Blocked Shot" "Giveaway"     "Missed Shot"  "Penalty"

I had to remove the non-shot type events from the data set so those values will not affect the model. We also had to remove events from players that didn’t take the shot, like goalies or other players on the ice.

Before

dim(Train_PbP_Data)
## [1] 1248673      35

After

dim(Train_PbP_Data)
## [1] 686222     35

Data Creation

Although we have the location of the shot and the shot type, there is more information I extracted from the data set was valuable in predicting the likelihood a shot is going to turn into a goal.

After data creation we have several new columns: colnames(Train_PbP_Data) . Like the distance, shot_angle, is_rebound, is_rush.

Model Generation

I removed all the values that will affect the model and added values that will be valuable predictors for a shot becoming a goal.

I used a Logistic Regression model to predict the likelihood a shot will become a goal.

# Build Model
xGmodel <- glm(is_goal ~ poly(distance, 3, raw = TRUE) + 
                 poly(shot_angle, 3, raw = TRUE) + secondaryType +
                 is_rebound + is_rush,
               data = Train_Fenwick_Data, 
               family = binomial(link = 'logit'))
save(xGmodel, file = "xGmodelver2.rda")
summary(xGmodel)
## 
## Call:
## glm(formula = is_goal ~ poly(distance, 3, raw = TRUE) + poly(shot_angle, 
##     3, raw = TRUE) + secondaryType + is_rebound + is_rush, family = binomial(link = "logit"), 
##     data = Train_Fenwick_Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0492  -0.4774  -0.3328  -0.2507   2.9773  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -1.375e+00  5.345e-02 -25.730  < 2e-16 ***
## poly(distance, 3, raw = TRUE)1   -2.039e-02  4.594e-03  -4.437 9.10e-06 ***
## poly(distance, 3, raw = TRUE)2   -1.158e-03  1.305e-04  -8.870  < 2e-16 ***
## poly(distance, 3, raw = TRUE)3    1.270e-05  1.044e-06  12.170  < 2e-16 ***
## poly(shot_angle, 3, raw = TRUE)1  1.474e-03  2.022e-03   0.729  0.46592    
## poly(shot_angle, 3, raw = TRUE)2 -3.777e-04  4.642e-05  -8.137 4.06e-16 ***
## poly(shot_angle, 3, raw = TRUE)3  2.604e-06  2.920e-07   8.920  < 2e-16 ***
## secondaryTypeDeflected            5.909e-01  6.117e-02   9.660  < 2e-16 ***
## secondaryTypeSlap Shot            8.199e-01  4.352e-02  18.840  < 2e-16 ***
## secondaryTypeSnap Shot            6.082e-01  3.879e-02  15.677  < 2e-16 ***
## secondaryTypeTip-In               3.714e-01  4.206e-02   8.830  < 2e-16 ***
## secondaryTypeWrap-around         -2.834e-01  1.001e-01  -2.831  0.00464 ** 
## secondaryTypeWrist Shot           3.994e-01  3.268e-02  12.221  < 2e-16 ***
## is_rebound                        6.544e-01  2.645e-02  24.741  < 2e-16 ***
## is_rush                           2.216e-01  8.472e-02   2.616  0.00891 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 99068  on 163201  degrees of freedom
## Residual deviance: 90332  on 163187  degrees of freedom
## AIC: 90362
## 
## Number of Fisher Scoring iterations: 6
coef(xGmodel)
##                      (Intercept)   poly(distance, 3, raw = TRUE)1 
##                    -1.375252e+00                    -2.038563e-02 
##   poly(distance, 3, raw = TRUE)2   poly(distance, 3, raw = TRUE)3 
##                    -1.157918e-03                     1.270123e-05 
## poly(shot_angle, 3, raw = TRUE)1 poly(shot_angle, 3, raw = TRUE)2 
##                     1.474089e-03                    -3.777363e-04 
## poly(shot_angle, 3, raw = TRUE)3           secondaryTypeDeflected 
##                     2.604261e-06                     5.908737e-01 
##           secondaryTypeSlap Shot           secondaryTypeSnap Shot 
##                     8.198835e-01                     6.081882e-01 
##              secondaryTypeTip-In         secondaryTypeWrap-around 
##                     3.713624e-01                    -2.834322e-01 
##          secondaryTypeWrist Shot                       is_rebound 
##                     3.993653e-01                     6.544108e-01 
##                          is_rush 
##                     2.215849e-01

Test Model

We built the model on 2015 and 2016 season data and now want to test on 2017 season data. There are multiple ways I could have tested and validated the data but I chose this approach because it would not allow my model to be affected by any of the in-season data, it would also not separate shot values that are connected.

# test the predictions of our model is to plot it’s ROC curve 
# calculate the area underneath the curve
g <- roc(is_goal ~ xG, data = Test_Fenwick_Data)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(g)

#Find the area underneath the curve
auc(g)
## Area under the curve: 0.7174

We have a relatively accurate model with a higher than 72% accuracy in predicting goals based on shots. We could improve the accuracy of the plot several ways. We could use cross validation or remove some of the bias in our data like shots that are unlikely to be turned into a goal.

Shooting Plot

As you can see from this heat map visualization shots take from the slot with a better angle to the net are more likely to go in. These are some of the unique opportunites you can make from these models.

avg_xG_by_coord <- Test_Fenwick_Data %>% group_by(x, y) %>%
  summarise(xg = mean(xG))

ggplot(avg_xG_by_coord, aes(x, y, fill = xg)) + geom_raster() +
  scale_fill_gradient(low = 'blue', high = 'red')+
  geom_vline(xintercept = 0, color = 'red') +
  geom_vline(xintercept = 25, color = 'blue') +
  geom_vline(xintercept = 88, color = 'red') +
  xlab('X Coordinates') + ylab('Y Coordinates') +
  labs(title = 'Average xG Value by Coordinate')

Player Evaluation

The model I created did a good job of predicting how many goals a player should have. We can use this model and then create visualizations like the ones below to determine how well players have performed.

#code to group xg and goals by player
## Go back ad add the player names for more information
xg_player <- Test_Fenwick_Data %>%
  group_by(player_id, team_id_for) %>%
  summarise( xG = sum(xG), Goals = sum(is_goal), Difference = sum(xG) - sum(is_goal))
head(xg_player)
## # A tibble: 6 x 5
## # Groups:   player_id [6]
##   player_id team_id_for    xG Goals Difference
##       <dbl>       <dbl> <dbl> <dbl>      <dbl>
## 1   8464989           5  9.20     7      2.20 
## 2   8465009           6  5.64     7     -1.36 
## 3   8466138          28 12.1     20     -7.94 
## 4   8466139          10 20.8     16      4.85 
## 5   8468493          10  3.47     5     -1.53 
## 6   8468498          15  2.65     3     -0.351
arrange(xg_player, desc(xG))
## # A tibble: 1,005 x 5
## # Groups:   player_id [893]
##    player_id team_id_for    xG Goals Difference
##        <dbl>       <dbl> <dbl> <dbl>      <dbl>
##  1   8475166          10  42.2    49     -6.84 
##  2   8477492          21  38.4    46     -7.59 
##  3   8475794          25  37.7    37      0.689
##  4   8475169          28  35.7    32      3.74 
##  5   8478414          28  35.6    35      0.584
##  6   8475765          19  35.2    44     -8.77 
##  7   8475848           8  34.0    33      0.962
##  8   8475314           2  33.7    29      4.72 
##  9   8474715          29  33.1    43     -9.91 
## 10   8476881          28  33.0    44    -11.0  
## # … with 995 more rows
ggplot(aes(x = xG, y = Goals), data = xg_player) +
  geom_point() + 
  geom_smooth(method = 'lm') +
  labs(title = 'Expected Goals vs Goals by Player')

# Linear model
play_xg <- lm(Goals ~ xG, data = xg_player)
summary(play_xg)
## 
## Call:
## lm(formula = Goals ~ xG, data = xg_player)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9279  -1.6197   0.2038   1.1817  21.0743 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.03877    0.15084  -6.887 1.01e-11 ***
## xG           1.16900    0.01351  86.554  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.438 on 1003 degrees of freedom
## Multiple R-squared:  0.8819, Adjusted R-squared:  0.8818 
## F-statistic:  7492 on 1 and 1003 DF,  p-value: < 2.2e-16

Team Evaluation

We can even use the model to assess how well a team performed. How many more or less goals they scored than expected by the model I created.

# By team 
## Join the team name in to add more context
xg_team <- Test_Fenwick_Data %>%
  group_by(team_id_for) %>%
  summarise( xG = sum(xG), Goals = sum(is_goal), Difference = sum(xG) - sum(is_goal))

arrange(xg_team, desc(xG))
## # A tibble: 34 x 4
##    team_id_for    xG Goals Difference
##          <dbl> <dbl> <dbl>      <dbl>
##  1          28  316.   346     -30.1 
##  2          19  316.   318      -2.30
##  3          12  304.   281      22.9 
##  4           6  302.   336     -34.1 
##  5          54  289.   271      18.4 
##  6          10  287.   302     -14.7 
##  7          21  279.   292     -12.7 
##  8          25  276.   243      32.8 
##  9           5  273.   277      -3.68
## 10          20  266.   300     -34.1 
## # … with 24 more rows
team_xg <- lm(Goals ~ xG, data = xg_team)
summary(team_xg)
## 
## Call:
## lm(formula = Goals ~ xG, data = xg_team)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.910 -13.491  -2.756  14.411  64.357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.83398   12.48907   0.227    0.822    
## xG           1.02283    0.05164  19.805   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23 on 32 degrees of freedom
## Multiple R-squared:  0.9246, Adjusted R-squared:  0.9222 
## F-statistic: 392.3 on 1 and 32 DF,  p-value: < 2.2e-16
ggplot(aes(x = xG, y = Goals), data = xg_team) +
  geom_point() + 
  geom_smooth(method = 'lm') +
  labs(title = 'Expected Goals vs Goals by Team')