Fredrik Olsson
Data Scientist
With the current Premier League season 2023/2024 moving rapidly towards the final phase of the season, we are going to take a small step back in time and focus on the last three seasons. While there is of course more to the sport than just scoring goals, it is arguably the most enticing part of the game. So, in this blogpost, we will outline an approach to try and predict the top scorers in the Premier League for the season 2022/2023 using Machine Learning techniques and player data from previous seasons.
More specifically, we want to build a model that - given a set of information on the player - can predict the number of goals that player will score during the Premier League season. With such a model, we can make predictions on all the Premier League players and thus get a list of predicted top scorers and their respective goal counts.
While this is a regression problem, it doesn't quite fit the standard Linear Regression problem, and the answer lies with the target variable: number of goals. In the standard Linear Regression, the target variable is assumed to follow a normal distribution, i.e. $Y_i \sim N(\mu_i, \sigma^2)$, and we model the expected value as a linear function of our features:
This however assumes that the target can take on any real number, which in our case is not true. The number of goals scored is a count data variable, i.e. positive integers only. Therefore, we instead assume that the number of goals follows a Poisson distribution, i.e. $Y_i \sim Poi(\mu_i)$, and we model the logarithm of the expected value as a linear function of our features:
This is what is called a Poisson Regression problem.
Alright, so let's dive into the datasets that we are going to use, in order to solve this poisson regression problem that we have. We have three different sets, one from 2022/2023 and two from the last two Premier League seasons before that. All data comes from transfermarkt.com. For the previous two seasons we have the following features - as well as the target variable (number of goals scored) - available to us in the dataset:
We will use the one from the season 2020/2021 as our training data, and the one from the season 2021/2022 as the validation set, when training and evaluating a model that we can use to make predictions on the dataset for the current season 2022/2023. The observant reader will notice that there are some features in the train and validation sets that we don't have available before the season starts (when we want to make our predictions for the 2022/2023 season) namely
games
subs_on
subs_off
assists
We actually don't have placement
either, but here we will use the value from the previous season. Here it's more about giving the model an understanding of the level of the team the player plays for, rather than the exact position the team finishes in. But when it comes to the number of games (and substitutions on and off) and the assists, we will have to deal with not having those features available when we make the predictions for the 2022/2023 season.
Note that goalkeepers and defenders have been excluded from the datasets, so we are only working with midfielders and attackers. All players not scoring any goals at all during the previous seasons, have also been removed from the (training and validation) datasets.
Let's start to build some models! We'll start simple and use only the relevant features that are available to us in all datasets:
position
age
market_value
placement
Since we have simple tabular data, we will make use of the tree-based models:
Random Forest Regressor
XGBoost Regressor
that often performs best on this type of data. Note that we use Poisson Deviance as the split criterion for the Random Forest Regressor, and Poisson objective function in the XGBoost Regressor, to handle this Poisson Regression problem properly.
We will also need an evaluation metric to find the best model, both in terms of model types and hyper-parameters. A natural choice in the case of Poisson Regression is to use Mean Poisson Deviance:
As with many loss functions the evaluation value we get is quite hard to interpret, more than that lower is better. We will here instead use a metric that has a natural interpretation in our case and captures what we want our model to achieve, namely Mean Absolute Error:
The MAE tells us by how many goals our predictions on average differ from the actual number of goals scored by the players. This aligns well with wanting a model that can predict the number of goals a player will score, and the value is easy to interpret.
After training and hyper-parameter tuning the models on the training (2020/2021) and validation (2021/2022) datasets, the best model we got (according to validation MAE) was a XGBoost Regressor model:
Features | Model type | Training MAE | Validation MAE |
---|---|---|---|
Basic | XGBoost | 1.4265 | 2.1397 |
So, on average, this model's goal prediction on a player differ with a little more than 2 goals, compared to the actual number of goals scored.
Okay, so we have our first model. Now, let's try to improve the results! We are going to ignore the fact that we have some features that are not available to us in the dataset for the season 2022/2023, and let the model use them anyway. In other words, let's include:
games
subs_on
subs_off
assists
in the model as well. Hopefully that will improve the model performance, and we will deal with the issues of these features not being available in all datasets later on.
The inclusion of very relevant extra information such as number of games played etc, helped the models reach a better performance. Again the XGBoost model (with a different set of hyper-parameters) was the best performer:
Features | Model type | Training MAE | Validation MAE |
---|---|---|---|
Basic | XGBoost | 1.4265 | 2.1397 |
Fully extended | XGBoost | 1.1163 | 1.7314 |
By adding these extra features to the model, we managed to get our predictions ~0.4 goals closer to the actual value on average.
From the results, we would obviously like to include these extra features, but the problem unfortunately still stands that they are missing from the 2022/2023 season dataset that we want to make predictions on, since we want to make predictions at the start of the season before this data is known to us. There are solutions for that, which we will look at later, but it's definitely easier to deal with one missing feature than to deal with four of them at once. Therefore, we will look at feature importances for the model, and see if we can find one of these four that are more important than the others, and try to use just that one.
We will use the feature importance technique called Permutation Feature Importance. The idea is that - for one feature at the time - we randomly shuffle the values for that feature between the rows in the dataset, and thus break the relationship between that feature and the target variable. We then compare the difference in model score (in this case validation MAE) on the original data and the new distorted data with the shuffled feature. This difference gives us an indication of how much the model depends on that feature. Due to the random nature of the procedure, we do it several times for every feature and end up with the following box-plot for the feature importances (on both the train and validation datasets):
We can clearly see that the games
feature, together with position
and market_value
, proves to be the most important features for the model. The `subs_on` feature also seems to be pretty useful for the model, but let's stick with only one missing feature.
Before we try to deal with the problem that the games
feature is missing in the dataset for the 2022/2023 season, let's train and validate the model using the first set of features (see "Basic features") together with only the games
feature and not the other ones added for the previous model, and see how well it performs in the setting where that feature is available to us.
The XGBoost model still performs best among the chosen model types, and comparing the validation MAE between this one ("Extended features") and the previous model ("Fully extended features"), we are not that far off from the performance we got when also including the other extra features:
Features | Model type | Training MAE | Validation MAE |
---|---|---|---|
Basic | XGBoost | 1.4265 | 2.1397 |
Fully extended | XGBoost | 1.1163 | 1.7314 |
Extended | XGBoost | 1.2427 | 1.7747 |
At last, we are going to deal with the games
feature being missing from the last season's dataset which is the one we want to make the predictions on. So to simulate this setting, we will remove the values for this feature in the validation dataset, and then try to impute reasonable values.
The method we will use for this is called $k$-Nearest Neighbours ($k$-
NN) Imputation. Here we will make use of the fact that the games
feature is available in the training data. The idea is that, for every data point in the validation dataset, find the $k$ (5 in our case) most similar data points in the training dataset based on the features we do have available, and then impute the average of those data points' games
values. So, we are basically modelling the missing feature using the available ones.
So for our final model evaluation we will evaluate the models using the same features as in the previous case ("Extended features") but now with our games feature being imputed in the validation set - which is the setting we will have in our prediction for the 2022/2023 season's top scorers. Also in this final case, the XGBoost model came out on top as the best option, with the following evaluation scores:
Features | Model type | Training MAE | Validation MAE |
---|---|---|---|
Basic | XGBoost | 1.4265 | 2.1397 |
Fully extended | XGBoost | 1.1163 | 1.7314 |
Extended | XGBoost | 1.2427 | 1.7747 |
Extended - with data imputation | XGBoost | 1.2796 | 2.0896 |
As we can see, we unfortunately get a performance decrease compared to our previous model where the games
feature was available in the validation dataset. When we looked at feature importances before, this proved to be a very important feature for the model, and the data imputation didn't quite seem to fully replace the quality of the true actual data for the games
feature.
But this model is however an improvement on the first one we made, and since these two are the only ones we can actually use to make predictions on the dataset we want to predict, this is our best model to solve the problem at hand!
Using our best model option, the "Extended features - with data imputation" variant of the XGBoost, we get the following predicted top 15 goal scorers for the 2022/2023 season in Premier League:
Name | Position | Team | Goal prediction |
---|---|---|---|
Harry Kane | Forward | Tottenham | 23 |
Mohamed Salah | Winger | Liverpool | 22 |
Kevin De Bruyne | Attacking Midfield | Manchester City | 19 |
Heung-min Son | Winger | Tottenham | 17 |
Erling Haaland | Forward | Manchester City | 17 |
Jarrod Bowen | Winger | West Ham | 12 |
Bernardo Silva | Attacking Midfield | Manchester City | 12 |
Bruno Fernandes | Attacking Midfield | Manchester United | 11 |
Cristiano Ronaldo | Forward | Manchester United | 11 |
Jack Grealish | Winger | Manchester City | 10 |
James Maddison | Attacking Midfield | Leicester City | 10 |
Gabriel Jesus | Forward | Arsenal | 10 |
Richarlison | Forward | Tottenham | 9 |
Raheem Sterling | Winger | Chelsea | 9 |
Dominic Calvert-Lewin | Forward | Everton | 9 |
Below follows the actual top 15 goal scorers:
Name | Position | Team | Actual goals |
---|---|---|---|
Erling Haaland | Forward | Manchester City | 36 |
Harry Kane | Forward | Tottenham | 30 |
Ivan Toney | Forward | Brentford | 20 |
Mohammed Salah | Winger | Liverpool | 19 |
Callum Wilson | Forward | Newcastle | 18 |
Marcus Rashford | Winger | Manchester United | 17 |
Martin Ødegaard | Attacking Midfield | Arsenal | 15 |
Ollie Watkins | Forward | Aston Villa | 15 |
Gabriel Martinelli | Forward | Arsenal | 15 |
Bukayo Saka | Winger | Arsenal | 14 |
Alexander Mitrovic | Forward | Fulham | 14 |
Harvey Barnes | Winger | Leicester City | 13 |
Rodrigo | Forward | Leeds United | 13 |
Gabriel Jesus | Forward | Arsenal | 11 |
Miguel Almirón | Winger | Newcastle | 11 |
Comparing the predicted and actual list of top 15 goal scorers in Premier League season 2022/2023, this clearly proved to be a very tricky problem. Only four players show up in both the actual and predicted table: Harry Kane, Mohammed Salah, Erling Haaland and Gabriel Jesus.
I believe that the main thing making this problem really hard, is that there are some very important factors involved that we unfortunately do not have any control over in our feature set:
Team form: teams performing exceptionally well compared to previous seasons like Arsenal playing on the highest level they have been on for many years, with players like Martin Ødegaard, Gabriel Martinelli and Bukayo Saka making it into the list of top scorers.
Player form: either players finding their goal scoring form compared to previous seasons, like Marcus Rashford, or players loosing theirs like Jarred Bowen.
New players like Erling Haaland on whom we do not have any historical data.
Injuries and January transfers causing players (aside from player form) to play fewer games than expected. For example Cristiano Ronaldo who left the Premier League in the January transfer window.
Apart from the lack of features, we are also working with very small datasets with high variance, which is not an optimal setting for a successful machine learning solution.
To mitigate some of these issues, and to try and improve the model performance in general, a few ideas to make the model better could be:
Increase the amount of data:
Include more historical data from previous Premier League seasons. Would probably have to deal with the time aspect in the modelling, i.e. that the goal scoring behaviour tend to change over time.
Include data from other leagues. Other leagues can of course vary compared to the Premier League in terms of goal scoring so that would have to be taken care of in the modelling phase.
Explore the possibilities of finding additional useful player data that can be added to the feature set.
While contradicting the efforts of increasing the amount of data, we could set a higher limit than at least 1 goal to be included in the training and validation data. The idea is to model only players that have a chance of making the list of top scorers and not diluting the dataset with players that only score a few goals.
We could try a more robust modelling of the missing games
feature than just using simple $k$-
NN imputation.
We could try to do some more feature engineering on the data we have, in order to maybe find some useful combination of different features that the model itself does not find during the model training.
All in all, a very fun project with plenty of room for improvements!
Published on February 23rd 2024
Last updated on February 23rd 2024, 15:34
Fredrik Olsson
Data Scientist