Building a Football Expected Goals (xG) Model with XGBoost
In football analytics, Expected Goals (xG) has become one of the most important metrics for evaluating team and player performance. In this article, I'll walk you through how I built an xG model using XGBoost and deployed it as an interactive web application.
What is Expected Goals (xG)?
Expected Goals (xG) is a statistical measure that quantifies the quality of a shot based on various factors. Instead of simply counting goals, xG tells us how likely a shot was to result in a goal. A shot with an xG of 0.7 means it had a 70% chance of being scored. Key factors that influence xG include:
- Shot location and angle to goal
- Body part used (foot, head, etc.)
- Type of attack (open play, set piece, etc.)
- Defensive pressure
- Whether it was a one-on-one situation
- And many other variables
The Data Source
For this project, I used data from Statsbomb, which provides event data for football matches. Their open data repository includes competitions like the Women's World Cup and men's Champions League finals.
Building the xG Model
Data Preparation
The first step was to extract shot data from the Statsbomb dataset and prepare it for modelling. This involved:
- Extracting shot locations and outcomes
- Calculating derived features like distance and angle to goal
- Encoding categorical features like shot type and body part
Model Selection
After experimenting with different algorithms, I chose XGBoost for the final model due to its superior performance. I compared it against a Logistic Regression model and Statsbomb's own xG values as a baseline.
Feature Importance
One of the advantages of using XGBoost is that we can easily visualise which features are most important for predicting goal probability:
I also used SHAP (SHapley Additive exPlanations) values to understand how each feature contributes to individual predictions:
Model Calibration
A well-calibrated model ensures that when we predict a shot has a 30% chance of being a goal, it should actually result in a goal about 30% of the time. I verified this using calibration curves:
Deploying the Model
To make the model accessible and interactive, I built a web application using:
- FastAPI for the backend API
- HTML/JavaScript for the frontend interface
- Render for hosting the application
The application allows users to:
- Place a shot on a football pitch
- Configure shot parameters (body part, shot type, etc.)
- Add defensive players
- Get an instant xG prediction
The Technical Implementation
The backend is built with FastAPI, which handles the model predictions:
@app.post("/predict_xg") def predict_xg(shot: ShotInput): # Extract and calculate features features = prepare_features_for_model(shot) # Encode categorical features X = encode_categorical_features(features) # Predict xG xg_prediction = model.predict_proba(X)[0, 1] return { "xG": float(xg_prediction), "features": features, }
The frontend allows users to interact with a visual representation of a football pitch and sends the shot data to the API for prediction.
Insights and Learnings
Building this xG model revealed several interesting insights:
- Shot location matters most: Distance to goal and the distance the keeper is from the strike are the strongest predictors of goal probability.
- Defensive pressure is crucial: The presence and position of defenders significantly impacts xG, in fact if the nearest defender is more than a few yards away from the shooter, then xG is almost the same as if they were not there.
- Context matters: One-on-one situations with the goalkeeper dramatically increase xG
Some of the problems with the model are that it dramatically under-predicts or over-predicts xG for unusual situations. For example an open goal from 40 yards out, with a keeper upfield is only predicted as a 0.08xG, whereas most professionally footballers would most likely be able to make this shot with high likelihood of scoring.
Similarly as you can see below the model over-predicts the likelihood of scoring from beyond the halfway line against a well positioned goalkeeper (similar to shooting from 35 yards against a well positioned keeper). In reality we would expect the keeper to save 99.9% of shots from this distance, but the model has most likely never seen a shot from this situation in its training data.
Try It Yourself
You can play with the xG model yourself at ‣.
Future Improvements
While the current model performs well, there are several ways it could be enhanced:
- Adding more contextual features like match state and fatigue, and previous actions leading to the shot.
- Adding data such as high of ball at impact.
- Training on a larger dataset for better generalisation.
- Developing position-specific models for different types of shots.
- Imposing monotonic constrains on features (this was tested in hyperparameters tuning but did not improve the model, so a more careful implementation may be required).
Additionally, the model may be extended to post-shot xG if this type of data becomes available.
Conclusion
Building an xG model is a fascinating intersection of football knowledge and data science. The model not only helps quantify shot quality but also provides insights into what makes a good scoring opportunity. As football analytics continues to evolve, metrics like xG will become increasingly important for teams, analysts, and fans alike.
If you're interested in exploring the code and methodology further, check out the GitHub repository for this project.