Building a Football xG Model with XGBoost

Building a Football Expected Goals (xG) Model with XGBoost

In football analytics, Expected Goals (xG) has become one of the most important metrics for evaluating team and player performance. In this article, I'll walk you through how I built an xG model using XGBoost and deployed it as an interactive web application.

What is Expected Goals (xG)?

Expected Goals (xG) is a statistical measure that quantifies the quality of a shot based on various factors. Instead of simply counting goals, xG tells us how likely a shot was to result in a goal. A shot with an xG of 0.7 means it had a 70% chance of being scored. Key factors that influence xG include:

Shot location and angle to goal

Body part used (foot, head, etc.)

Type of attack (open play, set piece, etc.)

Defensive pressure

Whether it was a one-on-one situation

And many other variables

The Data Source

For this project, I used data from Statsbomb, which provides event data for football matches. Their open data repository includes competitions like the Women's World Cup and men's Champions League finals.

Building the xG Model

Data Preparation

The first step was to extract shot data from the Statsbomb dataset and prepare it for modelling. This involved:

Extracting shot locations and outcomes

Calculating derived features like distance and angle to goal

Encoding categorical features like shot type and body part

Model Selection

After experimenting with different algorithms, I chose XGBoost for the final model due to its superior performance. I compared it against a Logistic Regression model and Statsbomb's own xG values as a baseline.

ROC Curve comparison of XGBoost vs Logistic Regression vs StatsBomb

Feature Importance

One of the advantages of using XGBoost is that we can easily visualise which features are most important for predicting goal probability:

I also used SHAP (SHapley Additive exPlanations) values to understand how each feature contributes to individual predictions:

Model Calibration

A well-calibrated model ensures that when we predict a shot has a 30% chance of being a goal, it should actually result in a goal about 30% of the time. I verified this using calibration curves:

Deploying the Model

To make the model accessible and interactive, I built a web application using:

FastAPI for the backend API

HTML/JavaScript for the frontend interface

Render for hosting the application

The application allows users to:

Place a shot on a football pitch

Configure shot parameters (body part, shot type, etc.)

Add defensive players

Get an instant xG prediction

The Technical Implementation

The backend is built with FastAPI, which handles the model predictions:


@app.post("/predict_xg")
def predict_xg(shot: ShotInput):
    # Extract and calculate features
    features = prepare_features_for_model(shot)

    # Encode categorical features
    X = encode_categorical_features(features)

    # Predict xG
    xg_prediction = model.predict_proba(X)[0, 1]

    return {
        "xG": float(xg_prediction),
        "features": features,
    }

The frontend allows users to interact with a visual representation of a football pitch and sends the shot data to the API for prediction.

Insights and Learnings

Building this xG model revealed several interesting insights:

Shot location matters most: Distance to goal and the distance the keeper is from the strike are the strongest predictors of goal probability.

Defensive pressure is crucial: The presence and position of defenders significantly impacts xG, in fact if the nearest defender is more than a few yards away from the shooter, then xG is almost the same as if they were not there.

Context matters: One-on-one situations with the goalkeeper dramatically increase xG

Some of the problems with the model are that it dramatically under-predicts or over-predicts xG for unusual situations. For example an open goal from 40 yards out, with a keeper upfield is only predicted as a 0.08xG, whereas most professionally footballers would most likely be able to make this shot with high likelihood of scoring.

Similarly as you can see below the model over-predicts the likelihood of scoring from beyond the halfway line against a well positioned goalkeeper (similar to shooting from 35 yards against a well positioned keeper). In reality we would expect the keeper to save 99.9% of shots from this distance, but the model has most likely never seen a shot from this situation in its training data.

Try It Yourself

You can play with the xG model yourself at ‣.

Future Improvements

While the current model performs well, there are several ways it could be enhanced:

Adding more contextual features like match state and fatigue, and previous actions leading to the shot.

Adding data such as high of ball at impact.

Training on a larger dataset for better generalisation.

Developing position-specific models for different types of shots.

Imposing monotonic constrains on features (this was tested in hyperparameters tuning but did not improve the model, so a more careful implementation may be required).

Additionally, the model may be extended to post-shot xG if this type of data becomes available.

Conclusion

Building an xG model is a fascinating intersection of football knowledge and data science. The model not only helps quantify shot quality but also provides insights into what makes a good scoring opportunity. As football analytics continues to evolve, metrics like xG will become increasingly important for teams, analysts, and fans alike.

If you're interested in exploring the code and methodology further, check out the GitHub repository for this project.