Enhancing Football Predictions with a Deep Learning-Based Expected Goals Model

In this post we will build an expected goals (xG) model using a neural network to predict the likely outcome of a shot.

What is xG?

Expected goals (xG) is a statistical metric that revolutionised the way we analyse football. It measures the quality of a shot by calculating the probability of it resulting in a goal. This probability is determined based on various factors, such as:

Shot location: Shots taken from closer to the goal and at better angles have a higher xG.

Shot type: Headers, volleys, and shots from distance typically have lower xG values.

Shot angle: Shots taken from acute angles are less likely to score.

Defensive pressure: Shots taken under pressure are less likely to score.

By assigning an xG value to each shot, we can get a more objective measure of a team or player’s performance. For example, if a team has an xG of 2.0 but only scores 1 goal, we can infer that they were unlucky and should have scored more. Conversely, if a team scores 3 goals but has an xG of 1.5, we can say that they were fortunate.

Deep Learning and xG

Traditional xG models rely on simple statistical models to calculate shot probabilities. However, recent advancements in Machine Learning, and more specifically Deep Learning, have opened up new possibilities for more accurate and sophisticated xG models.

Deep learning models can:

Learn complex patterns: Deep neural networks can learn complex patterns and relationships in data that are difficult for humans to discern. This allows them to consider a wider range of factors when calculating xG, such as player positioning, defensive positioning, and the flow of the game.

Handle large datasets: Deep learning models can handle massive amounts of data, which is essential for training accurate xG models. This allows them to learn from a larger and more diverse set of shots, leading to more accurate predictions.

Continuously improve: Deep learning models can be continuously trained and updated with new data, allowing them to adapt to changing trends and improve their accuracy over time.

Datasets

We will use the StatsBomb Open Datasets to train our model. This dataset contains datapoints on completed matches, events within matches, tournaments statistics and more. For out purpose we will be using the events data across all adult male competitions and filtering to only include shots data. In the end this gives us a dataset of 73874 shots.

La Liga: 21210

Premier League: 10837

Ligue: 1 10346

Serie A: 10033

1. Bundesliga : 8747

FIFA World Cup: 3904

Indian Super league: 3095

UEFA Euro: 2629

African Cup of Nations: 1244

Copa America: 790

Champions League: 594

Major League Soccer: 146

UEFA Europa League: 89

Copa del Rey: 74

Liga Profesional: 56

North American League: 50

FIFA U20 World Cup: 30

Feature Engineering

The raw dataset contains useful information, like player locations when the shot was taken, as well as binary variables like if the shot was a one-on-one, or was a first time shot. We can augment this with our own computed features.

Distance to goal

We can calculate the distance to goal using the location metric. We should expect shots taken from further away to have a lower probability of resulting in a goal.


def calculate_distance_to_goal(location):
  goal_center = np.array([120, 40])
  shot_location = np.array(location)
  distance_to_goal = np.linalg.norm(shot_location - goal_center)
  return distance_to_goal

df["distance_to_goal"] = df["location"].apply(calculate_distance_to_goal)

Angle to goal

Similarly we may expect shots taken from tighter angles to be less likely to result in a goal.

Angle from shooter to the posts. A wider angle indicates a more central position.


def calculate_shot_angle(location):
    left_post = np.array([120, 36])
    right_post = np.array([120, 44])
    shot_location = np.array(location)

    # Exclude shots taken from the same x-coordinate as the goal line
    if shot_location[0] == 120:
        return np.nan

    # Calculate distances using the Euclidean distance
    a = np.linalg.norm(left_post - right_post)
    b = np.linalg.norm(shot_location - left_post)
    c = np.linalg.norm(shot_location - right_post)

    # Use the cosine rule to calculate the angle
    # cos(C) = (a^2 + b^2 - c^2) / (2ab)
    cos_angle = (b**2 + c**2 - a**2) / (2 * b * c)
    shot_angle_radians = np.arccos(np.clip(cos_angle, -1.0, 1.0))  # Clip to handle floating point errors

    # Convert the angle from radians to degrees
    shot_angle_degrees = np.degrees(shot_angle_radians)

    return shot_angle_degrees

df["shot_angle"] = df["location"].apply(calculate_shot_angle)

Number of close opponents

Similarly we may also expect the number of close opponents to negatively effect likelihood of scoring.


def find_opponents_within_radius(row, radius=1.5):
    try:
        freeze_frame_df = pd.json_normalize(row["shot_freeze_frame"])
        opponents_df = freeze_frame_df[freeze_frame_df["teammate"] == False]
    except:
        return np.nan

    if len(opponents_df) == 0:
        return 0

    shot_location = np.array(row["location"])
    opponents_locations = opponents_df["location"].values
    opponents_locations = [item for item in opponents_locations]

    # Calculate distances
    distances = [distance.euclidean(opponent, shot_location) for opponent in opponents_locations]

    # Count the number of opponents within the specified radius
    num_opponents_within_radius = sum(d <= radius for d in distances)

    return num_opponents_within_radius

df["opponents_within_radius"] = df.apply(find_opponents_within_radius, axis=1)

GK distance to goal

We expect that if the GK is out of position then the shooter is more likely to be successful, although the distance may not always indicate that the keeper is out of position.


def calculate_gk_distance_to_goal(row):
    try:
        freeze_frame_df = pd.json_normalize(row["shot_freeze_frame"])
        # Filter for the goalkeeper
        gk_df = freeze_frame_df[freeze_frame_df["position.name"] == "Goalkeeper"]
        if gk_df.empty:
            return np.nan
        gk_location = np.array(gk_df.iloc[0]["location"])
    except:
        return np.nan

    goal_center = np.array([120, 40])
    gk_distance_to_goal = np.linalg.norm(gk_location - goal_center)
    return gk_distance_to_goal

df["gk_distance_to_goal"] = df.apply(calculate_gk_distance_to_goal, axis=1)

Number of players in the shot triangle

We would expect that if there are more players in between the shooter and the goal then the likelihood of scoring decreases.


def calculate_players_in_shot_triangle(row):
    try:
        freeze_frame_df = pd.json_normalize(row["shot_freeze_frame"])
        # Exclude the shooter
        non_shooters_df = freeze_frame_df[freeze_frame_df["teammate"] == False]
    except:
        return np.nan

    if non_shooters_df.empty:
        return 0

    # Define the goalposts and shot location
    left_post = np.array([120, 36])
    right_post = np.array([120, 44])
    shot_location = np.array(row["location"])

    # Create a triangle using the shot location and the goalposts
    shot_triangle = Polygon([shot_location, left_post, right_post])

    # Count players inside the triangle
    num_players_in_triangle = 0
    for location in non_shooters_df["location"]:
        player_point = Point(location)
        if shot_triangle.contains(player_point):
            num_players_in_triangle += 1

    return num_players_in_triangle


df["players_in_shot_triangle"] = df.apply(calculate_players_in_shot_triangle, axis=1)

Positition Encoding

In order to utilise the position label, we will summarise the more granular labels provided by StatsBomb into more simple [Goalkeeper, Defender, Midfielder, Forward].


df["position_simplified"] = df["position"].map(position_map)

Convert Data Types and fill NaN

Our model cannot take boolean values as input so we need to convert these to integers.


boolean_columns = ['shot_aerial_won', 'shot_first_time', 'shot_one_on_one', 'under_pressure']
df[boolean_columns] = df[boolean_columns].fillna(0).astype(int)
df["target"] = df["shot_outcome"].apply(lambda x: 1 if x == 'Goal' else 0)

We will also need to fill missing values with statistically meaningful values.


opponents_within_radius_median = df['opponents_within_radius'].median()
gk_distance_to_goal_median = df['gk_distance_to_goal'].median()
players_in_shot_triangle_median = df['players_in_shot_triangle'].median()

df['opponents_within_radius'] = df['opponents_within_radius'].fillna(opponents_within_radius_median)
df['gk_distance_to_goal'] = df['gk_distance_to_goal'].fillna(gk_distance_to_goal_median)
df['players_in_shot_triangle'] = df['players_in_shot_triangle'].fillna(players_in_shot_triangle_median)

Targets and Baseline xG

Our target is the shot outcome, which we will convert to a binary variable of 0 or 1.


df["target"] = df["shot_outcome"].apply(lambda x: 1 if x == 'Goal' else 0)

We can use the StatsBomb xG model as our baseline model. We will use accuracy as our metric, i.e. number of shots correctly predicted as either a goal or no goal. The accuracy of StatsBomb xG predictions is 0.90, meaning it correctly predict 90% of shots.


threshold = 0.5
predictions = (df['shot_statsbomb_xg'] > threshold).astype(int)

# Calculate accuracy
accuracy = (predictions == df['target']).mean()
print(f"Accuracy of StatsBomb xG predictions: {accuracy:.2f}")

Model Details

Categorical Features

Our model does not natively understand categorical features, thus we need to convert these values to numerical representation:


labelencoder = defaultdict(LabelEncoder)
df_inputs.loc[:,categorical_columns] = df_inputs.loc[:,categorical_columns].apply(lambda x: labelencoder[x.name].fit_transform(x))
df_inputs[categorical_columns] = df_inputs[categorical_columns].astype('category')

This converts our catergories into integer numbers for each category. The problem with just using this number as an input is that the model may believe the numbers have some ordinal ranking, i.e. 3 > 2 > 1, but actually we do not want the model to learn this meaning. Thus we need to create embedding layer in our network to represent these inputs:


emb_c = {n: len(col.cat.categories) for n, col in df_inputs.items() if col.dtype == 'category'}
emb_cols = emb_c.keys() # names of columns chosen for embedding
emb_szs = [(c, min(50, (c+1)//2)) for _,c in emb_c.items()] #embedding sizes for the chosen columns


nn.ModuleList([nn.Embedding(categories, size) for categories,size in emb_szs])

The above approach is taken from FastAI. It creates an embedding layer for each categorical feature, where the size of the layer is proportional to the number of unique categories.

Numerical Features

For optimal performance, we should standardise the numerical data so that it has unit variance. We can save the scaler to a file so we can use it to transform the data later.

Before transforming the data we need to split the data into training and validation sets so there is no data leakage from the training data to validation data.


X_train, X_val, y_train, y_val = train_test_split(df_inputs.iloc[:, :-1], df_inputs.iloc[:, -1], test_size=0.1, random_state=42, shuffle=True)


scaler = StandardScaler()
X_train.loc[:, num_columns] = scaler.fit_transform(X_train.loc[:, num_columns])
X_val.loc[:, num_columns] = scaler.transform(X_val.loc[:, num_columns])
pickle.dump(scaler, open('std_scaler.pkl', 'wb'))

We need to be careful to always transform our data in the same way when doing inference later, that means scaling the numerical data using the same scaler.

Create a Data Loader

We can create a data loader that outputs the categorical, numerical, and target variables.


class ShotsDataset(Dataset):
    def __init__(self, X, Y, emb_cols):
        X = X.copy()
        self.X1 = X.loc[:, emb_cols].copy().values.astype(np.int64) #categorical columns
        self.X2 = X.drop(columns=emb_cols).copy().values.astype(np.float32) #numerical columns
        self.y = Y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X1[idx], self.X2[idx], self.y[idx]

#creating train and valid datasets
train_ds = ShotsDataset(X_train, y_train, emb_cols)
valid_ds = ShotsDataset(X_val, y_val, emb_cols)

train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size, shuffle=True)

Model Definition

Our model definition is below. It is a simple neural network that can be defined with a given number of layers with ReLU activations. It outputs a single neuron that will predict the probability of a goal being scored. The forward pass also defines the loss function with is the Binary Cross Entropy Loss. We use dropout regularisation and Xavier and Kaiming weight initialisation. We also use the AdamW as the optimiser with a learning rate schedular. The hyperparameters found below are the best combination found from an exustive grid search optimisation.


batch_size = 256 # 128, 64
epochs = 100
hidden_size=200
dropout_rate=0.1
num_blocks=2
learning_rate=0.03
min_lr = learning_rate * 0.1
total_steps = batch_size * epochs
warmup_iters = total_steps * 0.1
lr_decay_iters = total_steps * 0.9
weight_decay = 0.01
betas = (0.9, 0.95)


class Block(nn.Module):
    def __init__(self, in_features, out_features, dropout_rate=0.1):
        super().__init__()
        self.fc = nn.Linear(in_features, out_features)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout_rate)
        self.bn = nn.BatchNorm1d(out_features)

    def forward(self, x):
        # Residual connection
        x = x + self.dropout(self.relu(self.fc(self.bn(x))))
        return x


class Model(nn.Module):

    def __init__(
        self,
        embedding_sizes: list[tuple[int, int]],
        n_numerical: int,
        hyperparameters: dict | None = None,
        hidden_size: int | None = 200,
        dropout_rate: float | None = 0.1,
        num_blocks: int | None = 4,
        return_loss: bool = True,
    ):
        super().__init__()

        # Default hyperparameters
        default_hyperparameters = {
            "hidden_size": hidden_size,
            "dropout_rate": dropout_rate,
            "num_blocks": num_blocks,
            "block_config": None
        }


        # Update defaults with any provided hyperparameters
        if hyperparameters:
            default_hyperparameters.update(hyperparameters)

        # Assign hyperparameters
        self.hidden_size = default_hyperparameters["hidden_size"]
        self.dropout_rate = default_hyperparameters["dropout_rate"]
        self.num_blocks = default_hyperparameters["num_blocks"]
        self.block_config = default_hyperparameters["block_config"]
        self.return_loss = return_loss

        # Validate hyperparameters
        assert self.hidden_size > 0, "hidden_size must be positive"
        assert 0 <= self.dropout_rate <= 1, "dropout_rate must be between 0 and 1"
        assert self.num_blocks > 0, "num_blocks must be positive"

        self.criterion = nn.BCEWithLogitsLoss()
        self.embeddings = nn.ModuleList([nn.Embedding(categories, size) for categories, size in embedding_sizes])
        self.n_emb = sum(e.embedding_dim for e in self.embeddings)  # length of all embeddings combined
        self.emb_drop = nn.Dropout(self.dropout_rate)
        self.n_numerical = n_numerical

        # Inputs and combine embeddings and numerical data
        self.bn1 = nn.BatchNorm1d(self.n_numerical)
        self.fc1 = nn.Linear(self.n_emb + self.n_numerical, self.hidden_size)
        self.relu1 = nn.ReLU()
        self.drop1 = nn.Dropout(self.dropout_rate)

        if self.block_config:
            self.hidden_layers = nn.ModuleList(
                [Block(hs_in, hs_out, dropout_rate=dropout_rate) for hs_in, hs_out, dropout_rate in self.block_config]
            )
        else:
            self.hidden_layers = nn.ModuleList(
                [Block(self.hidden_size, self.hidden_size, dropout_rate=self.dropout_rate) for _ in range(self.num_blocks)]
            )

        # Output layer
        self.bn_output = nn.BatchNorm1d(self.hidden_size)
        self.fc_output = nn.Linear(self.hidden_size, 1)

        # init all weights
        self.apply(self._init_weights)

        # report number of parameters
        print(f"number of parameters: {sum(p.numel() for p in self.parameters()):,}")

    def forward(self, x_cat, x_cont, targets=None):
        # Embed the categorical data and concatenate
        x = [e(x_cat[:, i]) for i, e in enumerate(self.embeddings)]
        x = torch.cat(x, 1)
        x = self.emb_drop(x)

        # Join the embeddings and the numerical data
        x2 = self.bn1(x_cont)
        x = torch.cat([x, x2], 1)

        # First layer
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.drop1(x)

        for block in self.hidden_layers:
            x = block(x)

        # Output layer
        x = self.bn_output(x)
        logits = self.fc_output(x)

        if self.return_loss:
            loss = None
            if targets is not None:
                # Use BCEWithLogitsLoss for binary classification
                loss = self.criterion(logits.view(-1), targets.float().view(-1))
            return logits, loss
        else:
            return logits


    def _init_weights(self, module):
        """
        Xavier uniform initialization for the linear and embedding layers.
        Generally a good starting point for most models, especially with ReLU activation functions.

        Other Options:

        Kaiming (He) Initialization: This is often recommended for networks with ReLU or variants like Leaky ReLU.
         It helps in maintaining the variance of activations throughout the layers.

        Uniform vs. Normal Distribution: You can choose between uniform and normal distributions for initializing weights.
        Xavier and Kaiming initializations can be applied using either distribution.

        Bias Initialization: Initializing biases to zero is common, but sometimes
        small positive values can help, especially in ReLU networks, to ensure that all neurons are active initially.


        """
        if isinstance(module, nn.Linear):
            torch.nn.init.kaiming_uniform_(module.weight, nonlinearity='relu')
            # torch.nn.init.xavier_uniform_(module.weight)
            if module.bias is not None:
                # torch.nn.init.zeros_(module.bias)
                torch.nn.init.constant_(module.bias, 0.01)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.xavier_uniform_(module.weight)

    def configure_optimizers(
        self,
        weight_decay: float = 0.01,
        learning_rate: float = 3e-4,
        betas: tuple = (0.9, 0.95),
        device_type: str = "cpu",
        min_lr: float = 3e-4,
        warmup_iters: int = 1000,
        lr_decay_iters: int = 10000,
    ):
        """
        Configure the optimizer for the model.
        We use the AdamW optimizer with weight decay.

        The difference between Adam and AdamW is that Adam uses L2 regularization,
        while AdamW uses weight decay directly on the parameters duing updates
        Generally AdamW is preferred as it is more flexible and allows for more control over the learning rate.
        AdamW also helps with generalization and reducing overfitting.
        """
        self.learning_rate = learning_rate
        self.weight_decay = weight_decay
        self.betas = betas
        self.min_lr = min_lr
        self.warmup_iters = warmup_iters
        self.lr_decay_iters = lr_decay_iters

        # start with all of the candidate parameters
        param_dict = {pn: p for pn, p in self.named_parameters()}
        # filter out those that do not require grad
        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
        # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
        # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
        optim_groups = [
            {"params": decay_params, "weight_decay": weight_decay},
            {"params": nodecay_params, "weight_decay": 0.0},
        ]
        num_decay_params = sum(p.numel() for p in decay_params)
        num_nodecay_params = sum(p.numel() for p in nodecay_params)
        print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
        print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
        # Create AdamW optimizer and use the fused version if it is available
        fused_available = "fused" in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and device_type == "cuda"
        extra_args = dict(fused=True) if use_fused else dict()
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)
        print(f"using fused AdamW: {use_fused}")

        return optimizer

Model Training

We will train our model and validate the model at each step, saving the model with the best results.


model = Model(emb_szs, X_train.shape[1] - len(emb_cols), hyperparameters={
    'hidden_size': hidden_size,
    'dropout_rate': dropout_rate,
    'num_blocks': num_blocks
})
optimizer = model.configure_optimizers(weight_decay=weight_decay, learning_rate=learning_rate)

train_loop(model, epochs, optimizer, save_model=True)

Where our training loop is


def train_loop(model, epochs, optimizer, save_model: bool = False):
    step = 0
    best_accuracy = 0
    no_improvement_epochs = 0
    patience = 15  # Number of epochs to wait for improvement

    for epoch in range(epochs):
        loss, step = train_model(model, train_dl, step, optimizer)
        # print(f"Epoch {epoch} Step {step} training loss: {loss:.3f}")
        loss, accuracy = val_loss(model, valid_dl)
        # print(f"Epoch {epoch} Step {step} valid loss {loss:.3f} and accuracy {accuracy:.3f}")
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            if save_model:
                torch.save(model.state_dict(), 'best_model.pth')
            no_improvement_epochs = 0  # Reset counter if improvement
        else:
            no_improvement_epochs += 1

        if no_improvement_epochs >= patience:
            print("Early stopping due to no improvement in validation accuracy.")
            break
    print(f"best accuracy: {best_accuracy:.3f}")
    return best_accuracy

The accuracy achieved on this model is 0.91, which a small but significant increase over the baseline model accuracy of 0.90.

Model Calibration (optional)

Model calibration is the process of adjusting the output of a model to better match the true probabilities. This is key for utilising an xG model so we can understand the probability of a goal being scored from a given shot. For example a shot with a 0.2 xG should have a 20% chance of being scored, with a 0.8 xG having an 80% chance of being scored.

Generally when MLP are trained the outputs are not calibrated to the true probabilities. This is where Platt Scaling comes in.

Platt Scaling is typically used for calibrating models for binary classification. It involves training a logistic regression model on the output from your original model using a separate calibration dataset. A calibrated probability mapping function from the original probabilities is established by using maximum likelihood estimation. With the assistance of logistic regression, Scikit-learn’s CalibratedClassifierCV class facilitates probability calibration. In order to use the CalibratedClassifierCV with a PyTorch model so additional hacks are required which can be found in the codebase.

As you can observe below, the uncalibrated model actually performs relatively well.

The calibrated model only shows minor improvements so I do not believe this step to be necessary for this model.

Feature Analysis

We can run a feature analysis with SHAP to get a better understanding of which feature at most important to the model output, and which ones positively or negatively effect the outcome.

SHAP force values for a single data point

The above plot show that the most important features are distance to goal (positive impact), shot angle (positive impact), players in the shot triangle (negative impact), and the body part used for the shot (negative impact).

We can also generate a summary plot over all features to get a more complete picture.

SHAP summary plot over many data points.

This plot those that generally shot angle can have the largest effect on the output, along with distance to goal. A simpler model may remove some of these more redundant features and focus on enriching the more valuable feature like shot body part by including data on players weak foot. We may also want to inlcude details on player names in the future as we would expect certain players to have a higher likilihood of scoring from certain situations. In our tests we found adding this data did not help, but this may be due to insufficient data.

Analysing Real Games

We also ran an analysis of the 2004/05 Champions League Final between Liverpool FC and AC Milan. The real final score before penalties in this game was 3–3. Our model and baselines are below:

StatsBomb Baseline: AC Milan: 3.16 xG, Liverpool: 1.80 xG

Neural Network: AC Milan: 2.75 xG, Liverpool: 1.41 xG

This result reflects a broader trend for our model to be more pessimistic about shot odds compared to the baseline model.

Conclusion

This has been a relatively simple introduction to xG models. In further work we could explore more complex feature transforms and bringing in data from additional sources. We can also explore players adjusted xG, and post impact xG which takes into consideration ball flight when calculating xG (useful for evaluating goalkeepers). A more thorough exploration would also compare this models to others successful model types like XGBoost trees or Naive Bayes Classifiers.

For all the code used to for data preparation and model training, please refer to my GitHub repo. For the free data please refer to StatsBomb.