In this post we will build an expected goals (xG) model using a neural network to predict the likely outcome of a shot.
What is xG?
Expected goals (xG) is a statistical metric that revolutionised the way we analyse football. It measures the quality of a shot by calculating the probability of it resulting in a goal. This probability is determined based on various factors, such as:
- Shot location: Shots taken from closer to the goal and at better angles have a higher xG.
- Shot type: Headers, volleys, and shots from distance typically have lower xG values.
- Shot angle: Shots taken from acute angles are less likely to score.
- Defensive pressure: Shots taken under pressure are less likely to score.
By assigning an xG value to each shot, we can get a more objective measure of a team or player’s performance. For example, if a team has an xG of 2.0 but only scores 1 goal, we can infer that they were unlucky and should have scored more. Conversely, if a team scores 3 goals but has an xG of 1.5, we can say that they were fortunate.
Deep Learning and xG
Traditional xG models rely on simple statistical models to calculate shot probabilities. However, recent advancements in Machine Learning, and more specifically Deep Learning, have opened up new possibilities for more accurate and sophisticated xG models.
Deep learning models can:
- Learn complex patterns: Deep neural networks can learn complex patterns and relationships in data that are difficult for humans to discern. This allows them to consider a wider range of factors when calculating xG, such as player positioning, defensive positioning, and the flow of the game.
- Handle large datasets: Deep learning models can handle massive amounts of data, which is essential for training accurate xG models. This allows them to learn from a larger and more diverse set of shots, leading to more accurate predictions.
- Continuously improve: Deep learning models can be continuously trained and updated with new data, allowing them to adapt to changing trends and improve their accuracy over time.
Datasets
We will use the StatsBomb Open Datasets to train our model. This dataset contains datapoints on completed matches, events within matches, tournaments statistics and more. For out purpose we will be using the events data across all adult male competitions and filtering to only include shots data. In the end this gives us a dataset of 73874 shots.
- La Liga: 21210
- Premier League: 10837
- Ligue: 1 10346
- Serie A: 10033
- 1. Bundesliga : 8747
- FIFA World Cup: 3904
- Indian Super league: 3095
- UEFA Euro: 2629
- African Cup of Nations: 1244
- Copa America: 790
- Champions League: 594
- Major League Soccer: 146
- UEFA Europa League: 89
- Copa del Rey: 74
- Liga Profesional: 56
- North American League: 50
- FIFA U20 World Cup: 30
Feature Engineering
The raw dataset contains useful information, like player locations when the shot was taken, as well as binary variables like if the shot was a one-on-one, or was a first time shot. We can augment this with our own computed features.
Distance to goal
We can calculate the distance to goal using the location metric. We should expect shots taken from further away to have a lower probability of resulting in a goal.
def calculate_distance_to_goal(location): goal_center = np.array([120, 40]) shot_location = np.array(location) distance_to_goal = np.linalg.norm(shot_location - goal_center) return distance_to_goal df["distance_to_goal"] = df["location"].apply(calculate_distance_to_goal)
Angle to goal
Similarly we may expect shots taken from tighter angles to be less likely to result in a goal.
def calculate_shot_angle(location): left_post = np.array([120, 36]) right_post = np.array([120, 44]) shot_location = np.array(location) # Exclude shots taken from the same x-coordinate as the goal line if shot_location[0] == 120: return np.nan # Calculate distances using the Euclidean distance a = np.linalg.norm(left_post - right_post) b = np.linalg.norm(shot_location - left_post) c = np.linalg.norm(shot_location - right_post) # Use the cosine rule to calculate the angle # cos(C) = (a^2 + b^2 - c^2) / (2ab) cos_angle = (b**2 + c**2 - a**2) / (2 * b * c) shot_angle_radians = np.arccos(np.clip(cos_angle, -1.0, 1.0)) # Clip to handle floating point errors # Convert the angle from radians to degrees shot_angle_degrees = np.degrees(shot_angle_radians) return shot_angle_degrees df["shot_angle"] = df["location"].apply(calculate_shot_angle)
Number of close opponents
Similarly we may also expect the number of close opponents to negatively effect likelihood of scoring.
def find_opponents_within_radius(row, radius=1.5): try: freeze_frame_df = pd.json_normalize(row["shot_freeze_frame"]) opponents_df = freeze_frame_df[freeze_frame_df["teammate"] == False] except: return np.nan if len(opponents_df) == 0: return 0 shot_location = np.array(row["location"]) opponents_locations = opponents_df["location"].values opponents_locations = [item for item in opponents_locations] # Calculate distances distances = [distance.euclidean(opponent, shot_location) for opponent in opponents_locations] # Count the number of opponents within the specified radius num_opponents_within_radius = sum(d <= radius for d in distances) return num_opponents_within_radius df["opponents_within_radius"] = df.apply(find_opponents_within_radius, axis=1)
GK distance to goal
We expect that if the GK is out of position then the shooter is more likely to be successful, although the distance may not always indicate that the keeper is out of position.
def calculate_gk_distance_to_goal(row): try: freeze_frame_df = pd.json_normalize(row["shot_freeze_frame"]) # Filter for the goalkeeper gk_df = freeze_frame_df[freeze_frame_df["position.name"] == "Goalkeeper"] if gk_df.empty: return np.nan gk_location = np.array(gk_df.iloc[0]["location"]) except: return np.nan goal_center = np.array([120, 40]) gk_distance_to_goal = np.linalg.norm(gk_location - goal_center) return gk_distance_to_goal df["gk_distance_to_goal"] = df.apply(calculate_gk_distance_to_goal, axis=1)
Number of players in the shot triangle
We would expect that if there are more players in between the shooter and the goal then the likelihood of scoring decreases.
def calculate_players_in_shot_triangle(row): try: freeze_frame_df = pd.json_normalize(row["shot_freeze_frame"]) # Exclude the shooter non_shooters_df = freeze_frame_df[freeze_frame_df["teammate"] == False] except: return np.nan if non_shooters_df.empty: return 0 # Define the goalposts and shot location left_post = np.array([120, 36]) right_post = np.array([120, 44]) shot_location = np.array(row["location"]) # Create a triangle using the shot location and the goalposts shot_triangle = Polygon([shot_location, left_post, right_post]) # Count players inside the triangle num_players_in_triangle = 0 for location in non_shooters_df["location"]: player_point = Point(location) if shot_triangle.contains(player_point): num_players_in_triangle += 1 return num_players_in_triangle df["players_in_shot_triangle"] = df.apply(calculate_players_in_shot_triangle, axis=1)
Positition Encoding
In order to utilise the position label, we will summarise the more granular labels provided by StatsBomb into more simple [Goalkeeper, Defender, Midfielder, Forward].
df["position_simplified"] = df["position"].map(position_map)
Convert Data Types and fill NaN
Our model cannot take boolean values as input so we need to convert these to integers.
boolean_columns = ['shot_aerial_won', 'shot_first_time', 'shot_one_on_one', 'under_pressure'] df[boolean_columns] = df[boolean_columns].fillna(0).astype(int) df["target"] = df["shot_outcome"].apply(lambda x: 1 if x == 'Goal' else 0)
We will also need to fill missing values with statistically meaningful values.
opponents_within_radius_median = df['opponents_within_radius'].median() gk_distance_to_goal_median = df['gk_distance_to_goal'].median() players_in_shot_triangle_median = df['players_in_shot_triangle'].median() df['opponents_within_radius'] = df['opponents_within_radius'].fillna(opponents_within_radius_median) df['gk_distance_to_goal'] = df['gk_distance_to_goal'].fillna(gk_distance_to_goal_median) df['players_in_shot_triangle'] = df['players_in_shot_triangle'].fillna(players_in_shot_triangle_median)
Targets and Baseline xG
Our target is the shot outcome, which we will convert to a binary variable of 0 or 1.
df["target"] = df["shot_outcome"].apply(lambda x: 1 if x == 'Goal' else 0)
We can use the StatsBomb xG model as our baseline model. We will use accuracy as our metric, i.e. number of shots correctly predicted as either a goal or no goal. The accuracy of StatsBomb xG predictions is 0.90, meaning it correctly predict 90% of shots.
threshold = 0.5 predictions = (df['shot_statsbomb_xg'] > threshold).astype(int) # Calculate accuracy accuracy = (predictions == df['target']).mean() print(f"Accuracy of StatsBomb xG predictions: {accuracy:.2f}")
Model Details
Categorical Features
Our model does not natively understand categorical features, thus we need to convert these values to numerical representation:
labelencoder = defaultdict(LabelEncoder) df_inputs.loc[:,categorical_columns] = df_inputs.loc[:,categorical_columns].apply(lambda x: labelencoder[x.name].fit_transform(x)) df_inputs[categorical_columns] = df_inputs[categorical_columns].astype('category')
This converts our catergories into integer numbers for each category. The problem with just using this number as an input is that the model may believe the numbers have some ordinal ranking, i.e. 3 > 2 > 1, but actually we do not want the model to learn this meaning. Thus we need to create embedding layer in our network to represent these inputs:
emb_c = {n: len(col.cat.categories) for n, col in df_inputs.items() if col.dtype == 'category'} emb_cols = emb_c.keys() # names of columns chosen for embedding emb_szs = [(c, min(50, (c+1)//2)) for _,c in emb_c.items()] #embedding sizes for the chosen columns nn.ModuleList([nn.Embedding(categories, size) for categories,size in emb_szs])
The above approach is taken from FastAI. It creates an embedding layer for each categorical feature, where the size of the layer is proportional to the number of unique categories.
Numerical Features
For optimal performance, we should standardise the numerical data so that it has unit variance. We can save the scaler to a file so we can use it to transform the data later.
Before transforming the data we need to split the data into training and validation sets so there is no data leakage from the training data to validation data.
X_train, X_val, y_train, y_val = train_test_split(df_inputs.iloc[:, :-1], df_inputs.iloc[:, -1], test_size=0.1, random_state=42, shuffle=True)
scaler = StandardScaler() X_train.loc[:, num_columns] = scaler.fit_transform(X_train.loc[:, num_columns]) X_val.loc[:, num_columns] = scaler.transform(X_val.loc[:, num_columns]) pickle.dump(scaler, open('std_scaler.pkl', 'wb'))
We need to be careful to always transform our data in the same way when doing inference later, that means scaling the numerical data using the same scaler.
Create a Data Loader
We can create a data loader that outputs the categorical, numerical, and target variables.
class ShotsDataset(Dataset): def __init__(self, X, Y, emb_cols): X = X.copy() self.X1 = X.loc[:, emb_cols].copy().values.astype(np.int64) #categorical columns self.X2 = X.drop(columns=emb_cols).copy().values.astype(np.float32) #numerical columns self.y = Y def __len__(self): return len(self.y) def __getitem__(self, idx): return self.X1[idx], self.X2[idx], self.y[idx] #creating train and valid datasets train_ds = ShotsDataset(X_train, y_train, emb_cols) valid_ds = ShotsDataset(X_val, y_val, emb_cols) train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True) valid_dl = DataLoader(valid_ds, batch_size=batch_size, shuffle=True)
Model Definition
Our model definition is below. It is a simple neural network that can be defined with a given number of layers with ReLU activations. It outputs a single neuron that will predict the probability of a goal being scored. The forward pass also defines the loss function with is the Binary Cross Entropy Loss. We use dropout regularisation and Xavier and Kaiming weight initialisation. We also use the AdamW as the optimiser with a learning rate schedular. The hyperparameters found below are the best combination found from an exustive grid search optimisation.
batch_size = 256 # 128, 64 epochs = 100 hidden_size=200 dropout_rate=0.1 num_blocks=2 learning_rate=0.03 min_lr = learning_rate * 0.1 total_steps = batch_size * epochs warmup_iters = total_steps * 0.1 lr_decay_iters = total_steps * 0.9 weight_decay = 0.01 betas = (0.9, 0.95)
class Block(nn.Module): def __init__(self, in_features, out_features, dropout_rate=0.1): super().__init__() self.fc = nn.Linear(in_features, out_features) self.relu = nn.ReLU() self.dropout = nn.Dropout(dropout_rate) self.bn = nn.BatchNorm1d(out_features) def forward(self, x): # Residual connection x = x + self.dropout(self.relu(self.fc(self.bn(x)))) return x class Model(nn.Module): def __init__( self, embedding_sizes: list[tuple[int, int]], n_numerical: int, hyperparameters: dict | None = None, hidden_size: int | None = 200, dropout_rate: float | None = 0.1, num_blocks: int | None = 4, return_loss: bool = True, ): super().__init__() # Default hyperparameters default_hyperparameters = { "hidden_size": hidden_size, "dropout_rate": dropout_rate, "num_blocks": num_blocks, "block_config": None } # Update defaults with any provided hyperparameters if hyperparameters: default_hyperparameters.update(hyperparameters) # Assign hyperparameters self.hidden_size = default_hyperparameters["hidden_size"] self.dropout_rate = default_hyperparameters["dropout_rate"] self.num_blocks = default_hyperparameters["num_blocks"] self.block_config = default_hyperparameters["block_config"] self.return_loss = return_loss # Validate hyperparameters assert self.hidden_size > 0, "hidden_size must be positive" assert 0 <= self.dropout_rate <= 1, "dropout_rate must be between 0 and 1" assert self.num_blocks > 0, "num_blocks must be positive" self.criterion = nn.BCEWithLogitsLoss() self.embeddings = nn.ModuleList([nn.Embedding(categories, size) for categories, size in embedding_sizes]) self.n_emb = sum(e.embedding_dim for e in self.embeddings) # length of all embeddings combined self.emb_drop = nn.Dropout(self.dropout_rate) self.n_numerical = n_numerical # Inputs and combine embeddings and numerical data self.bn1 = nn.BatchNorm1d(self.n_numerical) self.fc1 = nn.Linear(self.n_emb + self.n_numerical, self.hidden_size) self.relu1 = nn.ReLU() self.drop1 = nn.Dropout(self.dropout_rate) if self.block_config: self.hidden_layers = nn.ModuleList( [Block(hs_in, hs_out, dropout_rate=dropout_rate) for hs_in, hs_out, dropout_rate in self.block_config] ) else: self.hidden_layers = nn.ModuleList( [Block(self.hidden_size, self.hidden_size, dropout_rate=self.dropout_rate) for _ in range(self.num_blocks)] ) # Output layer self.bn_output = nn.BatchNorm1d(self.hidden_size) self.fc_output = nn.Linear(self.hidden_size, 1) # init all weights self.apply(self._init_weights) # report number of parameters print(f"number of parameters: {sum(p.numel() for p in self.parameters()):,}") def forward(self, x_cat, x_cont, targets=None): # Embed the categorical data and concatenate x = [e(x_cat[:, i]) for i, e in enumerate(self.embeddings)] x = torch.cat(x, 1) x = self.emb_drop(x) # Join the embeddings and the numerical data x2 = self.bn1(x_cont) x = torch.cat([x, x2], 1) # First layer x = self.fc1(x) x = self.relu1(x) x = self.drop1(x) for block in self.hidden_layers: x = block(x) # Output layer x = self.bn_output(x) logits = self.fc_output(x) if self.return_loss: loss = None if targets is not None: # Use BCEWithLogitsLoss for binary classification loss = self.criterion(logits.view(-1), targets.float().view(-1)) return logits, loss else: return logits def _init_weights(self, module): """ Xavier uniform initialization for the linear and embedding layers. Generally a good starting point for most models, especially with ReLU activation functions. Other Options: Kaiming (He) Initialization: This is often recommended for networks with ReLU or variants like Leaky ReLU. It helps in maintaining the variance of activations throughout the layers. Uniform vs. Normal Distribution: You can choose between uniform and normal distributions for initializing weights. Xavier and Kaiming initializations can be applied using either distribution. Bias Initialization: Initializing biases to zero is common, but sometimes small positive values can help, especially in ReLU networks, to ensure that all neurons are active initially. """ if isinstance(module, nn.Linear): torch.nn.init.kaiming_uniform_(module.weight, nonlinearity='relu') # torch.nn.init.xavier_uniform_(module.weight) if module.bias is not None: # torch.nn.init.zeros_(module.bias) torch.nn.init.constant_(module.bias, 0.01) elif isinstance(module, nn.Embedding): torch.nn.init.xavier_uniform_(module.weight) def configure_optimizers( self, weight_decay: float = 0.01, learning_rate: float = 3e-4, betas: tuple = (0.9, 0.95), device_type: str = "cpu", min_lr: float = 3e-4, warmup_iters: int = 1000, lr_decay_iters: int = 10000, ): """ Configure the optimizer for the model. We use the AdamW optimizer with weight decay. The difference between Adam and AdamW is that Adam uses L2 regularization, while AdamW uses weight decay directly on the parameters duing updates Generally AdamW is preferred as it is more flexible and allows for more control over the learning rate. AdamW also helps with generalization and reducing overfitting. """ self.learning_rate = learning_rate self.weight_decay = weight_decay self.betas = betas self.min_lr = min_lr self.warmup_iters = warmup_iters self.lr_decay_iters = lr_decay_iters # start with all of the candidate parameters param_dict = {pn: p for pn, p in self.named_parameters()} # filter out those that do not require grad param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad} # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no. # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't. decay_params = [p for n, p in param_dict.items() if p.dim() >= 2] nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2] optim_groups = [ {"params": decay_params, "weight_decay": weight_decay}, {"params": nodecay_params, "weight_decay": 0.0}, ] num_decay_params = sum(p.numel() for p in decay_params) num_nodecay_params = sum(p.numel() for p in nodecay_params) print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters") print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters") # Create AdamW optimizer and use the fused version if it is available fused_available = "fused" in inspect.signature(torch.optim.AdamW).parameters use_fused = fused_available and device_type == "cuda" extra_args = dict(fused=True) if use_fused else dict() optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args) print(f"using fused AdamW: {use_fused}") return optimizer
Model Training
We will train our model and validate the model at each step, saving the model with the best results.
model = Model(emb_szs, X_train.shape[1] - len(emb_cols), hyperparameters={ 'hidden_size': hidden_size, 'dropout_rate': dropout_rate, 'num_blocks': num_blocks }) optimizer = model.configure_optimizers(weight_decay=weight_decay, learning_rate=learning_rate) train_loop(model, epochs, optimizer, save_model=True)
Where our training loop is
def train_loop(model, epochs, optimizer, save_model: bool = False): step = 0 best_accuracy = 0 no_improvement_epochs = 0 patience = 15 # Number of epochs to wait for improvement for epoch in range(epochs): loss, step = train_model(model, train_dl, step, optimizer) # print(f"Epoch {epoch} Step {step} training loss: {loss:.3f}") loss, accuracy = val_loss(model, valid_dl) # print(f"Epoch {epoch} Step {step} valid loss {loss:.3f} and accuracy {accuracy:.3f}") if accuracy > best_accuracy: best_accuracy = accuracy if save_model: torch.save(model.state_dict(), 'best_model.pth') no_improvement_epochs = 0 # Reset counter if improvement else: no_improvement_epochs += 1 if no_improvement_epochs >= patience: print("Early stopping due to no improvement in validation accuracy.") break print(f"best accuracy: {best_accuracy:.3f}") return best_accuracy
The accuracy achieved on this model is 0.91, which a small but significant increase over the baseline model accuracy of 0.90.
Model Calibration (optional)
Model calibration is the process of adjusting the output of a model to better match the true probabilities. This is key for utilising an xG model so we can understand the probability of a goal being scored from a given shot. For example a shot with a 0.2 xG should have a 20% chance of being scored, with a 0.8 xG having an 80% chance of being scored.
Generally when MLP are trained the outputs are not calibrated to the true probabilities. This is where Platt Scaling comes in.
Platt Scaling is typically used for calibrating models for binary classification. It involves training a logistic regression model on the output from your original model using a separate calibration dataset. A calibrated probability mapping function from the original probabilities is established by using maximum likelihood estimation. With the assistance of logistic regression, Scikit-learn’s CalibratedClassifierCV class facilitates probability calibration. In order to use the CalibratedClassifierCV with a PyTorch model so additional hacks are required which can be found in the codebase.
As you can observe below, the uncalibrated model actually performs relatively well.
The calibrated model only shows minor improvements so I do not believe this step to be necessary for this model.
Feature Analysis
We can run a feature analysis with SHAP to get a better understanding of which feature at most important to the model output, and which ones positively or negatively effect the outcome.
The above plot show that the most important features are distance to goal (positive impact), shot angle (positive impact), players in the shot triangle (negative impact), and the body part used for the shot (negative impact).
We can also generate a summary plot over all features to get a more complete picture.
This plot those that generally shot angle can have the largest effect on the output, along with distance to goal. A simpler model may remove some of these more redundant features and focus on enriching the more valuable feature like shot body part by including data on players weak foot. We may also want to inlcude details on player names in the future as we would expect certain players to have a higher likilihood of scoring from certain situations. In our tests we found adding this data did not help, but this may be due to insufficient data.
Analysing Real Games
We also ran an analysis of the 2004/05 Champions League Final between Liverpool FC and AC Milan. The real final score before penalties in this game was 3–3. Our model and baselines are below:
- StatsBomb Baseline: AC Milan: 3.16 xG, Liverpool: 1.80 xG
- Neural Network: AC Milan: 2.75 xG, Liverpool: 1.41 xG
This result reflects a broader trend for our model to be more pessimistic about shot odds compared to the baseline model.
Conclusion
This has been a relatively simple introduction to xG models. In further work we could explore more complex feature transforms and bringing in data from additional sources. We can also explore players adjusted xG, and post impact xG which takes into consideration ball flight when calculating xG (useful for evaluating goalkeepers). A more thorough exploration would also compare this models to others successful model types like XGBoost trees or Naive Bayes Classifiers.
For all the code used to for data preparation and model training, please refer to my GitHub repo. For the free data please refer to StatsBomb.