Reinforcement Learning with Graphs

Intro

This blog is a simplified version of the full paper that can be found here along with all the code for the project.

Recent advances in network science and artificial intelligence have highlighted the complex dynamics of decision-making in interconnected social systems. While individual agents often optimise for personal benefit, such choices can lead to suboptimal collective outcomes, a phenomenon observed across various domains from economics to social networks.

This research addresses a critical challenge: how can we strategically intervene in network structures to promote more cooperative and socially beneficial behaviours? Our approach, PlanNet, introduces a computational framework that uses reinforcement learning and graph neural networks to systematically modify network interactions.

By formulating a Markov Decision Process, PlanNet enables a centralised planner to strategically form or eliminate edges between agents and incentivise action changes. The method leverages graph neural networks to generate policies that can generalise across different network topologies, moving beyond existing approaches that are either computationally limited or dependent on specific graph structural properties.

The research contributes three key innovations:

A reinforcement learning framework for network optimisation

A graph neural network-based policy representation

Empirical validation across synthetic and real-world social network structures

By demonstrating the potential to nudge network dynamics towards more collective outcomes, this work offers insights into computational approaches for understanding and potentially improving complex social interactions.

Motivation

When it comes to public goods - things like roads, libraries, or vaccination programs - we often face a challenging paradox. While everyone benefits from these shared resources, each individual might prefer to let others bear the cost of providing them. This creates what economists call a social dilemma: situations where rational individual choices lead to outcomes that make everyone worse off. Think of vaccination rates - each person who gets vaccinated helps protect the entire community, but some might avoid it due to inconvenience or concerns about side effects, hoping to benefit from others' immunity instead.

Our research focuses on network-based public goods games, where benefits and costs depend on who you're connected to. Imagine a group of neighbouring towns deciding whether to build emergency services. Each town benefits if either they or any of their neighbours build a fire station, but they'd prefer their neighbours make the investment. Without coordination, this could lead to either wasteful duplication or dangerous gaps in coverage. This is where a central planner could help - by strategically adjusting connections between communities or offering targeted incentives, they might guide the network toward more efficient and equitable outcomes.

How can we design systems that encourage cooperation without forcing it? This question drives our investigation into using AI to help find stable, socially optimal solutions in these networked social dilemmas.

Our mechanism works best in scenarios where people can easily change their behavior or adjust their social connections through repeated interactions. Consider everyday examples like workplace carpooling or vaccination decisions. In carpooling, people's choices are influenced by their colleagues' decisions, while vaccine hesitancy often spreads through social networks where people tend to align with their immediate social circle rather than the broader population. A central planner - like a company's HR department or public health organisation - could strategically encourage new connections between people who make different choices, or offer targeted incentives to key individuals who might influence others. For instance, connecting vaccine-hesitant groups with well-informed peers, or incentivising certain employees to start carpooling could create a ripple effect that shifts behavior across the entire network. This approach is particularly relevant in today's digital age, where social media platforms already influence who we connect with and what information we see.

Learning With Graphs

Recent advances in graph neural networks and deep reinforcement learning have opened up new possibilities for optimising network structures. While traditional approaches relied on mathematical properties of graphs, we can now use AI to learn strategic interventions that maximise social welfare in complex networks. Our approach leverages these technologies to develop a central planner that can make targeted changes to network connections and incentivise individual behaviour changes, moving beyond previous limitations to tackle real-world social coordination problems.

In network games, players' outcomes depend on both their own choices and those of their connected neighbours. We focus on two specific types: the Majority Game, where players benefit by matching their neighbours' choices, and the Best-Shot Public Goods Game, where players decide whether to invest in a shared resource that benefits their entire neighbourhood. While these games naturally settle into stable states (Nash equilibria), these states often aren't the best for overall social welfare - imagine a neighbourhood where everyone's trying to freeload off others' investments. Our solution uses reinforcement learning to act as a "social planner" that can strategically modify network connections or incentivise different choices, guiding the network toward states that benefit everyone. We measure success by looking at the average rewards across all players, aiming to find arrangements that create the most value for the community as a whole.

Overview

Our research introduces PlanNet, a reinforcement learning system designed to optimise social networks by making strategic modifications to network connections and influencing individual choices. Acting as a central planner with a limited budget, PlanNet can add or remove connections between people (at a set cost) and encourage individuals to change their behaviour (at half that cost). What makes our approach unique is its ability to efficiently process and modify complex social networks using a combination of graph neural networks and deep Q-learning, while generalising to different network sizes and structures.

The key innovation lies in our three-step decision process that first decides the type of intervention, then identifies the specific nodes to target. Our system has proven effective across various scenarios, from small community networks to large-scale social structures, using a single set of parameters. Notably, we've successfully applied this approach to real-world challenges like vaccine hesitancy, where understanding and influencing social network dynamics can have significant public health impacts. Most impressively, our model maintains its effectiveness even when trained only on small networks and applied to much larger ones, demonstrating robust scalability.

Background

Game Theory

Classic Game Theory

Game theory studies how rational players make decisions when their outcomes depend on others' choices. While it applies to traditional games like Chess, it's also used to understand everything from business competition to social behaviour. The key idea is that players choose actions based on their expected payoffs, but these payoffs often depend on what everyone else is doing. A stable situation where no one can benefit by changing their strategy alone is called a Nash equilibrium - think of it as a "social norm" that everyone sticks to because there's no individual advantage in breaking it.

The classic example is the prisoner's dilemma, where two suspects can either cooperate (stay silent) or defect (betray each other). While both would be better off cooperating, the fear of being betrayed leads them to both defect, resulting in a worse outcome for everyone. This highlights a central challenge in game theory: sometimes what's rational for individuals leads to suboptimal results for the group. Solutions can include repeated interactions (where players can build trust over time), or external interventions like taxes or incentives to encourage socially beneficial choices.

Network Game Theory

In network games, players are connected like points in a social network, and their payoffs depend on what their direct connections (neighbours) do. We focus on two interesting variants: the Majority Game and the Best-Shot Public Goods Game. In the Majority Game, players simply try to match what most of their neighbours are doing - like choosing whether to attend a party based on whether your friends are going. This naturally creates stable situations where everyone's either doing the same thing as their local group or split into clear factions.

The Best-Shot Public Goods Game models situations where people decide whether to contribute to a shared resource . The fascinating part is that if any of your neighbours contributes, your best move is to save your money and benefit from their contribution. This creates an interesting dilemma: while it's best for just some well-placed people to contribute, figuring out exactly who should contribute is surprisingly tricky. In fact, a seemingly efficient network structure (like having one central person connected to everyone else) might actually lead to worse outcomes if left to evolve naturally.

Best-Shot Public Goods Game in Equilibrium

While individuals naturally focus on their own interests, sometimes we need a bigger-picture view to improve outcomes for everyone. This is where central planners and institutions come in - they can create rules, incentives, and structures that encourage beneficial behaviours across entire populations. These can range from formal systems (like laws and contracts) to informal ones (like social norms and customs) that naturally emerge to guide behaviour.

Reinforcement Learning

Reinforcement learning is like teaching through trial and error - an AI agent learns to make decisions by trying different actions and getting feedback in the form of rewards. Unlike other forms of machine learning where the AI learns from pre-labeled examples or finds patterns in data, reinforcement learning agents must actively explore their environment, figuring out which actions lead to the best outcomes in different situations. The tricky part is balancing the need to exploit actions that are known to work well versus exploring new actions that might work even better - much like a restaurant-goer deciding between ordering a favourite dish or trying something new on the menu.

When problems get complex, like in games such as Go where there are more possible positions than atoms in the universe, simple lookup tables of good moves won't cut it. That's where "deep" reinforcement learning comes in, using neural networks to help the AI generalise from its experiences and make smart decisions in new situations. These neural networks act like sophisticated pattern recognisers, helping the AI understand which features of a situation are important for making good decisions, even in scenarios it hasn't exactly seen before.

Markov Decision Process

Imagine you're playing a strategic game where every move depends on your current situation. That's essentially what a Markov Decision Process (MDP) models - an intelligent agent making decisions in an environment where each action's outcome depends only on the present state. It's like a simplified version of real-world decision-making, where the past doesn't matter as much as what's happening right now.

At its core, an MDP breaks down decision-making into a few key elements: states (your current situation), actions (what you can do), transition probabilities (how likely different outcomes are), rewards (what you gain from each action), and a discount factor (how much you care about future gains compared to immediate ones). The magic of MDPs is their "memoryless" property - they assume that the next state depends only on the current state, which makes complex scenarios more manageable for computational analysis.

In the world of reinforcement learning, an agent's policy is like a strategic playbook that determines its behavior across different situations, which might not always be fully clear. Think of it as a decision-making guide that can either choose a single best action or assign probabilities to different actions. The core objective is simple yet powerful: maximise total rewards over time. Each action generates a reward that either reinforces or discourages future behaviour, creating a learning mechanism where the agent gradually discovers the most beneficial strategies. The "value" of a state isn't just about immediate payoff, but about its potential to lead to future rewards - imagine a seemingly unremarkable chess position that actually sets up a brilliant checkmate. By constantly estimating these state values, the agent learns to make increasingly sophisticated decisions, transforming raw experience into strategic wisdom. Unlike some approaches that build a detailed environmental model, this method learns directly from interactions, making it adaptable and robust across diverse scenarios.

Temporal Difference Learning

Temporal Difference (TD) learning is a technique in reinforcement learning that updates its understanding after every single step, rather than waiting for a complete episode to finish. It's like constantly adjusting your strategy in a game, learning from each move instead of only reviewing your performance at the end. This approach is particularly powerful because it can handle ongoing, never-ending scenarios - think of continuous tasks like controlling a robot or playing an infinite game.

TD learning introduces a fascinating concept called "bootstrapping," where it estimates future value based on the current best guess. By comparing the reward just received with the predicted value of the next state, the algorithm can quickly refine its understanding. This method is typically more stable and efficient than traditional Monte Carlo methods, which require waiting until the end of an entire sequence to learn. The magic happens through a simple update rule that balances immediate rewards with predicted future value, allowing the agent to learn incrementally and adapt its strategy on the fly.

Q-Learning, Double Q-Learning, and Deep Q-Networks

Q-learning is a reinforcement learning technique that helps agents learn the best actions in different situations. Unlike traditional methods that focus on state values, Q-learning looks at the value of specific actions in each state. The real magic happens when the agent starts to balance exploring new possibilities with exploiting known good strategies. This is typically done using an "ε-greedy" approach, where the agent sometimes chooses the best-known action and sometimes tries something random, gradually becoming more confident as it learns.

Deep Q-Networks (DQN) take this concept to the next level by using neural networks to handle complex, high-dimensional environments. To prevent the learning process from becoming unstable, the algorithm uses two tricks: experience replay and a separate target network. Experience replay is like letting the agent review and learn from its past experiences in a random order, while the target network helps stabilise the learning process by providing a more consistent reference point.

One of the most interesting challenges in Q-learning is avoiding overestimation of action values. The Double DQN (Deep Q-Network) approach solves this by using two different networks to select and evaluate actions, kind of like having two judges instead of one. This helps prevent the agent from becoming overly optimistic about certain actions and ensures a more balanced learning approach. The result is a more robust learning algorithm that can handle complex, dynamic environments with greater stability and accuracy.

Graph Neural Networks

Graph Neural Networks (GNNs) are like intelligent mappers that can understand complex relationships in connected data. These networks can tackle tasks like predicting connections, classifying nodes, or understanding entire graph structures, making them incredibly versatile for problems where relationships matter more than individual data points.

Unlike traditional neural networks that treat data as independent elements, GNNs excel at capturing intricate connections by allowing information to flow between neighbouring nodes. They've evolved from early approaches like DeepWalk to more sophisticated methods that can handle massive, dynamic graphs, essentially learning to create meaningful representations of complex interconnected systems. By using techniques like message passing and pooling, these networks can extract insights from data as diverse as social networks, protein interactions, and recommendation engines.

Structure2vec is an approach to understanding complex networks by creating "embeddings" - essentially digital fingerprints for each node in a graph. Think of it like teaching a computer to understand relationships by looking at how connected points influence each other. By iteratively passing information between neighbouring nodes and learning how to weigh their features, the algorithm can create rich representations that capture the subtle nuances of complex networks.

The magic happens through a process called mean-field inference, where the algorithm makes multiple passes through the network, gradually refining its understanding of how each node relates to its neighbours. Unlike traditional methods, structure2vec can handle large, intricate graphs with minimal computational overhead, making it possible to extract meaningful insights from everything from social networks to molecular structures. By learning these smart, adaptive embeddings, the technique opens up new possibilities for understanding interconnected systems in ways that were previously impossible.

PlanNet

Three Step Markov Decision Process

In our approach a central planner can tweak a network to nudge players towards a more cooperative outcome. This approach breaks down the process into a three-step dance: first selecting an action type (add edge, remove edge, flip player action), then choosing the first node, and finally selecting the second node. Each move triggers a recalculation of the network's equilibrium (players change actions if optimal for them), allowing the planner to gradually reshape interactions in a strategic, step-by-step manner.

The number of total steps in the game is defined by the planner budget, which defines the number of actions the planner can take. The planner see the current state of the graph, represented by a GNN, and selects and action. If that action is not allowed, i.e. it does not disconnect the graph or create self loops, then the move is skipped. After each more the planner receives a reward.

Objective and Intermediate Rewards

We set up the planer agent to maximise network's "social welfare" - essentially how well all players in a system are doing. This approach involves carefully designing rewards to guide an AI agent towards creating more cooperative and beneficial network structures. The challenge is creating a reward system that encourages long-term strategic thinking rather than chasing short-term gains, which can sometimes mean temporarily moving through less optimal states to reach a truly optimal configuration.

The key innovation is introducing "intermediate rewards" that provide feedback after each step, helping the agent understand its progress without getting stuck in local optimums. By scaling and shaping these rewards, the AI can learn to make nuanced network modifications that gradually improve overall system performance, much like a chess player thinking several moves ahead to set up a winning strategy.

Q-Network

PlanNet is a AI agent designed to strategically modify network structures, combining the power of Double Deep Q-Networks with a graph neural network called structure2vec. The approach breaks down decision-making into three sub-steps, using a shared embedding network that learns to understand the graph's structure while allowing specialised decision-making at each stage. By processing graph and node-level information simultaneously, the agent can make nuanced choices about adding, removing, or modifying network connections.

At each sub-step, the agent receives detailed information about the current network state, including previous actions and agent behaviours. By using techniques like batch normalisation and carefully designed neural network layers, PlanNet can learn to make strategic modifications that gradually improve the network's overall performance, balancing exploration of new possibilities with exploitation of known good strategies.

Q networks for selecting the action type (top) and selecting the nodes to apply the action to (bottom)

Results

We evaluated our approach by comparing it against several baseline strategies. The first comparison was with the initial equilibrium established through best response dynamics without any interventions. We also tested against a random policy that made completely arbitrary network modifications. In the majority game, we introduced an additional comparison with a hard-coded "preferential edge addition" policy. This baseline strategy targeted nodes with the lowest reward playing the globally non-dominant action, connecting them to high-performing nodes playing the dominant action. The underlying hypothesis was that struggling agents, when strategically linked to successful performers, might be more likely to change their behaviour. By potentially triggering a cascading effect where network connections induce behavioural shifts, this approach aimed to naturally improve social cohesion and overall network welfare, with all actions still constrained by the available network modifications.

We evaluated our model by first training and testing on the same type and size of the synthetic graph in our dataset. We set an action budget of 10% of the initial edges in the graph for each graph size and type. Thus for a graph with 20 edges, we have a budget of 2 moves, and for a graph, with 250 edges, we have a budget of 25 moves. We set each action to have the same associated cost.

The validation set performance during training can be seen in the figure below. We observe that, in general, the agent finds it easier to learn a policy resulting in significant social welfare improvements for smaller graphs, with the training time required to converge to a successful policy generally increasing as a function of the graph size.

Validation Set Social Welfare Change Through Training

In the best-shot public goods game, we observe a clear trend for our approach to improve social welfare on all graph sizes, although the absolute social welfare tends to decrease as a function of graph size across all graph types. For the majority game, we observed that the Nash equilibrium found through best-response dynamics was often optimal or near-optimal, with almost all agents converging to playing the same action. This high initial social welfare gave little room for improvements. Even so, our agents could reliably improve the social welfare, often to the optimal value for smaller graphs.

Test Graph Social Welfare for varying graph sizes and structures.

To test how well our approach adapts to new network structures, we evaluated our trained models on test sets with different graph sizes and configurations. Surprisingly, we found remarkably consistent performance across varying graph structures, with only minor variations in out-of-distribution test results. Interestingly, models trained on smaller graphs in the best-shot game demonstrated particularly robust generalisation capabilities. This could be because smaller graphs allow for more efficient learning of core network strategies, while larger graphs require significantly more computational time to train - with training complexity increasing polynomially with network size. The most exciting insight is that the fundamental patterns and strategic approaches discovered in small-scale networks appear to successfully scale up, suggesting that our method can learn generalisable principles of network modification that aren't simply tied to specific graph sizes.

Conclusion

This blog presents a method for learning interventions in network games to achieve social objectives, leveraging advanced graph neural networks and reinforcement learning. By developing a novel approach that avoids direct reliance on graph spectral properties and enables dynamic graph modifications, we demonstrated effective performance across various graph types and social objectives. Our work introduces key technical innovations including a Markov Decision Process decomposition, Double Q-learning to reduce overestimation, and intermediate rewards to address sparse reward signals. Critically, we showed model generalisability from small synthetic to real-world graphs.