The Bipartite Map Equation
Posted on June 14, 2021
by Christopher Blöcker
Tags: Map Equation, PhD
In this post, I give a shorter and less mathy explanation of what I’ve written about in my paper Mapping Flow on Bipartite Networks.
Abstract
The bipartite map equation is a tool for community detection in bipartite networks. It is an adaption of the map equation and uses the insight that random walks in bipartite networks have to alternate between the two different node types. A tuning parameter controls to what degree the bipartite network structure is used when searching for communities. Applied to real-world networks, for example authorship networks where authors are connected to articles, or social networks where researchers are connected to conferences, the bipartite map equation is capable of finding communities at different scales, depending on the chosen parametrisation.
Background: The Map Equation
The map equation is a tool for community detection that uses principles from information theory. It takes a graph and a partition \(M\) of its nodes into modules (\(M\) for “map”) and answers the question how well the map describes the structure of the graph. For that, it uses the concept of a random walk: in a map that describes the graph’s structure well, a random walk tends to stay within the modules more than switching between them. But in a map that does not describe the structure well, a random walk changes between modules more.
The simplest possible map is when we assign all nodes to the same module, and we call this the one-level map \(M_1\). Consider the following graph with seven nodes and the map \(M_1\):
How good is this map \(M_1\)? The map equation answers this question by calculating how expensive it is to describe the new location of the random walker every time it takes a step using codewords. A codeword is a string of ones and zeros, for example \(1101\) or \(101010\). Each node \(u\) has a unique codeword \(w_u\) according to a Huffman code. Further, each node \(u\) also has a so-called visit rate \(p_u\), that is the probability that the random walker is at the node at any point in time. Putting the codewords and visit rates together, we can calculate how many bits we need to use on average to describe the random walk: \[ L_G\left(M_1\right) = \sum_{u} p_u \cdot \left|w_u\right|, \] where \(\left|w_u\right|\) stands for the length of codeword \(w_u\). We call \(L_G\left(M_1\right)\) the codelength of \(M_1\).
To calculate the codelength, we need figure out the visit rates and codewords for each node. Since the graph is undirected and unweighted, we can calculate the visit rates as \[ p_u = \frac{k_u}{\sum_v k_v} \] where \(k_u\) is the degree of node \(u\). Using the visit rates, we can design a Huffman code to figure out the codeword lenghts and plug everything into the equation from above to calculate \(L_G\left(M_1\right)\). In our example, we get 2.80 bits:
\(u\) | \(p_u\) | \(w_u\) | \(\left|w_u\right|\) | \(p_u \cdot \left|w_u\right|\) |
---|---|---|---|---|
1 | \(0.15\) | 110 | 3 | 0.45 |
2 | \(0.15\) | 111 | 3 | 0.45 |
3 | \(0.15\) | 100 | 3 | 0.45 |
4 | \(0.20\) | 01 | 2 | 0.40 |
5 | \(0.15\) | 101 | 3 | 0.45 |
6 | \(0.10\) | 000 | 3 | 0.30 |
7 | \(0.10\) | 001 | 3 | 0.30 |
\(\sum\) | \(1\) | 2.80 |
A different but equivalent way to think about this is to consider a random process \(X\) whose value is the random walkers’ current node. The state probabilities are the same as the node visit rates \(p_u\), and we can calculate the process’ entropy \(H\left(X\right)\) as \[ H\left(X\right) = - \sum_{u} p_u \log_2 p_u. \]
For the one-level map \(M_1\) in our example graph we get \(H\left(X\right) \approx 2.77\) bits. The reason for the small discrepancy between \(2.80\) bits and \(2.77\) bits is that we used actual codewords in the first, but the theoretical lower limit for the codeword length (\(\log_2 p_u\)) in the second approach. From now on, we will go with the lower-limit approach that tells us what’s the best possible codelength that we could ever achieve, even though we might not actually be able to achieve it because bits are indivisible and we always have to use full bits in codewords. But in practice this doesn’t matter because we don’t have to define the codewords. We only need to know the shortest possible codelength for a given map to compare it to others and decide which one is better.
Speaking of comparing maps, let’s take a look at another map, \(M_a = \left\{\left\{1,2,3,4\right\}, \left\{5,6,7\right\}\right\}\), that splits the graph into two modules.Intuitively, this map should make sense: nodes within modules are well-connected to each other and the number of connections between modules is small. But how do we calculate the codelength for a hierarchical map? Of course with recursion! The entropy of a random walk through several modules is a weighted average of the individual module entropies plus the entropy of switching between the modules. We can think of this as two or more layers of random variables:
- at the top level, there is a random process \(X_M\) that becomes active when the random walker switches module (the top level is also called “index level”)
- at the module level, each module \(m \in M\) has a random process \(X_m\) that becomes active when the random walker is in \(m\)
and each of these processes contributes to some degree – the fraction of time that the random walker spends in the corresponding area of the network, or switching between the areas. The map equation calculates the codelength of a hierarchical map as \[ L\left(M\right) = p_M \cdot H\left(X_M\right) + \sum_{m \in M} p_m \cdot H\left(X_m\right), \] where \(p_M\) and \(p_m\) are the fractions of time that the random walker spends switching between areas and in area \(m\) of the network, respectively. When plugging \(M_a\) into the map equation, we get \(L_G\left(M_a\right) \approx 2.41\) bits, which is clearly better than \(M_1\) which had a value of \(L_G\left(M_1\right) \approx 2.77\) bits.
Now you might wonder how to calculate all values and the map equation for a specific map exactly. If that is the case, you can take a look at this post where I have explained the map equation in more detail and from a slightly different point of view, including executable code in Haskell.
Random Walks on Bipartite Graphs
A bipartite graph is a graph that has two sets of nodes, let’s call them \(A\) and \(B\), and edges can only connect nodes from \(A\) with nodes from \(B\). Connections within \(A\) and within \(B\) are forbidden. Popular examples of bipartite networks are authorship networks where authors are connected to the articles they have written, social networks where scientists are connected to conferences, and rating networks where users rate products.
The edge constraints in bipartite graphs have an important consequence for random walks: a random walker has to switch between nodes of type \(A\) and nodes of type \(B\) in every step. There’s always a user, then a movie, then a user, then a movie…, but never two users or two movies in a row. This means that the random walker has to spend half of the time on user nodes and half of the time on movie nodes.
Using a random process \(X\) that switches between all nodes in the graph according to their visit rates would be a wasteful use of resources because it doesn’t use the bipartite constraint: the process could potentially jump between same-type nodes – even though this would not actually happen in practice when we simulate a bipartite random walk. Instead, we can think about switching between two random processes, \(X_A\) and \(X_B\), one for each of the two sets of nodes. Half of the time \(X_A\) “selects” a node from \(A\), and half of the time \(X_B\) “selects” a node from \(B\), let’s call this process \(X_{AB}\). This also avoids the problem that random walks in bipartite networks have no stationary visit rates. Instead of thinking about visit rates for all nodes at once, we think about the visit rates of type-\(A\) (or type-\(B\)) nodes, given that the random walker is at a type-\(A\) (or type-\(B\)) node. The entropy of \(X_{AB}\) is simply the average entropy of \(X_A\) and \(X_B\): \[ H\left(X_{AB}\right) = \frac{1}{2} H\left(X_A\right) + \frac{1}{2} H\left(X_B\right). \]
We can relate \(X\) to \(X_{AB}\) through Bayes’ rule for conditional entropy, \[H\left(X|Y\right) = H\left(X\right) - H\left(Y\right) + H\left(Y|X\right).\] Or rewritten into the right format: \[H\left(X\right) = H\left(Y\right) + H\left(X|Y\right) - H\left(Y|X\right).\] We introduce one more process, \(Y\), that simply alternates between \(A\) and \(B\). That is, \(Y\) generates an infinite sequence \(A, B, A, B, A, B, \ldots\) and has entropy \(H\left(Y\right) = 1\). The connection between \(X\) and \(X_{AB}\) is the following: \[ \begin{align} H\left(X\right) & = H\left(Y\right) + H\left(X_{AB}\right) \\ & = 1 + H\left(X_{AB}\right) \\ & = 1 + \frac{1}{2} H\left(X_A\right) + \frac{1}{2} H\left(X_B\right). \end{align} \] This tells us that, by keeping track of node types, \(X_{AB}\) saves one bit per random-walker step, compared to \(X\). Note that \(H\left(Y|X\right) = 0\) because knowing the current node fully determines its type.
And how does this relate to community detection? Well, instead of calculating the codelength based on \(X\), we can use \(X_{AB}\) for bipartite networks. Let \(G_{bi}\) be a bipartite graph and \(M_1\) be the one-level map in \(G_{bi}\), then the codelength of \(M_1\) is \[ \begin{align} L_{G_{bi}}\left(M_1\right) & = H\left(X_{AB}\right) = \frac{1}{2} H\left(X_A\right) + \frac{1}{2} H\left(X_B\right), \end{align} \] which will be \(1\) bit lower compared to the standard map equation. For hierarchical maps \(M\), the codelength is \[ \begin{align} L_{G_{bi}}\left(M\right) & = \frac{1}{2} \left[p_{A_M} H\left(X_{A_M}\right) + \sum_{m \in M} p_{A_m} H\left(X_{A_m}\right)\right] \\ & + \frac{1}{2} \left[p_{B_M} H\left(X_{B_M}\right) + \sum_{m \in M} p_{B_m} H\left(X_{B_m}\right)\right], \end{align} \] where the subscript \(A_*\) and \(B_*\) denote restriction to type-\(A\) and type-\(B\) nodes, respectively. For example, \(A_m\) are the type-\(A\) nodes in module \(m\) and \(X_{A_m}\) is a random process that considers the type-\(A\) nodes in \(m\).
Essentially, what we do is calculate the bipartite codelength by averaging between the type-\(A\) and type-\(B\) codelength. Imagine running a random walk but only considering visits to type-\(A\) nodes. Then, taking the same random walk but only considering visits to type-\(B\) nodes. By averaging between the two, we consider both node types.
Blurring the Line with Noise
So far, we have treated node types in two ways, we have either ignored them, or we have made full use of them. But it turns out that there is an option in between and we can use node types at any rate \(\alpha\) by introducing noise.
To represent the noise-free case, we introduced the process \(Y\) that alternates between \(A\) and \(B\). Keeping track of node types allowed us to reduce the codelength in the one-level map by the entropy of \(Y\), that is exactly \(1\) bit. \(Y\) never makes mistakes about the node type. And if we know what state \(X_{AB}\) is in, we also know the state of \(Y\), that is the conditional entropy of \(Y\), given \(X_{AB}\), is zero: \(H\left(Y | X_{AB} \right) = 0\).
But what if we now introduce a node-type error-rate of \(\alpha\)? Then, knowing the state of \(X_{AB}\) does not mean anymore that we can determine the state of \(Y\) with certainty. To be clear: we are not changing the true node types, we’re only pretending that it is difficult to “measure” them.
Imagine that the current state of \(X_{AB}\) is some type-\(A\) node. We try to measure this, but the probability to make a mistake is \(\alpha\). That means that with probability \(1-\alpha\) we will determine, correctly, that the current node type is \(A\), but with probability \(\alpha\) we make a mistake and end up thinking that the current node type is \(B\).
Let’s call this new node-type process that makes mistakes with probability \(\alpha\) simply \(Y_\alpha\). The state of \(Y_\alpha\) is not fully determined by \(X_{AB}\), instead we have \(H\left(Y_\alpha | X_{AB} \right) = H\left(\alpha, 1-\alpha\right) = H_\alpha\), and this is the amount by which we can reduce the codelength of the one-level map (before, this was \(1\) bit). By recursion, in hierarchical maps, the codelength of each module is reduced by \(H_\alpha\), but this reduction is then weighted by the fraction of time that the respective random process for the module is active.
There are three values for \(\alpha\) that have a simple interpretation:
- \(\alpha = 0\) corresponds to never making mistakes and being absolutely certain about the node type.
- \(\alpha = \frac{1}{2}\) corresponds to making mistakes half of the time. Effectively, this is the same as not using node types at all and leads to the same results as the standard map equation.
- \(\alpha = 1\) corresponds to always making mistakes and is effectively the same as never making mistakes.
Some Results
What can we do with this? Well, we can try out different values for \(\alpha\) and see how the detected communities change depending on how much node-type information we use.
The Fonseca-Ganade ant-plant networks is an ecological network where the round nodes represent ant species and the square-shaped nodes represent plant species. Ant species are connected to those plant species that they use as a food source or for housing. If we don’t use the bipartite node-type information and set \(\alpha = \frac{1}{2}\), we find six communities.If you want to see more results for different networks, I recommend to take a look at my paper Mapping Flow on Bipartite Networks. I have also implemented the bipartite map equation in a branch of the community detection software package Infomap, in case you want to play with it.