#### Introduction

The map equation provides a way to measure how well a clustering of the nodes in a graph captures the structure of that same graph. It takes into account the “connectedness” within and between clusters for that. In this blog post, we will cover the required background to undertand the map equation and give an implementation of the map equation in Haskell.

#### Background

Before we can dive into how the map equation works, we need to cover some background on graph theory and Huffman codes. If you feel like you know enough about those topics, you can skip the background section and continue with the map equation.

##### Graph Theory

Graph theory is a branch of mathematics that studies graphs.

A graph \(G = \left( V, E, \delta \right)\) is a tripe with:

- \(V\), the set of vertices (or nodes) of \(G\)
- \(E \subseteq V \times V\), the set of edges (or links) of \(G\) and
- \(\delta \colon E \to \mathbb{K}\), a function that assigns values (or weights) over some field \(\mathbb{K}\) to the graph’s edges.

Depending on the application, we can interpret the edge weights in different ways: e.g., they can stand for the distance between nodes or the strength of some sort of flow between nodes. We can calculate the total edge weight \(\Delta\) in a graph by

\[\Delta \left( G \right) = \sum_{e \in E} \delta\left(e\right).\]

In Haskell, we can use the following data types to represent graphs:

```
type Vertex = Int
type Edge = (Vertex, Vertex)
data Graph = Graph { nodes :: [Vertex]
edges :: [Edge]
, delta :: Edge -> Double
, }
```

The total edge weight can be calculated like this:

```
totalWeight :: Graph -> Double
Graph _ edges delta) = sum [delta e | e <- edges] totalWeight (
```

Graphs can be directed or undirected. In directed graphs, edges have a direction, similar to one-way streets, and can only be used in one direction. In undirected graphs, edges can be used in both directions. For simplicity, we only consider undirected graphs here.

For example, we can draw the undirected graph \(G_\triangle = \left( \left\{ 1, 2, 3 \right\} , \left\{ \left( 1, 2 \right), \left( 2, 3 \right), \left( 3, 1 \right) \right\} , e \mapsto 1 \right)\) with \(3\) nodes, \(3\) edges, all edge weights equal to \(1\), and \(\Delta \left( G_\triangle \right) = 3\), like so:

The above graph, \(G_\triangle\), can then be represented like this:

```
g :: Graph
= let nodes = [1, 2, 3]
g = [(1, 2), (2, 3), (3, 1)]
edges = \(u, v) -> if (u, v) `elem` edges || (v, u) `elem` edges
delta then 1.0
else 0.0
in Graph nodes edges delta
```

##### Huffman Codes

Huffman codes are a class of coding schemes that assign code words to input symbols based on the frequencies of the input symbols. Frequently occuring input symbols correspond to short code words while infrequently occuring input symbols correspond to long code words.

Let \(A = \left\{ a_0, a_1, \ldots, a_{n-1} \right\}\) be a finite alphabet with \(n\) symbols, the source alphabet, and let \(B = \left\{ 0, 1 \right\}\) be a binary alphabet, the coding alphabet. Then, a Huffman code defines an encoding function \(\phi \colon A \to B^+\), i.e., \(\phi\) takes an input symbol \(a \in A\) and returns its code word \(\phi \left( a \right) \in B^+\). Here, \(B^+\) denotes non-empty sequences over \(B\).

Let \(p \colon A \to \mathbb{R}\) be a probability distribution over the symbols in \(A\) and let \(\left| \phi \left( a \right) \right|\) denote the length of the code word for \(a\). Then, a Huffman code assigns *prefix-free* code words to the input symbols \(a \in A\) based on their frequencies \(p \left( a \right)\), such that the average code word length \(L \left( \phi \right)\) is minimised:

\[ L \left( \phi \right) = \sum_{a \in A} p \left( a \right) \left| \phi \left( a \right) \right|. \]

The term prefix-free refers to the fact that no code word is a prefix of any other code word in the code.

For example, consider giving directions in New York City. You can go *left*, *straight*, *right*, or you can do a *u-turn*, so let us define \(A_{NYC} = \left\{ \leftarrow, \uparrow, \rightarrow, \curvearrowleft \right\}\). U-turns and left turns may not be allowed in every crossing and right turns are much easier and safer than left turns. Furthermore, once you have reached the correct street, you will head straigt for a longer time. Thus, it is reasonable to expect that heading straight is more common than taking a right turn and taking a right turn is much more common than taking a left turn or a u-turn. Let’s say the frequencies of the directions are \(p_{NYC} \left( \leftarrow \right) = 0.1\), \(p_{NYC} \left( \uparrow \right) = 0.7\), \(p_{NYC} \left( \rightarrow \right) = 0.15\), and \(p_{NYC} \left( \curvearrowleft \right) = 0.05\). A possible Huffman code is given in the following table as \(\phi_{NYC}\):

\(a\) | \(p_{NYC} \left( a \right)\) | \(\phi_{NYC} \left( a \right)\) |
---|---|---|

\(\uparrow\) | \(0.7\) | 0 |

\(\rightarrow\) | \(0.15\) | 10 |

\(\leftarrow\) | \(0.1\) | 110 |

\(\curvearrowleft\) | \(0.05\) | 111 |

With this coding scheme, the directions \(\left[ \uparrow, \uparrow, \leftarrow, \uparrow, \rightarrow, \uparrow \right]\) are encoded as \(\left[0, 0, 110, 0, 10, 0 \right]\). As mentioned earlier, we can see that

- the length of code words depends on the frequencies of the input symbols and more frequent symbols have shorter code words and
- no code word is the prefix of any other code word.

We can get an idea about how good our code is by comparing \(L \left( \phi \right)\) to the theoretical lower bound on the minimum code length which can be computed using the entropy \(H\):

\[H \left( p \right) = - \sum_{a \in A} p \left( a \right) \operatorname{ld} p \left( a \right).\]

In Haskell, we add a normalisation step to make sure that the values add up to \(1\) and calculate the entropy like this:

```
entropy :: [Double] -> Double
= - sum [(x/sum xs) * logBase 2 (x/sum xs) | x <- xs, x /= 0] entropy xs
```

In the New York City example, we have \(L \left( \phi_{NYQ} \right) = 1.45\) and \(H \left( p_{NYQ} \right) \approx 1.32\).

#### The Map Equation

Imagine we have a graph \(G = \left( V, E, \delta \right)\) and our goal is to figure out which of its nodes belong together. To be precise, we want to find the best hierarchical clustering of the nodes. By clustering, we mean a partition of the nodes into disjoint sets.

For example, consider the undirected graph \(G = \left( V, E, \delta \right)\) with \(V = \left\{ 1, 2, 3, 4, 5, 6 \right\}\), \(E = \left\{ \left( 1, 2 \right), \left( 2, 3 \right), \left( 3, 1 \right), \left( 3, 4 \right), \left( 4, 5 \right), \left( 5, 6 \right), \left( 6, 4 \right) \right\}\), and \(\delta = e \mapsto 1\). For drawing \(G\), we omit the edge labels but keep in mind that they are all \(1\):

A possible clustering is then \(C = \left\{ 1, 2, 3, 4, 5, 6 \right\}\) on the highest level, \(A = \left\{ 1, 2, 3 \right\}\), \(B = \left\{ 4, 5, 6 \right\}\) on the next finer level, and \(a_1 = \left\{ 1 \right\}\), \(a_2 = \left\{ 2 \right\}\), \(a_3 = \left\{ 3 \right\}\), \(b_1 = \left\{ 4 \right\}\), \(b_2 = \left\{ 5 \right\}\), \(b_3 = \left\{ 6 \right\}\) on the finest level.

In Haskell, we can represent clusters with the following data type

```
data Cluster = Sub [Cluster]
| Leaf Vertex
```

The above clustering would then be

```
cluster :: Cluster
= Sub [ Sub [Leaf 1, Leaf 2, Leaf 3]
cluster Sub [Leaf 4, Leaf 5, Leaf 6]
, ]
```

And we can retrieve the children of a cluster, i.e., the nodes that are covered by it, with

```
children :: Cluster -> [Vertex]
Sub cs) = concatMap children cs
children (Leaf v) = [v] children (
```

But how can we measure how good a clustering is? This is exactly what the map equation does: given a graph and a clustering of its nodes, the map equation tells us how good this clustering is. Unfortunately, the problem of finding the best node clustering in a graph is NP-hard. However, we can use the map equation as an optimisation criterion to search the space of possible node clusterings over a graph.

So how does it work? The map equation looks at the graph from a point of view of a random walk. A random walker starts at a random node. Then, for each step, he selects one of the reachable nodes as the next node. His selection is based on the weight between the current node and the reachable nodes: a higher weight on an edge corresponds to a higher probability for stepping along that edge. During the random walk, we keep track of how often he visits each of the nodes. We divide these numbers by the total number of steps and this gives us the visit rates of the nodes. In order to be sure to get the exact visit rates, we have to let the random walker continue for an infinitely long time. However, in undirected graphs, we can calculate the visit rates directly: the visit rate \(\pi\left(v\right)\) for a node \(v\) is given by

\[\pi \left(v\right) = \frac{\sum_{u \in V} \delta \left( \left( u, v \right) \right)}{2 \cdot \Delta \left(G\right)}\]

(In case you’re wondering: in directed graphs, the visit rates are given by the leading eigenvector of the transition matrix which is defined by the graph’s edges. However, there are some complications in the case of nodes that have no incoming or outgoing edges as the random walker can get trapped or would never reach those nodes. But there are tricks to solve those problems and to approximate the visit rates.)

In Haskell, we can calculate the visit rates like this:

```
type VisitRate = Double
visitRate :: Graph -> Vertex -> VisitRate
@(Graph nodes _ delta) v = sum [delta (u,v) | u <- nodes] / (2 * totalWeight g) visitRate g
```

For the next step it is useful to interpret the node visit rates as the fraction of flow that is generated by each node. Let \(f \left(u,v\right)\) be the flow from \(u\) to \(v\) with

\[f \left(u,v\right) = \pi \left(u\right) \cdot \frac{\delta \left( \left( u, v \right) \right)}{\sum_{w \in V} \delta \left( \left( u, w \right) \right)},\]

i.e., the amount of flow from \(u\) to \(v\) is proportional to the weight of the connection between \(u\) and \(v\), divided by the total weight of all edges connected to \(u\). In Haskell, we have

```
f :: Graph -> Vertex -> Vertex -> Double
@(Graph nodes _ delta) u v =
f g* delta (u, v) / sum [delta (u, w) | w <- nodes] visitRate g u
```

We can generalise the idea of flow from *flow between nodes* to *flow between sets of nodes* (or clusters). Let \(M, N \subseteq V\) be clusters. Then, the flow from \(M\) to \(N\) is

\[F \left(M, N\right) = \sum_{u \in M} \sum_{v \in N} f \left(u,v\right).\]

In Haskell, we have

```
fs :: Graph -> Cluster -> Cluster -> Double
= sum [f g u v | u <- children m, v <- children n] fs g m n
```

Now, consider a random walker who moves between clusters. The probabilities for his steps are given by the flow between the clusters, i.e., \(F\). And this is all we need in order to design a Huffman code that describes his movements. However, we are not *actually* constructing Huffman code. Instead, we calculate the theoretical lower bound for the average code word length *given* a clustering. So the goal is really to find a clustering that minimises this lower bound.

Let \(M\) be a cluster, i.e., a *set of nodes*, and let \(\operatorname{sub} \left( M \right)\) be the partitioning of \(M\) into its sub-clusters, i.e., a *set of clusters*. Let \(\operatorname{outside} \left(M \right)\) be the set of nodes outside of \(M\), i.e., those nodes from which the random walker can step into \(M\), with

\[\operatorname{outside} \left( M \right) = \left\{ \begin{array}{cl} V & \text{if }\left| M \right| = 1 \\ V \setminus M & \text{otherwise.} \\ \end{array} \right.\]

The same in Haskell is

```
sub :: Cluster -> [Cluster]
Sub cs) = cs
sub (Leaf _ ) = []
sub (
outside :: Graph -> Cluster -> Cluster
@(Graph nodes _ _) (Leaf _) = Sub [Leaf v | v <- nodes]
outside g@(Graph nodes _ _) m = Sub [Leaf v | v <- nodes \\ children m] outside g
```

Then, when at some cluster \(M\), a random walker can step into one of \(M\)’s sub-clusters or he can step out of \(M\) to go back to the next higher level. We can visualise the possible steps by drawing a tree that captures the cluster structure. If we were constructing Huffman codes, we would construct a separate code for each cluster. For example, in \(B\), there would be one code word for visiting the sub-clusters \(b_1\), \(b_2\), and \(b_3\), as well as a code word for exiting \(B\) and going one level up to \(C\). A single step in the graph may correspond to stepping across multiple cluster borders, such that the random walker needs to use multiple code words from different Huffman codes to encode his steps.

Let \(\operatorname{steps}\left(M\right)\) be the set of clusters to which a random walker can step from \(M\) with

\[\operatorname{steps}\left(M\right) = \left\{ \operatorname{outside}\left(M\right) \right\} \cup \operatorname{sub}\left(M\right)\]

I.e., the random walker can step out of \(M\) or into one of the sub-clusters. In Haskell, this is

```
steps :: Graph -> Cluster -> [Cluster]
= outside g m : sub m steps g m
```

Let the flows that correspond to these steps be given by

\[\operatorname{flows}\left(M\right) = \left\{ F \left( \operatorname{outside} \left( S \right), S \right) |\, S \in \operatorname{steps} \left( M \right) \right\}.\]

In Haskell, we have:

```
flows :: Graph -> Cluster -> [Double]
= [fs g (outside g s) s | s <- steps g m] flows g m
```

Then, the lower bound for the code length of a Huffman code for \(M\) consists of two parts:

- the entropy of the possible moves in \(M\), weighted by the total flow that corresponds to these moves, and
- the sum of the lower bounds for the code lenghts of \(M\)’s sub-clusters.

\[L_{min} \left( M \right) = \sum_{f \in \operatorname{flows}\left(M\right)} f \cdot H\left(\operatorname{flows}\left(M\right)\right) + \sum_{m \in \operatorname{sub} \left(M\right)} L_{min} \left(m\right).\]

In Haskell, we can calculate it by

```
codeLength :: Graph -> Cluster -> Double
= sum (flows g m) * entropy (flows g m)
codeLength g m + sum [codeLength g n | n <- sub m]
```

For our example cluster from above, we get \(L_{min} \approx 2.32\).

\(L_{min}\) is exactly the measure that the map equation provides and that we want to minimise.

#### A Recursive Graph

Now that we have everything in place to calculate the lower bound for the average code length of Huffman codes that describe movements between clusters in graphs, let’s take a look at another graph. Below is a plot of a triangular Sierpinski graph, i.e., a graph that consists of nodes which form a triangle. These triangles, in turn, form “larger” triangles and so on. Here, we consider a triangular Sierpinski graph with \(27\) nodes.

Using the Haskell data structures from before, we can represent the graph like this:

```
sierpinski :: Graph
= let nodes = [1 .. 27]
sierpinski = [ ( 1, 2), ( 2, 3), ( 3, 1) -- triangle 1
edges 4, 5), ( 5, 6), ( 6, 4) -- triangle 2
, ( 7, 8), ( 8, 9), ( 9, 7) -- triangle 3
, ( 2, 4), ( 6, 8), ( 7, 3) -- triangles 1+2+3
, ( 10,11), (11,12), (12,10) -- triangle 4
, (13,14), (14,15), (15,13) -- triangle 5
, (16,17), (17,18), (18,16) -- triangle 6
, (11,13), (15,17), (16,12) -- triangles 4+5+6
, (19,20), (20,21), (21,19) -- triangle 7
, (22,23), (23,24), (24,22) -- triangle 8
, (25,26), (26,27), (27,25) -- triangle 9
, (20,22), (24,26), (21,25) -- triangles 7+8+9
, (5,10), ( 9,14), (18,23)
, (
]= \(u, v) -> if (u, v) `elem` edges || (v, u) `elem` edges
delta then 1.0
else 0.0
in Graph nodes edges delta
```

Now we want to find a clustering of the nodes such that the map equation is minimised. A trivial clustering would be to assign all nodes into a single cluster, which gives us a lower bound on the code length of \(L_{min} \approx 4.75\).

In Haskell, we can represent this clustering like this:

```
sierpinskiCluster1 :: Cluster
= Sub [ Leaf 1, Leaf 2, Leaf 3
sierpinskiCluster1 Leaf 4, Leaf 5, Leaf 6
, Leaf 7, Leaf 8, Leaf 9
, Leaf 10, Leaf 11, Leaf 12
, Leaf 13, Leaf 14, Leaf 15
, Leaf 16, Leaf 17, Leaf 18
, Leaf 19, Leaf 20, Leaf 21
, Leaf 22, Leaf 23, Leaf 24
, Leaf 25, Leaf 26, Leaf 27
, ]
```

The recursive structure of the graph suggests that we group those nodes which form a triangle into a cluster. This way, we get \(9\) clusters where each cluster contains \(3\) singleton clusters. The code length for this clustering is \(L_{min} \approx 3.57\).

In Haskell, this clustering is given by

```
sierpinskiCluster2 :: Cluster
= Sub [ Sub [Leaf 1, Leaf 2, Leaf 3]
sierpinskiCluster2 Sub [Leaf 4, Leaf 5, Leaf 6]
, Sub [Leaf 7, Leaf 8, Leaf 9]
, Sub [Leaf 10, Leaf 11, Leaf 12]
, Sub [Leaf 13, Leaf 14, Leaf 15]
, Sub [Leaf 16, Leaf 17, Leaf 18]
, Sub [Leaf 19, Leaf 20, Leaf 21]
, Sub [Leaf 22, Leaf 23, Leaf 24]
, Sub [Leaf 25, Leaf 26, Leaf 27]
, ]
```

Alternatively, we could group the larger triangles in the graph together. This way, we get \(3\) larget clusters with \(9\) singleton clusters each. The code length in this case is \(L_{min} = 3.68\).

In Haskell, we have this way of clustering the nodes as

```
sierpinskiCluster3 :: Cluster
= Sub [ Sub [ Leaf 1, Leaf 2, Leaf 3
sierpinskiCluster3 Leaf 4, Leaf 5, Leaf 6
, Leaf 7, Leaf 8, Leaf 9
,
]Sub [ Leaf 10, Leaf 11, Leaf 12
, Leaf 13, Leaf 14, Leaf 15
, Leaf 16, Leaf 17, Leaf 18
,
]Sub [ Leaf 19, Leaf 20, Leaf 21
, Leaf 22, Leaf 23, Leaf 24
, Leaf 25, Leaf 26, Leaf 27
,
] ]
```

We can achieve the shortest code length when we strictly follow the recursive structure of the graph. We put each node into a singleton cluster. Then, we group the nodes which form a triangle into a cluster. Then, we group those clusters which form a triangle into a cluster. This way, we get \(3\) clusters on the highest level. Each of those contains \(3\) clusters with \(3\) singleton clusters each. The code length is then \(L_{min} \approx 3.48\).

In Haskell, this clustering is

```
sierpinskiCluster4 :: Cluster
= Sub [ Sub [ Sub [Leaf 1, Leaf 2, Leaf 3]
sierpinskiCluster4 Sub [Leaf 4, Leaf 5, Leaf 6]
, Sub [Leaf 7, Leaf 8, Leaf 9]
,
]Sub [ Sub [Leaf 10, Leaf 11, Leaf 12]
, Sub [Leaf 13, Leaf 14, Leaf 15]
, Sub [Leaf 16, Leaf 17, Leaf 18]
,
]Sub [ Sub [Leaf 19, Leaf 20, Leaf 21]
, Sub [Leaf 22, Leaf 23, Leaf 24]
, Sub [Leaf 25, Leaf 26, Leaf 27]
,
] ]
```

#### Source Code

If you are interested, you can download the full source code from here. The linked file requires stack to set up a Haskell environment. It also contains some additional data structures and functionality to read graphs and clusters from JSON files. However, **this implementation is not optimised for performance**, the focus here was on simplicity.

After making the file executable, you can run it according to the usage

```
Usage: MapEquation.hs --graph GRAPH --cluster CLUSTER
This program provides an implementation of the map equation. It can be used to
calculate the lower bound on the average code length that describes the a
random walk through node clusters in a graph.
Available options:
-h,--help Show this help text
--graph GRAPH Path to a graph input file
--cluster CLUSTER Path to a cluster input file
```

It expects a graph file and a cluster file to calculate the code length.

We represent graphs as edge lists in JSON files. Each edge in the list is made up of a start node, an end node, and an edge weight. For example, the example graph we used in the map equation section would be represented like this:

`[[1,2,1.0], [2,3,1.0], [3,1,1.0], [3,4,1.0], [4,5,1.0], [5,6,1.0], [6,4,1.0]]`

We represent the hierarchical structure of clusters directly as nested lists of nodes. The example cluster we used in the map equation section is:

`[[1,2,3], [4,5,6]]`

#### Conclusion

We looked at how the map equation captures the movements of a random walker through a graph. Using this walk, we derived visit rates for the graph’s nodes and calculated the flow between pairs of nodes. Then, we clustered the nodes and considered the movement of the random walker between those clusters. With the visit rates and flows, we were able to determine the lower bound on the average code length of a Huffman code that describes this movement. As such, the map equation provides a way to measure how good a node clustering for a given graph is. It can be used as an optimisation criterion to find solutions for the NP-hard node clustering problem.