Hugo<![CDATA[vf2++ on Scientific Python blog]]> 2023-09-14T17:37:57+00:00 https://blog.scientific-python.org/tags/vf2++/ <![CDATA[The VF2++ algorithm]]> https://blog.scientific-python.org/networkx/vf2pp/graph-iso-vf2pp/ Konstantinos Petridis 2022-08-10T00:00:00+00:00 2022-08-10T00:00:00+00:00 Implementing the VF2++ algorithm for the Graph Isomorphism.

The last and final post discussing the VF2++ helpers can be found here. Now that we’ve figured out how to solve all the sub-problems that VF2++ consists of, we are ready to combine our implemented functionalities to create the final solver for the Graph Isomorphism problem.

## Introduction

We should quickly review the individual functionalities used in the VF2++ algorithm:

• Node ordering which finds the optimal order to access the nodes, such that those that are more likely to match are placed first in the order. This reduces the possibility of infeasible searches taking place first.
• Candidate selection such that, given a node $u$ from $G_1$, we obtain the candidate nodes $v$ from $G_2$.
• Feasibility rules introducing easy-to-check cutting and consistency conditions which, if satisfied by a candidate pair of nodes $u$ from $G_1$ and $v$ from $G_2$, the mapping is extended.
• $T_i$ updating which updates the $T_i$ and $\tilde{T}_i$, $i=1,2$ parameters in case that a new pair is added to the mapping, and restores them when a pair is popped from it.

We are going to use all these functionalities to form our Isomorphism solver.

## VF2++

First of all, let’s describe the algorithm in simple terms, before presenting the pseudocode. The algorithm will look something like this:

1. Check if all preconditions are satisfied before calling the actual solver. For example there’s no point examining two graphs with different number of nodes for isomorphism.
2. Initialize all the necessary parameters ($T_i$, $\tilde{T}_i$, $i=1,2$) and maybe cache some information that is going to be used later.
3. Take the next unexamined node $u$ from the ordering.
4. Find its candidates and check if there’s a candidate $v$ such that the pair $u-v$ satisfies the feasibility rules
5. if there’s any, extend the mapping and go to 3.
6. if not, pop the last pair $\hat{u}-\hat{v}$ from the mapping and try a different candidate $\hat{v}$, from the remaining candidates of $\hat{u}$
7. The two graphs are isomorphic if the number of mapped nodes equals the number of nodes of the two graphs.
8. The two graphs are not isomorphic if there are no remaining candidates for the first node of the ordering (root).

The official code for the VF2++ is presented below.

# Check if there's a graph with no nodes in it
if G1.number_of_nodes() == 0 or G2.number_of_nodes() == 0:
return False

# Check that both graphs have the same number of nodes and degree sequence
if not nx.faster_could_be_isomorphic(G1, G2):
return False

# Initialize parameters (Ti/Ti_tilde, i=1,2) and cache necessary information about degree and labels
graph_params, state_params = _initialize_parameters(G1, G2, node_labels, default_label)

# Check if G1 and G2 have the same labels, and that number of nodes per label is equal between the two graphs
if not _precheck_label_properties(graph_params):
return False

# Calculate the optimal node ordering
node_order = _matching_order(graph_params)

# Initialize the stack to contain node-candidates pairs
stack = []
candidates = iter(_find_candidates(node_order, graph_params, state_params))
stack.append((node_order, candidates))

mapping = state_params.mapping
reverse_mapping = state_params.reverse_mapping

# Index of the node from the order, currently being examined
matching_node = 1

while stack:
current_node, candidate_nodes = stack[-1]

try:
candidate = next(candidate_nodes)
except StopIteration:
stack.pop()
matching_node -= 1
if stack:
# Pop the previously added u-v pair, and look for a different candidate _v for u
popped_node1, _ = stack[-1]
popped_node2 = mapping[popped_node1]
mapping.pop(popped_node1)
reverse_mapping.pop(popped_node2)
_restore_Tinout(popped_node1, popped_node2, graph_params, state_params)
continue

if _feasibility(current_node, candidate, graph_params, state_params):
# Terminate if mapping is extended to its full
if len(mapping) == G2.number_of_nodes() - 1:
cp_mapping = mapping.copy()
cp_mapping[current_node] = candidate
yield cp_mapping
continue

# Feasibility rules pass, so extend the mapping and update the parameters
mapping[current_node] = candidate
reverse_mapping[candidate] = current_node
_update_Tinout(current_node, candidate, graph_params, state_params)
# Append the next node and its candidates to the stack
candidates = iter(
_find_candidates(node_order[matching_node], graph_params, state_params)
)
stack.append((node_order[matching_node], candidates))
matching_node += 1


## Performance

This section is dedicated to the performance comparison between VF2 and VF2++. The comparison was performed in random graphs without labels, for number of nodes anywhere between the range $(100-2000)$. The results are depicted in the two following diagrams.  We notice that the maximum speedup achieved is 14x, and continues to increase as the number of nodes increase. It is also highly prominent that the increase in number of nodes, doesn’t seem to affect the performance of VF2++ to a significant extent, when compared to the drastic impact on the performance of VF2. Our results are almost identical to those presented in the original VF2++ paper, verifying the theoretical analysis and premises of the literature.

## Optimizations

The achieved boost is due to some key improvements and optimizations, specifically:

• Optimal node ordering, which avoids following unfruitful branches that will result in infeasible states. We make sure that the nodes that have the biggest possibility to match are accessed first.
• Implementation in a non-recursive manner, avoiding Python’s maximum recursion limit while also reducing function call overhead.
• Caching of both node degrees and nodes per degree in the beginning, so that we don’t have to access those features in every degree check. For example, instead of doing
res = []
for node in G2.nodes():
if G1.degree[u] == G2.degree[node]:
res.append(node)
# do stuff with res ...


to get the nodes of same degree as u (which happens a lot of times in the implementation), we just do:

res = G2_nodes_of_degree[G1.degree[u]]
# do stuff with res ...


where “G2_nodes_of_degree” stores set of nodes for a given degree. The same is done with node labels.

• Extra shrinking of the candidate set for each node by adding more checks in the candidate selection method and removing some from the feasibility checks. In simple terms, instead of checking a lot of conditions on a larger set of candidates, we check fewer conditions but on a more targeted and significantly smaller set of candidates. For example, in this code:
candidates = set(G2.nodes())
for candidate in candidates:
if feasibility(u, candidate):
do_stuff()


we take a huge set of candidates, which results in poor performance due to maximizing calls of “feasibility”, thus performing the feasibility checks in a very large set. Now compare that to the following alternative:

candidates = [
n
for n in G2_nodes_of_degree[G1.degree[u]].intersection(
G2_nodes_of_label[G1_labels[u]]
)
]
for candidate in candidates:
if feasibility(u, candidate):
do_stuff()


Immediately we have drastically reduced the number of checks performed and calls to the function, as now we only apply them to nodes of the same degree and label as $u$. This is a simplification for demonstration purposes. In the actual implementation there are more checks and extra shrinking of the candidate set.

## Demo

Let’s demonstrate our VF2++ solver on a real graph. We are going to use the graph from the Graph Isomorphism wikipedia.  Let’s start by constructing the graphs from the image above. We’ll call the graph on the left G and the graph on the left H:

import networkx as nx

G = nx.Graph(
[
("a", "g"),
("a", "h"),
("a", "i"),
("g", "b"),
("g", "c"),
("b", "h"),
("b", "j"),
("h", "d"),
("c", "i"),
("c", "j"),
("i", "d"),
("d", "j"),
]
)

H = nx.Graph(
[
(1, 2),
(1, 5),
(1, 4),
(2, 6),
(2, 3),
(3, 7),
(3, 4),
(4, 8),
(5, 6),
(5, 8),
(6, 7),
(7, 8),
]
)


### use the VF2++ without taking labels into consideration

res = nx.vf2pp_is_isomorphic(G, H, node_label=None)
# res: True

res = nx.vf2pp_isomorphism(G, H, node_label=None)
# res: {1: "a", 2: "h", 3: "d", 4: "i", 5: "g", 6: "b", 7: "j", 8: "c"}

res = list(nx.vf2pp_all_isomorphisms(G, H, node_label=None))
# res: all isomorphic mappings (there might be more than one). This function is a generator.


### use the VF2++ taking labels into consideration

# Assign some label to each node
G_node_attributes = {
"a": "blue",
"g": "green",
"b": "pink",
"h": "red",
"c": "yellow",
"i": "orange",
"d": "cyan",
"j": "purple",
}

nx.set_node_attributes(G, G_node_attributes, name="color")

H_node_attributes = {
1: "blue",
2: "red",
3: "cyan",
4: "orange",
5: "green",
6: "pink",
7: "purple",
8: "yellow",
}

nx.set_node_attributes(H, H_node_attributes, name="color")

res = nx.vf2pp_is_isomorphic(G, H, node_label="color")
# res: True

res = nx.vf2pp_isomorphism(G, H, node_label="color")
# res: {1: "a", 2: "h", 3: "d", 4: "i", 5: "g", 6: "b", 7: "j", 8: "c"}

res = list(nx.vf2pp_all_isomorphisms(G, H, node_label="color"))
# res: {1: "a", 2: "h", 3: "d", 4: "i", 5: "g", 6: "b", 7: "j", 8: "c"}


Notice how in the first case, our solver may return a different mapping every time, since the absence of labels results in nodes that can map to more than one others. For example, node 1 can map to both a and h, since the graph is symmetrical. On the second case though, the existence of a single, unique label per node imposes that there’s only one match for each node, so the mapping returned is deterministic. This is easily observed from output of list(nx.vf2pp_all_isomorphisms) which, in the first case, returns all possible mappings while in the latter, returns a single, unique isomorphic mapping.

]]>
<![CDATA[ISO Feasibility & Candidates]]> https://blog.scientific-python.org/networkx/vf2pp/iso-feasibility-candidates/ Konstantinos Petridis 2022-07-11T00:00:00+00:00 2022-07-11T00:00:00+00:00 Information about my progress on two important features of the algorithm.

The previous post can be found here, be sure to check it out so you can follow the process step by step. Since then, another two very significant features of the algorithm have been implemented and tested: node pair candidate selection and feasibility checks.

## Introduction

As previously described, in the ISO problem we are basically trying to create a mapping such that, every node from the first graph is matched to a node from the second graph. This searching for “feasible pairs” can be visualized by a tree, where each node is the candidate pair that we should examine. This can become much clearer if we take a look at the below figure. In order to check if the graphs $G_1$, $G_2$ are isomorphic, we check every candidate pair of nodes and if it is feasible, we extend the mapping and go deeper into the tree of pairs. If it’s not feasible, we climb up and follow a different branch, until every node in $G_1$ is mapped to a node $G_2$. In our example, we start by examining node 0 from G1, with node 0 of G2. After some checks (details below), we decide that the nodes 0 and 0 are matching, so we go deeper to map the remaining nodes. The next pair is 1-3, which fails the feasibility check, so we have to examine a different branch as shown. The new branch is 1-2, which is feasible, so we continue on using the same logic until all the nodes are mapped.

## Candidate Pair Selection

Although in our example we use a random candidate pair of nodes, in the actual implementation we are able to target specific pairs that are more likely to be matched, hence boost the performance of the algorithm. The idea is that, in every step of the algorithm, given a candidate

$$u\in V_1$$

we compute the candidates

$$v\in V_2$$

where $V_1$ and $V_2$ are the nodes of $G_1$ and $G_2$ respectively. Now this is a puzzle that does not require a lot of specific knowledge on graphs or the algorithm itself. Keep up with me, and you will realize it yourself. First, let $M$ be the mapping so far, which includes all the “covered nodes” until this point. There are actually three different types of $u$ nodes that we might encounter.

### Case 1

Node $u$ has no neighbors (degree of $u$ equals to zero). It would be redundant to test as candidates for $u$, nodes from $G_2$ that have more than zero neighbors. That said, we eliminate most of the possible candidates and keep those that have the same degree as $u$ (in this case, zero). Pretty easy right?

### Case 2

Node $u$ has neighbors, but none of them belong to the mapping. This situation is illustrated in the following figure. The grey lines indicate that the nodes of $G_1$ (left 1,2) are mapped to the nodes of $G_2$ (right 1,2). They are basically the mapping. Again, given $u$, we make the observation that candidates $v$ of u, should also have no neighbors in the mapping, and also have the same degree as $u$ (as in the figure). Notice how if we add a neighbor to $v$, or if we place one of its neighbors inside the mapping, there is no point examining the pair $u-v$ for matching.

### Case 3

Node $u$ has neighbors and some of them belong to the mapping. This scenario is also depicted in the below figure. In this case, to obtain the candidates for $u$, we must look into the neighborhoods of nodes from $G_2$, which map back to the covered neighbors of $u$. In our example, $u$ has one covered neighbor (1), and 1 from $G_1$ maps to 1 from $G_2$, which has $v$ as neighbor. Also, for v to be considered as candidate, it should have the same degree as $u$, obviously. Notice how every node that is not in the neighborhood of 1 (in $G_2$) cannot be matched to $u$ without breaking the isomorphism.

## ISO Feasibility Rules

Let’s assume that given a node $u$, we obtained its candidate $v$ following the process described in the previous section. At this point, the Feasibility Rules are going to determine whether the mapping should be extended by the pair $u-v$ or if we should try another candidate. The feasibility of a pair $u-v$ is examined by consistency and cutting checks.

### Consistency rules

At, first I am going to present the mathematical expression of the consistency check. It may seem complicated at first, but it’s going to be made simple by using a visual illustration. Using the notation $nbh_i(u)$ for the neighborhood of u in graph $G_i$, the consistency rule is:

$$\forall\tilde{v}\in nbh_2(v)\cap M:(u, M^{-1}(\tilde{v}))\in E_1) \wedge \forall\tilde{u}\in nbh_1(u)\cap M:(u, M(\tilde{u}))\in E_2)$$

We are going to use the following simple figure to demystify the above equation. The mapping is depicted as grey lines between the nodes that are already mapped, meaning that 1 maps to A and 2 to B. What is implied by the equation is that, for two nodes $u$ and $v$ to pass the consistency check, the neighbors of $u$ that belong in the mapping, should map to neighbors of $v$ (and backwards). This could be checked by code as simple as:

for neighbor in G1[u]:
if neighbor in mapping:
if mapping[neighbor] not in G2[v]:
return False
elif G1.number_of_edges(u, neighbor) != G2.number_of_edges(
v, mapping[neighbor]
):
return False


where the final two lines also check the number of edges between node $u$ and its neighbor $\tilde{u}$, which should be the same as those between $v$ and its neighbor which $\tilde{u}$ maps to. At a very high level, we could describe this check as a 1-look-ahead check.

### Cutting rules

We have previously discussed what $T_i$ and $\tilde{T_i}$ represent (see previous post). These sets are used in the cutting checks as follows: the number of neighbors of $u$ that belong to $T_1$, should be equal to the number of neighbors of $v$ that belong to $T_2$. Take a moment to observe the below figure. Once again, node 1 maps to A and 2 to B. The red nodes (4,5,6) are basically $T_1$ and the yellow ones (C,D,E) are $T_2$. Notice that in order for $u-v$ to be feasible, $u$ should have the same number of neighbors, inside $T_1$, as $v$ in $T_2$. In every other case, the two graphs are not isomorphic, which can be verified visually. For this example, both nodes have 2 of their neighbors (4,6 and C,E) in $T_1$ and $T_2$ respectively. Careful! If we delete the $V-E$ edge and connect $V$ to $D$, the cutting condition is still satisfied. However, the feasibility is going to fail, by the consistency checks of the previous section. A simple code to apply the cutting check would be:

if len(T1.intersection(G1[u])) != len(T2.intersection(G2[v])) or len(
T1out.intersection(G1[u])
) != len(T2out.intersection(G2[v])):
return False


where T1out and T2out correspond to $\tilde{T_1}$ and $\tilde{T_2}$ respectively. And yes, we have to check for those as well, however we skipped them in the above explanation for simplicity.

## Conclusion

At this point, we have successfully implemented and tested all the major components of the algorithm VF2++,

• Node Ordering
• $T_i/\tilde{T_i}$ Updating
• Feasibility Rules
• Candidate Selection

This means that, in the next post, hopefully, we are going to discuss our first, full and functional implementation of VF2++.

]]>
<![CDATA[Updates on VF2++]]> https://blog.scientific-python.org/networkx/vf2pp/node-ordering-ti-updating/ Konstantinos Petridis 2022-07-06T00:00:00+00:00 2022-07-06T00:00:00+00:00 Summary of the progress on VF2++

This post includes all the major updates since the last post about VF2++. Each section is dedicated to a different sub-problem and presents the progress on it so far. General progress, milestones and related issues can be found here.

## Node ordering

The node ordering is one major modification that VF2++ proposes. Basically, the nodes are examined in an order that makes the matching faster by first examining nodes that are more likely to match. This part of the algorithm has been implemented, however there is an issue. The existence of detached nodes (not connected to the rest of the graph) causes the code to crash. Fixing this bug will be a top priority during the next steps. The ordering implementation is described by the following pseudocode.

Matching Order

1. Set $M = \varnothing$.
2. Set $\bar{V1}$ : nodes not in order yet
3. while $\bar{V1}$ not empty do
• $rareNodes=[$nodes from $V_1$ with the rarest labels$]$
• $maxNode=argmax_{degree}(rareNodes)$
• $T=$ BFSTree with $maxNode$ as root
• for every level in $T$ do
• $V_d=[$nodes of the $d^{th}$ level$]$
• $\bar{V_1} \setminus V_d$
• $ProcessLevel(V_d)$
4. Output $M$: the matching order of the nodes.

Process Level

1. while $V_d$ not empty do
• $S=[$nodes from $V_d$ with the most neighbors in M$]$
• $maxNodes=argmax_{degree}(S)$
• $rarestNode=[$node from $maxNodes$ with the rarest label$]$
• $V_d \setminus m$
• Append m to M

## $T_i$ and $\tilde{T_i}$

According to the VF2++ paper notation:

$$T_1=(u\in V_1 \setminus m: \exists \tilde{u} \in m: (u,\tilde{u}\in E_1))$$

where $V_1$ and $E_1$ contain all the nodes and edges of the first graph respectively, and $m$ is a dictionary, mapping every node of the first graph to a node of the second graph. Now if we interpret the above equation, we conclude that $T_1$ contains uncovered neighbors of covered nodes. In simple terms, it includes all the nodes that do not belong in the mapping $m$ yet, but are neighbors of nodes that are in the mapping. In addition,

$$\tilde{T_1}=(V_1 \setminus m \setminus T_1)$$

The following figure is meant to provide some visual explanation of what exactly $T_i$ is. The blue nodes 1,2,3 are nodes from graph G1 and the green nodes A,B,C belong to the graph G2. The grey lines connecting those two indicate that in this current state, node 1 is mapped to node A, node 2 is mapped to node B, etc. The yellow edges are just the neighbors of the covered (mapped) nodes. Here, $T_1$ contains the red nodes (4,5,6) which are neighbors of the covered nodes 1,2,3, and $T_2$ contains the grey ones (D,E,F). None of the nodes depicted would be included in $\tilde{T_1}$ or $\tilde{T_2}$. The latter sets would contain all the remaining nodes from the two graphs.

Regarding the computation of these sets, it’s not practical to use the brute force method and iterate over all nodes in every step of the algorithm to find the desired nodes and compute $T_i$ and $\tilde{T_i}$. We use the following observations to implement an incremental computation of $T_i$ and $\tilde{T_i}$ and make VF2++ more efficient.

• $T_i$ is empty in the beginning, since there are no mapped nodes ($m=\varnothing$) and therefore no neighbors of mapped nodes.
• $\tilde{T_i}$ initially contains all the nodes from graph $G_i, i=1,2$ which can be realized directly from the notation if we consider both $m$ and $T_1$ empty sets.
• Every step of the algorithm either adds one node $u$ to the mapping or pops one from it.

We can conclude that in every step, $T_i$ and $\tilde{T_i}$ can be incrementally updated. This method avoids a ton of redundant operations and results in significant performance improvement. The above graph shows the difference in performance between using the exhaustive brute force and incrementally updating $T_i$ and $\tilde{T_i}$. The graph used to obtain these measurements was a regular GNP Graph with a probability for an edge equal to $0.7$. It can clearly be seen that execution time of the brute force method increases much more rapidly with the number of nodes/edges than the incremental update method, as expected. The brute force method looks like this:

def compute_Ti(G1, G2, mapping, reverse_mapping):
T1 = {nbr for node in mapping for nbr in G1[node] if nbr not in mapping}
T2 = {
nbr
for node in reverse_mapping
for nbr in G2[node]
if nbr not in reverse_mapping
}

T1_out = {n1 for n1 in G1.nodes() if n1 not in mapping and n1 not in T1}
T2_out = {n2 for n2 in G2.nodes() if n2 not in reverse_mapping and n2 not in T2}
return T1, T2, T1_out, T2_out


If we assume that G1 and G2 have the same number of nodes (N), the average number of nodes in the mapping is $N_m$, and the average node degree of the graphs is $D$, then the time complexity of this function is:

$$O(2N_mD + 2N) = O(N_mD + N)$$

in which we have excluded the lookup times in $T_i$, $mapping$ and $reverse\_mapping$ as they are all $O(1)$. Our incremental method works like this:

def update_Tinout(
G1, G2, T1, T2, T1_out, T2_out, new_node1, new_node2, mapping, reverse_mapping
):
# This function should be called right after the feasibility is established and node1 is mapped to node2.
uncovered_neighbors_G1 = {nbr for nbr in G1[new_node1] if nbr not in mapping}
uncovered_neighbors_G2 = {
nbr for nbr in G2[new_node2] if nbr not in reverse_mapping
}

# Add the uncovered neighbors of node1 and node2 in T1 and T2 respectively
T1 = T1.union(uncovered_neighbors_G1)
T2 = T2.union(uncovered_neighbors_G2)

# todo: maybe check this twice just to make sure
T1_out = T1_out - uncovered_neighbors_G1
T2_out = T2_out - uncovered_neighbors_G2

return T1, T2, T1_out, T2_out


which based on the previous notation, is:

$$O(2D + 2(D + M_{T_1}) + 2D) = O(D + M_{T_1})$$

where $M_{T_1}$ is the expected (average) number of elements in $T_1$.

Certainly, the complexity is much better in this case, as $D$ and $M_{T_1}$ are significantly smaller than $N_mD$ and $N$.

In this post we investigated how node ordering works at a high level, and also how we are able to calculate some important parameters so that the space and time complexity are reduced. The next post will continue with examining two more significant components of the VF2++ algorithm: the candidate node pair selection and the cutting/consistency rules that decide when the mapping should or shouldn’t be extended. Stay tuned!

]]>
<![CDATA[GSoC 2022: NetworkX VF2++ Implementation]]> https://blog.scientific-python.org/networkx/vf2pp/gsoc-2022/ Konstantinos Petridis 2022-06-09T00:00:00+00:00 2022-06-09T00:00:00+00:00 This is the first blog of my GSoC-2022 journey. It includes general information about me, and a superficial description of the project.

## Intro

I got accepted as a GSoC contributor, and I am so excited to spend the summer working on such an incredibly interesting project. The mentors are very welcoming, communicative, fun to be around, and I really look forward to collaborating with them. My application for GSoC 2022 can be found here.