### Authors: [Michael Lees](http://www.mhlees.com), Debraj Roy

---

The MIT License (MIT)

Copyright (c) 2015 Michael Lees, Debraj Roy

Permission is hereby granted, free of charge, to any person obtaining a copy

of this software and associated documentation files (the "Software"), to deal

in the Software without restriction, including without limitation the rights

to use, copy, modify, merge, publish, distribute, sublicense, and/or sell

copies of the Software, and to permit persons to whom the Software is

furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all

copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,

OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE

SOFTWARE.

---

# Analysing Real Networks using NetworkX

---

#### NOTE: this assumes python 2.7x if you have python 3 installed, please see: https://docs.python.org/3/howto/pyporting.html

#### If you haven't done so you will need to install all the required python libraries, If you are unfamiliar with pip, Anaconda is an easy way to do this: http://docs.continuum.io/anaconda/install

---

In this class we will look at how to download real network datasets and load them into NetworkX. These networks can also be visualised in Gephi, assuming they are correctly formatted. The purpose of this is to enable you to use and analyse real network data using software, which will be crucial during the group projects.

## Online Data Resources

There are a number of online resources which provide existing datasets that you can play with. Here we list a few, they will offer different networks, of different size and different format (see [formats]). Some will be easier to load in NetworkX/Gephi, but you are free to try any yourself.

- The Koblenz Network Collection (http://konect.uni-koblenz.de/networks)

There are a number of established network file formats that you may used for existing datasets. In the most basic form, all file formats must include: a. A list of nodes (perhaps with associated attributes) b. A list of edges (perhaps with associated attributes). Different file formats will store this in different, and some provide features that others don't. Here we ask you to look at a few file formats:

- Edgelist or CSV (http://networkx.lanl.gov/reference/readwrite.edgelist.html)

By all means study these formats, but as long as you can read a format in NetworkX and/or Gephi that is all that is important. Be aware that sometimes special characters in the date (e.g., quotation, brackets, etc.) can cause Gephi/NetworkX to have problems loading.

## Download the Airline Network

We will now download an edgelist representation of the world-wide airline traffic. Go to http://openflights.org/data.html and read about the dataset. We have already processed the "airports.dat" and "routes.dat" to create an edge list, where nodes are airports and edges exist between nodes if any airline conducts a flight between the two airports. You should now download the edgelist from http://tinyurl.com/n9ohgxg

The edgelist format is simply a Comma Separated Value file (CSV) where each line represent an edge and the first two entries on a line represent the node IDs. In this case the node IDs are the IATA airport codes, so for example the first entry (not including the header line) indicates a flight from Goroka, Papua New Guinea (GKA) to Port Moresby, Papua New Guinea (POM), which is operated by Air Niugini.

## Load the Airilne Network in NetworkX

Now we will try to load the network into NetworkX. First we import the necessary python libraries.

%% Cell type:code id: tags:

``` python

importcsv#import the Python CSV library

importnetworkxasnx#import NetworkX

importnumpyasnp#import numpy for ...

importcommunity#import community (https://pypi.python.org/pypi/python-louvain/0.3)

importpowerlaw#import powerlaw library for testing fits

#force drawing of graphs inline for ipython notebook

%matplotlibinline

importmatplotlib.pyplotasplt#import matplotlib for plotting/drawing grpahs

importoperator#standard python library used for sorting

```

%% Cell type:markdown id: tags:

Next we use the Python open command and the built-in read_edgelist NetworkX command to create our Graph G. In the below code, we open the csv file edgelist.csv as 'rb' which means read (not write) in binary (not plain text).

The parameters of the read_edgelist method can be found in the NetworkX manual here: https://networkx.github.io/documentation/latest/

Goto the webpage and search in the search box for the command read_edgelist, you should end up at: https://networkx.github.io/documentation/latest/reference/generated/networkx.readwrite.edgelist.read_edgelist.html?highlight=read_edgelist

- The first parameter is the file (which we call file_handle)

- The second parameter is a delimiter, which specifies the marker between records, in this case a comma

- The third parameter specifies the types of network to construct, we use a directed graph here (obviously)

- The fourth parameter specifies the type of node, which for us is a string, which is the IATA code e.g., GKA

- The final parameter specifies the encoding used by the file (e.g., UTF-8, ASCII, etc.)

Please spend sometime now becoming familiar with the NetworkX documentation, you should be able to search for help yourself from here on in.

%% Cell type:code id: tags:

``` python

withopen('edgelist.csv','rb')asfile_handle:

next(file_handle,'')# skip the header line (NOTE the first list in the CSV file doesn't contain an edge)

Now we have the graph loaded into NetworkX we can obtain some simple statistics about the Network. For example the number of nodes (N), edges (L) and average degree <k>.

Recall that for a directed network the average degree is simply $<k>= \frac{L}{N} $

%% Cell type:code id: tags:

``` python

N=G.order()#G.order(), gives number of nodes

L=G.size()#G.size(), gives number of edges

avg_deg=float(L)/N#calculate average degree

#print out statistics

print"Nodes: ",N

print"Edges: ",L

print"Average degree: ",avg_deg

```

%%%% Output: stream

Nodes: 3286

Edges: 39428

Average degree: 11.9987827145

%% Cell type:markdown id: tags:

## In-degree and Out-degree

We can now measure the distribution of airports in terms of how many in-coming and out-going flights they have. This is known as the node in-degree and out-degree (respectively). NetworkX provides a simple way to get the in and out degree of all nodes. This data is given in the form of a python dictionary where the key is the node and the value is the in/out degree. Use the NetworkX manual again to search for these methods https://networkx.github.io/documentation/latest/

%% Cell type:code id: tags:

``` python

in_degrees=G.in_degree()# dictionary node:degree

out_degrees=G.out_degree()

print"JFK routes in %d"%in_degrees['"JFK"']#Number routes arriving at JFK international

print"JFK routes out %d"%out_degrees['"JFK"']#Number routes departing from JFK international

print"Heathrow routes in %d"%in_degrees['"LHR"']#routes in London heathrow

print"Heathrow routes out %d"%out_degrees['"LHR"']#routes out London heathrow

print"Singapore routes in %d"%in_degrees['"SIN"']#routes in of Changi, Singapore

print"Singapore routes out %d"%out_degrees['"SIN"']#routes out Changi, Singapore

print"Schipol routes in %d"%in_degrees['"AMS"']#routes in of Schipol, Amsterdam

print"Schipol routes out %d"%out_degrees['"AMS"']#routes out Schipol, Amsterdam

```

%%%% Output: stream

JFK routes in 170

JFK routes out 172

Heathrow routes in 165

Heathrow routes out 165

Singapore routes in 122

Singapore routes out 122

Schipol routes in 239

Schipol routes out 246

%% Cell type:markdown id: tags:

## Plotting In-degree and Out-degree

So it seems Schipol has the most unique routes, also that in-degree and out-degree are correlated! Note this data is from 2012.

Now it would be nice to be able to show the distribution of in-degree and out-degree for all airports, e.g., how many airports have an out-degree of 122, how many have and in-degree of 65. A histogram plot is the best way to do this.

In order to do this we make a set of unique in/out degrees, then for each unique in/out degree we count the number of airports. We will also sort the degrees in increasing order to make our plot more readable. This is basically our degree distribution plot of the airline network.

We also try to fit a powerlaw using the very helpful python package *powerlaw* (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0085777)

Now we can ask a question regarding the maximum number of flights (routes) needed to reach any airport from any other airport. This should indicate the longest number of legs required to reach any place in the World! We can also calculate the average path length, which indicates the average number of legs required to travel between different cities in the world.

Recall that the network diameter is the longest shortest path between any two nodes in the network, and also that average path length of graph G: $$l_G = \frac{\sum_{i \neq j} d(n_i, n_j)}{N(N-1)}$$ where $d(n_i, n_j)$ is the shortest path between nodes $n_i$ and $n_j$

%% Cell type:code id: tags:

``` python

#Note some of these things can be calculated more easily in NetworkX

ifnot'avg_path_length'inglobals():#only calculate this if its not been calculated before

max_path_length=0

total=0.0

forninG:#iterate over all nodes

path_length=nx.single_source_shortest_path_length(G,n)# generate shortest paths from node n to all others

total+=sum(path_length.values())#total of all shortest paths from n

ifmax(path_length.values())>max_path_length:#keep track of longest shortest path we see.

Now we can calculate some other statistics about the network, namely:

- Betweenness centrality - which airport lies on most routes

- Closeness centrality - which airport is the shortest number of hops to all other airports

- Eigenvector centrality - which airport is connected to other important networks

Note: the following calculations may take some time.

We also print the top 10 airports with the highest betweenness centrality. Unsuprisingly Frankfurt is highest, this also has the highest degree in the network. However, the airport with the second highest betweenness may surprise you. If you're interested you can read more about this [here](http://toreopsahl.com/2011/08/12/why-anchorage-is-not-that-important-binary-ties-and-sample-selection/)

%% Cell type:code id: tags:

``` python

ifnot'bet_cen'inglobals():#only calculate this if its not been calculated before

plt.hist(bet_cen.values(),bins=200,normed=False)#Try varying the bins, also look at the hist manual

plt.xlabel('Betweenness Centrality')

plt.ylabel('Frequency')

plt.yscale('log')

plt.title('Betweenness Centrality Distribution')

plt.show()

plt.hist(clo_cen.values(),bins=200,normed=False)#Try varying the bins, also look at the hist manual

plt.xlabel('Closeness Centrality')

plt.ylabel('Frequency')

plt.title('Closeness Centrality Distribution')

plt.show()

plt.hist(eig_cen.values(),bins=200,normed=False)#Try varying the bins, also look at the hist manual

plt.xlabel('Eigenvector Centrality')

plt.ylabel('Frequency')

plt.yscale('log')

plt.title('Eigenvector Centrality Distribution')

plt.show()

```

%% Cell type:markdown id: tags:

## Visualise the Network

Now we use some of NetworkX's built in layout algorithms to try and visualise the Network. We do this in two ways, first we visualise (as small circles) all the nodes (with spring_layout) and edges. Then we make a subset of nodes, in particular those that have an out degree greater than 180, and visualise those with larger circles and print their labels. You can play around with this and read the documentation to try and achieve a better (and more informative) layout.

#### NOTE: you can ignore any warnings here

%% Cell type:code id: tags:

``` python

# create the layout

pos=nx.spring_layout(G)

# If you have graphviz installed you can try the following.

We can first use NetworkX to identify the partitions in the network and then colour nodes based on which partition they lie in. To do this we first make the Network undirected.

%% Cell type:code id: tags:

``` python

importcommunity

G_ud=G.to_undirected()#Make the network undirected

nx.transitivity(G_ud)

# find modularity

part=community.best_partition(G_ud)

mod=community.modularity(part,G_ud)

# plot, color nodes using community structure

values=[part.get(node)fornodeinG_ud.nodes()]

pos=nx.spring_layout(G_ud,iterations=100,k=100.0/N)#k indicates separation strength, bigger=> further apart nodes

nx.draw_networkx_nodes(G_ud,pos,

cmap=plt.get_cmap('jet'),

node_color=values,

node_size=30,

with_labels=False,

alpha=0.8)

nx.draw_networkx_edges(G_ud,pos,alpha=0.02)

plt.show()

```

%% Cell type:markdown id: tags:

## Writing output to file

NetworkX provides many ways in which to export the graph data in various formats (see [formats](#formats)). This is quite straightforward and we will show some (you can check the NetworkX documentation for more) examples below.

%% Cell type:code id: tags:

``` python

nx.write_graphml(G,"airports.graphml")#export the directed network

nx.write_gml(G,"airports.gml")#you can add gz to the extension to automatically compress the file

nx.write_gexf(G,"airports.gexf")#write in gephi format

```

%% Cell type:markdown id: tags:

## Visualise in Gephi

NetworkX is a an incredibly flexible way to deal with networks. Once you have everything in Python you can start to implement dynamic processes on the network (e.g., spreading, removing of nodes, etc.). Unfortunately, NetworkX doesn't have great visualisation capabilities built in. If you would like to visualise your network, it's better to export the file and use Gephi.

You should now also be able to load these graphs into Gephi and redo everything you have just done using NetworkX! Once you play around with Gephi a little, you should be able to produce a picture similar to this:

If you manage to complete all of the above, go back and change some of the parameters, see if you can modify the plots (colour, line vs bar, etc.) Once you feel comfortable that you understand all the code you can try a few of the things below.

1. Download another Network data set from the links mentioned at the start of the notebook

2. Load the Airline Network into Gephi and see if you can create the same statistics

3. The average path length of this network gives some indication about how quickly a disease might spread through the network. With a shorter path length indicating a faster spread. Try to find an airport that, if closed, would slow the disease spread the most (i.e., make the biggest increase in average path length). What strategy/statistic might you use to identify that airport?

4. Investigate the NetworkX library to plot more statistics

5. Try to find out if you can add a weight to the edges indicating the number of flights between each airport. This will require finding another data source online, and may or may not be possible!