Commit 7502d25c authored by Mike Lees's avatar Mike Lees
Browse files

Fixed copy/paste error in robustness notebook

parent d6997c1f
%% Cell type:markdown id: tags:
### Authors: [Michael Lees](http://www.mhlees.com), Debraj Roy
---
The MIT License (MIT)
Copyright (c) 2015 Michael Lees, Debraj Roy
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
---
# Analysing Real Networks using NetworkX
---
#### NOTE: this assumes python 2.7x if you have python 3 installed, please see: https://docs.python.org/3/howto/pyporting.html
#### If you haven't done so you will need to install all the required python libraries, If you are unfamiliar with pip, Anaconda is an easy way to do this: http://docs.continuum.io/anaconda/install
---
In this class we will look at how to download real network datasets and load them into NetworkX. These networks can also be visualised in Gephi, assuming they are correctly formatted. The purpose of this is to enable you to use and analyse real network data using software, which will be crucial during the group projects.
## Online Data Resources
There are a number of online resources which provide existing datasets that you can play with. Here we list a few, they will offer different networks, of different size and different format (see [formats]). Some will be easier to load in NetworkX/Gephi, but you are free to try any yourself.
- The Koblenz Network Collection (http://konect.uni-koblenz.de/networks)
- Gephi’s database (http://wiki.gephi.org/index.php/Datasets)
- Mark Newman’s collection (http://www-personal.umich.edu/~mejn/netdata/)
- Baraba ́si Lab (http://www3.nd.edu/~networks/resources.htm)
- LucianoCosta’sGroup(http://cyvision.ifsc.usp.br/Cyvision/?page=NETWORKS&subpage=Databases)
- Open Flights (http://openflights.org/data.html)
- Tore Opsahl's collection (http://toreopsahl.com/datasets)
- Google (http://www.google.com)
## Network Data Formats
<a name="formats"></a>
There are a number of established network file formats that you may used for existing datasets. In the most basic form, all file formats must include: a. A list of nodes (perhaps with associated attributes) b. A list of edges (perhaps with associated attributes). Different file formats will store this in different, and some provide features that others don't. Here we ask you to look at a few file formats:
- GraphML (https://www.wikiwand.com/en/GraphML)
- GML (https://www.wikiwand.com/en/Graph_Modelling_Language)
- Gephi (http://gexf.net/format/)
- GDF (http://gephi.github.io/users/supported-graph-formats/gdf-format/)
- Edgelist or CSV (http://networkx.lanl.gov/reference/readwrite.edgelist.html)
By all means study these formats, but as long as you can read a format in NetworkX and/or Gephi that is all that is important. Be aware that sometimes special characters in the date (e.g., quotation, brackets, etc.) can cause Gephi/NetworkX to have problems loading.
## Download the Airline Network
We will now download an edgelist representation of the world-wide airline traffic. Go to http://openflights.org/data.html and read about the dataset. We have already processed the "airports.dat" and "routes.dat" to create an edge list, where nodes are airports and edges exist between nodes if any airline conducts a flight between the two airports. You should now download the edgelist from http://tinyurl.com/n9ohgxg
The edgelist format is simply a Comma Separated Value file (CSV) where each line represent an edge and the first two entries on a line represent the node IDs. In this case the node IDs are the IATA airport codes, so for example the first entry (not including the header line) indicates a flight from Goroka, Papua New Guinea (GKA) to Port Moresby, Papua New Guinea (POM), which is operated by Air Niugini.
## Load the Airilne Network in NetworkX
Now we will try to load the network into NetworkX. First we import the necessary python libraries.
%% Cell type:code id: tags:
``` python
import csv #import the Python CSV library
import networkx as nx #import NetworkX
import numpy as np #import numpy for ...
import community #import community (https://pypi.python.org/pypi/python-louvain/0.3)
import powerlaw #import powerlaw library for testing fits
#force drawing of graphs inline for ipython notebook
%matplotlib inline
import matplotlib.pyplot as plt #import matplotlib for plotting/drawing grpahs
import operator #standard python library used for sorting
```
%% Cell type:markdown id: tags:
Next we use the Python open command and the built-in read_edgelist NetworkX command to create our Graph G. In the below code, we open the csv file edgelist.csv as 'rb' which means read (not write) in binary (not plain text).
The parameters of the read_edgelist method can be found in the NetworkX manual here: https://networkx.github.io/documentation/latest/
Goto the webpage and search in the search box for the command read_edgelist, you should end up at: https://networkx.github.io/documentation/latest/reference/generated/networkx.readwrite.edgelist.read_edgelist.html?highlight=read_edgelist
- The first parameter is the file (which we call file_handle)
- The second parameter is a delimiter, which specifies the marker between records, in this case a comma
- The third parameter specifies the types of network to construct, we use a directed graph here (obviously)
- The fourth parameter specifies the type of node, which for us is a string, which is the IATA code e.g., GKA
- The final parameter specifies the encoding used by the file (e.g., UTF-8, ASCII, etc.)
Please spend sometime now becoming familiar with the NetworkX documentation, you should be able to search for help yourself from here on in.
%% Cell type:code id: tags:
``` python
with open('edgelist.csv', 'rb') as file_handle:
next(file_handle, '') # skip the header line (NOTE the first list in the CSV file doesn't contain an edge)
G = nx.read_edgelist(file_handle, delimiter=',', create_using=nx.DiGraph(),
nodetype=str, encoding="utf-8")
```
%% Cell type:markdown id: tags:
## Printing Network Statistics
Now we have the graph loaded into NetworkX we can obtain some simple statistics about the Network. For example the number of nodes (N), edges (L) and average degree <k>.
Recall that for a directed network the average degree is simply $<k>= \frac{L}{N} $
%% Cell type:code id: tags:
``` python
N = G.order() #G.order(), gives number of nodes
L = G.size() #G.size(), gives number of edges
avg_deg = float(L) / N #calculate average degree
#print out statistics
print "Nodes: ", N
print "Edges: ", L
print "Average degree: ", avg_deg
```
%%%% Output: stream
Nodes: 3286
Edges: 39428
Average degree: 11.9987827145
%% Cell type:markdown id: tags:
## In-degree and Out-degree
We can now measure the distribution of airports in terms of how many in-coming and out-going flights they have. This is known as the node in-degree and out-degree (respectively). NetworkX provides a simple way to get the in and out degree of all nodes. This data is given in the form of a python dictionary where the key is the node and the value is the in/out degree. Use the NetworkX manual again to search for these methods https://networkx.github.io/documentation/latest/
%% Cell type:code id: tags:
``` python
in_degrees = G.in_degree() # dictionary node:degree
out_degrees = G.out_degree()
print "JFK routes in %d" % in_degrees['"JFK"'] #Number routes arriving at JFK international
print "JFK routes out %d" % out_degrees['"JFK"'] #Number routes departing from JFK international
print "Heathrow routes in %d" % in_degrees['"LHR"'] #routes in London heathrow
print "Heathrow routes out %d" % out_degrees['"LHR"'] #routes out London heathrow
print "Singapore routes in %d" % in_degrees['"SIN"'] #routes in of Changi, Singapore
print "Singapore routes out %d" % out_degrees['"SIN"'] #routes out Changi, Singapore
print "Schipol routes in %d" % in_degrees['"AMS"'] #routes in of Schipol, Amsterdam
print "Schipol routes out %d" % out_degrees['"AMS"'] #routes out Schipol, Amsterdam
```
%%%% Output: stream
JFK routes in 170
JFK routes out 172
Heathrow routes in 165
Heathrow routes out 165
Singapore routes in 122
Singapore routes out 122
Schipol routes in 239
Schipol routes out 246
%% Cell type:markdown id: tags:
## Plotting In-degree and Out-degree
So it seems Schipol has the most unique routes, also that in-degree and out-degree are correlated! Note this data is from 2012.
Now it would be nice to be able to show the distribution of in-degree and out-degree for all airports, e.g., how many airports have an out-degree of 122, how many have and in-degree of 65. A histogram plot is the best way to do this.
In order to do this we make a set of unique in/out degrees, then for each unique in/out degree we count the number of airports. We will also sort the degrees in increasing order to make our plot more readable. This is basically our degree distribution plot of the airline network.
We also try to fit a powerlaw using the very helpful python package *powerlaw* (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0085777)
%% Cell type:code id: tags:
``` python
# Set figure width to 30in and height to 20in
import collections
fig_size= [18,13]
#plt.rcParams["figure.figsize"] = fig_size
plt.rcParams.update({'font.size': 22, "figure.figsize": fig_size})
#Histogram of in-degrees, Plot and save to png
plt.hist(in_degrees.values(), bins=300, normed=False)
plt.title('In degree distribution')
plt.savefig('in-degree.png')
plt.show()
#Plot log-log of in-degree distribution
plt.yscale('log') #set y scale to be log
plt.xscale('log') #set x scale to be log
#create a dictionary where key is degree and value is the number of times that degree is found
#See the python collections Counter for more information
in_degree_counts = dict(collections.Counter(in_degrees.values()))
#Create scatter plot with degree on x-axis and counts on y-axis (red colour, x marker and size 150)
plt.scatter(in_degree_counts.keys(), in_degree_counts.values(), c='r', marker='x', s=150)
plt.xlim((.9, 1e3)) #set x axis limits
plt.ylim((0, 1e3)) #set y axis limits
plt.xlabel('degree')
plt.ylabel('Frequency')
plt.title('In-degree distribution')
plt.show()
fit = powerlaw.Fit(in_degree_counts.values())
R, p = fit.distribution_compare('exponential','exponential', normalized_ratio=True)
print R, p
R, p = fit.distribution_compare('power_law','lognormal', normalized_ratio=True)
print R, p
R, p = fit.distribution_compare('exponential','power_law', normalized_ratio=True)
print R, p
print fit.power_law.alpha
print fit.power_law.xmin
#Histogram of out-degrees, plot and save to png
plt.hist(out_degrees.values(), bins=300, normed=False)
plt.title('Out degree distribution')
plt.savefig('out-degree.png')
plt.show()
#Plot log-log of in-degree distribution
plt.yscale('log') #set y scale to be log
plt.xscale('log') #set x scale to be log
#create a dictionary where key is degree and value is the number of times that degree is found
#See the python collections Counter for more information
out_degree_counts = dict(collections.Counter(out_degrees.values()))
#Create scatter plot with degree on x-axis and counts on y-axis (green colour, o marker and size 150)
plt.scatter(out_degree_counts.keys(), out_degree_counts.values(), c='g', marker='o', s=150)
plt.xlim((.9, 1e3)) #set x axis limits
plt.ylim((0, 1e3)) #set y axis limits
plt.xlabel('degree')
plt.ylabel('Frequency')
plt.title('Out-degree distribution')
plt.show()
```
%% Cell type:markdown id: tags:
## Path Lengths in the Airline Network
Now we can ask a question regarding the maximum number of flights (routes) needed to reach any airport from any other airport. This should indicate the longest number of legs required to reach any place in the World! We can also calculate the average path length, which indicates the average number of legs required to travel between different cities in the world.
Recall that the network diameter is the longest shortest path between any two nodes in the network, and also that average path length of graph G: $$l_G = \frac{\sum_{i \neq j} d(n_i, n_j)}{N(N-1)}$$ where $d(n_i, n_j)$ is the shortest path between nodes $n_i$ and $n_j$
%% Cell type:code id: tags:
``` python
#Note some of these things can be calculated more easily in NetworkX
if not 'avg_path_length' in globals(): #only calculate this if its not been calculated before
max_path_length = 0
total = 0.0
for n in G: #iterate over all nodes
path_length=nx.single_source_shortest_path_length(G, n) # generate shortest paths from node n to all others
total += sum(path_length.values()) #total of all shortest paths from n
if max(path_length.values()) > max_path_length: #keep track of longest shortest path we see.
max_path_length = max(path_length.values())
avg_path_length = total / (N*(N - 1)) #calculate average.
print "Average path length %f" % avg_path_length #print average path
print "Network Diameter %d" % max_path_length #print diameter
```
%% Cell type:markdown id: tags:
## Centrality
Now we can calculate some other statistics about the network, namely:
- Betweenness centrality - which airport lies on most routes
- Closeness centrality - which airport is the shortest number of hops to all other airports
- Eigenvector centrality - which airport is connected to other important networks
Note: the following calculations may take some time.
We also print the top 10 airports with the highest betweenness centrality. Unsuprisingly Frankfurt is highest, this also has the highest degree in the network. However, the airport with the second highest betweenness may surprise you. If you're interested you can read more about this [here](http://toreopsahl.com/2011/08/12/why-anchorage-is-not-that-important-binary-ties-and-sample-selection/)
%% Cell type:code id: tags:
``` python
if not 'bet_cen' in globals(): #only calculate this if its not been calculated before
bet_cen = nx.betweenness_centrality(G)
clo_cen = nx.closeness_centrality(G)
eig_cen = nx.eigenvector_centrality(G)
#Histogram of in-degrees
print "Betweenness mean: %f" % np.mean(bet_cen.values()) #mean betweenness
#get a list of airports sorted by betweenness
airports_sorted_by_betweenness = sorted(bet_cen.items(), key=operator.itemgetter(1), reverse=True)
for x in range(10): #print top 10 airports by betweenness
print str(x+1)+ ": " + str(airports_sorted_by_betweenness[x])
#plt.bar(center, hist)
plt.rcParams.update({'font.size': 22})
plt.hist(bet_cen.values(), bins=200, normed=False) #Try varying the bins, also look at the hist manual
plt.xlabel('Betweenness Centrality')
plt.ylabel('Frequency')
plt.yscale('log')
plt.title('Betweenness Centrality Distribution')
plt.show()
plt.hist(clo_cen.values(), bins=200, normed=False) #Try varying the bins, also look at the hist manual
plt.xlabel('Closeness Centrality')
plt.ylabel('Frequency')
plt.title('Closeness Centrality Distribution')
plt.show()
plt.hist(eig_cen.values(), bins=200, normed=False) #Try varying the bins, also look at the hist manual
plt.xlabel('Eigenvector Centrality')
plt.ylabel('Frequency')
plt.yscale('log')
plt.title('Eigenvector Centrality Distribution')
plt.show()
```
%% Cell type:markdown id: tags:
## Visualise the Network
Now we use some of NetworkX's built in layout algorithms to try and visualise the Network. We do this in two ways, first we visualise (as small circles) all the nodes (with spring_layout) and edges. Then we make a subset of nodes, in particular those that have an out degree greater than 180, and visualise those with larger circles and print their labels. You can play around with this and read the documentation to try and achieve a better (and more informative) layout.
#### NOTE: you can ignore any warnings here
%% Cell type:code id: tags:
``` python
# create the layout
pos = nx.spring_layout(G)
# If you have graphviz installed you can try the following.
#pos=nx.graphviz_layout(G,prog='neato')
# draw the nodes and the edges (all)
nx.draw_networkx_nodes(G,pos,node_color='b',alpha=0.2,node_size=16)
nx.draw_networkx_edges(G,pos,alpha=0.005)
nodes_to_keep = [n for n in out_degrees if out_degrees[n] > 180]
G_s = G.subgraph(nodes_to_keep)
# draw the most important nodes with a different style
nx.draw_networkx_nodes(G_s,pos,node_color='r',alpha=0.4,node_size=450)
# also the labels this time
nx.draw_networkx_labels(G_s,pos,font_size=20,font_color='b')
```
%% Cell type:markdown id: tags:
## Modularity and More Layouts
We can first use NetworkX to identify the partitions in the network and then colour nodes based on which partition they lie in. To do this we first make the Network undirected.
%% Cell type:code id: tags:
``` python
import community
G_ud = G.to_undirected() #Make the network undirected
nx.transitivity(G_ud)
# find modularity
part = community.best_partition(G_ud)
mod = community.modularity(part, G_ud)
# plot, color nodes using community structure
values = [part.get(node) for node in G_ud.nodes()]
pos = nx.spring_layout(G_ud,iterations=100,k=100.0/N) #k indicates separation strength, bigger=> further apart nodes
nx.draw_networkx_nodes(G_ud,pos,
cmap=plt.get_cmap('jet'),
node_color=values,
node_size=30,
with_labels=False,
alpha=0.8)
nx.draw_networkx_edges(G_ud,pos,alpha=0.02)
plt.show()
```
%% Cell type:markdown id: tags:
## Writing output to file
NetworkX provides many ways in which to export the graph data in various formats (see [formats](#formats)). This is quite straightforward and we will show some (you can check the NetworkX documentation for more) examples below.
%% Cell type:code id: tags:
``` python
nx.write_graphml(G, "airports.graphml") #export the directed network
nx.write_gml(G,"airports.gml") #you can add gz to the extension to automatically compress the file
nx.write_gexf(G, "airports.gexf") #write in gephi format
```
%% Cell type:markdown id: tags:
## Visualise in Gephi
NetworkX is a an incredibly flexible way to deal with networks. Once you have everything in Python you can start to implement dynamic processes on the network (e.g., spreading, removing of nodes, etc.). Unfortunately, NetworkX doesn't have great visualisation capabilities built in. If you would like to visualise your network, it's better to export the file and use Gephi.
You should now also be able to load these graphs into Gephi and redo everything you have just done using NetworkX! Once you play around with Gephi a little, you should be able to produce a picture similar to this:
<img src="https://dl.dropboxusercontent.com/u/950215/airports.png">
%% Cell type:markdown id: tags:
## More Questions & Experiments
If you manage to complete all of the above, go back and change some of the parameters, see if you can modify the plots (colour, line vs bar, etc.) Once you feel comfortable that you understand all the code you can try a few of the things below.
1. Download another Network data set from the links mentioned at the start of the notebook
2. Load the Airline Network into Gephi and see if you can create the same statistics
3. The average path length of this network gives some indication about how quickly a disease might spread through the network. With a shorter path length indicating a faster spread. Try to find an airport that, if closed, would slow the disease spread the most (i.e., make the biggest increase in average path length). What strategy/statistic might you use to identify that airport?
4. Investigate the NetworkX library to plot more statistics
5. Try to find out if you can add a weight to the edges indicating the number of flights between each airport. This will require finding another data source online, and may or may not be possible!
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment