#### NOTE: this assumes python 2.7x if you have python 3 installed, please see: https://docs.python.org/3/howto/pyporting.html
#### If you haven't done so you will need to install all the required python libraries, If you are unfamiliar with pip, Annaconda is an easy way to do this: http://docs.continuum.io/anaconda/install
---
In this class we will look at how to download real network datasets and load them into NetworkX. These networks can also be visualised in Gephi, assuming they are coorectly formatted. The purpose of this is to enable you to use and analyse real network data using software, which will be crucial during the group projects.
## Online Data Resources
There are a number of online resources which provide existing datasets that you can play with. Here we list a few, they will offer different networks, of different size and different format (see [formats]). Some will be easier to load in NetworkX/Gephi, but you are free to try any yourself.
- The Koblenz Network Collection (http://konect.uni-koblenz.de/networks)
There are a number of established netowrk file formats that you may used for existing datasets. In the most basic form, all file formats must include: a. A list of nodes (perhaps with associated attributes) b. A list of edges (perhaps with associated attributes). Different file formats will store this in different, and some provide features that others dont. Here we ask you to look at a few file formats:
- Edgelist or CSV (http://networkx.lanl.gov/reference/readwrite.edgelist.html)
By all means study these formats, but as long as you can read a format in NetworkX and/or Gephi that is all that is important. Be aware that sometimes special characters in the date (e.g., quotation, brackets, etc.) can cause Gephi/NetworkX to have problems loading.
## Download the Airline Network
We will now download an edgelist representation of the world-wide airline traffic. Go to http://openflights.org/data.html and read about the dataset. We have already processed the "airports.dat" and "routes.dat" to create an edge list, where nodes are airports and edges exist between nodes if any airline conducts a flight between the two airports. You should now download the edgelist from http://tinyurl.com/n9ohgxg
The edgelist fomrat is simply a Comma Seperated Value file (CSV) where each line represent an edge and the first two entries on a line represent the node IDs. In this case the node IDs are the IATA airport codes, so for example the first entry (not including the header line) indicates a flight from Goroka, Papua New Guinea (GKA) to Port Moresby, Papua New Guinea (POM), which is operated by Air Niugini.
## Load the Airilne Network in NetworkX
Now we will try to load the network into NetworkX. First we import the necessary python libraries.
%% Cell type:code id: tags:
``` python
import csv #import the Python CSV library
import networkx as nx #import NetworkX
import numpy as np #import numpy for ...
import community #import community (https://pypi.python.org/pypi/python-louvain/0.3)
import powerlaw #import powerlaw library for testing fits
#force drawing of graphs inline for ipython notebook
%matplotlib inline
import matplotlib.pyplot as plt #import matplotlib for plotting/drawing grpahs
```
%% Cell type:markdown id: tags:
Next we use the Python open command and the built-in read_edgelist NetworkX command to create our Graph G. In the below code, we open the csv file edgelist.csv as 'rb' which means read (not write) in binary (not plain text).
The parameters of the read_edgelist method can be found in the NetworkX manual here: https://networkx.github.io/documentation/latest/
Goto the webpage and search in the search box for the command read_edgelist, you should end up at: https://networkx.github.io/documentation/latest/reference/generated/networkx.readwrite.edgelist.read_edgelist.html?highlight=read_edgelist
- The first parameter is the file (which we call file_handle)
- The second parameter is a delimiter, which specifies the marker between records, in this case a comma
- The third parameter specifies the types of network to construct, we use a directed graph here (obviously)
- The fourth parameter specifies the type of node, which for us is a string, which is the IATA code e.g., GKA
- The final paraetmer specifies the encoding used by the file (e.g., UTF-8, ASCII, etc.)
Please spend sometime now becoming familiar with the NetworkX documentation, you should be able to search for help yourself from here on in.
%% Cell type:code id: tags:
``` python
with open('edgelist.csv', 'rb') as file_handle:
next(file_handle, '') # skip the header line (NOTE the first list in the CSV file doesn't contain an edge)
G = nx.read_edgelist(file_handle, delimiter=',', create_using=nx.DiGraph(),
Now we have the graph loaded into NetworkX we can obtain some simple statistics about the Network. For example the number of nodes (N), edges (L) and average degree <k>.
Recall that for a directed network the average degree is simply $<k>= \frac{L}{N} $
%% Cell type:code id: tags:
``` python
N = G.order() #G.order(), gives number of nodes
L = G.size() #G.size(), gives number of edges
avg_deg = float(L) / N #calculate average degree
#print out statistics
print "Nodes: ", N
print "Edges: ", L
print "Average degree: ", avg_deg
```
%%%% Output: stream
Nodes: 3286
Edges: 39429
Average degree: 11.9990870359
%% Cell type:markdown id: tags:
## In-degree and Out-degree
We can now measure the distribution of airports in terms of how many in-coming and out-going flights they have. This is known as the node in-degree and out-degree (respectively). NetworkX provides a simple way to get the in and out degree of all nodes. This data is given in the form of a python dictionary where the key is the node and the value is the in/out degree. Use the NetworkX manual again to search for these methods https://networkx.github.io/documentation/latest/
print "JFK routes in %d" % in_degrees['"JFK"'] #Number routes arriving at JFK international
print "JFK routes out %d" % out_degrees['"JFK"'] #Number routes departing from JFK international
print "Heathrow routes in %d" % in_degrees['"LHR"'] #routes in London heathrow
print "Heathrow routes out %d" % out_degrees['"LHR"'] #routes out London heathrow
print "Singapore routes in %d" % in_degrees['"SIN"'] #routes in of Changi, Singapore
print "Singapore routes out %d" % out_degrees['"SIN"'] #routes out Changi, Singapore
print "Schipol routes in %d" % in_degrees['"AMS"'] #routes in of Schipol, Amsterdam
print "Schipol routes out %d" % out_degrees['"AMS"'] #routes out Schipol, Amsterdam
```
%%%% Output: stream
JFK routes in 170
JFK routes out 172
Heathrow routes in 165
Heathrow routes out 165
Singapore routes in 122
Singapore routes out 122
Schipol routes in 239
Schipol routes out 246
%% Cell type:markdown id: tags:
## Plotting In-degree and Out-degree
So it seems schipol has the most unique routes, also that in-degree and out-degree are correlated! Note this data is from 2012.
Now it would be nice to be able to show the distribution of in-degree and out-degree for all airports, e.g., how many airports have an out-degree of 122, how many have and in-degree of 65. A histogram plot is the bestway to do this.
In order to do this we make a set of unqiue in/out degrees, then for each unique in/out degree we count the number of airports. We will also sort the degrees in increasing order to make our plot more readable. This is basically our degree distribution plot of the airline network.
We also try to fit a powerlaw using the very helpful python package *powerlaw* (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0085777)
/usr/local/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
Now we can ask a question regarding the maximum number of flights (routes) needed to reach any airport from any other airport. This should indicate the longest number of legs required to reach any place in the World! We can also calculate the average path length, which indicates the average number of legs required to travel between different cities in the world.
Recall that the network diameter is the longest shortest path between any two nodes in the network, and also that average path length of graph G: $$l_G = \frac{\sum_{i \neq j} d(n_i, n_j)}{N(N-1)}$$ where $d(n_i, n_j)$ is the shortest path between nodes $n_i$ and $n_j$
%% Cell type:code id: tags:
``` python
#Note some of these things can be calculated more easily in NetworkX
if not 'avg_path_length' in globals(): #only calculate this if its not been calculated before
max_path_length = 0
total = 0.0
for n in G: #iterate over all nodes
path_length=nx.single_source_shortest_path_length(G, n) # generate shortest paths from node n to all others
total += sum(path_length.values()) #total of all shortest paths from n
if max(path_length.values()) > max_path_length: #keep track of longest shortest path we see.
max_path_length = max(path_length.values())
avg_path_length = total / (N*(N - 1)) #calculate average.
print "Average path length %f" % avg_path_length #print average path
Now we use some of NetworkX's built in layout algorithms to try and visualise the Network. We do this in two ways, first we visualise (as small circles) all the nodes (with spring_layout) and edges. Then we make a subset of nodes, in particular those that have an out degree greater than 180, and visualise those with larger circles and print their labels. You can play around with this and read the documentation to try and achieve a better (and more informative) layourt.
#### NOTE: you can ignore any warnings here
%% Cell type:code id: tags:
``` python
# create the layout
pos = nx.spring_layout(G)
# If you have graphviz installed you can try the following.
/usr/local/lib/python2.7/site-packages/matplotlib/collections.py:650: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison