If you recall we were looking at the titanic data, and trying to split the data using sex, class, age etc, and the name of the game is to have these splits consist of mostly survivors or non-survivors, that way when we predict that a particular group will survive, because the majority in that group do, we will be mostly right.
Consider that if of women survived compared with for men, when we make the prediction, reducing these probabilities to classes, we are throwing away a lot of information.
What we have been groping towards is a kind of crude decision tree learning and this, with some more advanced variations, is what I want to implement.
But what is a decision tree? What is a tree? What is a graph? How are we to proceed if we know no graph theory? Well we could probably get away without any. But we’re better than that. So today I will give a primer on graph theory, will this eventual goal in mind. Once we know something about graphs, we can learn about making graphs with Python, and once we’ve done that, we can try them out on the titanic data.
2. The Definition of a Graph
Here is a picture of a graph:
The red filled circles we call ‘vertices’ or sometimes ‘nodes’. The lines connecting the vertices we call ‘edges’.
A graph is basically a way of showing relationships between things, for instance, imagine that the vertices above were people, and the edges represented the relationship of holding hands, awww.
If you have come across the idea of relations before, you may be thinking of reflexivity, transitivity and symmetry.
A ‘reflexive’ edge is what we call a loop, like this:
A ‘symmetric’ edge is what we call undirected, and it just looks like a line, here’s what directed edges look like:
The beginning vertex of a directed edge is called the source, and the ending vertex is called the target.
Transitivity – Homer taller than Bart and Bart taller than Lisa implies Homer taller than Lisa – you could visualise as a triangle:
Now that we have a visual idea of graphs, let’s define them formally.
Definition 1 An undirected graph is a triple , where is a set of vertices, is a set of edges and is a mapping from to subsets of of size or .
And for directed graphs:
Definition 2 A directed graph , where are sets of vertices and edges respectively, and
If you’ve studied maths, then you’ll know that one of the key questions to ask about a new mathematical object is: what are the morphisms? Meaning, what are the functions from graphs to graphs that preserve structure.
A graph homomorphism is a function between two graphs, taking vertices to vertices and edges, that ‘respects’ the structure of the graph, in symbols: , the function commutes with the maps.
Notice that the notation suppresses the difference between the mapping in and , but we’re all grown-ups here, we get context.
As you can probably guess, for directed graphs we require and
A subgraph of a graph is a subset of its vertices, a subset of its edges, with the same mappings but that it is still a graph.
The degree of a vertex is the number of edges connected to it.
3. Types of Graphs
Now we’ll see some of the standard classes of graphs. These will all be simple graphs, meaning that there are no loops, finite numbers of vertices and edges, at most one edge between any two vertices and undirected.
3.1. Linear Graphs
These just look like a line:
3.2. Cyclic Graphs
We’ve already seen two of these, you walk around a cycle and come back to the beginning. We denote the cyclic graph with vertices by .
3.3. Complete Graphs
This is the situation where everything is connected to everything else, the most edges possible. This is denoted
A tree is connected, meaning there is only one ‘piece’ of the graph, and has no cycles, meaning no parts that look a cyclic graph.
The degree 1 vertices are called leaves, and an n-ary tree is one where each non-leaf has n ‘children’.
If you want to learn more about trees, there is an excellent post about it here.
4. Representing Graphs
One very natural way to encode the layout of a graph is an adjacency matrix. If then the matrix will be , where each row represents a vertex as a source, and the columns are vertices as targets, and the entry is the number of edges with that source and target.
So if there is a in row and column , there are two edges from vertex to vertex . Of course the matrix will depend on how you choose to order the vertices. For example, the adjacency matrix of , and the graph looks like this:
Notice that if there are no loops, the diagonal will be s, and if undirected then the matrix is symmetric about the main diagonal. Another neat property is that if is the adjacency matrix of a simple graph , then is the number of paths of length from vertex . Prove it yourself!
The problem with adjacency matrices as a data structure, is that it takes a lot of work to change it when you add or delete a vertex. Another way of doing it is with an adjacency list. This is where you have a list of the vertices, with each vertex ‘pointing’ to a list of vertices that are its targets. This can easily be implemented using dictionaries in Python. This makes it easy to find adjacent vertices, but it doesn’t really treat edges with the respect they deserve. We’ll be implementing something where both edges and vertices are objects that know about who they are connected to, this won’t be as fast as an adjacency list, but it will allow us to keep track of a lot of information.
5. Next Time
This was just to give you an idea of graphs are, and some of terminology needed to discuss them. There is loads more stuff to learn about graphs, you can colour edges and vertices, talk about paths and walks on a graph, planar graphs… but that’s all for now. Next time: object orientated programming with Python.