-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
174 lines (122 loc) · 6.24 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
Author: Chris Wacek
Email: [email protected]
Date: 16 Jan, 2013
Inet Topology Tools
===================
This repository contains a set of tools used to construct network
topology graphs from traceroute data. Traceroutes from a number
of vantage points are combined to create a connected graph, where
each vertex is a point of presence connected to other points of
presence via links present in trace routes.
Points of presence can be thought to represent access points on
the internet. They consist of one or more IP addresses present in
the trace routes. IP addresses are joined to a point of presence
if the distance between them is very small (< 2.5ms) and they are
in the same AS and /24.
Installation/Requirements
-------------------------
The tools are almost entirely written in Python. They are located
in the bin directory and have a setup script for installing them.
As an aside, if you don't work with Python often, I highly
recommend installing `virtualenv`. There's a good explanation of
it [here]( http://iamzed.com/2009/05/07/a-primer-on-virtualenv/ ).
To install simply run `python setup.py install` inside the bin/
directory. The scripts, all of which begin with 'inet\_graph'
will then be usable from your path.
### Dependencies
1. **Redis**
This package consists of many different scripts which act in
concert to produce a network graph from raw traceroute data.
In order to make this work (and save a ton of time reading
and writing from disk), the code all uses the
[Redis]( http://redis.io/ ) key-value store as a backend.
Redis must be installed and running on whatever machine this
is run on. Additionally, the scripts currently just default
to using database 0 on the Redis server. If there is an
existing Redis server operating, this setting may need to be
changed.
2. **PyPy (optional)**
Some of the data processing scripts can take a long time to run.
They are much faster when run using the PyPy interpreter instead
of the standard Python interpreter. You can find more information
about, and obtain PyPy, at the [PyPy site]( http://pypy.org/ ).
Unfortunately, some of the scripts **do not support** PyPy
because they use libraries that do not work with PyPy. `lxml` is
a big culprit here. As a result, while I recommend you use PyPy
for some steps, you do not have to if the idea of switching
interpreters makes you lose sleep.
3. **Assorted Python Libraries**
The scripts use a variety of Python libraries. Rather than
specify them as a global set of dependencies however, each of
the scripts is set up to warn if a dependency is not met.
The Tools
---------
There are a number of different contained in the package, which
perform different steps of the process. This section provides a
brief overview of each of them.
### `inet_graph_process_data`
This script does the hard work of processing traceroutes into the
Redis database and then identifying and joining IP addresses into
points of presence.
It has a number of subcommands:
- `dump_ips`
Dumps the unique set of IP addresses present in a traceroute file.
- `load_IP_data`
Before points of presence can be determined information about
IP<->AS relationships is needed. This command is designed to
load that data into the database so it can be used later.
- `parse`
Parses a traceroute file into the database. This process
involves a certain amount of normalization, such as removing
negative distances and handling missing steps in a traceroute.
The specifics of normalization are found in TraceParser.py
This process can be run multiple times in parallel, although
there are diminishing returns due to IO requirements and the
ability of the Redis backend to handle multiple connections.
16 at a time is pretty effective though.
- `assign_pops`
Assigns every IP address in the database to a point of
presence. This can also be run in parallel, although there
are situations where deadlocks occur, and those have to be
deferred when run in parallel. As a result, after parallel
processing is complete it must be run one more time with the
--process_failed flag.
- `process_joins`
Joins any points of presence that were assigned separately
but which realistically should be the same point of presence.
### `inet_graph_load_as_relationships`
The sole purpose of this script is to parse the CAIDA [*Inferred
AS Relationships Dataset*]( http://http://www.caida.org/data/active/as-relationships/index.xml )
into the Redis database for use with `inet_graph_create`.
### `inet_graph_create`
Utility to produce a GraphML file from traceroute data that has
been loaded into a local Redis database by `inet_graph_process_data`.
This script performs two actions sequentially:
1. Loading the network graph from the Redis database and
attaching Tor relays, clients and destinations to it based on
the data provided.
This step also performs some slight trimming of the graph by
removing unconnected components, trimming degree-1 vertices
(dangling edges of the tree), and collapsing degree-two nodes
where it can be done without losing data.
This intermediate graph representation is saved to disk as an
GraphML file. This is important for repeatability, since
this first step can take quite a while to perform.
2. Reducing the size of the network graph. This step can load a
previously created intermediate representation using the
option or it can take place immediately following step 1.
At a high level, the network graph is reduced by removing all
links that are not on a shortest path between two points of
interest. In this case, points of interest are defined as Tor
relays, clients, and destinations.
If AS peering data is available, then each path is checked
for valley-freeness, and if it is not valley-free, then a
modified shortest-path algorithm is used to find the
shortest valley-free path.
The result of this process is written out as a GraphML file
once complete.
### `inet_graph_aspath`
A tool that has nothing to do with constructing the network
map, but given the resulting output files can be used to
determine the AS path between two IP addresses within the
network.