There are a lot of definitions for geolocation, but for argus data, geolocation is the use of argus object values for geo-relevant positioning. For pure argus data, the Layer 2 and Layer 3 network addresses that are contained in the flow data records provide the basis for geographically placing the data. For argus data derived from Netflow data, AS numbers can be used to provide a form of netlocation. Additional data that is used to provide relative geolocation are TTL (hops), Round Trip Times, and One-Way Delay metrics. Layer 2 and 3 network address information doesn't provide any sense of where they are, but because Layer 2 and 3 addresses are suppose to be globally unique, at any given moment, there should be a single physical location for each of these objects.
To provide geolocation, such as country codes, or latitude/longitude (lat/lon) information, argus clients use third party databases to provide the mapping between Layer 3 addresses and some geo-relevant information. Argus clients support the use of two free Internet information systems; the InterNic databases, which provide Country Codes, and MaxMind's Opensource GeoIP database, which can provide geolocation for the registered administrator of the domain.
Country codes are fairly reliable, and some IP address location from GeoIP are well mapped, so these free systems are very useful.
All argus clients can use the databases from the Internic for providing country codes. This support is triggered by simply printing either of the country code designations in the argus record. The database itself is specified in your .rarc fil. This is usually stored in /usr/local/argus/delegated-ipv4-latest. The name is pretty weird, but it follows the convention of how the Internic names its files.
But where did this database come from? A starter file is provided in the argus-clients distribution tarfile, in the ./support/Config directory, but its probably stale when you get it. In the ./support/Config directory, there is a shell script, called ragetcountrycodes.sh. This script uses wget() to retrieve databases from the various internation domain name registries, and merges them together to form our database.
After placing the database file at the PATH specified in your .rarc, printing country codes is available.
Argus clients can be configured and compiled to use MaxMind's OpenSource C API libraries, which provide support for using their free and "pay as you go" databases. To enable this support, you need to obtain the GeoIP C API and install it on your system, using the instructions provided. After installation, you configure all the argus clients to use these databases using:
% ./configure --with-GeoIP=yes
% make clean; make
Currently, two (2) programs, ralabel() and radium() provide support for argus record labeling. The Maxmind libraries are configured for the location of the databases, so argus client programs configured with these libraries don't have to have any additional configuraiton support for the information store, they just need to know what kind of geolocation information you're interested in working with.
Geolocation data maybe relevant for only a short time, and so getting the data into the argus data records is an important support feature. Geolocation data, such as country codes, lat/lon values and AS numbers, have structured storage support, so that you can filter on it, aggregate using the data as aggregation keys, and you can anonymize the information. Use ralabel() and its extensive configuration support to specify what geolocation data will be inserted into each argus record that it encounters. The amount of data added is not huge, but it will have an impact. Most uses of this information involve some form of pipeline processing, where the geolocation data is added at some point in a data pipeline, so that a downstream process can process the records that contain the location data, at which time the data is stripped, prior to storing on disk. This form of "semantic pumping" is a common practice with near real-time flow data processing.
Geolocation data for country codes and AS numbers are structred. We have specific DSR's for this information, the data is stored as packed binary data, and because its in specific DSR's you can filter and aggregate on these values. Other geolocation data such as lat/lon, postal address, state/region, zip and area codes are unstructured. The unstructured data is stored as argus label metadata; ascii text strings with a very simple syntax. These can be printed, merged, grep'ed, and stripped.
To insert geolocation data into argus data, whether its in a file or a stream, you use either ralabel() or radium(). For specific information regarding radium() and record classification/ labeling, please see the radium() documentation. ralabel() has its own ralabel.conf configuration file that turns on the various labeling features, and all the geolocation support is configured using this file. To get a feel for all the features, grab the sample ralabel.conf that came with the most recent argus-clients distribution, and give it a test drive.
In the general case, you will process either "primitive" argus data or you will be processing the data in either an informal or formal pipeline process. Below is a standard strategy for taking an argus file, and labeling both the source and destination IPv4 addresses with geolocation data.
% ralabel -r /tmp/ipaddrs.out -f /tmp/ralabel.conf -w /tmp/ralabel.out
Country code databases are maintained by the Internet Corporation for Assigned Names and Numbers (ICANN). The InterNIC, which maintains the global Domain Name Systems registration functions, is supported by a collection of Regional Internet Registries (RIR) that manage the allocation of IP addresses for their region. These organizations maintain databases of responsible parties for the IP addresses that have been allocated. This data store provides the information for IP address/country code databases.
RIR based country code information is a reasonably good source of geolocation information, at least for locating the person/company that claims to have responsibility for a particular address. The actual physical location of a specific address, however, is outside the scope of what the RIRs are doing. But the information is very useful for many geolocation applications. If you want better information, commercial databases can be more accurate. Many commercial databases, however, simply repackage the RIR databases, so take this information with a "grain of salt", so to speak.
As an example, to generate aggregated statistics for country codes, you need to first insert country codes into the records themselves, using ralabel(), and then you need to aggregate the resulting argus data stream, using racluster(). This will require a ralabel.conf configuration file to turn on RALABEL_ARIN_COUNTRY_CODES labeling. Use the sample ./support/Config/ralabel.conf file as a starting point, and uncomment the two lines that reference "ARIN". Assuming you created the file as /tmp/ralabel.conf, run:
% racluster -M rmon -m srcid smac saddr -r daily.argus.data -w /tmp/ipaddrs.out - ipv4
% ralabel -r /tmp/ipaddrs.out -f /tmp/ralabel.conf -w /tmp/ralabel.out
% racluster -m sco -r /tmp/ralabel.out -w - | rasort -m pkts -w /tmp/country.stats.out
The first command will generate the list of singular IPv4 addresses from your daily.argus.data file. The "-M rmon" option is important here, as that tells racluster() to generate stats for singular objects. The second command labels the argus records with the country code appropriate for the IP addresses. Then we cluster again, based on the "sco" label, and sort the output based on packet count.
The resulting /tmp/country.stats.out file will have argus records representing aggreagated statistics for each distinct country code found in the IPv4 addresses in your data. We limit the effort to IPv4 because the labeling is currently only working for IPv4 addresses. This should be corrected soon.
The resulting output argus record file can be printed, sorted, filtered whatever. So lets generate a report that shows the percent traffic per country. Using ra() we specify the columns we want in the report, and we use the "-%" option to print the columns as percent total. We only need three(3) decimal precision here, so:
% ra -r /tmp/ra.out -s stime dur sco:10 pkts:10 bytes -p3
StartTime Dur sCo TotPkts TotBytes
2009/09/15.00:00:00.901 86402.672 US 9092847 8180857733
2009/09/15.15:26:35.781 6932.988 UA 37323 34939933
2009/09/15.00:01:52.826 85437.820 EU 34853 27606569
2009/09/15.12:08:29.805 41747.223 NO 5338 3510110
2009/09/15.00:01:52.388 85318.320 DE 4374 1960894
2009/09/15.00:01:52.733 85038.320 GB 2063 961983
2009/09/15.00:51:19.951 82470.445 JP 1635 646413
2009/09/15.00:19:22.518 84389.109 SE 1336 500372
2009/09/15.00:00:40.821 85499.008 CA 1233 235801
2009/09/15.00:06:17.104 86020.656 FR 1223 154310
2009/09/15.01:18:37.078 80444.398 KR 1067 74638
2009/09/15.16:35:30.421 7.551 PL 900 336890
2009/09/15.10:44:19.486 41702.855 SI 834 470894
2009/09/15.00:19:22.174 84388.914 NL 596 81189
2009/09/15.09:52:10.142 50289.852 IT 545 360437
2009/09/15.00:19:21.677 85236.164 CH 412 46950
2009/09/15.09:51:57.418 43514.777 AU 396 197180
2009/09/15.09:05:00.989 51842.965 AP 216 35912
2009/09/15.00:06:39.473 81546.859 CN 138 17668
2009/09/15.21:49:08.497 1.763 ZA 80 12868
2009/09/15.01:05:13.283 80195.633 TW 64 9190
2009/09/15.01:22:16.781 57680.957 IE 64 8344
2009/09/15.09:41:22.225 42981.516 IN 48 7456
2009/09/15.00:01:53.272 63859.348 RU 44 5401
2009/09/15.00:01:54.930 0.237 NZ 16 1800
2009/09/15.12:52:03.644 0.196 DK 16 1916
2009/09/15.23:37:05.528 0.577 LU 16 2244
2009/09/15.09:52:27.636 0.141 CS 8 864
2009/09/15.11:39:26.752 0.166 FI 8 1202
2009/09/15.09:43:33.302 0.166 BR 4 458
2009/09/15.09:43:15.856 0.105 BE 4 412
2009/09/15.07:59:15.435 12581.292 MX 4 260
2009/09/15.09:43:39.674 2.363 VE 2 150
2009/09/15.15:08:30.196 2.404 HU 2 150
2009/09/15.11:29:36.073 0.000 CO 1 63
2009/09/15.02:59:00.745 0.000 RO 1 79
2009/09/15.08:33:38.455 0.000 AR 1 418
2009/09/15.08:22:13.605 0.000 ES 1 62
2009/09/15.20:27:30.993 0.000 IL 1 92
2009/09/15.08:30:49.794 0.000 HK 1 62
AS Numbers are not strictly geographic location information, but rather network location information. Each IP address resides in a single Source Autonomous System, and like country codes, there is geolocation information for the management entity for each AS in many public and commercial databases. Cisco's Netflow records can provide AS numbers for IP addresses, and they can be either the Origin ASN's, or the Peer AS, which is an Autonomous System that claims to be a good route for traffic headed to a particular IP address. Peer ASNs are your next-hop AS for routing. While this information is important for traffic engineering and routing, it is not useful for geolocation of the IP address itself.
We use the MaxMind GeoIP Lite database to provide Origin AS Number values for IP addresses. These numbers can be filtered and aggregated, so that you can generate views of argus data specific to Origin ASNs. The methods above can be used to generate data views for "sas" the source AS number.
There are a large number of both commercial and public sources of IP Address GeoLocation Information that can provide lat/lon data. We provide programatic support using MaxMind's GeoIP Open Source API's (see above) which provides lat/lon for IP addresses. MaxMind's commercial database has reported excellent quality for this information.
Page Last Modified: 14:22:39 EDT 13 Mar 2012 ©Copyright 2000 - 2012 QoSient, LLC. All Rights Reserved.