Anonymity is a state of "namelessness", or where an object has no identifying properties. Anonymity in network data is a big topic when you consider sharing data for research or collaboration. There are laws in many countries against disclosing personal information, and many corporate, educational and governmental organizations are concerned about disclosing information about the architecture, organization and functions of their networks and information systems. But sharing data is critical for getting things done, so we intend to provide useful mechanisms for anonymity of flow data.
The strategy that we take with argus data anonymization is that we want to preserve the information needed to convey the value of the data, and either change or throw away everything else. Because data sharing isn't always a life-or-death level issue, not all uses of anonymization require 'perfect secrecy', or 'totally defendable' results. If you require this level of protection, use ranonymize() with care and thought. We believe that you can achieve practical levels of anonymity and still retain useful data, with these tools.
The IETF has developed a draft document on flow data anonymization, draft-ietf-ipfix-anon, which has some opinions and some descritions of some techniques for flow data anonymization. Argus clients should minimally support all of the techniques described in this document. If we are missing something that you would like to see in flow data anonymization, please send email to the list.
The argus-client program that performs anonymization is ranonymize(). This program has a very complex configuration, as there are a lot of things that need to be considered when sharing data for any and all purposes. A sample configuration file can be found in the argus-clients distribution in ./support/Config/ranonymize.conf. This file describes each configuration variable and provides detail on what it is designed to do and how to use it. Grab this file and give it a read if you want to do something very clever.
By default ranonymize() will anonymize network addresses, protocol specific port numbers, timestamps, transaction reference numbers, TCP base sequence numbers, IP identifiers (ip_id), and any record sequence numbers. How it does that is described below. By default, you will get great anonymization. Great, but not "perfect", in that there are theoretical behavioral analytics that can "reverse engineer" the identifiers, if another has an understanding of even just a subset of the flow data. If you need a greater level of anonymization, you will need to either "strip" some of the data elements, such as jitter and the IP attributes data elements, and/or use the configuration file to specify additional anonymization strategies.
Once you have anonymized your data, use ra() to print out all the fields in your resulting argus data, using the ra.print.all.conf configuration file in the ./support/Config directory, to see what data is left over. If you see something you don't like, run ranonymize again over the data with a ranonymize.conf file to deal with the specific item.
Argus has the unique property of supporting the capture of payload data in its flow record status reports. This feature is used for a lot of things, such as protocol identification, protocol conformance verification and validation, security policy enforcement verification, and upper protocol analysis. The feature is completely configurable (by default its turned off, of course) and you determine how much upper layer data you want to capture.
There is only one supported anonymization strategy for User Payload Data, and that is to remove it from the argus record. This argus data element has the potential to contain exactly what most sites are worried about sharing/leaking/exposing, so we aren't going to try to do anything with this data. If you want to preserve it (why on earth would you want to do that?) write your own program.
It maybe surprising that time anonymization comes before other objects, but time is so important to the person wanting to defeat your anonymization strategy, that we should deal with it first. The absolute time and the relative times in argus records should be considered for anonymization, and ranonymize() has lots of support for modifying time, injecting variations in time etc.... By default, ranonymize() will add a constant uSecs offset, chosen at random, to all the timestamps in all the records in an argus data stream or file. This "Fixed Offset" style of anonymization preserves relative time, interpacket arrival, jitter and transaction duration, which in general, are the kinds of things that you need when analysising flow data.
The next most important objects in argus data for anonymization are the network addresses. Argus has the unique property, currently, of supporting the capture of many encapsulation identifiers at the same time. Argus can have Ethernet addresses, Infiniband Addresses, tunnel identifiers (GRE, ESP spi), etc... With regard, to anonymization, each of these can provide some form of identification. The most important is the Layer 2 addresses that argus can optional contain. These addresses are unique to the endsystem, whether that is a router/switch, a cell phone using Wi-Fi, a laptop or workstation, the ethernet address is the most identifying information in the flow record.
ranonymize() has the ability to anonymize the entire address, or portions of the address, in order to preserve certain semanitcs. Ethernet addresses are interesting in that they contain a Vendor identifier and then a completely unique station identifier. There are situations where you may want to preserve the Vendor ID, say to convey to the recepient of the anonymized data that the flows are going through a Netgear Wireless Router, on one side, and a Juniper Router on the other. But you will still want to anonymize the Station ID.
For IP addresses, which are composed of a Network address and a Host address, ranonymize() supports anonymizing the two parts independantly. This is important because you many want to preserve the Network address hierarchy, i.e. two different IP addresses that are in the same Network, could be anonymized to have the same Network address part, but different Host address parts. We discuss the various strategies for anonymization below.
In any event, all network station identifiers should be considered for anonymization.
The next most important objects in argus data for anonymization are the sequence numbers. Argus records contain a lot of sequence numbers that are copied from the packets themselves. Argus does this to support calculations of loss, but also to aid in identifying network traffic at multiple points in the network. This is, of course, the vary condition that we need to protect ourselves from, so all protocol sequence numbers, such as TCP, ESP, DNS transactional sequence numbers, and even the IP fragmentation identifier, need to be anonymized. This is done by default, and you do have some control over this in the configuration file.
The next most important objects in argus data for anonymization are the Service Access Port (SAP) numbers. Most SAP's are not identifiable objects. They are well known port numbers or protocol numbers, which are so ubiquitous that having the information is not useful, or they have only local significance. But some port numbers, such as the UDP and TCP dynamically allocated Private Ports are somewhat unique, and should be considered for anonymization.
ranonymize() provides a lot of flexibiliity in anonymizing port numbers, because the port numbers have significance to the receiver of the anonymized data. They want to know what services are being referenced etc...
Page Last Modified: 14:22:39 EDT 13 Mar 2012 ©Copyright 2000 - 2012 QoSient, LLC. All Rights Reserved.