Python is a favorite language and development environment for Data Scientists. Here, the Argus Project is working on developing and describing a Python environment for processing Argus based network flow data. Structured as a set of libraries, programs, workflows and processes, we aim to provide a way to analyze any network flow data, imported as Argus flow data, process it using Python. Where possible, the Argus Project will use Python community projects as its starting point, focusing on Python, Anaconda, and Pandas.
In these examples, we'll touch on the principal features of a Python based ML workflow. First, we need to startup python ...
Start Python
For most of these examples, we'll use Python 3.8.0 on Mac OS X Catalina. No particular reason, other than the Mac is my primary development system.
apophis:~ carter$ python3 Python 3.8.0 (v3.8.0:fa919fdf25, Oct 14 2019, 10:23:27) [Clang 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information.
>>>
Import the Necessary Packages
This will work great, if the packages are installed on your system. Errors like 'ModuleNotFoundError' will give away that they aren't installed correctly.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from pandas.plotting import scatter_matrix
Argus data import fields are specified when you generate the Argus data CSV. Columns, column order, format are all unstructured, in that you can define as many as you want, and import as much as is useful. A little parsimony goes a long way in data import processing times, so import what you need, especially when working with millions of rows of data.
Column names are specific to Argus, but can be modified in the CSV, prior to import. Names are very important as they are referenced in almost all the next steps in this example.
## The shape is the number for rows and columns print(df.shape) (4953136, 11)
print(df.head(5))) StartTime Flgs Proto SrcAddr Sport Dir DstAddr Dport TotPkts TotBytes State
0 1.571803e+09 e udp 192.168.0.203 59805 -> 192.168.0.1 53 1 85 INT
1 1.571803e+09 e udp 192.168.0.203 59805 -> 192.168.0.1 53 1 85 INT
2 1.571803e+09 e tcp 192.168.0.203 41888 -> 204.12.217.19 80 1 74 REQ
3 1.571803e+09 e udp 192.168.0.150 13470 -> 192.168.0.1 53 1 88 INT
4 1.571803e+09 e arp 192.168.0.154 who 192.168.0.109 1 60 INT
Begin Exploring
Once the data is imported into Pandas, there really isn't any limitation to what you can do with the data.
These examples are here for illustration, and don't mean to guide the Data Scientist into a specific set of data elements that are more important than others.
## Get Unique Destination Ports print(df["Dport"].unique()) ['53''80' ... '16560''50563'57716']
## Get Distinct Destination Addresses with count print(df.groupby("DstAddr").size())
Box and Whisker plots (box plots, for short) are an exploratory visualization technique that show range and distribution of a single variable. The middle line indicates the median of the distribution. The top and bottom sides of the box indicate the 75th and 25th percentiles of the data distribution. And the whiskers show 1.5 times the Interquartile Range (IQR) which is the 75th percentile minus the 25th percentile. Outliers are shown as points outside the whiskers. For DNS queries, flows are limited to two packets (a request and a response), but can vary in the number of bytes in each flow.
A Scatter matrix is another form of exploratory data visualization that shows the distribution of each variable on the diagonal axis and the correlation between the variables in the off-diagonal spaces. As expected, bytes and packets for a flow are highly correlated.
## Plot a Scatter Matrix scatter_matrix(df[["TotPkts", "TotBytes"]]) plt.show()