For most of these examples, we'll use Python 3.8.0 on Mac OS X Catalina. No particular reason, other than the Mac is my primary development system.
apophis:~ carter$ python3
Python 3.8.0 (v3.8.0:fa919fdf25, Oct 14 2019, 10:23:27)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Import the Necessary Packages
This will work great, if the packages are installed on your system. Errors like 'ModuleNotFoundError' will give away that they aren't installed correctly.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
Import the Data to Pandas
## Read CSV Argus data from a file
filename = "argus-output.csv"
df = pd.read_csv(filename)
Argus data import fields are specified when you generate the Argus data CSV. Columns, column order, format are all unstructured, in that you can define as many as you want, and import as much as is useful. A little parsimony goes a long way in data import processing times, so import what you need, especially when working with millions of rows of data.
Column names are specific to Argus, but can be modified in the CSV, prior to import. Names are very important as they are referenced in almost all the next steps in this example.
## The shape is the number for rows and columns
StartTime Flgs Proto SrcAddr Sport Dir DstAddr Dport TotPkts TotBytes State 0 1.571803e+09 e udp 192.168.0.203 59805 -> 192.168.0.1 53 1 85 INT 1 1.571803e+09 e udp 192.168.0.203 59805 -> 192.168.0.1 53 1 85 INT 2 1.571803e+09 e tcp 192.168.0.203 41888 -> 22.214.171.124 80 1 74 REQ 3 1.571803e+09 e udp 192.168.0.150 13470 -> 192.168.0.1 53 1 88 INT 4 1.571803e+09 e arp 192.168.0.154 who 192.168.0.109 1 60 INT
Once the data is imported into Pandas, there really isn't any limitation to what you can do with the data.
These examples are here for illustration, and don't mean to guide the Data Scientist into a specific set of data elements that are more important than others.
## Get Unique Destination Ports
['53''80' ... '16560''50563'57716']
## Get Distinct Destination Addresses with count
DstAddr 0.0.0.1 3833 0.0.0.2 92 ... ff15::efc0:988f 3780 ff:ff:ff:ff:ff:ff 759 Length: 19393, dtype: int64
Box and Whisker plots (box plots, for short) are an exploratory visualization technique that show range and distribution of a single variable. The middle line indicates the median of the distribution. The top and bottom sides of the box indicate the 75th and 25th percentiles of the data distribution. And the whiskers show 1.5 times the Interquartile Range (IQR) which is the 75th percentile minus the 25th percentile. Outliers are shown as points outside the whiskers. For DNS queries, flows are limited to two packets (a request and a response), but can vary in the number of bytes in each flow.
## Plot Box Plots
dns = df[df["Dport"] == "53"]
A Scatter matrix is another form of exploratory data visualization that shows the distribution of each variable on the diagonal axis and the correlation between the variables in the off-diagonal spaces. As expected, bytes and packets for a flow are highly correlated.
## Plot a Scatter Matrix