Python Data Analysis

Exploratory Data Analysis

Python is a favorite language and development environment for Data Scientists. Here, the Argus Project is working on developing and describing a Python environment for processing Argus based network flow data. Structured as a set of libraries, programs, workflows and processes, we aim to provide a way to analyze any network flow data, imported as Argus flow data, process it using Python. Where possible, the Argus Project will use Python community projects as its starting point, focusing on Python, Anaconda, and Pandas.

In these examples, we'll touch on the principal features of a Python based ML workflow. First, we need to startup python ...

Start Python

For most of these examples, we'll use Python 3.8.0 on Mac OS X Catalina. No particular reason, other than the Mac is my primary development system.

apophis:~ carter$ python3
Python 3.8.0 (v3.8.0:fa919fdf25, Oct 14 2019, 10:23:27) 
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>>

Import the Necessary Packages

This will work great, if the packages are installed on your system. Errors like 'ModuleNotFoundError' will give away that they aren't installed correctly.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix

Import the Data to Pandas

## Read CSV Argus data from a file

filename = "argus-output.csv"
df = pd.read_csv(filename)

Check the Data Import

Argus data import fields are specified when you generate the Argus data CSV. Columns, column order, format are all unstructured, in that you can define as many as you want, and import as much as is useful. A little parsimony goes a long way in data import processing times, so import what you need, especially when working with millions of rows of data.

Column names are specific to Argus, but can be modified in the CSV, prior to import. Names are very important as they are referenced in almost all the next steps in this example.

## The shape is the number for rows and columns
print(df.shape)

(4953136, 11)

print(df.head(5)))

     StartTime  Flgs Proto        SrcAddr Sport  Dir       DstAddr Dport  TotPkts TotBytes State
0 1.571803e+09   e     udp  192.168.0.203 59805   ->   192.168.0.1 53           1       85   INT
1 1.571803e+09   e     udp  192.168.0.203 59805   ->   192.168.0.1 53           1       85   INT
2 1.571803e+09   e     tcp  192.168.0.203 41888   -> 204.12.217.19 80           1       74   REQ
3 1.571803e+09   e     udp  192.168.0.150 13470   ->   192.168.0.1 53           1       88   INT
4 1.571803e+09   e     arp  192.168.0.154        who 192.168.0.109              1       60   INT

Begin Exploring

Once the data is imported into Pandas, there really isn't any limitation to what you can do with the data.

These examples are here for illustration, and don't mean to guide the Data Scientist into a specific set of data elements that are more important than others.

## Get Unique Destination Ports
print(df["Dport"].unique())

['53''80' ... '16560''50563'57716']

## Get Distinct Destination Addresses with count
print(df.groupby("DstAddr").size())

DstAddr
0.0.0.1               3833
0.0.0.2                 92
              ...  
                     
ff15::efc0:988f       3780
ff:ff:ff:ff:ff:ff      759
Length: 19393, dtype: int64

Visualize the Data

Box and Whisker plots (box plots, for short) are an exploratory visualization technique that show range and distribution of a single variable. The middle line indicates the median of the distribution. The top and bottom sides of the box indicate the 75th and 25th percentiles of the data distribution. And the whiskers show 1.5 times the Interquartile Range (IQR) which is the 75th percentile minus the 25th percentile. Outliers are shown as points outside the whiskers. For DNS queries, flows are limited to two packets (a request and a response), but can vary in the number of bytes in each flow.

## Plot Box Plots
dns = df[df["Dport"] == "53"]
dns[["TotPkts","TotBytes"]].plot(kind='box',subplots=True,layout=(1,2),sharex=False,sharey=False)
plt.show()

A Scatter matrix is another form of exploratory data visualization that shows the distribution of each variable on the diagonal axis and the correlation between the variables in the off-diagonal spaces. As expected, bytes and packets for a flow are highly correlated.

## Plot a Scatter Matrix
scatter_matrix(df[["TotPkts", "TotBytes"]])
plt.show()

Python Data Analysis

Exploratory Data Analysis

Start Python

Import the Necessary Packages

Import the Data to Pandas

Check the Data Import

Begin Exploring

Visualize the Data

ARGUS + ML

ARGUS Sensor

ARGUS System

ARGUS + ANALYTICS

© Copyright QoSient, LLC.
All Rights Reserved.
site by spliteye

Python Data Analysis

Exploratory Data Analysis

Start Python

Import the Necessary Packages

Import the Data to Pandas

Check the Data Import

Begin Exploring

Visualize the Data

ARGUS + ML

ARGUS Sensor

ARGUS System

ARGUS + ANALYTICS

© Copyright QoSient, LLC. All Rights Reserved. site by spliteye

© Copyright QoSient, LLC.
All Rights Reserved.
site by spliteye