The Argus system is the network data source of choice for many prominent Machine Learning (ML) and AI based Network Based Anomaly Detection (NBAD) projects. Unsupervised learning using network flow data has been an active research topic for many years, and organizations like Oak Ridge National Lab (ORNL) have had great results using Argus data in their operational system, SITU.
The very large data capabilities, rich data models, flexible data formats, high performance data generation and processing, metadata enhancement capabilities, streaming and block processing strategies, and technical maturity all come together to provide an environment where successful ML and AI models can be developed, tested, optimized and deployed.
Having massive amounts of non-statistical, historical, transactional network activity data is really important for ML model development and training. Having a lot of attributes associated with the data, is a key component for ML deep learning. Argus is famous for being one of the first large data sources of network data, well structured, with the right kind of attributes that allow ML to "peek" into network state and condition, in real-time ... data that is actually useful and reliable.
Deploying ML in operational networks, requires a stable and reliable network data platform. Argus provides the most mature streaming network situational awareness capability available, providing guarantees on data timeliness, order and state, which makes it a natural choice for ML base NBAD.
ML and networking have been an interesting pair for a few decades now, and a number of basic concepts have emerged that will help the Data Scientist to approach the complexities of this topic. The more data the better. Non-statistical approaches yield the best prediction results. Designed data, rather than trying to make do with the data that is lying around, is key to successful ML solutions.
Data generation, collection, feature engineering, establishing ground truth, model development and validation, model optimization, deployment are all complex concepts that must be addressed when considering an operational ML approach to any problem. Of the four broad categories of problems that can leverage ML, the Argus Project focuses on clustering, classification and regression of network traffic flow-features. Because Argus has been used in many ML network research projects, it is a proven network traffic data set that can deliver predictable and reliable results.
In the literature, there are a lot of papers where ML has been applied to network traffic prediction, classification, routing, operations, performance and network security. The source of data is either packets (packet-level features) or network flow data (flow-level features). With network scale still going up, in complexity and speed, network flow data is the data of choice.
Traffic classification dominates the network ML literature and Argus data is designed specifically for flow feature-based, as well as early and sub-flow-based traffic analytics and classification. Argus data also contains sampled payload data, so it is great for doing most payload-based traffic classification, especially encrypted traffic classification. And because Argus can be deployed in end-systems and embedded in networking devices, it is perfect data for doing host behavior-based traffic classification as well as NFV / SDN based classification. ML for network security leverages traffic classification and focuses on traffic anomaly detection and hybrid intrusion detection.
Argus is at the core of a number of prominent unsupervised ML projects at US National Laboratories, Universities and private companies. These projects provide a glimpse of how Argus data can be used in large scale operations to provide detection and protection for important assets.
The technology needed to do effective Machine Learning for network based anomaly detection, involves developing / supporting a set of environments for the Data Scientists that support the whole ML life cycle. We're working on Argus data processing in Python, R, Matlab and Mathematica.
Getting data into the platform is just one step in the process, and many use CSV and JSON, both of which are supported in Argus. But getting streaming data into the platform can be complex and difficult for some applications, and getting that data in for a 100G network can be very challenging.
If there are other basic environments that you need, please give us a holler.
There seem to be two basic strategies for effective ML Network analysis, and these are based on whether the ML is processing data streams or whether its processing block / file based data.
Block processing, where the ML reads argus data from files, or a database table, is envisioned to support ML model development and testing.
Stream processing, where the ML is processing real-time streams of data, is envisioned for operational deployments of ML models for network data analysis.