JUMPSEC works on a prototype lightweight anomaly detection system
Deploying machine learning models in the cyber security industry is complicated - especially with budget and technology limitations. Especially when it comes to anomaly detection, there’s been much debate over privacy, balance, budget, robustness, cloud security and reliable implementation.
For cyber security companies using machine learning technologies, ensuring clients’ safety with trustworthy artificial intelligence (AI) must always be the primary objective.
And for more sensitive organisations and departments, data privacy and security are of even greater importance, complicating the use of cloud resources, which adds to security and privacy issues.
And finding an affordable, tailored solution to implement machine learning technologies is certainly a challenge for most small-to-medium sized businesses.
As machine learning is increasingly adopted, Machine Learning Operations (MLOps) are now available to bridge the gap between AI software and traditional software implementation standards.
As a replacement for DevOps, MLOps represent principles of versioning, testing, automation, reproducibility, deployment, and monitoring at three levels: data, ML model, and code. While MLOps has developed into a standard for the deployment and design of machine learning software, in practice, it has not been widely adopted.
This article briefly explores open-source solutions using Apache Airflow (Airflow) to deploy an anomaly detection model under the MLOps concept. Airflow helps visually manage the whole process from feature extraction to detection, monitoring and debugging. Its robust Directed Acyclic Graph [1] (DAG) and Scheduler allow users to link every step and regularly process multiple tasks in parallel.
In anomaly detection, logs contain the most valuable information that can be used to find security threats and anomalous behaviours.
Combining Airflow with the advanced Deeplog [2] framework to detect anomaly logs, JUMPSEC has now been working on a prototype lightweight deployment for a complex anomaly detection system in a non-cloud environment.
When undertaking this, the main objectives were a) scalability, b) flexibility and extensibility, c) integration, d) automation, e) maintainability, f) privacy and data security, and g) traceability and continuous improvement.
By creating custom DAGs using Airflow, we can now eliminate 70% of the manual effort required for analysts to identify threats, with a detection time of one hour or less.
An online learning approach to gathering user feedback helps improve detection accuracy with new data, leading to regular updating of the models to ensure currency and accuracy, at no extra cost to the storage of data or computation power.
SMEs can now benefit from the solution – which offers proven ability to handle huge volumes of data, extending to multiple clients and billions of data daily from our tests using multiple processing clusters.
The solution delivers a mechanism for traceability and continuous improvement, is flexible enough to cope with multiple log types and sources, and is easy to deploy.
Of course, log data needs parsing to make it machine readable. After deep research, we opted to use an integrated tool – the aptly-named Logparser – to structure the data and help make sense of diverse, unstructured logs, fast.
Further, by utilising Principal Component Analysis – which can analyse 24 million lines of console logs in three minutes - as the primary machine learning method, more reliable results are obtained.
To assist deep learning, inside Deeplog, the detection logic combines the Long Short-Term Memory (LSTM) model and time-series analysis. Some applications replace LSTM with the more practical Gated Recurrent Units (GRU), speeding up analysis. We have changed LSTM to GRU, reducing the volume of training data and introduced custom online learning into the framework and provided a mechanism to integrate users’ feedback. By introducing online learning, we can ensure the model gets updated as new threats or behaviours appear.
We see the middleware detection engine as a plug-in, that shouldn’t influence normal business processes. And with no need for additional official data storage, the solution helps reduce vulnerable vectors and also increases sustainability.
In a further effort to ensure business continuity and minimise the impact of anomaly analysis, log data export to a local server can be performed at regular intervals. The detection engine can directly read structured JSON data from the local server. Even though an anomaly detection plugin is available inside ELK, [3] (which this solution uses as the basic device for initially processing logs), it commands a significant budget compared with an open-source solution, like this prototype. The data are named with type and timestamp, which can help trace later and predict the corresponding reports with identical timestamps.
Our prototype detection engine emphasises scalability, flexibility, and maintainability. It is both open source and cost-effective. While perhaps still a little rough around the edges, the prototype goes some way to achieving near real-time analysis, which we will continue to research.
[1] A collection of tasks to run, organized to reflect their relationships and dependencies. A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code.
[2] A state-of-the-art solution that enables anomalies to be detected from different log sources.
[3] ELK stands for ElasticSearch, Logstash, and Kibana. Logstash is the pipeline to process data and collect and transform data from multiple locations. Elasticsearch is the database and Kibana visualizes the data with different formats and patterns.