4 Advanced Uses of Data Mining and Machine Learning in Cybersecurity

Karthik Shiraly
August 24, 2021
Data mining and machine learning in cybersecurity: hacker

Cybersecurity has turned into an unequal arms race with hackers deploying innovative new attacks every day while most businesses are hobbled by weak security solutions. Automation of security using artificial intelligence, big data, and data science is a viable route for startups and small businesses like you to keep the defenses of your information systems on an equal footing in this arms race while you focus on your core business. Let's explore four ways data mining and machine learning in cybersecurity enable you to keep your business secure.

1. Automate Pentesting Using Machine Learning

Your company’s IT systems are far more dynamic than you realize, with frequent OS updates, application updates, or settings changes by your employees. Each and every email that one of your employees receives is a potential cyber threat that can compromise your network security, computer security, or worst of all, your company’s business information security.

A pentesting learning system that constantly monitors and adapts to your highly dynamic IT environment, while keeping up with the latest vulnerabilities discovered by security researchers, is a very attractive proposition. A pentester’s workflow can be automated end-to-end by combining vulnerability understanding using natural language processing and attack planning using deep reinforcement learning, all customized to your IT systems.

Vulnerability Understanding Using NLP

Data mining and machine learning in cybersecurity: vulnerability disclosure
A vulnerability disclosure

Source: CVE

The first phase of automated pentesting involves vulnerability understanding using natural language processing to digest the information present in a vulnerability disclosure published by databases like NIST’s NVD

The steps are:

  • Obtain word embeddings: The system processes the disclosure using pre-trained transformer models like BERT to derive context-aware word embeddings for every word. This is a preprocessing step that improves the accuracy of the next step when compared to using raw words.
  • Extract attack entities using NER: Named entity recognition is used to extract important information such as the vulnerable application, relevant versions, vulnerable operating system, impact of the attack, attack approach, and more. LSTM recurrent neural networks or transformer neural networks are used here because they are able to process text by using word order and surrounding context to detect patterns. Given the word embeddings from the previous step as inputs, these networks look for embedding sequences that match the information patterns we are interested in, such as application name or attack approach.
  • Generate attack graph: The extracted information is used to create interaction rules like the one in this example graph that is sent to an attack graph generator like MulVAL.

The attack graph generated here is a partial graph containing just the interaction rules generated from vulnerability descriptions. The rest of the graph is generated in the next phase.

Attack Planning Using Deep Reinforcement Learning

Data mining and machine learning in cybersecurity: IVRE Network Scanner
IVRE Network Scanner

Source: GitHub

The second phase of automated pentesting involves scanning your network to create the full attack graph and using reinforcement learning to automatically look for vulnerable systems. 

The steps are:

  • Scan your systems: Tools like IVRE (for internal networks) or Shodan (for public-facing systems) are used to obtain information about all your systems.
  • Generate full attack graph: The information returned by scanning tools and the partial attack graph from the previous phase are combined to create a full attack graph for your network. The nodes in the graph are the computers, routers, and other devices in your network. The edges are the possible interactions with them. 
  • Use deep Q-Learning to find a successful attack path: The attack graph contains all the systems in your network and interactions that may possibly lead to some unauthorized outcome. However, we don’t know which path through this graph will succeed. A deep Q-learning network (DQN) controls an agent to find attack paths by traversing the nodes and edges of the attack graph. A DQN is trained to learn the Q-value function Q(s,a) that gives the expected reward that the agent can expect when it performs an action, “a,” from state, “s.” Since it’s a function, it can handle unseen states too — a critical requirement in a field where new vulnerabilities and exploits are discovered every day. 
  • DQN reward strategy: Each malicious outcome such as successful login using a cracked password or successful root access carries different reward points. At each step in the graph, the agent executes an interaction that its training says might fetch it the maximum reward points in its current state.
  • DQN tells specialized tools to exploit: It delegates the actual execution of an exploit to specialized tools like Metasploit. The agent proceeds in this way from system to system on the attack graph until one of the attack paths succeeds in a malicious outcome.
  • Report successful attack paths and interactions: The system generates a daily report of successful attack paths and interactions that enabled it to achieve malicious outcomes. Your system administration team can then take corrective actions and update their SOPs to react quickly to such an outcome if it happens in the future. 

2. Machine Learning for Intrusion Detection

Data mining and machine learning in cybersecurity: Suricata IDS
Suricata IDS

Source: Linux Screenshots

If you’re using any kind of commercial or open-source network intrusion detection system in your business, it’s probably already using some kind of machine learning for network anomaly detection because rule-based systems are just not good at detecting or preventing the kind of cyber attacks carried out nowadays by expert hackers.

However, not all machine learning methods work well and it’s possible your system's threat detection can be improved. For example, older classifier algorithms such as association rules, Naive Bayes, and support vector machines have been surpassed by more modern machine learning algorithms like decision trees, XGBoost, and deep artificial neural networks for supervised learning. A good defense-in-depth strategy requires you to be aware of current security limitations and improved approaches.

Recurrent neural networks and transformer networks are intuitive models for intrusion detection because an intrusion involves a chain of malicious network activities. For training, network monitoring tools can capture network traffic packets while other tools capture system metrics like CPU usage. Connection metrics like bytes transferred, session metrics like duration, and system metrics like the number of shells launched can then be derived from that raw data. These metrics embed characteristics unique to different kinds of intrusion attacks. Generally, a packet's headers are more useful for this task than its payload.

But there are two inherent problems in such data. First, how do you handle temporal ordering in the data so that sequential models like RNNs can model that ordering correctly? Second, how do you do feature selection such that the detector performs well with very few false negatives — so that actual attacks are not ignored — and few false positives — so that productivity of your IT staff is not reduced due to too many false alerts?

The temporal ordering question requires raw network packets and raw system metrics to contain timestamps. Some datasets like the popular KDD99 dataset aggregate the data for all packets associated with a particular connection, drop all timestamp information, and publish only the aggregated records with little clarity on the temporal ordering of the aggregated records. This gives reason to doubt the output of RNN or transformer models trained on this dataset or its improved versions like NSL-KDD. Instead, we recommend that you use only datasets like UNSW-NB15 that publish the raw packet data. An additional benefit of training on raw packet data is that real-time intrusion detection is likely to be more accurate than training on data aggregated over some duration.

The feature selection problem has many approaches. Some researchers have experimented with dimensionality reduction techniques like principal component analysis (PCA) and mutual information. The underlying intuition is that multiple sets of features in the raw data are likely to be correlated; by reducing each set of correlated features to a single composite feature, the model gets less confused by too many variables with noisy data and can identify the output class more easily.

Another approach is to simply use deep neural networks to implicitly select features. Some researchers have used convolutional networks to generate an embedding feature vector for each row that is derived from some subset of the raw features. During training, the function that generates the feature vector automatically combines the raw features in such a way that it maximizes the network’s accuracy. This eliminates the need for manual feature selection using expert knowledge or PCA.

3. Machine Learning for Malware Analysis

workflow for malware analysis

Malware like malicious executables, viruses, and rootkits can enter your IT system through multiple routes — emails, applications like Excel, files uploaded to your web applications, or pen drives. They have characteristic static patterns in their binary content as well as characteristic dynamic traits when they run, and machine learning can learn both types of patterns to help with malware detection.

For static signature patterns, recurrent neural networks and transformer networks are two obvious approaches since the patterns are sequential both locally and at larger distances. However, an alternate approach is to look for byte patterns in their image representations using a hybrid neural network with convolutional and recurrent LSTM layers.

It works by looking at the bytes that make up a malware file as a one-dimensional grayscale image with 1 row and file-size number of pixels. Each image is then downscaled to a uniform size (say 1x10,000 pixels) since convolutional layers require fixed-length images. The downscaling does cause some data loss but the single dimension ensures that structure and sequences are largely retained. The convolutional layers detect small collocated patterns in the images while the LSTM layers detect larger sequential patterns in the images. These sets of local and global patterns in the binary content were found to be unique enough to enable the model to achieve over 98% detection accuracy.

Static patterns enable easy real-time detection on regular hardware, including smartphones used for fieldwork outside labs. However, a major problem with detecting static binary signature patterns this way is that malware frequently uses obfuscators, packers, and adversarial attacks to fool ML models.

Looking for patterns in the runtime behavior of malware can overcome this problem. Malware, when activated, still has to make system calls like file creation and memory access to achieve its malicious goals. These system call sequences can be recorded by running the malware in a safe sandbox like Cuckoo Sandbox or Sandboxie. A subset of critical system calls are selected as features and the call sequences are converted into feature vectors using one-hot encoding. A hybrid neural network with convolutional and recurrent LSTM layers is trained to look for patterns in these call sequences. The convolutional layers detect small patterns of collocated calls. The LSTM layers detect longer sequential dependencies in the call sequences. Each malware is identifiable by unique combinations of these local and global patterns. A final softmax layer classifies malware based on these features and outputs the malware’s name.

The main disadvantage of this behavioral approach is that it can’t be used for real-time detection or used on regular smartphones because the malware needs to run for a considerable duration in specialized sandbox software.

4. Data Mining Using the ELK Stack

ELK Security
ELK Security

Source: GitHub

Custom deep learning models are not always necessary for your defense. There are plenty of ready-to-use tools to keep your business safe while you focus on your core competencies. The ELK Stack (ElasticSearch Logstash Kibana) is one such popular system that provides visualization and data mining techniques for your application and system logs. It also comes with built-in security capabilities.

For example, it supports anomaly detection using machine learning out-of-the-box without you having to train it. It can detect any anomalies in metrics reported by middleware like Nginx, in your business-specific metrics, and in system metrics. It also has attack detection for common security threats like SQL injection, phishing, and intrusion using pattern matching and machine learning techniques. The Kibana interface enables profiling these attacks by date and location to help you detect coordinated attacks on your infrastructure.

Don’t Take Chances With Your Data Security

Cybersecurity is a bit like health. It’s one of those things that you probably just don’t think about until it fails (and your company falls victim to cybercrime). You wouldn’t want a future funding or acquisition offer to fail because their due diligence uncovered security concerns in your IT systems. Just like insurance, machine learning has the potential to raise the protection for your systems and customer data to an acceptable level while you focus on your core business. 

We at Scalr think your business deserves a better level of data protection 365 days a year. Come talk to us about your security anxieties and we’ll find a way to automate your security systems.