Introduction
Artificial intelligence, and more precisely machine learning (ML), has become an almost omnipresent topic in the tech industry over the last decade. ML is applied to all kinds of problems, from image and speech recognition, online fraud detection, up to stock market predictions. It seems just natural to also apply it to tackle one of cybersecurity’s biggest challenges: threat detection. As malicious actors have created a plethora of cybersecurity threats ranging from phishing pages up to ransomware and wipers, the ability to detect and prevent attacks in the first place before harm has been done is an essential building block in every organization’s cyber defense strategy.
In fact, the cybersecurity industry has become effective at fighting malicious URLs and files that are known, or in other words, that have been analyzed already so that signatures have been crafted. The big remaining problem is unknown threats, which can be either entirely new threats or new variants of already existing threat families. Fending off unknown cyber threats is of vital importance for defending your assets.
Machine learning algorithms can solve problems they were not explicitly programmed to do. Due to their nature, they find solutions to problems using approaches notably different to human thinking. Therefore, an algorithm is constructed using a training, validation, and test set. It is automatically crafted to provide optimal results depending on various mathematical conditions. Given their machine-oriented nature, such algorithms are capable of processing vast amounts of data and identifying complex patterns that humans struggle to see. Thus, they can detect things that cannot be detected using other approaches.
With email as the main infection vector, attackers have increasingly focused on using malicious URLs rather than malicious attachments. This most notably includes phishing emails pointing to forged credential harvesting pages, which closely resemble the look and feel of legitimate login pages. VMRay Analyzer can detonate URLs with its unique web analysis engine to collect a plethora of different web site characteristics that allow for detecting phishing web sites in a generic manner rather than relying on static signatures only.
Even on seemingly simple web pages, a myriad of different characteristics can be extracted and thus influence the detection verdict. Typical characteristics include page visuals (logos, page layout, etc.), texts presented to the user, hosting information (domain properties, use of encryption, etc.), source code (use of specific libraries, etc.), and so on. The high number of combinations along with high quality, noise-free input provided by VMRay analysis engines qualify perfectly for machine learning, where the different characteristics are used as features to feed the ML algorithm.
VMRay has been diligently working on expanding our best-in-class threat detection with ML and is therefore happy to announce the introduction of a new ML phishing detection in VMRay Analyzer v4.5. The solution supplements already existing detections to catch even more unknown phishing pages which can otherwise fall through the cracks.
In the following, we describe how it works.
How VMRay’s ML Phishing Detection Works:
VMRay’s ML Phishing detection is currently implemented using five different models, each utilizing different feature sets, and therefore providing different model performances. This approach gives significant room for diversity and flexibility on how we experiment and utilize our machine learning models to address use cases. Furthermore, this allows possible usage customization and settings, like what model to use for a specific scenario, what mode are we going to use based on prediction confidences (e.g. normal, conservative, aggressive), etc.
As mentioned, five models are in place:
- One model based on features crafted from the actual URL path, and the main URL page
- Two models based on features engineered from VMRay’s analyzers output
- One model utilizing all the crafted features generated by the first three models
- One voting model using the predictions of the three models’ prediction and prediction confidences
The first model is based on the features retrieved and extracted from both the URL path, and the contents of the completely loaded URL main page. Examples of the URL path based features are:
- URL length
- number / percentage of alphanumeric characters used in the URL path
- shortened URL indicator
Excluding the URL path entropy, these features are directly and easily extracted from the URL path. The URL entropy is added as a feature for it gives a sense of randomness measurement to the strings in the actual URL.
The content-based features are retrieved and extracted from the contents of the completely loaded URL page. In other words, the source page, including the embedded / inline javascripts. Examples of the features extracted from the source are:
- character counts and percentages
- number and percentage of particular functions used (e.g. eval function)
- variable and function name lengths (i.e min, max, avg)
- token count related features – these are based on some specific tokens we found noteworthy in tracking and counting
Combining them, a total of fifty-five (55) features were used to train this model. Based on training and testing results, the model has the highest accuracy in predicting phishing URLs with an acceptable number of false positives. However, false positives tend to fluctuate when tested against real world phishing samples.
The second and third models probably provide the key differentiation of VMRay’s machine learning implementation. This is mainly due to the fact that these models were trained using the features retrieved and extracted from the output of the dynamic analysis of VMRay Analyzers. We’re talking about clean and complete dynamic analysis logs generated after detonating the URL. The data present in these analysis logs are significantly rich on their rawest form to be included as features. However, to engineer more features and to boost the model performance, statistical methods, transformations, combinations, and aggregations were applied to these raw data has been performed. Examples of the raw features which are in the analysis logs are:
- host-based information (e.g. like whois info)
- complete request and reponse headers
- URL page and design related resources
One hundred twenty one (121) features were engineered to train these two models separately. Training and testing results show that these two models perform the most stable, in terms of predicting phishing URLs, with very low false positives. Against real world phishing samples, these results are also seen. This just shows that robust and stable models were generated using the clean analysis data generated by the VMRay analyzers.
The fourth model is trained using all the features engineered from the three models already described. In general, it’s imperative to train machine learning models using as many (uncorrelated) features as possible. This is mainly because the algorithms tend to discover more underlying and guiding patterns and correlations, using these numerous uncorrelated features (along with significant amount of data points). As a result, such approach will make the learning, both generalization and memorization, optimal. In our case, we want to train a model which has the aggressiveness of the first model, and the very low false positives from the other two models. In short, we want to come up with a model accurate in both ends.
The fifth model is a sort of derivative of the fourth model. Technically, this is not a machine learning model / algorithm. Rather, it’s an implementation using the consensus of the predictions and prediction confidences of the four models described above. In machine learning terminologies, such approach can also be called as “ensembling by prediction“. Ensembling is a proven effective way or algorithm used in machine learning modeling to boost the overall accuracy, by replicating naive base estimators, and aggregating the individual results in different ways.
We opted for a machine learning approach with a significant feature engineering part, because it allows us to fully utilize the structured data provided by the clean and rich outputs of our VMRay Analyzers. We also plan to investigate Deep Learning in the future; mainly, because of its ability to learn features, from raw unstructured data, by tweaking the neural network’s architecture. This leads to the removal of the feature engineering phase, mainly done by/coming from domain experts.
The models described above are the results of the actual training process. However, in machine learning modeling, another essential part to take care of, which comes prior the feature engineering and algorithm selection is the sample set collection. In this phase, we ensured that samples included in the training and testing set are the best representation of the entire population or universe (i.e. the collection of phishing and clean URLs). By profiling the samples, we ensured that samples to be included in the train set were:
- Collected from all the different sources VMRay is using, both for phishing and clean URLs
- Free from duplicates or similarities against one another (e.g. 8 phishing URLs which can be just represented by one)
- As much as possible, at least close to uniform when it comes to number of samples for both phishing and clean URLs.
Summing up the details discussed, powered by VMRay’s Analyzer superior technology and with the essential machine learning parts put in together, VMRay was able to generate its ML Phishing Models.