Why (and which) data is essential to create a reliable Machine Learning model?
Machine Learning Blog Post Series – 4: By Shazia Saqib
MACHINE LEARNING BLOG SERIES
- Machine Learning & Cybersecurity – An Introduction
- The main concepts of AI and Machine Learning
- Why we need Machine Learning in Cybersecurity, and how it can help
Data is probably one of the most important and valuable commodities in modern day society. Data analytics in machine learning is being used to capture otherwise unseen patterns in almost all industries, from manufacturing to agriculture, from scientific institutions to government organizations. It can even be used to predict disasters before they actually happen.
Companies with a data-first mentality will have the chance to reimagine and reinvent their business. [1]
In our last article, “Why do we need Machine Learning in cybersecurity and how can it help? ”, we outlined a general introduction to the state of cyber security. We explained why the analytics and automation power of Machine Learning (ML) can help cover the blind spots in cybersecurity by recognizing the hidden patterns to identify attacks and automatically mitigate them. We also explored 19 of the most prominent AI use cases in Cybersecurity listed by Gartner.
Download your complimentary Gartner Report
Machine learning models mainly rely on four primary data types. These include numerical, categorical, time-series, and text data.
Numerical data, or quantitative data, is measurable data such as distance, age, weight, or the cost of an electricity bill.
Text data is simply words, sentences, or paragraphs that can provide some level of insight to machine learning models.
Categorical data is sorted based on shared characteristics. Social class, gender, and hometown are a few examples of categorical data.
Time series data consists of data points indexed at specific points in time. More often than not, this data is collected at consistent intervals.
As cyber-attacks are growing exponentially, there is a need to detect the patterns of an attack and find a correlation between reported attack volume in consecutive days[2].
How to build and evaluate a Machine Learning model
The data is probably the most critical component of a reliable machine learning model. Everything from the model creation process to the actual predictions of the model is dependent on the data that is used as input.
Feature extraction:
Feature extraction is when the features are selected and translated into a mathematical form that the model can understand. When you build a model, you have numerous parameters or “features” that can be used to predict the desired outcome. Feature selection identifies the essential features and eliminates the irrelevant or redundant ones using dimensionality reduction.
VMRay uses a supervised machine learning model, which means that the most relevant features are picked among numerous potential indicators such as URL string entropy, white space percentage, etc.
Model engineering:
A model is a parameterized function through which we can map inputs to outputs. The models are created through a careful and meticulous engineering and experimentation process of selecting, validating and evaluating the models.
VMRay trains its models by creating a Machine Learning workflow, which ensures explainability for the significant parts of the process that contributed to the prediction, such as sample set collection, feature engineering, feature weights and inference, etc.
In most cases, the success of model engineering is a matter of avoiding overfitting or underfitting and finding the optimized balance between the accuracy and false positives of the outcome. It’s also essential to create a generalizable model: one that will be almost equally reliable when it encounters a new dataset.
The complexity of the model can be controlled by experimenting with the number of hyperparameters. In the regression example on the figure below, you can see why an optimal level of model complexity is needed. When the model finds the exact match of input and output in the training dataset (an overfitting example), it would be hard to apply this model to external data. On the other hand, if the model is too simple, then it will not be able to perform well even with the training data, which basically means underfitting.
In short, we need a model that does not “memorize” the mathematical function that links the input data to the prediction, rather we need it to “learn” the underlying function.
To achieve this, we need a process called “Model Evaluation”, which means testing and improving the model to see if it performs well both with the training set and with new data.
This stage is where the quality of data and expertise of those who build the model pays off.
For model evaluation, the data set is divided into two subgroups. The first one -generally the bigger group- is the training set, while the second is called the validation set. The model is trained to find the function that ties the input and output within the training set, and then this function is tested with the validation set, which is outside the data upon which the model is trained.
There are different methods to evaluate the performance on the validation set and find the optimal balance, such as minimizing cost function and optimizing accuracy (correct prediction) compared to validation accuracy. The closer the performance within the training set and validation set, the more generalizable the model is.
Another very important tool for model evaluation is the Confusion Matrix, where we can have both a general overview and conduct a deeper analysis of the balance of accuracy and FP.
For example, when evaluating a model that predicts whether a sample is malware or not, we can use this tool to calculate:
Recall: How much of the actual malware is caught, calculated by dividing predicted malware to all malware: Recall=TP/(TP+FN)
Precision: How much of the “positive” predictions are correct, calculated by dividing “correctly predicted malware” to all predicted malware: Precision = TP / (TP+FP)
Accuracy: How much of the predictions are correct, basically, true predictions divided by all predictions: Accuracy = (TP+TN)/(TP+FP+TN+FN)
Output:
Within the VMRay platform, the machine learning module is an additional layer that is fed by the input derived from VMRay’s cutting-edge dynamic analysis technologies. The output of the Machine Learning model, then, contributes to the unique VTI (VMRay Threat Identifiers) scoring methodology, as a separate VTI rule.
Users can view the ML prediction on the detailed VTI list, although it does not deliver the ML prediction as a “standalone verdict”. By displaying the ML outcome as one of the many identifiers, VMRay balances the ML prediction with the decisions of 20+ unique detection technologies. The users can not only display what the ML engine predicts among the VTIs, but also can adjust the impact of that prediction on the overall verdict.
What it takes to create a reliable model
When we speak about what is the most important requirement for Artificial Intelligence and Machine Learning, it always comes to the input data. It’s the data that makes the difference because data is the raw material of ML.
An HFS Study shows that 75% of executives do not have a high level of trust in their data[4]. According to another study from Gartner, 40% of enterprise data is either inaccurate, incomplete, or unavailable[5]. So, the first thing needed to create the Machine Learning model is to find the most trustworthy input. And this is where VMRay excels. VMRay provides the model with the highest quality input in three aspects:
VERACITY:.
“Access to good data is one of the major challenges of AI/ML development.” says Gartner, highlighting the importance of accurate and trustworthy data.
VMRay’s core technologies and the innovations they keep creating and introducing with every release enable the platform to see the true face of the enemy. While VMRay Analyzer analyzes a file or URL, the sample displays its genuine behavior, as even the most evasive sample is unaware of being observed. Thus, VMRay can bring the most accurate data to the table, which is very much needed to build a reliable Machine Learning Model. In short, the input that VMRay uses to feed, train, and validate the model is accurate and noise-free data.
SCOPE:
Covid19 showed us once again that the data from the past loses relevance in a short time because the world is changing at an exponential pace. As a result of this ongoing disruptive transformation, a new concept is emerging: “small and wide data”, where “small” refers to the increasing importance of relevant, to-the-point data instead of big volumes of it. As per Gartner, “70% of organizations will shift their focus from big to small and wide data by 2025” [6].
VMRay specializes in what matters the most: the types of threats that others miss. The unknown, zero-day threats that the cybersecurity world is not yet aware of, and the sophisticated and targeted attacks that use evasive techniques to remain hidden. This means that VMRay’s expertise and data are to-the-point, when it comes to detecting undetectable threats.
RANGE:
To create a reliable and generalizable machine learning model that helps detect new threats, you need data that includes a wide variety of threat types, targets, vectors, and behavior patterns.
VMRay, has a diverse client portfolio in terms of verticals, regions, and company sizes. We’re working with top companies: 4 of the top 5 global tech giants, 14 out of Fortune 100 largest companies, 17 of the World’s Most Valuable Brands. In addition to the private companies, VMRay works with more than 50 critical Government customers from 17 countries. This adds a huge range to our expertise and know-how.
And VMRay Analyzer logs every necessary detail about the malicious behavior in each step of its execution. This adds enormous breadth to the data.
Summary
VMRay Analyzer analyzes samples using cutting-edge advanced detection technologies that observe and report the “actual” behavior of malicious samples and generate accurate and noise-free outputs. This is why VMRay Analyzer is trusted by leading private and public organizations to cover the blind spots, and validate the alerts and false positives of their existing security tools and systems.
VMRay’s Machine Learning model works as a module of this already strong technology platform and gets the highest quality input directly from the dynamic analysis: reliable, relevant, and wide range of data. And these are exactly the qualities needed to create, validate, and evaluate a trustworthy machine learning model.
For further information
DOWLOAD THE WHITEPAPER
Summary
- https://www.forbes.com/sites/teradata/2020/10/15/why-data-matters/?sh=420ac0486886
- https://www.datarobot.com/blog/the-importance-of-machine-learning-data/#:~:text=Data%20can%20come%20in%20many,series%20data%2C%20and%20text%20data.
- https://www.rtinsights.com/executives-dont-trust-data/#:~:text=However%2C%20according%20to%20a%20study,to%20be%20%E2%80%9Cworld%20class%E2%80%9D.
- https://insightsoftware.com/blog/5-signs-youre-using-bad-data-to-make-business-decisions/#:~:text=A%20Gartner%20study%20states%20that,and%20high%2Drisk%20decision%20making.
- https://www.gartner.com/en/newsroom/press-releases/2021-05-19-gartner-says-70-percent-of-organizations-will-shift-their-focus-from-big-to-small-and-wide-data-by-2025#:~:text=Gartner%2C%20Inc.,(AI)%20less%20data%20hungry.