Malware is one the imminent threats that companies and users face every day. Whether it is a phishing email or an exploit delivered throughout the browser, coupled with multiple evasion methods and other security vulnerabilities, it is a proven fact that nowadays defense systems cannot compete. The availability of frameworks such as Veil, Shelter, and others are known to be used by professionals when conducting pentesting work and are known to be quite effective. Today I am going to show you that indeed Machine Learning can be used to detect Malware without having to use neither a signature detection nor a behavioral analysis. P.S: Many products nowadays like CylanceProtect, SentinelOne, Carbon Black are known to leverage these capabilities the framework we are going to develop trough out this session is not at any level capable of doing what these products do, and I will explain shortly why.
Machine learning a brief introduction
Machine Learning can be split into two major methods supervised learning and unsupervised learning the first means that the data we are going to work with is labeled the second means it is unlabeled, detecting malware can be attacked using both methods, but we will focus on the first one since our goal is to classify files. Classification is a sub domain of supervised learning it can be either binary (malware-not malware) or multi-class (cat-dog-pig-lama…) thus malware detection falls under binary classification. Explaining Machine Learning is beyond this article, and nowadays you can find a large amount of resources to know more about it, and you can check the Appendix for more of these resources.
The problem set
Machine Learning works by defining a problem, collecting the data, processing the data to make it usable and then feeding it to the algorithms. This makes it quite hard to implement in everything for the extensive amount of resources you may need to do this; this is called the machine learning workflow it is the minimal steps you need to start doing Machine Learning. In our case let’s define our workflow:
First, we need to collect malware samples and clean samples we cannot work with less than 10k samples of both, and it is advisable to use even more of these We need to extract meaningful features from our samples these features will be the basis of our study; features are what describe something, for example, the features of a house are:
number of rooms SQ foot of the house price
After extracting these features, we need to process all our samples to build a dataset it can be a database file or a CSV file this way it will be easier to turn it into vectors since the algorithms work by performing computation on vectors Lastly, we need metrics in this binary classification there are a multitude of metrics to benchmark the performance of an algorithm (ROC/AUC, Confusion Matrix…) we will use a confusion matrix since it represents the rates of True Positives and True Negatives as well as False Positives and False Negatives.
Collecting samples and feature extraction
I assume the reader knows about the PE File Format if you do not you can read about it here, collecting samples is quite easy you can either use a paid service like (VirusTotal) or one of the links here Okay, let’s start on by discussing our model. For our algorithm to learn from the data you feed it we need to make that data understandable and clear, in our case, we will use 12 features to teach our algorithm these features will be extracted from each binary and organized into a CSV file once.
Feature extraction
To extract features, we will be using pefile. First Step is to download pefile I assume you know some Python and how to use pip. From your terminal run: pip install pefile Now that you have the necessary tools let’s write some code, but first let’s discuss what kind of information we want to extract. We are interested in extracting the following fields of a PE File:
Major Image Version: Used to indicate the major version number of the application; in Microsoft Excel version 4.0, it would be 4. Virtual Adress and Size of the IMAGE_DATA_DIRECTORY OS Version Import Adress Table Adress Ressources Size Number Of Sections Linker Version Size of Stack Reserve DLL Characteristics Export Table Size and Adress
To make our code more organized let’s start by creating a class that represents the PE File information as one object Now we move on to write a small method that constructs a dictionnary for each PE File thus each sample will be represented as a python dictionnary where keys are the features and values are the value of each parsed field . Since we can write code let’s write a script that will loop trough all samples in a folder and process each one of them then dump all those dictionaries into one csv file that we will use . Okay now we are ready to process some data, I advise you to use the code from my Github .
Exploring the data
A Step that is not needed but can be quite eye opening experience it gives a more intuitive idea about the whole data. In [2]: In [3]: Out[3]: In [4]: Out[4]: We can see the discrepancies between the two sets especially in the first two features Let’s plot some of these features to get a visual idea about those differences In [6]: Out[6]:
We can notice the “clustering” of the Malicious samples on a tight centroid while the cleanfiles are sparse over the ‘x’ line let’s try now to plot other features as well to get an overall understanding of what we have here In [13]: Out[13]:
In [14]: Out[14]:
The more we plot and analyze the data the more we understand and get a sense of the overall distribution,of course a problem arises what do I do if I have a high-dimensional dataset well what we have here is fairly low dimensional but a lot of technics can be used to reduce the dimensions to the more “important” features algorithms like PCA and t-SNE can be used to visualize the data on 3D or even 2D plots .
Machine learning application
Enough with the statistics let’s do some work, till now we did not do any machine learning work what we did is part of the whole work we took some data, cleaned it and prepared it. Now to start experimenting with Machine Learning, we have to do a few more things:
First, we need to merge our datasets (malicious and clean) into one DataFram We need to split our DataFrame into two parts the first one will be used for training and later for testing We will then proceed to apply few algorithms and see what happens
Dataset preparation
In [22]: Now we have 4 Matrices quite big ones X_train and y_train will be used to train our different classifiers, and X_test will be used to predict the labels, and y_test will be used for metrics, in fact, we are going to compare the predictions from X_test to y_test to see how we did perform. We start by using Random Forests which are an ensemble version of Decision Trees they work by creating a lot of decision trees at training time and outputting the class that is the mode of the classes (classification), they are quite performant when it comes to binary classification problems In [25]: Notice anything? Well if you have 6 False Positives and 4 False Negatives with no parameter tuning and no modifications are quite good,actually we were able to detect 697 Clean files correctly and 745 Malicious Ones Correctly, guess our small Anti-Virus is working :D. Let’s try this time another classifier, we will build a simple neural network and test it on another randomized split. According to Wikipedia A Multi-Layer Perceptron is the generalized version of the perceptron which is the basis model of the neuron they are the fundamental building blocks for deep learning methods where we meet larger and deeper networks. In [26]: The all mighty Neural Network failed to detect eighteen Threats not only that it detected them as clean files which is a very very bad problem imagine your antivirus detecting a ransomware as a clean file? Well this sounds like AV Evasion on AI but let’s not be pessimistic our Neural Network is very primitive we can actually make it more accurate, but this is beyond the scope of this article
Conclusion
This is just the beginning. I wanted to show that Malware Classification is indeed a solvable problem if we accept 99% as a good accuracy rate. Of course, building and deploying something like this, in reality, is time-consuming and requires more knowledge and more data. This was merely a preview of the infinite possibilities machine learning and AI, in general, offers us, I hope this was educational, fun and insightful.
Sources
Machine Learning Course by Andrew NG Course that will make you a deep learning practitioner in 7 weeks only requirement (Python) Elements of Statistical Learning (Harstie) this is a more theoretical book but quite insightful Selecting features to classify malware