ABSTRACT

We propose a versatile framework in which one can employ different machine learning algorithms to successfully distinguish between malware files and clean files, while aiming to minimise the number of false positives. In this paper we present the ideas behind our framework by working firstly with cascade one-sided perceptrons and secondly with cascade kernelized one-sided perceptrons. After having been successfully tested on medium-size datasets of malware and clean files, the ideas behind this framework were submitted to a scaling-up process that enable us to work with very large datasets of malware and clean files.

In this paper, we present a framework for malware detection aiming to get as few false positives as possible, by using a simple and a simple multi-stage combination (cascade) of different versions of the perceptron algorithm. Other automate classification algorithms could also be used in this framework, but we do not explore here this alternative. The main steps performed through this framework are sketched as follows:
1. A set of features is computed for every binary file in the training or test datasets , based on many possible ways of analyzing a malware.

2. A machine learning system based firstly on one-sided perceptrons, and then on feature mapped one-sided perceptrons and a kernelized one-sided perceptrons , combined with feature selection based on the F1 and F2 scores, is trained on a medium-size dataset consisting of clean and malware files. Cross-validation is then performed in order to choose the right values for parameters. Finally, tests are performed on another, non-related dataset. The obtained results  were very encouraging.
3. In the end  we will analyse different aspects involved in the scale-up of our framework to identifying malware files on very large training datasets.

Software Requirements:

•Anaconda IDE

• Language – Python 3.8

Hardware Requirements:

•Hard Disk: Greater than 500 GB

•RAM: Greater than 4 GB

•Processor: I3 and Above