In this paper, we propose a feature-free method for detecting phishing websites using the Normalized Compression Distance (NCD), a parameter-free similarity measure which computes the similarity of two websites by compressing them, thus eliminating the need to perform any feature extraction. It also removes any dependence on a specific set of website features. This method examines the HTML of webpages and computes their similarity with known phishing websites, in order to classify them. We use the Furthest Point First algorithm to perform phishing prototype extractions, in order to select instances that are representative of a cluster of phishing webpages. We also introduce the use of an incremental learning algorithm as a framework for continuous and adaptive detection without extracting new features when concept drift occurs. On a large dataset, our proposed method significantly outperforms previous methods in detecting phishing websites, with an AUC score of 98.68%, a high true positive rate (TPR) of around 90%, while maintaining a low false positive rate (FPR) of 0.58%. Our approach uses prototypes, eliminating the need to retain long term data in the future, and is feasible to deploy in real systems with a processing time of roughly 0.3 seconds.
- Malicious Web sites are the basis of most of the criminal activities over the internet.
- The dangers that arise due to the malicious sites are enormous and the end-users must be prohibited from visiting such sites.
- The users should prohibit themselves from clicking on such Uniform Resource Locator (URL).
- In order to prevent such attacks, the paper proposes the use of machine learning algorithms to detect
- Phishing Websites. The Existing PWD (Phishing Website Detection) model is trained using an existing dataset which contains URLs, each with unique features, and is applied to three different
- machine learning classififiers—support vector machine, logistic regression and Naïve Bayes. After training and testing the algorithms, it is observed that Naïve Bayes classififier recorded the highest accuracy
- Low Accuracy Due to Training Loss
- Many Website features not included for the consideration
- Collect dataset containing phishing and legitimate websites from the open source platforms.
- Write a code to extract the required features from the URL database.
- Analyze and preprocess the dataset by using EDA techniques.
- Divide the dataset into training and testing sets.
- Run selected machine learning and deep neural network algorithm (DNN) on the dataset.
- Write a code for displaying the evaluation result considering accuracy metrics.
- Compare the obtained results for trained models and specify which is better.
- DNNThis is also one of the classification algorithm which is supervised and is easy to use. It can used for both classification and regression applications, but it is more famous to be used in classification applications. In this algorithm each point which is a data item is plotted in a dimensional space, this space is also known as n dimensional plane, where the ‘n’ represents the number of features of the data. The classification is done based on the differentiation in the classes, these classes are data set points present in different planes.
- -Provide clear idea about the effective level of each classifier on phishing email detection
- -High level of accuracy by take the advantages of classifiers many
- – High level of accuracy.
- Fast in classification process fast ,less consuming memory, high accuracy, Evolving with time, online working
HARDWARE SOFTWARE REQUIREMENTS
- Front End – Anaconda IDE
- Backend – SQL
- Language – Python 3.8
- Hard Disk: Greater than 500 GB
- RAM: Greater than 4 GB
- Processor: I3 and Above
Including Packages =======================
* Base Paper
* Complete Source Code
* Complete Documentation
* Complete Presentation Slides
* Flow Diagram
* Database File
* Execution Procedure
* Readme File
* Video Tutorials