A stacking model using URL and HTML features for phishing webpage detection

1 May 2019  ·  Yukun Li a, Zhenguo Yang b, C, , Xu Chen a, Huaping Yuan b, Wenyin Liu b, ∗∗ ·

In this paper, we present a stacking model to detect phishing webpages using URL and HTML features. In terms of features, we design lightweight URL and HTML features and introduce HTML string embedding without using the third-party services, making it possible to develop real-time detection applications. Furthermore, we devise a stacking model by combining GBDT, XGBoost and LightGBM in multiple layers, which enables different models to be complementary, thus improving the performance on phishing webpage detection. In particular, we collect two real-world datasets for evaluations, named as 50KPD and 50K-IPD, respectively. 50K-PD contains 49,947 webpages with URLs and HTML codes. 50K-IPD contains 53,103 webpages with screenshots in addition to URLs and HTML codes. The proposed approach outperforms quite a few machine learning models on multiple metrics, achieving 97.30% on accuracy, 4.46% on missing alarm rate, and 1.61% on false alarm rate on 50K-PD dataset. On 50K-IPD dataset, the proposed approach achieves 98.60% on accuracy, 1.28% on missing alarm rate, and 1.54% on false alarm rate.

PDF
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here