Website Classification Using Word Based Multiple N -Gram Models and Random Search Oriented Feature Parameters

21 Dec 2018  ·  Ashadullah Shawon, Syed Tauhid Zuhori, Firoz Mahmud, Md. Jamil-Ur Rahman ·

Website classification is a convenient starting point for building an intelligent web browser and social networking sites that can understand the favorite categories of a user and also detect adult or harmful websites perfectly. Classifying the web sites using the information of the Uniform Resource Locator (URL) is an important and fast technique. A perfect result is needed for URL classification to make it usable in the real world applications. So we have proposed an improved approach for URL classification that is able to provide a better result. We have introduced the word-based multiple n-gram models for efficient feature extraction and multinomial distribution for Naive Bayes classifier under the Random Search pipeline for hyperparameter optimization that finds the best parameters of the URL features. The experimental result of our research is compared with the result of previous research works and we have shown a better result than the existing result. Our experimental result provides 88.77% in recall and 87.63% in F1-Score which is the best performance so far.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here